Cytokine-induced Immune Programs and custom gene progam enrichment analysis#

From the Human Cytokine Dictionary, we derived 82 Cytokine-induced Immune Programs (CIP). This tutorial guides you through the enrichment analysis of these gene sets. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples.

This tutorial runs you through the core analysis of this package, which computes enrichment scores for CIPs in your chosen condition. The user can also define their own gene programs.

# Import core packages and the hucira package.

import os
import time
import warnings

import pandas as pd
from tqdm import TqdmWarning

import hucira as hc

warnings.simplefilter("ignore", TqdmWarning)

start = time.time()
current_wd = os.getcwd()
print(current_wd)
/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks

1. Load input data#

The two main input files for this analysis are:

  1. the Cytokine-induced Immune Programs (CIP)

  2. your transcriptome data object

Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.

#### Load the file containing our cytokine-induced immune programs derived from DRVI
# https://www.biorxiv.org/content/10.1101/2024.11.06.622266v3

cip_sig = hc.load_CIP_signatures(save_dir=current_wd)
cip_sig.head()
gene CIP effect_size celltype
0 MEOX1 MD-05 CellInteraction 0.120686 Treg
1 MEOX1 MD-05 CellInteraction 0.120686 CD4_T_cell
2 MEOX1 MD-05 CellInteraction 0.120686 CD8_T_cell
3 MEOX1 MD-05 CellInteraction 0.120686 cDC
4 MEOX1 MD-05 CellInteraction 0.120686 MAIT
#### Load the query adata object

adata = hc.load_multiple_sclerosis_data()
adata
AnnData object with n_obs × n_vars = 65326 × 10266
    obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
    obsm: 'X_umap'
#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata

your_celltype_colname = "labels"
your_contrast_colname = "MS"

2. Process and prepare input data parameters#

Follow these steps:

  1. Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.

  2. Because there is no standard nomenclature for cell types, we have to manually create celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.

  3. Choose the experimental condition of interest for your query data (contrast)

Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.

adata.var.head()
FO538757.2
AP006222.2
RP11-206L10.9
FAM41C
NOC2L

The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.

Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.

print(f"All celltypes in signatures:\n  {cip_sig.celltype.unique()}\n")
print(f"All celltypes in query data:\n  {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
    f"All experimental states (contrasts/conditions) in query data:\n  {sorted(adata.obs[your_contrast_colname].unique())}\n"
)
All celltypes in signatures:
  ['Treg' 'CD4_T_cell' 'CD8_T_cell' 'cDC' 'MAIT' 'CD14_Mono' 'CD16_Mono'
 'B_cell' 'NK_CD56hi' 'NK_CD56low' 'NKT' 'HSPC']

All celltypes in query data:
  ['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']

All experimental states (contrasts/conditions) in query data:
  ['False', 'True']

To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.

# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]

celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)

# Define condition of interest
contrast = ("True", "False")
print(contrast)
[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')

3.1. Run enrichment analysis#

The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions. This simple example returns an example of the main outcome of this cytokine enrichment score analysis.

  1. run_one_enrichment_test() returns enrichment results for one celltype and contrast.

  2. check_robustness() returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a gene program is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=cip_sig,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[0],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
)

enrichment_results[
    [
        "celltype_combo",
        "CIP",
        "contrast",
        "direction",
        "ES",
        "NES",
        "NOM p-val",
        "FDR q-val",
        "FWER p-val",
        "frac_shared_genes_signature",
        "Lead_genes",
    ]
].sort_values("NES", ascending=False)
celltype_combo CIP contrast direction ES NES NOM p-val FDR q-val FWER p-val frac_shared_genes_signature Lead_genes
0 B1 (B_cell) PD-05 BcellAct True_vs_False upregulated 0.554781 1.95207 0.0 0.0 0.0 0.508380 MS4A1;CORO1A;PTPN6;TLR10;SYPL1;LYL1;CXCR5;ANXA...
1 B1 (B_cell) PD-08 BcellProlif True_vs_False upregulated 0.517229 1.912568 0.0 0.0 0.0 0.268102 CD79B;MS4A1;FCMR;CD180;CORO1A;ID3;PTPN6;HVCN1;...
2 B1 (B_cell) PD-11 BcellSignal True_vs_False upregulated 0.546199 1.798589 0.0 0.0 0.0 0.358025 CD79B;MS4A1;FCMR;IGHM;ID3;HVCN1;PARP1;TCL1A;BT...
3 B1 (B_cell) FC-05 IgE-Humoral True_vs_False upregulated 0.424371 1.583758 0.001056 0.001586 0.006 0.386700 CD79B;MS4A1;IL4R;IGHM;HVCN1;TLR10;SYPL1;SELL;B...
4 B1 (B_cell) PD-03 BcellDiff-1 True_vs_False upregulated 0.372912 1.499887 0.0 0.005288 0.022 0.301955 CCR7;STX7;MS4A1;FCMR;IFI16;IGHM;SNX2;HVCN1;CLE...
5 B1 (B_cell) PD-04 BcellReg True_vs_False upregulated 0.369261 1.498489 0.0 0.004583 0.023 0.262981 CD79B;PLEKHF2;STX7;STK17A;MS4A1;FCMR;IFI16;IL4...
6 B1 (B_cell) PD-10 Transcriptional True_vs_False upregulated 0.330519 1.28669 0.024896 0.046533 0.24 0.340220 HLA-DQA1;MS4A1;FCMR;IL4R;ID3;PTPN6;HVCN1;CLEC2...
# Filter enrichment results for robust and significant outputs.

robust_results = hc.check_robustness(
    all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results
celltype_combo contrast CIP frac_valid frac_significant is_robust NES_min NES_max FDR q-val qval_threshold threshold_frac_below_alpha
0 B1 (B_cell) True_vs_False PD-05 BcellAct 1.0 1.0 True 1.952070 1.952070 0.000000 0.1 0.9
1 B1 (B_cell) True_vs_False PD-08 BcellProlif 1.0 1.0 True 1.912568 1.912568 0.000000 0.1 0.9
2 B1 (B_cell) True_vs_False PD-11 BcellSignal 1.0 1.0 True 1.798589 1.798589 0.000000 0.1 0.9
3 B1 (B_cell) True_vs_False FC-05 IgE-Humoral 1.0 1.0 True 1.583758 1.583758 0.001586 0.1 0.9
4 B1 (B_cell) True_vs_False PD-03 BcellDiff-1 1.0 1.0 True 1.499887 1.499887 0.005288 0.1 0.9
5 B1 (B_cell) True_vs_False PD-04 BcellReg 1.0 1.0 True 1.498489 1.498489 0.004583 0.1 0.9
6 B1 (B_cell) True_vs_False PD-10 Transcriptional 1.0 1.0 True 1.286690 1.286690 0.046533 0.1 0.9

3.2 Run enrichment analysis of custom gene programs#

The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores for a single query cell type across different conditions. Instead of using the cytokine-induced gene programs identified in the paper, you can also provide your own gene programs (gene sets) of interest.

These columns are required in the input table: gene, query_program, and celltype.

#### Creating toy data (derived from CIP) to show functionality of custom gene sets.
# These columns are necessary: ["gene", "query_program", "celltype"]

genes = [
    "MEOX1",
    "FAS",
    "SYNE2",
    "TESK2",
    "ALPK1",
    "IPCEF1",
    "TXK",
    "KIFAP3",
    "CEACAM1",
    "MAGI3",
    "IL12RB2",
    "DOCK9",
    "LAG3",
    "RGS1",
    "ESR1",
    "FKBP5",
    "APOL1",
    "RANGAP1",
    "RPS6KA5",
    "NFATC2",
    "ISM1",
    "LPIN2",
    "NPDC1",
    "BMPR1A",
    "ACTA2",
    "SIAE",
    "DUSP16",
    "FAM184A",
    "CEP70",
    "XRN1",
    "SLC25A12",
    "PRG4",
    "MAN1C1",
    "PKD2",
    "KDSR",
    "EPHX2",
    "BCL2L14",
    "HYCC1",
    "BATF3",
    "FAM110A",
    "POLN",
    "RFXAP",
    "GIMAP4",
    "PHF11",
    "IRF4",
    "STAMBPL1",
    "FAM13A",
    "HAPLN3",
    "GAS8",
    "TAF4B",
    "GALM",
    "IGFBP3",
    "RAB19",
    "MSRB2",
    "ABTB3",
    "PRKCA",
    "LY96",
    "CHODL",
    "GOLGA7B",
    "SUSD3",
    "SLC37A3",
    "UBASH3A",
    "BDH1",
    "WDR64",
    "EIF4E3",
    "TRAT1",
    "ICOS",
    "F2RL1",
    "ERAP2",
    "ENOX2",
    "GLB1L3",
    "SCG5",
    "PRRX2",
    "TSPAN5",
    "TAPT1",
    "CMTM8",
    "DLGAP1",
    "TMEM126B",
    "MLLT3",
    "PTPRM",
    "GPR171",
    "CLVS1",
    "GRAMD1C",
    "GIMAP7",
    "TSHZ2",
    "GPR19",
    "GBP6",
    "KCNQ3",
    "STING1",
    "GPRIN3",
    "PLA2G2C",
    "PCDH18",
    "KIAA0408",
    "GIMAP5",
    "AFAP1",
    "PGAP1",
    "DPP4",
    "CD247",
    "ARHGAP11A",
    "PCMTD2",
    "ATP10A",
    "SOGA3",
    "HAUS3",
]

custom_gene_sets = pd.DataFrame(
    {
        "gene": genes,
        "query_program": ["some_gene_program"] * len(genes),
        "celltype": ["CD8_T_cell"] * len(genes),
    }
)

custom_gene_sets.head()
gene query_program celltype
0 MEOX1 some_gene_program CD8_T_cell
1 FAS some_gene_program CD8_T_cell
2 SYNE2 some_gene_program CD8_T_cell
3 TESK2 some_gene_program CD8_T_cell
4 ALPK1 some_gene_program CD8_T_cell
enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=custom_gene_sets,
    contrast_combo=("True", "False"),
    celltype_combo=("CD8a", "CD8_T_cell"),
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
)

enrichment_results.sort_values("NES", ascending=False)
Name query_program ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes ... threshold_expression min_size max_size permutation_num weight seed threads num_genes_signature num_shared_genes_signature frac_shared_genes_signature
0 prerank some_gene_program 0.502597 1.412391 0.043728 0.043728 0.038 4/30 8.21% GIMAP7;GIMAP4;CD247;GIMAP5 ... 0.05 10 1000 1000 1.0 2025 6 103.0 30.0 0.291262

1 rows × 30 columns

end = time.time()
print(f"Time: {int((end - start) / 60)} min {int((end - start) % 60)} sec")
Time: 0 min 4 sec
## --END-- ##