Cytokine-induced Immune Programs and custom gene progam enrichment analysis#
From the Human Cytokine Dictionary, we derived 82 Cytokine-induced Immune Programs (CIP). This tutorial guides you through the enrichment analysis of these gene sets. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples.
This tutorial runs you through the core analysis of this package, which computes enrichment scores for CIPs in your chosen condition. The user can also define their own gene programs.
# Import core packages and the hucira package.
import os
import time
import warnings
import pandas as pd
from tqdm import TqdmWarning
import hucira as hc
warnings.simplefilter("ignore", TqdmWarning)
start = time.time()
current_wd = os.getcwd()
print(current_wd)
/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks
1. Load input data#
The two main input files for this analysis are:
the Cytokine-induced Immune Programs (CIP)
your transcriptome data object
Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.
#### Load the file containing our cytokine-induced immune programs derived from DRVI
# https://www.biorxiv.org/content/10.1101/2024.11.06.622266v3
cip_sig = hc.load_CIP_signatures(save_dir=current_wd)
cip_sig.head()
| gene | CIP | effect_size | celltype | |
|---|---|---|---|---|
| 0 | MEOX1 | MD-05 CellInteraction | 0.120686 | Treg |
| 1 | MEOX1 | MD-05 CellInteraction | 0.120686 | CD4_T_cell |
| 2 | MEOX1 | MD-05 CellInteraction | 0.120686 | CD8_T_cell |
| 3 | MEOX1 | MD-05 CellInteraction | 0.120686 | cDC |
| 4 | MEOX1 | MD-05 CellInteraction | 0.120686 | MAIT |
#### Load the query adata object
adata = hc.load_multiple_sclerosis_data()
adata
AnnData object with n_obs × n_vars = 65326 × 10266
obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
obsm: 'X_umap'
#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata
your_celltype_colname = "labels"
your_contrast_colname = "MS"
2. Process and prepare input data parameters#
Follow these steps:
Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.
Because there is no standard nomenclature for cell types, we have to manually create
celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.Choose the experimental condition of interest for your query data (
contrast)
Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.
adata.var.head()
| FO538757.2 |
|---|
| AP006222.2 |
| RP11-206L10.9 |
| FAM41C |
| NOC2L |
The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.
Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.
print(f"All celltypes in signatures:\n {cip_sig.celltype.unique()}\n")
print(f"All celltypes in query data:\n {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
f"All experimental states (contrasts/conditions) in query data:\n {sorted(adata.obs[your_contrast_colname].unique())}\n"
)
All celltypes in signatures:
['Treg' 'CD4_T_cell' 'CD8_T_cell' 'cDC' 'MAIT' 'CD14_Mono' 'CD16_Mono'
'B_cell' 'NK_CD56hi' 'NK_CD56low' 'NKT' 'HSPC']
All celltypes in query data:
['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']
All experimental states (contrasts/conditions) in query data:
['False', 'True']
To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.
# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]
celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)
# Define condition of interest
contrast = ("True", "False")
print(contrast)
[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')
3.1. Run enrichment analysis#
The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions.
This simple example returns an example of the main outcome of this cytokine enrichment score analysis.
run_one_enrichment_test() returns enrichment results for one celltype and contrast.
check_robustness() returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a gene program is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.
enrichment_results = hc.run_one_enrichment_test(
adata=adata,
df=cip_sig,
contrast_combo=contrast,
celltype_combo=celltype_combos[0],
contrast_column=your_contrast_colname,
celltype_column=your_celltype_colname,
direction="upregulated",
threshold_lfc=1.0,
threshold_expression=0.05,
)
enrichment_results[
[
"celltype_combo",
"CIP",
"contrast",
"direction",
"ES",
"NES",
"NOM p-val",
"FDR q-val",
"FWER p-val",
"frac_shared_genes_signature",
"Lead_genes",
]
].sort_values("NES", ascending=False)
| celltype_combo | CIP | contrast | direction | ES | NES | NOM p-val | FDR q-val | FWER p-val | frac_shared_genes_signature | Lead_genes | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1 (B_cell) | PD-05 BcellAct | True_vs_False | upregulated | 0.554781 | 1.95207 | 0.0 | 0.0 | 0.0 | 0.508380 | MS4A1;CORO1A;PTPN6;TLR10;SYPL1;LYL1;CXCR5;ANXA... |
| 1 | B1 (B_cell) | PD-08 BcellProlif | True_vs_False | upregulated | 0.517229 | 1.912568 | 0.0 | 0.0 | 0.0 | 0.268102 | CD79B;MS4A1;FCMR;CD180;CORO1A;ID3;PTPN6;HVCN1;... |
| 2 | B1 (B_cell) | PD-11 BcellSignal | True_vs_False | upregulated | 0.546199 | 1.798589 | 0.0 | 0.0 | 0.0 | 0.358025 | CD79B;MS4A1;FCMR;IGHM;ID3;HVCN1;PARP1;TCL1A;BT... |
| 3 | B1 (B_cell) | FC-05 IgE-Humoral | True_vs_False | upregulated | 0.424371 | 1.583758 | 0.001056 | 0.001586 | 0.006 | 0.386700 | CD79B;MS4A1;IL4R;IGHM;HVCN1;TLR10;SYPL1;SELL;B... |
| 4 | B1 (B_cell) | PD-03 BcellDiff-1 | True_vs_False | upregulated | 0.372912 | 1.499887 | 0.0 | 0.005288 | 0.022 | 0.301955 | CCR7;STX7;MS4A1;FCMR;IFI16;IGHM;SNX2;HVCN1;CLE... |
| 5 | B1 (B_cell) | PD-04 BcellReg | True_vs_False | upregulated | 0.369261 | 1.498489 | 0.0 | 0.004583 | 0.023 | 0.262981 | CD79B;PLEKHF2;STX7;STK17A;MS4A1;FCMR;IFI16;IL4... |
| 6 | B1 (B_cell) | PD-10 Transcriptional | True_vs_False | upregulated | 0.330519 | 1.28669 | 0.024896 | 0.046533 | 0.24 | 0.340220 | HLA-DQA1;MS4A1;FCMR;IL4R;ID3;PTPN6;HVCN1;CLEC2... |
# Filter enrichment results for robust and significant outputs.
robust_results = hc.check_robustness(
all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results
| celltype_combo | contrast | CIP | frac_valid | frac_significant | is_robust | NES_min | NES_max | FDR q-val | qval_threshold | threshold_frac_below_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B1 (B_cell) | True_vs_False | PD-05 BcellAct | 1.0 | 1.0 | True | 1.952070 | 1.952070 | 0.000000 | 0.1 | 0.9 |
| 1 | B1 (B_cell) | True_vs_False | PD-08 BcellProlif | 1.0 | 1.0 | True | 1.912568 | 1.912568 | 0.000000 | 0.1 | 0.9 |
| 2 | B1 (B_cell) | True_vs_False | PD-11 BcellSignal | 1.0 | 1.0 | True | 1.798589 | 1.798589 | 0.000000 | 0.1 | 0.9 |
| 3 | B1 (B_cell) | True_vs_False | FC-05 IgE-Humoral | 1.0 | 1.0 | True | 1.583758 | 1.583758 | 0.001586 | 0.1 | 0.9 |
| 4 | B1 (B_cell) | True_vs_False | PD-03 BcellDiff-1 | 1.0 | 1.0 | True | 1.499887 | 1.499887 | 0.005288 | 0.1 | 0.9 |
| 5 | B1 (B_cell) | True_vs_False | PD-04 BcellReg | 1.0 | 1.0 | True | 1.498489 | 1.498489 | 0.004583 | 0.1 | 0.9 |
| 6 | B1 (B_cell) | True_vs_False | PD-10 Transcriptional | 1.0 | 1.0 | True | 1.286690 | 1.286690 | 0.046533 | 0.1 | 0.9 |
3.2 Run enrichment analysis of custom gene programs#
The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores for a single query cell type across different conditions. Instead of using the cytokine-induced gene programs identified in the paper, you can also provide your own gene programs (gene sets) of interest.
These columns are required in the input table: gene, query_program, and celltype.
#### Creating toy data (derived from CIP) to show functionality of custom gene sets.
# These columns are necessary: ["gene", "query_program", "celltype"]
genes = [
"MEOX1",
"FAS",
"SYNE2",
"TESK2",
"ALPK1",
"IPCEF1",
"TXK",
"KIFAP3",
"CEACAM1",
"MAGI3",
"IL12RB2",
"DOCK9",
"LAG3",
"RGS1",
"ESR1",
"FKBP5",
"APOL1",
"RANGAP1",
"RPS6KA5",
"NFATC2",
"ISM1",
"LPIN2",
"NPDC1",
"BMPR1A",
"ACTA2",
"SIAE",
"DUSP16",
"FAM184A",
"CEP70",
"XRN1",
"SLC25A12",
"PRG4",
"MAN1C1",
"PKD2",
"KDSR",
"EPHX2",
"BCL2L14",
"HYCC1",
"BATF3",
"FAM110A",
"POLN",
"RFXAP",
"GIMAP4",
"PHF11",
"IRF4",
"STAMBPL1",
"FAM13A",
"HAPLN3",
"GAS8",
"TAF4B",
"GALM",
"IGFBP3",
"RAB19",
"MSRB2",
"ABTB3",
"PRKCA",
"LY96",
"CHODL",
"GOLGA7B",
"SUSD3",
"SLC37A3",
"UBASH3A",
"BDH1",
"WDR64",
"EIF4E3",
"TRAT1",
"ICOS",
"F2RL1",
"ERAP2",
"ENOX2",
"GLB1L3",
"SCG5",
"PRRX2",
"TSPAN5",
"TAPT1",
"CMTM8",
"DLGAP1",
"TMEM126B",
"MLLT3",
"PTPRM",
"GPR171",
"CLVS1",
"GRAMD1C",
"GIMAP7",
"TSHZ2",
"GPR19",
"GBP6",
"KCNQ3",
"STING1",
"GPRIN3",
"PLA2G2C",
"PCDH18",
"KIAA0408",
"GIMAP5",
"AFAP1",
"PGAP1",
"DPP4",
"CD247",
"ARHGAP11A",
"PCMTD2",
"ATP10A",
"SOGA3",
"HAUS3",
]
custom_gene_sets = pd.DataFrame(
{
"gene": genes,
"query_program": ["some_gene_program"] * len(genes),
"celltype": ["CD8_T_cell"] * len(genes),
}
)
custom_gene_sets.head()
| gene | query_program | celltype | |
|---|---|---|---|
| 0 | MEOX1 | some_gene_program | CD8_T_cell |
| 1 | FAS | some_gene_program | CD8_T_cell |
| 2 | SYNE2 | some_gene_program | CD8_T_cell |
| 3 | TESK2 | some_gene_program | CD8_T_cell |
| 4 | ALPK1 | some_gene_program | CD8_T_cell |
enrichment_results = hc.run_one_enrichment_test(
adata=adata,
df=custom_gene_sets,
contrast_combo=("True", "False"),
celltype_combo=("CD8a", "CD8_T_cell"),
contrast_column=your_contrast_colname,
celltype_column=your_celltype_colname,
direction="upregulated",
threshold_lfc=1.0,
threshold_expression=0.05,
threshold_pval=0.01,
)
enrichment_results.sort_values("NES", ascending=False)
| Name | query_program | ES | NES | NOM p-val | FDR q-val | FWER p-val | Tag % | Gene % | Lead_genes | ... | threshold_expression | min_size | max_size | permutation_num | weight | seed | threads | num_genes_signature | num_shared_genes_signature | frac_shared_genes_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | prerank | some_gene_program | 0.502597 | 1.412391 | 0.043728 | 0.043728 | 0.038 | 4/30 | 8.21% | GIMAP7;GIMAP4;CD247;GIMAP5 | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 103.0 | 30.0 | 0.291262 |
1 rows × 30 columns
end = time.time()
print(f"Time: {int((end - start) / 60)} min {int((end - start) % 60)} sec")
Time: 0 min 4 sec
## --END-- ##