Cytokine-induced Immune Programs and custom gene progam enrichment analysis#

From the Human Cytokine Dictionary, we derived 82 Cytokine-induced Immune Programs (CIP). This tutorial guides you through the enrichment analysis of these gene sets. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples.

This tutorial runs you through the core analysis of this package, which computes enrichment scores for CIPs in your chosen condition. The user can also define their own gene programs.

# Import core packages and the hucira package.

import os
import time
import warnings

import pandas as pd
from tqdm import TqdmWarning

import hucira as hc

warnings.simplefilter("ignore", TqdmWarning)

start = time.time()

current_wd = os.getcwd()
print(current_wd)

/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks

1. Load input data#

The two main input files for this analysis are:

the Cytokine-induced Immune Programs (CIP)
your transcriptome data object

Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.

#### Load the file containing our cytokine-induced immune programs derived from DRVI
# https://www.biorxiv.org/content/10.1101/2024.11.06.622266v3

cip_sig = hc.load_CIP_signatures(save_dir=current_wd)
cip_sig.head()

	gene	CIP	effect_size	celltype
0	MEOX1	MD-05 CellInteraction	0.120686	Treg
1	MEOX1	MD-05 CellInteraction	0.120686	CD4_T_cell
2	MEOX1	MD-05 CellInteraction	0.120686	CD8_T_cell
3	MEOX1	MD-05 CellInteraction	0.120686	cDC
4	MEOX1	MD-05 CellInteraction	0.120686	MAIT

#### Load the query adata object

adata = hc.load_multiple_sclerosis_data()
adata

AnnData object with n_obs × n_vars = 65326 × 10266
    obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
    obsm: 'X_umap'

#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata

your_celltype_colname = "labels"
your_contrast_colname = "MS"

2. Process and prepare input data parameters#

Follow these steps:

Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.
Because there is no standard nomenclature for cell types, we have to manually create celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.
Choose the experimental condition of interest for your query data (contrast)

Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.

adata.var.head()


FO538757.2
AP006222.2
RP11-206L10.9
FAM41C
NOC2L

The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.

Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.

print(f"All celltypes in signatures:\n  {cip_sig.celltype.unique()}\n")
print(f"All celltypes in query data:\n  {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
    f"All experimental states (contrasts/conditions) in query data:\n  {sorted(adata.obs[your_contrast_colname].unique())}\n"
)

All celltypes in signatures:
  ['Treg' 'CD4_T_cell' 'CD8_T_cell' 'cDC' 'MAIT' 'CD14_Mono' 'CD16_Mono'
 'B_cell' 'NK_CD56hi' 'NK_CD56low' 'NKT' 'HSPC']

All celltypes in query data:
  ['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']

All experimental states (contrasts/conditions) in query data:
  ['False', 'True']

To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.

# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]

celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)

# Define condition of interest
contrast = ("True", "False")
print(contrast)

[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')

3.1. Run enrichment analysis#

The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions. This simple example returns an example of the main outcome of this cytokine enrichment score analysis.

run_one_enrichment_test() returns enrichment results for one celltype and contrast.
check_robustness() returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a gene program is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=cip_sig,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[0],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
)

enrichment_results[
    [
        "celltype_combo",
        "CIP",
        "contrast",
        "direction",
        "ES",
        "NES",
        "NOM p-val",
        "FDR q-val",
        "FWER p-val",
        "frac_shared_genes_signature",
        "Lead_genes",
    ]
].sort_values("NES", ascending=False)

	celltype_combo	CIP	contrast	direction	ES	NES	NOM p-val	FDR q-val	FWER p-val	frac_shared_genes_signature	Lead_genes
0	B1 (B_cell)	PD-05 BcellAct	True_vs_False	upregulated	0.554781	1.95207	0.0	0.0	0.0	0.508380	MS4A1;CORO1A;PTPN6;TLR10;SYPL1;LYL1;CXCR5;ANXA...
1	B1 (B_cell)	PD-08 BcellProlif	True_vs_False	upregulated	0.517229	1.912568	0.0	0.0	0.0	0.268102	CD79B;MS4A1;FCMR;CD180;CORO1A;ID3;PTPN6;HVCN1;...
2	B1 (B_cell)	PD-11 BcellSignal	True_vs_False	upregulated	0.546199	1.798589	0.0	0.0	0.0	0.358025	CD79B;MS4A1;FCMR;IGHM;ID3;HVCN1;PARP1;TCL1A;BT...
3	B1 (B_cell)	FC-05 IgE-Humoral	True_vs_False	upregulated	0.424371	1.583758	0.001056	0.001586	0.006	0.386700	CD79B;MS4A1;IL4R;IGHM;HVCN1;TLR10;SYPL1;SELL;B...
4	B1 (B_cell)	PD-03 BcellDiff-1	True_vs_False	upregulated	0.372912	1.499887	0.0	0.005288	0.022	0.301955	CCR7;STX7;MS4A1;FCMR;IFI16;IGHM;SNX2;HVCN1;CLE...
5	B1 (B_cell)	PD-04 BcellReg	True_vs_False	upregulated	0.369261	1.498489	0.0	0.004583	0.023	0.262981	CD79B;PLEKHF2;STX7;STK17A;MS4A1;FCMR;IFI16;IL4...
6	B1 (B_cell)	PD-10 Transcriptional	True_vs_False	upregulated	0.330519	1.28669	0.024896	0.046533	0.24	0.340220	HLA-DQA1;MS4A1;FCMR;IL4R;ID3;PTPN6;HVCN1;CLEC2...

# Filter enrichment results for robust and significant outputs.

robust_results = hc.check_robustness(
    all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results

	celltype_combo	contrast	CIP	frac_valid	frac_significant	is_robust	NES_min	NES_max	FDR q-val	qval_threshold	threshold_frac_below_alpha
0	B1 (B_cell)	True_vs_False	PD-05 BcellAct	1.0	1.0	True	1.952070	1.952070	0.000000	0.1	0.9
1	B1 (B_cell)	True_vs_False	PD-08 BcellProlif	1.0	1.0	True	1.912568	1.912568	0.000000	0.1	0.9
2	B1 (B_cell)	True_vs_False	PD-11 BcellSignal	1.0	1.0	True	1.798589	1.798589	0.000000	0.1	0.9
3	B1 (B_cell)	True_vs_False	FC-05 IgE-Humoral	1.0	1.0	True	1.583758	1.583758	0.001586	0.1	0.9
4	B1 (B_cell)	True_vs_False	PD-03 BcellDiff-1	1.0	1.0	True	1.499887	1.499887	0.005288	0.1	0.9
5	B1 (B_cell)	True_vs_False	PD-04 BcellReg	1.0	1.0	True	1.498489	1.498489	0.004583	0.1	0.9
6	B1 (B_cell)	True_vs_False	PD-10 Transcriptional	1.0	1.0	True	1.286690	1.286690	0.046533	0.1	0.9

3.2 Run enrichment analysis of custom gene programs#

The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores for a single query cell type across different conditions. Instead of using the cytokine-induced gene programs identified in the paper, you can also provide your own gene programs (gene sets) of interest.

These columns are required in the input table: gene, query_program, and celltype.

#### Creating toy data (derived from CIP) to show functionality of custom gene sets.
# These columns are necessary: ["gene", "query_program", "celltype"]

genes = [
    "MEOX1",
    "FAS",
    "SYNE2",
    "TESK2",
    "ALPK1",
    "IPCEF1",
    "TXK",
    "KIFAP3",
    "CEACAM1",
    "MAGI3",
    "IL12RB2",
    "DOCK9",
    "LAG3",
    "RGS1",
    "ESR1",
    "FKBP5",
    "APOL1",
    "RANGAP1",
    "RPS6KA5",
    "NFATC2",
    "ISM1",
    "LPIN2",
    "NPDC1",
    "BMPR1A",
    "ACTA2",
    "SIAE",
    "DUSP16",
    "FAM184A",
    "CEP70",
    "XRN1",
    "SLC25A12",
    "PRG4",
    "MAN1C1",
    "PKD2",
    "KDSR",
    "EPHX2",
    "BCL2L14",
    "HYCC1",
    "BATF3",
    "FAM110A",
    "POLN",
    "RFXAP",
    "GIMAP4",
    "PHF11",
    "IRF4",
    "STAMBPL1",
    "FAM13A",
    "HAPLN3",
    "GAS8",
    "TAF4B",
    "GALM",
    "IGFBP3",
    "RAB19",
    "MSRB2",
    "ABTB3",
    "PRKCA",
    "LY96",
    "CHODL",
    "GOLGA7B",
    "SUSD3",
    "SLC37A3",
    "UBASH3A",
    "BDH1",
    "WDR64",
    "EIF4E3",
    "TRAT1",
    "ICOS",
    "F2RL1",
    "ERAP2",
    "ENOX2",
    "GLB1L3",
    "SCG5",
    "PRRX2",
    "TSPAN5",
    "TAPT1",
    "CMTM8",
    "DLGAP1",
    "TMEM126B",
    "MLLT3",
    "PTPRM",
    "GPR171",
    "CLVS1",
    "GRAMD1C",
    "GIMAP7",
    "TSHZ2",
    "GPR19",
    "GBP6",
    "KCNQ3",
    "STING1",
    "GPRIN3",
    "PLA2G2C",
    "PCDH18",
    "KIAA0408",
    "GIMAP5",
    "AFAP1",
    "PGAP1",
    "DPP4",
    "CD247",
    "ARHGAP11A",
    "PCMTD2",
    "ATP10A",
    "SOGA3",
    "HAUS3",
]

custom_gene_sets = pd.DataFrame(
    {
        "gene": genes,
        "query_program": ["some_gene_program"] * len(genes),
        "celltype": ["CD8_T_cell"] * len(genes),
    }
)

custom_gene_sets.head()

	gene	query_program	celltype
0	MEOX1	some_gene_program	CD8_T_cell
1	FAS	some_gene_program	CD8_T_cell
2	SYNE2	some_gene_program	CD8_T_cell
3	TESK2	some_gene_program	CD8_T_cell
4	ALPK1	some_gene_program	CD8_T_cell

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=custom_gene_sets,
    contrast_combo=("True", "False"),
    celltype_combo=("CD8a", "CD8_T_cell"),
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
)

enrichment_results.sort_values("NES", ascending=False)

	Name	query_program	ES	NES	NOM p-val	FDR q-val	FWER p-val	Tag %	Gene %	Lead_genes	...	threshold_expression	min_size	max_size	permutation_num	weight	seed	threads	num_genes_signature	num_shared_genes_signature	frac_shared_genes_signature
0	prerank	some_gene_program	0.502597	1.412391	0.043728	0.043728	0.038	4/30	8.21%	GIMAP7;GIMAP4;CD247;GIMAP5	...	0.05	10	1000	1000	1.0	2025	6	103.0	30.0	0.291262

1 rows × 30 columns

end = time.time()
print(f"Time: {int((end - start) / 60)} min {int((end - start) % 60)} sec")

Time: 0 min 4 sec

## --END-- ##

Cytokine-induced Immune Programs and custom gene progam enrichment analysis

Contents

Cytokine-induced Immune Programs and custom gene progam enrichment analysis#

1. Load input data#

2. Process and prepare input data parameters#

3.1. Run enrichment analysis#

3.2 Run enrichment analysis of custom gene programs#