Cytokine enrichment analysis using Human Cytokine Dictionary#

The Human Cytokine Dictionary (hcd) can be accessed through this module. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples. Using the Human Cytokine Dictionary as reference, you can “look up” cytokine responses of the disease states in your own dataset.

This tutorial runs you through the core analysis of this package, which computes enrichment scores for cytokines in your chosen condition. The enrichment analysis is based on internally computed pre-ranked genes, but the user has the option to define their own ranking statistics.

# Import core packages and the hucira package.

import os
import time
import warnings

import numpy as np
from tqdm import TqdmWarning

import hucira as hc

warnings.simplefilter("ignore", TqdmWarning)

start = time.time()

current_wd = os.getcwd()
print(current_wd)

/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks

1. Load input data#

The two main input files for this analysis are:

the Human Cytokine Dictionary
your transcriptome data object

Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.

#### Load the human cytokine dictionary (mini version)

df_hcd_all = hc.load_human_cytokine_dict(save_dir=current_wd, mini_version=True)
df_hcd_all.head()

	index	gene	log_fc	logCPM	F	p_value	adj_p_value	contrast	celltype	cytokine	...	median_num_cells_pbs	median_num_cells_cytokine	mean_num_cells_pbs	mean_num_cells_cytokine	min_num_cells_pbs	min_num_cells_cytokine	max_num_cells_pbs	max_num_cells_cytokine	num_DE_pbs_wells	well_biased
0	182	AACS	0.506575	4.404342	19.205548	3.586696e-04	5.992502e-03	NaN	Intermediate_B_cell	IL-32-beta	...	1435.5	162.0	1814.333333	201.666667	367.0	43.0	4609.0	414.0	4.0	False
1	415	AAMDC	-0.592411	4.404723	28.952278	5.144025e-05	7.192403e-04	NaN	Intermediate_B_cell	IFN-omega	...	1435.5	181.0	1814.333333	282.916667	367.0	113.0	4609.0	838.0	4.0	False
2	423	AAMDC	-0.574430	4.394900	29.933840	4.591638e-05	5.666800e-04	NaN	Intermediate_B_cell	IL-15	...	1435.5	305.5	1814.333333	382.916667	367.0	120.0	4609.0	1296.0	4.0	False
3	437	AAMDC	1.123624	5.038960	157.966713	7.296158e-11	1.083115e-07	NaN	Intermediate_B_cell	IL-23	...	1435.5	282.0	1814.333333	325.250000	367.0	95.0	4609.0	901.0	6.0	False
4	445	AAMDC	0.558964	4.716018	35.557344	2.372101e-06	9.600669e-05	NaN	Intermediate_B_cell	IL-34	...	1435.5	248.0	1814.333333	311.500000	367.0	83.0	4609.0	655.0	5.0	False

5 rows × 21 columns

#### Load the query adata object

adata = hc.load_multiple_sclerosis_data()
adata

AnnData object with n_obs × n_vars = 65326 × 10266
    obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
    obsm: 'X_umap'

#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata, which you can read from previous cell output

your_celltype_colname = "labels"
your_contrast_colname = "MS"

2. Process and prepare input data parameters#

Follow these steps:

Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.
Because there is no standard nomenclature for cell types, we have to manually create celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.
Choose the experimental condition of interest for your query data (contrast)

Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.

adata.var.head()


FO538757.2
AP006222.2
RP11-206L10.9
FAM41C
NOC2L

The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.

Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.

print(f"All celltypes in dictionary:\n  {df_hcd_all.celltype.unique()}\n")
print(f"All celltypes in query data:\n  {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
    f"All experimental states (contrasts/conditions) in query data:\n  {sorted(adata.obs[your_contrast_colname].unique())}\n"
)

All celltypes in dictionary:
  ['Intermediate_B_cell' 'NKT' 'CD8_Memory_T_cell' 'NK_CD56low' 'CD16_Mono'
 'NK_CD56hi' 'CD8_Naive_T_cell' 'pDC' 'ILC' 'MAIT' 'Naive_B_cell'
 'CD4_Naive_T_cell' 'Treg' 'Plasmablast' 'Granulocyte' 'B_cell'
 'CD4_T_cell' 'HSPC' 'CD8_T_cell' 'CD14_Mono' 'cDC' 'CD4_Memory_T_cell'
 'NK' 'Mono']

All celltypes in query data:
  ['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']

All experimental states (contrasts/conditions) in query data:
  ['False', 'True']

To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.

# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]

celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)

# Define condition of interest
contrast = ("True", "False")
print(contrast)

[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')

3.1. Run enrichment analysis#

The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions. This simple example demonstrates the main outcome of the cytokine enrichment score analysis.

run_one_enrichment_test() returns enrichment results for one celltype and contrast.
check_robustness() returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a cytokine enrichment is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.

# Run enrichment analysis

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=None,
)

# Look at all enrichment results (reduced view to columns of interest, not all statistics)
enrichment_results[
    [
        "celltype_combo",
        "cytokine",
        "contrast",
        "direction",
        "ES",
        "NES",
        "NOM p-val",
        "FDR q-val",
        "FWER p-val",
        "frac_shared_genes_signature",
        "Lead_genes",
        "threshold_lfc",
        "threshold_expression",
    ]
].sort_values("NES", ascending=False)

	celltype_combo	cytokine	contrast	direction	ES	NES	NOM p-val	FDR q-val	FWER p-val	frac_shared_genes_signature	Lead_genes	threshold_lfc	threshold_expression
0	CD8a (CD8_T_cell)	IL-15	True_vs_False	upregulated	0.599907	2.052186	0.0	0.0	0.0	0.484568	ACTB;HSPA8;PRF1;IL32;PFN1;ACTG1;MYL12A;GZMA;TA...	1.0	0.05
1	CD8a (CD8_T_cell)	IL-2	True_vs_False	upregulated	0.62742	2.024465	0.0	0.0	0.0	0.502994	HSPA8;PRF1;IL32;MYL12A;GZMA;TAP1;ENO1;GZMB;PSM...	1.0	0.05
2	CD8a (CD8_T_cell)	IL-7	True_vs_False	upregulated	0.475401	1.518688	0.005325	0.025716	0.06	0.334802	HSPA8;GZMB;PSME2;GZMH;CTSC;ISG15;GBP1;CALR;HSP...	1.0	0.05
3	CD8a (CD8_T_cell)	IFN-omega	True_vs_False	upregulated	0.42752	1.374484	0.03178	0.091201	0.233	0.417989	HSPA8;PRF1;TAP1;PSME2;IFITM2;PGAM1;ISG15;GBP1;...	1.0	0.05
4	CD8a (CD8_T_cell)	IL-1-beta	True_vs_False	upregulated	0.515314	1.353885	0.073027	0.091697	0.274	0.500000	TAP1;MT-CO1;ISG15;GBP1;IFITM1;UBE2L6;EPSTI1;PI...	1.0	0.05
5	CD8a (CD8_T_cell)	IL-4	True_vs_False	upregulated	0.483863	1.267079	0.152745	0.174504	0.468	0.259259	PTGER2;IMMT;PTGDR;CISH;VCL;XBP1;LIMA1;TMEM71	1.0	0.05
6	CD8a (CD8_T_cell)	IFN-beta	True_vs_False	upregulated	0.36094	1.187676	0.185224	0.258371	0.645	0.391489	TAP1;PSMB8;STOM;IFI16;IFITM2;ALOX5AP;ISG15;GBP...	1.0	0.05
7	CD8a (CD8_T_cell)	IFN-gamma	True_vs_False	upregulated	0.422746	1.179638	0.236721	0.238198	0.664	0.452055	TAP1;ISG15;GBP1;OASL;IFITM1;EPSTI1;MX1;GBP5;LA...	1.0	0.05
8	CD8a (CD8_T_cell)	IFN-alpha1	True_vs_False	upregulated	0.450718	1.114153	0.320047	0.308474	0.779	0.457143	ISG15;EPSTI1;MX1;PARP9;RTP4;OAS2;IFI6;OAS1;SAMD9L	1.0	0.05

# Filter enrichment results for robust and significant outputs.

robust_results = hc.check_robustness(
    all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results

	celltype_combo	contrast	cytokine	frac_valid	frac_significant	is_robust	NES_min	NES_max	FDR q-val	qval_threshold	threshold_frac_below_alpha
0	CD8a (CD8_T_cell)	True_vs_False	IL-15	1.0	1.0	True	2.052186	2.052186	0.000000	0.1	0.9
1	CD8a (CD8_T_cell)	True_vs_False	IL-2	1.0	1.0	True	2.024465	2.024465	0.000000	0.1	0.9
2	CD8a (CD8_T_cell)	True_vs_False	IL-7	1.0	1.0	True	1.518688	1.518688	0.025716	0.1	0.9
3	CD8a (CD8_T_cell)	True_vs_False	IFN-omega	1.0	1.0	True	1.374484	1.374484	0.091201	0.1	0.9
4	CD8a (CD8_T_cell)	True_vs_False	IL-1-beta	1.0	1.0	True	1.353885	1.353885	0.091697	0.1	0.9

3.2. Run enrichment analysis using custom ranking function#

The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores using an internal ranking statistic. It uses the signal-to-noise ratio to generate a pre-ranked list of genes for the enrichment analysis.

Alternatively, you can provide your own ranking statistic. In this example, the ranking is based on a simple log fold change calculation. As a result, the enrichment results change. This shows that the way genes are ranked can strongly influence the enrichment analysis.

from hucira.tl.enrichment_test import _compute_log_fold_change

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=_compute_log_fold_change,
)
enrichment_results.sort_values("NES", ascending=False)

	Name	cytokine	ES	NES	NOM p-val	FDR q-val	FWER p-val	Tag %	Gene %	Lead_genes	...	threshold_expression	min_size	max_size	permutation_num	weight	seed	threads	num_genes_signature	num_shared_genes_signature	frac_shared_genes_signature
0	prerank	IFN-gamma	0.565919	1.649477	0.00206	0.010223	0.01	16/33	15.43%	EPSTI1;OASL;TAP1;GBP1;RTP4;LAP3;PARP9;MX1;FBXO...	...	0.05	10	1000	1000	1.0	2025	6	73.0	33.0	0.452055
1	prerank	IFN-alpha1	0.638917	1.648862	0.002119	0.005111	0.01	7/16	11.38%	EPSTI1;RTP4;PARP9;MX1;OAS2;ISG15;OAS1	...	0.05	10	1000	1000	1.0	2025	6	35.0	16.0	0.457143
2	prerank	IL-1-beta	0.504161	1.397566	0.036881	0.064062	0.155	11/24	22.27%	EPSTI1;TAP1;GBP1;CTSA;FBXO6;ISG15;GBP5;TNFSF10...	...	0.05	10	1000	1000	1.0	2025	6	48.0	24.0	0.500000
3	prerank	IL-15	0.416504	1.374959	0.001	0.061591	0.194	59/157	20.34%	CX3CR1;PCNA;EPSTI1;OASL;TAP1;PRF1;GBP1;MCM5;IF...	...	0.05	10	1000	1000	1.0	2025	6	324.0	157.0	0.484568
4	prerank	IL-4	0.478325	1.30142	0.10865	0.101613	0.327	7/21	18.21%	IMMT;LIMA1;CISH;VCL;TMEM71;PTGDR;PTGER2	...	0.05	10	1000	1000	1.0	2025	6	81.0	21.0	0.259259
5	prerank	IFN-omega	0.384821	1.224033	0.09519	0.171059	0.507	37/79	31.13%	EPSTI1;OASL;TAP1;PRF1;GBP1;IFITM3;RTP4;LAP3;PA...	...	0.05	10	1000	1000	1.0	2025	6	189.0	79.0	0.417989
6	prerank	IL-2	0.380176	1.204854	0.096192	0.171302	0.564	33/84	23.19%	EPSTI1;TAP1;PRF1;GBP1;IFITM3;GZMB;LAP3;PARP9;M...	...	0.05	10	1000	1000	1.0	2025	6	167.0	84.0	0.502994
7	prerank	IFN-beta	0.365799	1.172695	0.142142	0.194613	0.654	31/92	19.50%	EPSTI1;OASL;TAP1;GBP1;PRKD2;IFITM3;RTP4;STOM;L...	...	0.05	10	1000	1000	1.0	2025	6	235.0	92.0	0.391489
8	prerank	IL-7	0.348978	1.099694	0.271815	0.295434	0.825	27/76	24.02%	EPSTI1;STIP1;GBP1;IFITM3;C1GALT1C1;GZMB;LAP3;M...	...	0.05	10	1000	1000	1.0	2025	6	227.0	76.0	0.334802

9 rows × 30 columns

The custom ranking function must follow a specific format:
It should accept four input parameters: adata, contrast_column, condition_1, and condition_2. The function must return a pandas DataFrame containing ranked genes, where gene names are used as the index, the ranking statistic forms the values, and the column name specifies the contrast (e.g., condition_1_vs_condition_2). This example shows a toy ranking function as an illustration.

from anndata import AnnData
import pandas as pd


def toy_ranking_statistic(
    adata: AnnData,
    contrast_column: str,
    condition_1: str,
    condition_2: str,
) -> pd.DataFrame:
    """Minimal toy ranking function."""
    # Random ranking statistic for each gene
    ranking_stat = np.random.randn(adata.n_vars)

    return pd.DataFrame(
        ranking_stat,
        index=adata.var_names,
        columns=[f"{condition_1}_vs_{condition_2}"],
    )

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=toy_ranking_statistic,
)
enrichment_results.sort_values("NES", ascending=False)

	Name	cytokine	ES	NES	NOM p-val	FDR q-val	FWER p-val	Tag %	Gene %	Lead_genes	...	threshold_expression	min_size	max_size	permutation_num	weight	seed	threads	num_genes_signature	num_shared_genes_signature	frac_shared_genes_signature
3	prerank	IL-2	0.203674	0.890107	0.669159	0.661717	0.834	29/84	28.87%	IFITM3;GBP1;TNFSF14;PSMB8;TPM4;LTB;FBXO6;IFITM...	...	0.05	10	1000	1000	1.0	2025	6	167.0	84.0	0.502994
8	prerank	IFN-alpha1	-0.233137	-0.686907	0.876768	0.944809	0.980749	3/16	16.27%	TYMP;HELZ2;SAMD9L	...	0.05	10	1000	1000	1.0	2025	6	35.0	16.0	0.457143
7	prerank	IFN-gamma	-0.209036	-0.748128	0.883768	1.0	0.970053	7/33	16.27%	TYMP;HELZ2;OASL;RGS1;PARP14;MT2A;SAMD9L	...	0.05	10	1000	1000	1.0	2025	6	73.0	33.0	0.452055
6	prerank	IFN-omega	-0.190154	-0.835762	0.799235	1.0	0.936898	15/79	16.27%	SP140;PPM1K;TAP2;UBA7;UBE2L6;TRANK1;TYMP;GTPBP...	...	0.05	10	1000	1000	1.0	2025	6	189.0	79.0	0.417989
5	prerank	IL-7	-0.196657	-0.838665	0.791411	1.0	0.935829	14/76	16.27%	NFKBIZ;C1GALT1C1;GZMH;PPA1;CTSC;TYMP;PIM1;GCH1...	...	0.05	10	1000	1000	1.0	2025	6	227.0	76.0	0.334802
4	prerank	IFN-beta	-0.188577	-0.845274	0.798793	1.0	0.931551	18/92	16.27%	SP140;PPM1K;TAP2;UBE2L6;CASP1;TRANK1;TYMP;GTPB...	...	0.05	10	1000	1000	1.0	2025	6	235.0	92.0	0.391489
2	prerank	IL-1-beta	-0.287848	-0.965184	0.484663	1.0	0.804278	10/24	26.21%	MT-CO2;UBE2L6;TYMP;PIM1;HELZ2;MT2A;MT-CO1;APOL...	...	0.05	10	1000	1000	1.0	2025	6	48.0	24.0	0.500000
1	prerank	IL-15	-0.243178	-1.184948	0.119835	0.750057	0.449198	34/157	16.87%	VAMP5;NFKBIZ;C1GALT1C1;SQLE;GZMH;ILK;TUBA1C;CT...	...	0.05	10	1000	1000	1.0	2025	6	324.0	157.0	0.484568
0	prerank	IL-4	-0.422944	-1.366421	0.097917	0.51572	0.20107	9/21	22.71%	JAML;GAB3;VCL;TMEM71;LIMA1;MAPKAPK3;IMMT;CISH;...	...	0.05	10	1000	1000	1.0	2025	6	81.0	21.0	0.259259