Cytokine enrichment analysis using Human Cytokine Dictionary#
The Human Cytokine Dictionary (hcd) can be accessed through this module. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples. Using the Human Cytokine Dictionary as reference, you can “look up” cytokine responses of the disease states in your own dataset.
This tutorial runs you through the core analysis of this package, which computes enrichment scores for cytokines in your chosen condition. The enrichment analysis is based on internally computed pre-ranked genes, but the user has the option to define their own ranking statistics.
# Import core packages and the hucira package.
import os
import time
import warnings
import numpy as np
from tqdm import TqdmWarning
import hucira as hc
warnings.simplefilter("ignore", TqdmWarning)
start = time.time()
current_wd = os.getcwd()
print(current_wd)
/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks
1. Load input data#
The two main input files for this analysis are:
the Human Cytokine Dictionary
your transcriptome data object
Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.
#### Load the human cytokine dictionary (mini version)
df_hcd_all = hc.load_human_cytokine_dict(save_dir=current_wd, mini_version=True)
df_hcd_all.head()
| index | gene | log_fc | logCPM | F | p_value | adj_p_value | contrast | celltype | cytokine | ... | median_num_cells_pbs | median_num_cells_cytokine | mean_num_cells_pbs | mean_num_cells_cytokine | min_num_cells_pbs | min_num_cells_cytokine | max_num_cells_pbs | max_num_cells_cytokine | num_DE_pbs_wells | well_biased | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 182 | AACS | 0.506575 | 4.404342 | 19.205548 | 3.586696e-04 | 5.992502e-03 | NaN | Intermediate_B_cell | IL-32-beta | ... | 1435.5 | 162.0 | 1814.333333 | 201.666667 | 367.0 | 43.0 | 4609.0 | 414.0 | 4.0 | False |
| 1 | 415 | AAMDC | -0.592411 | 4.404723 | 28.952278 | 5.144025e-05 | 7.192403e-04 | NaN | Intermediate_B_cell | IFN-omega | ... | 1435.5 | 181.0 | 1814.333333 | 282.916667 | 367.0 | 113.0 | 4609.0 | 838.0 | 4.0 | False |
| 2 | 423 | AAMDC | -0.574430 | 4.394900 | 29.933840 | 4.591638e-05 | 5.666800e-04 | NaN | Intermediate_B_cell | IL-15 | ... | 1435.5 | 305.5 | 1814.333333 | 382.916667 | 367.0 | 120.0 | 4609.0 | 1296.0 | 4.0 | False |
| 3 | 437 | AAMDC | 1.123624 | 5.038960 | 157.966713 | 7.296158e-11 | 1.083115e-07 | NaN | Intermediate_B_cell | IL-23 | ... | 1435.5 | 282.0 | 1814.333333 | 325.250000 | 367.0 | 95.0 | 4609.0 | 901.0 | 6.0 | False |
| 4 | 445 | AAMDC | 0.558964 | 4.716018 | 35.557344 | 2.372101e-06 | 9.600669e-05 | NaN | Intermediate_B_cell | IL-34 | ... | 1435.5 | 248.0 | 1814.333333 | 311.500000 | 367.0 | 83.0 | 4609.0 | 655.0 | 5.0 | False |
5 rows × 21 columns
#### Load the query adata object
adata = hc.load_multiple_sclerosis_data()
adata
AnnData object with n_obs × n_vars = 65326 × 10266
obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
obsm: 'X_umap'
#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata, which you can read from previous cell output
your_celltype_colname = "labels"
your_contrast_colname = "MS"
2. Process and prepare input data parameters#
Follow these steps:
Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.
Because there is no standard nomenclature for cell types, we have to manually create
celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.Choose the experimental condition of interest for your query data (
contrast)
Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.
adata.var.head()
| FO538757.2 |
|---|
| AP006222.2 |
| RP11-206L10.9 |
| FAM41C |
| NOC2L |
The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.
Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.
print(f"All celltypes in dictionary:\n {df_hcd_all.celltype.unique()}\n")
print(f"All celltypes in query data:\n {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
f"All experimental states (contrasts/conditions) in query data:\n {sorted(adata.obs[your_contrast_colname].unique())}\n"
)
All celltypes in dictionary:
['Intermediate_B_cell' 'NKT' 'CD8_Memory_T_cell' 'NK_CD56low' 'CD16_Mono'
'NK_CD56hi' 'CD8_Naive_T_cell' 'pDC' 'ILC' 'MAIT' 'Naive_B_cell'
'CD4_Naive_T_cell' 'Treg' 'Plasmablast' 'Granulocyte' 'B_cell'
'CD4_T_cell' 'HSPC' 'CD8_T_cell' 'CD14_Mono' 'cDC' 'CD4_Memory_T_cell'
'NK' 'Mono']
All celltypes in query data:
['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']
All experimental states (contrasts/conditions) in query data:
['False', 'True']
To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.
# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]
celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)
# Define condition of interest
contrast = ("True", "False")
print(contrast)
[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')
3.1. Run enrichment analysis#
The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions.
This simple example demonstrates the main outcome of the cytokine enrichment score analysis.
run_one_enrichment_test()returns enrichment results for one celltype and contrast.check_robustness()returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a cytokine enrichment is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.
# Run enrichment analysis
enrichment_results = hc.run_one_enrichment_test(
adata=adata,
df=df_hcd_all,
contrast_combo=contrast,
celltype_combo=celltype_combos[1],
contrast_column=your_contrast_colname,
celltype_column=your_celltype_colname,
direction="upregulated",
threshold_lfc=1.0,
threshold_expression=0.05,
threshold_pval=0.01,
ranking_statistic_fn=None,
)
# Look at all enrichment results (reduced view to columns of interest, not all statistics)
enrichment_results[
[
"celltype_combo",
"cytokine",
"contrast",
"direction",
"ES",
"NES",
"NOM p-val",
"FDR q-val",
"FWER p-val",
"frac_shared_genes_signature",
"Lead_genes",
"threshold_lfc",
"threshold_expression",
]
].sort_values("NES", ascending=False)
| celltype_combo | cytokine | contrast | direction | ES | NES | NOM p-val | FDR q-val | FWER p-val | frac_shared_genes_signature | Lead_genes | threshold_lfc | threshold_expression | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CD8a (CD8_T_cell) | IL-15 | True_vs_False | upregulated | 0.599907 | 2.052186 | 0.0 | 0.0 | 0.0 | 0.484568 | ACTB;HSPA8;PRF1;IL32;PFN1;ACTG1;MYL12A;GZMA;TA... | 1.0 | 0.05 |
| 1 | CD8a (CD8_T_cell) | IL-2 | True_vs_False | upregulated | 0.62742 | 2.024465 | 0.0 | 0.0 | 0.0 | 0.502994 | HSPA8;PRF1;IL32;MYL12A;GZMA;TAP1;ENO1;GZMB;PSM... | 1.0 | 0.05 |
| 2 | CD8a (CD8_T_cell) | IL-7 | True_vs_False | upregulated | 0.475401 | 1.518688 | 0.005325 | 0.025716 | 0.06 | 0.334802 | HSPA8;GZMB;PSME2;GZMH;CTSC;ISG15;GBP1;CALR;HSP... | 1.0 | 0.05 |
| 3 | CD8a (CD8_T_cell) | IFN-omega | True_vs_False | upregulated | 0.42752 | 1.374484 | 0.03178 | 0.091201 | 0.233 | 0.417989 | HSPA8;PRF1;TAP1;PSME2;IFITM2;PGAM1;ISG15;GBP1;... | 1.0 | 0.05 |
| 4 | CD8a (CD8_T_cell) | IL-1-beta | True_vs_False | upregulated | 0.515314 | 1.353885 | 0.073027 | 0.091697 | 0.274 | 0.500000 | TAP1;MT-CO1;ISG15;GBP1;IFITM1;UBE2L6;EPSTI1;PI... | 1.0 | 0.05 |
| 5 | CD8a (CD8_T_cell) | IL-4 | True_vs_False | upregulated | 0.483863 | 1.267079 | 0.152745 | 0.174504 | 0.468 | 0.259259 | PTGER2;IMMT;PTGDR;CISH;VCL;XBP1;LIMA1;TMEM71 | 1.0 | 0.05 |
| 6 | CD8a (CD8_T_cell) | IFN-beta | True_vs_False | upregulated | 0.36094 | 1.187676 | 0.185224 | 0.258371 | 0.645 | 0.391489 | TAP1;PSMB8;STOM;IFI16;IFITM2;ALOX5AP;ISG15;GBP... | 1.0 | 0.05 |
| 7 | CD8a (CD8_T_cell) | IFN-gamma | True_vs_False | upregulated | 0.422746 | 1.179638 | 0.236721 | 0.238198 | 0.664 | 0.452055 | TAP1;ISG15;GBP1;OASL;IFITM1;EPSTI1;MX1;GBP5;LA... | 1.0 | 0.05 |
| 8 | CD8a (CD8_T_cell) | IFN-alpha1 | True_vs_False | upregulated | 0.450718 | 1.114153 | 0.320047 | 0.308474 | 0.779 | 0.457143 | ISG15;EPSTI1;MX1;PARP9;RTP4;OAS2;IFI6;OAS1;SAMD9L | 1.0 | 0.05 |
# Filter enrichment results for robust and significant outputs.
robust_results = hc.check_robustness(
all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results
| celltype_combo | contrast | cytokine | frac_valid | frac_significant | is_robust | NES_min | NES_max | FDR q-val | qval_threshold | threshold_frac_below_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CD8a (CD8_T_cell) | True_vs_False | IL-15 | 1.0 | 1.0 | True | 2.052186 | 2.052186 | 0.000000 | 0.1 | 0.9 |
| 1 | CD8a (CD8_T_cell) | True_vs_False | IL-2 | 1.0 | 1.0 | True | 2.024465 | 2.024465 | 0.000000 | 0.1 | 0.9 |
| 2 | CD8a (CD8_T_cell) | True_vs_False | IL-7 | 1.0 | 1.0 | True | 1.518688 | 1.518688 | 0.025716 | 0.1 | 0.9 |
| 3 | CD8a (CD8_T_cell) | True_vs_False | IFN-omega | 1.0 | 1.0 | True | 1.374484 | 1.374484 | 0.091201 | 0.1 | 0.9 |
| 4 | CD8a (CD8_T_cell) | True_vs_False | IL-1-beta | 1.0 | 1.0 | True | 1.353885 | 1.353885 | 0.091697 | 0.1 | 0.9 |
3.2. Run enrichment analysis using custom ranking function#
The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores using an internal ranking statistic. It uses the signal-to-noise ratio to generate a pre-ranked list of genes for the enrichment analysis.
Alternatively, you can provide your own ranking statistic. In this example, the ranking is based on a simple log fold change calculation. As a result, the enrichment results change. This shows that the way genes are ranked can strongly influence the enrichment analysis.
from hucira.tl.enrichment_test import _compute_log_fold_change
enrichment_results = hc.run_one_enrichment_test(
adata=adata,
df=df_hcd_all,
contrast_combo=contrast,
celltype_combo=celltype_combos[1],
contrast_column=your_contrast_colname,
celltype_column=your_celltype_colname,
direction="upregulated",
threshold_lfc=1.0,
threshold_expression=0.05,
threshold_pval=0.01,
ranking_statistic_fn=_compute_log_fold_change,
)
enrichment_results.sort_values("NES", ascending=False)
| Name | cytokine | ES | NES | NOM p-val | FDR q-val | FWER p-val | Tag % | Gene % | Lead_genes | ... | threshold_expression | min_size | max_size | permutation_num | weight | seed | threads | num_genes_signature | num_shared_genes_signature | frac_shared_genes_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | prerank | IFN-gamma | 0.565919 | 1.649477 | 0.00206 | 0.010223 | 0.01 | 16/33 | 15.43% | EPSTI1;OASL;TAP1;GBP1;RTP4;LAP3;PARP9;MX1;FBXO... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 73.0 | 33.0 | 0.452055 |
| 1 | prerank | IFN-alpha1 | 0.638917 | 1.648862 | 0.002119 | 0.005111 | 0.01 | 7/16 | 11.38% | EPSTI1;RTP4;PARP9;MX1;OAS2;ISG15;OAS1 | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 35.0 | 16.0 | 0.457143 |
| 2 | prerank | IL-1-beta | 0.504161 | 1.397566 | 0.036881 | 0.064062 | 0.155 | 11/24 | 22.27% | EPSTI1;TAP1;GBP1;CTSA;FBXO6;ISG15;GBP5;TNFSF10... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 48.0 | 24.0 | 0.500000 |
| 3 | prerank | IL-15 | 0.416504 | 1.374959 | 0.001 | 0.061591 | 0.194 | 59/157 | 20.34% | CX3CR1;PCNA;EPSTI1;OASL;TAP1;PRF1;GBP1;MCM5;IF... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 324.0 | 157.0 | 0.484568 |
| 4 | prerank | IL-4 | 0.478325 | 1.30142 | 0.10865 | 0.101613 | 0.327 | 7/21 | 18.21% | IMMT;LIMA1;CISH;VCL;TMEM71;PTGDR;PTGER2 | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 81.0 | 21.0 | 0.259259 |
| 5 | prerank | IFN-omega | 0.384821 | 1.224033 | 0.09519 | 0.171059 | 0.507 | 37/79 | 31.13% | EPSTI1;OASL;TAP1;PRF1;GBP1;IFITM3;RTP4;LAP3;PA... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 189.0 | 79.0 | 0.417989 |
| 6 | prerank | IL-2 | 0.380176 | 1.204854 | 0.096192 | 0.171302 | 0.564 | 33/84 | 23.19% | EPSTI1;TAP1;PRF1;GBP1;IFITM3;GZMB;LAP3;PARP9;M... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 167.0 | 84.0 | 0.502994 |
| 7 | prerank | IFN-beta | 0.365799 | 1.172695 | 0.142142 | 0.194613 | 0.654 | 31/92 | 19.50% | EPSTI1;OASL;TAP1;GBP1;PRKD2;IFITM3;RTP4;STOM;L... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 235.0 | 92.0 | 0.391489 |
| 8 | prerank | IL-7 | 0.348978 | 1.099694 | 0.271815 | 0.295434 | 0.825 | 27/76 | 24.02% | EPSTI1;STIP1;GBP1;IFITM3;C1GALT1C1;GZMB;LAP3;M... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 227.0 | 76.0 | 0.334802 |
9 rows × 30 columns
The custom ranking function must follow a specific format:
It should accept four input parameters: adata, contrast_column, condition_1, and condition_2. The function must return a pandas DataFrame containing ranked genes, where gene names are used as the index, the ranking statistic forms the values, and the column name specifies the contrast (e.g., condition_1_vs_condition_2). This example shows a toy ranking function as an illustration.
from anndata import AnnData
import pandas as pd
def toy_ranking_statistic(
adata: AnnData,
contrast_column: str,
condition_1: str,
condition_2: str,
) -> pd.DataFrame:
"""Minimal toy ranking function."""
# Random ranking statistic for each gene
ranking_stat = np.random.randn(adata.n_vars)
return pd.DataFrame(
ranking_stat,
index=adata.var_names,
columns=[f"{condition_1}_vs_{condition_2}"],
)
enrichment_results = hc.run_one_enrichment_test(
adata=adata,
df=df_hcd_all,
contrast_combo=contrast,
celltype_combo=celltype_combos[1],
contrast_column=your_contrast_colname,
celltype_column=your_celltype_colname,
direction="upregulated",
threshold_lfc=1.0,
threshold_expression=0.05,
threshold_pval=0.01,
ranking_statistic_fn=toy_ranking_statistic,
)
enrichment_results.sort_values("NES", ascending=False)
| Name | cytokine | ES | NES | NOM p-val | FDR q-val | FWER p-val | Tag % | Gene % | Lead_genes | ... | threshold_expression | min_size | max_size | permutation_num | weight | seed | threads | num_genes_signature | num_shared_genes_signature | frac_shared_genes_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | prerank | IL-2 | 0.203674 | 0.890107 | 0.669159 | 0.661717 | 0.834 | 29/84 | 28.87% | IFITM3;GBP1;TNFSF14;PSMB8;TPM4;LTB;FBXO6;IFITM... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 167.0 | 84.0 | 0.502994 |
| 8 | prerank | IFN-alpha1 | -0.233137 | -0.686907 | 0.876768 | 0.944809 | 0.980749 | 3/16 | 16.27% | TYMP;HELZ2;SAMD9L | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 35.0 | 16.0 | 0.457143 |
| 7 | prerank | IFN-gamma | -0.209036 | -0.748128 | 0.883768 | 1.0 | 0.970053 | 7/33 | 16.27% | TYMP;HELZ2;OASL;RGS1;PARP14;MT2A;SAMD9L | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 73.0 | 33.0 | 0.452055 |
| 6 | prerank | IFN-omega | -0.190154 | -0.835762 | 0.799235 | 1.0 | 0.936898 | 15/79 | 16.27% | SP140;PPM1K;TAP2;UBA7;UBE2L6;TRANK1;TYMP;GTPBP... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 189.0 | 79.0 | 0.417989 |
| 5 | prerank | IL-7 | -0.196657 | -0.838665 | 0.791411 | 1.0 | 0.935829 | 14/76 | 16.27% | NFKBIZ;C1GALT1C1;GZMH;PPA1;CTSC;TYMP;PIM1;GCH1... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 227.0 | 76.0 | 0.334802 |
| 4 | prerank | IFN-beta | -0.188577 | -0.845274 | 0.798793 | 1.0 | 0.931551 | 18/92 | 16.27% | SP140;PPM1K;TAP2;UBE2L6;CASP1;TRANK1;TYMP;GTPB... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 235.0 | 92.0 | 0.391489 |
| 2 | prerank | IL-1-beta | -0.287848 | -0.965184 | 0.484663 | 1.0 | 0.804278 | 10/24 | 26.21% | MT-CO2;UBE2L6;TYMP;PIM1;HELZ2;MT2A;MT-CO1;APOL... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 48.0 | 24.0 | 0.500000 |
| 1 | prerank | IL-15 | -0.243178 | -1.184948 | 0.119835 | 0.750057 | 0.449198 | 34/157 | 16.87% | VAMP5;NFKBIZ;C1GALT1C1;SQLE;GZMH;ILK;TUBA1C;CT... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 324.0 | 157.0 | 0.484568 |
| 0 | prerank | IL-4 | -0.422944 | -1.366421 | 0.097917 | 0.51572 | 0.20107 | 9/21 | 22.71% | JAML;GAB3;VCL;TMEM71;LIMA1;MAPKAPK3;IMMT;CISH;... | ... | 0.05 | 10 | 1000 | 1000 | 1.0 | 2025 | 6 | 81.0 | 21.0 | 0.259259 |
9 rows × 30 columns
end = time.time()
print(f"Time: {int((end - start) / 60)} min {int((end - start) % 60)} sec")
Time: 0 min 12 sec
## --END-- ##