Cytokine enrichment analysis using Human Cytokine Dictionary#

The Human Cytokine Dictionary (hcd) can be accessed through this module. Your query data should be a transcriptomic data object in AnnData’s .h5ad format with gene symbols in .var axis and metadata describing immune celltypes and experimental conditions of samples. Using the Human Cytokine Dictionary as reference, you can “look up” cytokine responses of the disease states in your own dataset.

This tutorial runs you through the core analysis of this package, which computes enrichment scores for cytokines in your chosen condition. The enrichment analysis is based on internally computed pre-ranked genes, but the user has the option to define their own ranking statistics.

# Import core packages and the hucira package.

import os
import time
import warnings

import numpy as np
from tqdm import TqdmWarning

import hucira as hc

warnings.simplefilter("ignore", TqdmWarning)

start = time.time()
current_wd = os.getcwd()
print(current_wd)
/ictstr01/home/icb/jenni.liu/all_projects/cytokine_dict_project/huCIRA/docs/notebooks

1. Load input data#

The two main input files for this analysis are:

  1. the Human Cytokine Dictionary

  2. your transcriptome data object

Explore their metadata annotation for cell types and disease condition. The files are saved to or loaded from the current working directory. The path is explicitly specified here, although it’s also the default saving location if left unspecified.

#### Load the human cytokine dictionary (mini version)

df_hcd_all = hc.load_human_cytokine_dict(save_dir=current_wd, mini_version=True)
df_hcd_all.head()
index gene log_fc logCPM F p_value adj_p_value contrast celltype cytokine ... median_num_cells_pbs median_num_cells_cytokine mean_num_cells_pbs mean_num_cells_cytokine min_num_cells_pbs min_num_cells_cytokine max_num_cells_pbs max_num_cells_cytokine num_DE_pbs_wells well_biased
0 182 AACS 0.506575 4.404342 19.205548 3.586696e-04 5.992502e-03 NaN Intermediate_B_cell IL-32-beta ... 1435.5 162.0 1814.333333 201.666667 367.0 43.0 4609.0 414.0 4.0 False
1 415 AAMDC -0.592411 4.404723 28.952278 5.144025e-05 7.192403e-04 NaN Intermediate_B_cell IFN-omega ... 1435.5 181.0 1814.333333 282.916667 367.0 113.0 4609.0 838.0 4.0 False
2 423 AAMDC -0.574430 4.394900 29.933840 4.591638e-05 5.666800e-04 NaN Intermediate_B_cell IL-15 ... 1435.5 305.5 1814.333333 382.916667 367.0 120.0 4609.0 1296.0 4.0 False
3 437 AAMDC 1.123624 5.038960 157.966713 7.296158e-11 1.083115e-07 NaN Intermediate_B_cell IL-23 ... 1435.5 282.0 1814.333333 325.250000 367.0 95.0 4609.0 901.0 6.0 False
4 445 AAMDC 0.558964 4.716018 35.557344 2.372101e-06 9.600669e-05 NaN Intermediate_B_cell IL-34 ... 1435.5 248.0 1814.333333 311.500000 367.0 83.0 4609.0 655.0 5.0 False

5 rows × 21 columns

#### Load the query adata object

adata = hc.load_multiple_sclerosis_data()
adata
AnnData object with n_obs × n_vars = 65326 × 10266
    obs: 'labels', 'MS', 'CSF', 'valid_clusters', 'CD4_labels'
    obsm: 'X_umap'
#### Optional for easy workflow:
# Enter celltype column name and condition column name of your query adata, which you can read from previous cell output

your_celltype_colname = "labels"
your_contrast_colname = "MS"

2. Process and prepare input data parameters#

Follow these steps:

  1. Check if features are in HGNC format. Must match gene symbol nomenclature of dictionary.

  2. Because there is no standard nomenclature for cell types, we have to manually create celltype_combos, a mapping that links the cell types in the query dataset to the corresponding cell types in the human cytokine dictionary.

  3. Choose the experimental condition of interest for your query data (contrast)

Steps 2 and 3, just like defining cell type and contrast columns earlier, are optional. However, it can be helpful to get an overview of your data and run the pipeline more smoothly.

adata.var.head()
FO538757.2
AP006222.2
RP11-206L10.9
FAM41C
NOC2L

The Enrichment analysis needs two main pieces of information from the query adata: cell types and experimental (disease) conditions.

Annotations have to be examined manually, because naming conventions between objects differ. Compare the cell type annotation of the dictionary with the naming convention of your query adata and re-annotate your query adata’s annotation if necessary.

print(f"All celltypes in dictionary:\n  {df_hcd_all.celltype.unique()}\n")
print(f"All celltypes in query data:\n  {sorted(adata.obs[your_celltype_colname].unique())}\n")
print(
    f"All experimental states (contrasts/conditions) in query data:\n  {sorted(adata.obs[your_contrast_colname].unique())}\n"
)
All celltypes in dictionary:
  ['Intermediate_B_cell' 'NKT' 'CD8_Memory_T_cell' 'NK_CD56low' 'CD16_Mono'
 'NK_CD56hi' 'CD8_Naive_T_cell' 'pDC' 'ILC' 'MAIT' 'Naive_B_cell'
 'CD4_Naive_T_cell' 'Treg' 'Plasmablast' 'Granulocyte' 'B_cell'
 'CD4_T_cell' 'HSPC' 'CD8_T_cell' 'CD14_Mono' 'cDC' 'CD4_Memory_T_cell'
 'NK' 'Mono']

All celltypes in query data:
  ['B cell doublets', 'B1', 'B2', 'CD4', 'CD8a', 'CD8n', 'Gran', 'MegaK', 'Mono', 'Mono Doublet', 'NK1', 'NK2', 'RBC', 'Tdg', 'Tregs', 'contamination1', 'doublet', 'mDC1', 'mDC2', 'ncMono', 'pDC', 'plasma']

All experimental states (contrasts/conditions) in query data:
  ['False', 'True']

To create celltype_combos, enter ordered lists that match the same cell types across different naming conventions in your data vs the human cytokine dictionary. “create_celltype_combos()” can help you to create the right format, but you can also format manually.

# Define celltype_combos, an input parameter for enrichment testing
adata_celltypes = ["B1", "CD8a", "Mono"]
hcd_celltypes = ["B_cell", "CD8_T_cell", "CD14_Mono"]

celltype_combos = hc.create_celltype_combos(adata_celltypes, hcd_celltypes)
print(celltype_combos)

# Define condition of interest
contrast = ("True", "False")
print(contrast)
[('B1', 'B_cell'), ('CD8a', 'CD8_T_cell'), ('Mono', 'CD14_Mono')]
('True', 'False')

3.1. Run enrichment analysis#

The main analysis is done by run_one_enrichment_test(), which computes enrichment scores of one query celltype and different conditions. This simple example demonstrates the main outcome of the cytokine enrichment score analysis.

  1. run_one_enrichment_test() returns enrichment results for one celltype and contrast.

  2. check_robustness() returns the robust and significant results from that initial enrichment analysis using a single threshold to filter FDR q-val (statistical significance of enrichment result) and making sure that a cytokine enrichment is considered robust only if enough tests are evaluable and a sufficient fraction of them are significant. More about this in the advanced tutorial.

# Run enrichment analysis

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=None,
)

# Look at all enrichment results (reduced view to columns of interest, not all statistics)
enrichment_results[
    [
        "celltype_combo",
        "cytokine",
        "contrast",
        "direction",
        "ES",
        "NES",
        "NOM p-val",
        "FDR q-val",
        "FWER p-val",
        "frac_shared_genes_signature",
        "Lead_genes",
        "threshold_lfc",
        "threshold_expression",
    ]
].sort_values("NES", ascending=False)
celltype_combo cytokine contrast direction ES NES NOM p-val FDR q-val FWER p-val frac_shared_genes_signature Lead_genes threshold_lfc threshold_expression
0 CD8a (CD8_T_cell) IL-15 True_vs_False upregulated 0.599907 2.052186 0.0 0.0 0.0 0.484568 ACTB;HSPA8;PRF1;IL32;PFN1;ACTG1;MYL12A;GZMA;TA... 1.0 0.05
1 CD8a (CD8_T_cell) IL-2 True_vs_False upregulated 0.62742 2.024465 0.0 0.0 0.0 0.502994 HSPA8;PRF1;IL32;MYL12A;GZMA;TAP1;ENO1;GZMB;PSM... 1.0 0.05
2 CD8a (CD8_T_cell) IL-7 True_vs_False upregulated 0.475401 1.518688 0.005325 0.025716 0.06 0.334802 HSPA8;GZMB;PSME2;GZMH;CTSC;ISG15;GBP1;CALR;HSP... 1.0 0.05
3 CD8a (CD8_T_cell) IFN-omega True_vs_False upregulated 0.42752 1.374484 0.03178 0.091201 0.233 0.417989 HSPA8;PRF1;TAP1;PSME2;IFITM2;PGAM1;ISG15;GBP1;... 1.0 0.05
4 CD8a (CD8_T_cell) IL-1-beta True_vs_False upregulated 0.515314 1.353885 0.073027 0.091697 0.274 0.500000 TAP1;MT-CO1;ISG15;GBP1;IFITM1;UBE2L6;EPSTI1;PI... 1.0 0.05
5 CD8a (CD8_T_cell) IL-4 True_vs_False upregulated 0.483863 1.267079 0.152745 0.174504 0.468 0.259259 PTGER2;IMMT;PTGDR;CISH;VCL;XBP1;LIMA1;TMEM71 1.0 0.05
6 CD8a (CD8_T_cell) IFN-beta True_vs_False upregulated 0.36094 1.187676 0.185224 0.258371 0.645 0.391489 TAP1;PSMB8;STOM;IFI16;IFITM2;ALOX5AP;ISG15;GBP... 1.0 0.05
7 CD8a (CD8_T_cell) IFN-gamma True_vs_False upregulated 0.422746 1.179638 0.236721 0.238198 0.664 0.452055 TAP1;ISG15;GBP1;OASL;IFITM1;EPSTI1;MX1;GBP5;LA... 1.0 0.05
8 CD8a (CD8_T_cell) IFN-alpha1 True_vs_False upregulated 0.450718 1.114153 0.320047 0.308474 0.779 0.457143 ISG15;EPSTI1;MX1;PARP9;RTP4;OAS2;IFI6;OAS1;SAMD9L 1.0 0.05
# Filter enrichment results for robust and significant outputs.

robust_results = hc.check_robustness(
    all_results=enrichment_results, threshold_qval=0.1, threshold_valid=0.1, threshold_below_alpha=0.9
)
robust_results
celltype_combo contrast cytokine frac_valid frac_significant is_robust NES_min NES_max FDR q-val qval_threshold threshold_frac_below_alpha
0 CD8a (CD8_T_cell) True_vs_False IL-15 1.0 1.0 True 2.052186 2.052186 0.000000 0.1 0.9
1 CD8a (CD8_T_cell) True_vs_False IL-2 1.0 1.0 True 2.024465 2.024465 0.000000 0.1 0.9
2 CD8a (CD8_T_cell) True_vs_False IL-7 1.0 1.0 True 1.518688 1.518688 0.025716 0.1 0.9
3 CD8a (CD8_T_cell) True_vs_False IFN-omega 1.0 1.0 True 1.374484 1.374484 0.091201 0.1 0.9
4 CD8a (CD8_T_cell) True_vs_False IL-1-beta 1.0 1.0 True 1.353885 1.353885 0.091697 0.1 0.9

3.2. Run enrichment analysis using custom ranking function#

The main analysis is performed by run_one_enrichment_test(), which computes enrichment scores using an internal ranking statistic. It uses the signal-to-noise ratio to generate a pre-ranked list of genes for the enrichment analysis.

Alternatively, you can provide your own ranking statistic. In this example, the ranking is based on a simple log fold change calculation. As a result, the enrichment results change. This shows that the way genes are ranked can strongly influence the enrichment analysis.

from hucira.tl.enrichment_test import _compute_log_fold_change

enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=_compute_log_fold_change,
)
enrichment_results.sort_values("NES", ascending=False)
Name cytokine ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes ... threshold_expression min_size max_size permutation_num weight seed threads num_genes_signature num_shared_genes_signature frac_shared_genes_signature
0 prerank IFN-gamma 0.565919 1.649477 0.00206 0.010223 0.01 16/33 15.43% EPSTI1;OASL;TAP1;GBP1;RTP4;LAP3;PARP9;MX1;FBXO... ... 0.05 10 1000 1000 1.0 2025 6 73.0 33.0 0.452055
1 prerank IFN-alpha1 0.638917 1.648862 0.002119 0.005111 0.01 7/16 11.38% EPSTI1;RTP4;PARP9;MX1;OAS2;ISG15;OAS1 ... 0.05 10 1000 1000 1.0 2025 6 35.0 16.0 0.457143
2 prerank IL-1-beta 0.504161 1.397566 0.036881 0.064062 0.155 11/24 22.27% EPSTI1;TAP1;GBP1;CTSA;FBXO6;ISG15;GBP5;TNFSF10... ... 0.05 10 1000 1000 1.0 2025 6 48.0 24.0 0.500000
3 prerank IL-15 0.416504 1.374959 0.001 0.061591 0.194 59/157 20.34% CX3CR1;PCNA;EPSTI1;OASL;TAP1;PRF1;GBP1;MCM5;IF... ... 0.05 10 1000 1000 1.0 2025 6 324.0 157.0 0.484568
4 prerank IL-4 0.478325 1.30142 0.10865 0.101613 0.327 7/21 18.21% IMMT;LIMA1;CISH;VCL;TMEM71;PTGDR;PTGER2 ... 0.05 10 1000 1000 1.0 2025 6 81.0 21.0 0.259259
5 prerank IFN-omega 0.384821 1.224033 0.09519 0.171059 0.507 37/79 31.13% EPSTI1;OASL;TAP1;PRF1;GBP1;IFITM3;RTP4;LAP3;PA... ... 0.05 10 1000 1000 1.0 2025 6 189.0 79.0 0.417989
6 prerank IL-2 0.380176 1.204854 0.096192 0.171302 0.564 33/84 23.19% EPSTI1;TAP1;PRF1;GBP1;IFITM3;GZMB;LAP3;PARP9;M... ... 0.05 10 1000 1000 1.0 2025 6 167.0 84.0 0.502994
7 prerank IFN-beta 0.365799 1.172695 0.142142 0.194613 0.654 31/92 19.50% EPSTI1;OASL;TAP1;GBP1;PRKD2;IFITM3;RTP4;STOM;L... ... 0.05 10 1000 1000 1.0 2025 6 235.0 92.0 0.391489
8 prerank IL-7 0.348978 1.099694 0.271815 0.295434 0.825 27/76 24.02% EPSTI1;STIP1;GBP1;IFITM3;C1GALT1C1;GZMB;LAP3;M... ... 0.05 10 1000 1000 1.0 2025 6 227.0 76.0 0.334802

9 rows × 30 columns

The custom ranking function must follow a specific format:
It should accept four input parameters: adata, contrast_column, condition_1, and condition_2. The function must return a pandas DataFrame containing ranked genes, where gene names are used as the index, the ranking statistic forms the values, and the column name specifies the contrast (e.g., condition_1_vs_condition_2). This example shows a toy ranking function as an illustration.

from anndata import AnnData
import pandas as pd


def toy_ranking_statistic(
    adata: AnnData,
    contrast_column: str,
    condition_1: str,
    condition_2: str,
) -> pd.DataFrame:
    """Minimal toy ranking function."""
    # Random ranking statistic for each gene
    ranking_stat = np.random.randn(adata.n_vars)

    return pd.DataFrame(
        ranking_stat,
        index=adata.var_names,
        columns=[f"{condition_1}_vs_{condition_2}"],
    )
enrichment_results = hc.run_one_enrichment_test(
    adata=adata,
    df=df_hcd_all,
    contrast_combo=contrast,
    celltype_combo=celltype_combos[1],
    contrast_column=your_contrast_colname,
    celltype_column=your_celltype_colname,
    direction="upregulated",
    threshold_lfc=1.0,
    threshold_expression=0.05,
    threshold_pval=0.01,
    ranking_statistic_fn=toy_ranking_statistic,
)
enrichment_results.sort_values("NES", ascending=False)
Name cytokine ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes ... threshold_expression min_size max_size permutation_num weight seed threads num_genes_signature num_shared_genes_signature frac_shared_genes_signature
3 prerank IL-2 0.203674 0.890107 0.669159 0.661717 0.834 29/84 28.87% IFITM3;GBP1;TNFSF14;PSMB8;TPM4;LTB;FBXO6;IFITM... ... 0.05 10 1000 1000 1.0 2025 6 167.0 84.0 0.502994
8 prerank IFN-alpha1 -0.233137 -0.686907 0.876768 0.944809 0.980749 3/16 16.27% TYMP;HELZ2;SAMD9L ... 0.05 10 1000 1000 1.0 2025 6 35.0 16.0 0.457143
7 prerank IFN-gamma -0.209036 -0.748128 0.883768 1.0 0.970053 7/33 16.27% TYMP;HELZ2;OASL;RGS1;PARP14;MT2A;SAMD9L ... 0.05 10 1000 1000 1.0 2025 6 73.0 33.0 0.452055
6 prerank IFN-omega -0.190154 -0.835762 0.799235 1.0 0.936898 15/79 16.27% SP140;PPM1K;TAP2;UBA7;UBE2L6;TRANK1;TYMP;GTPBP... ... 0.05 10 1000 1000 1.0 2025 6 189.0 79.0 0.417989
5 prerank IL-7 -0.196657 -0.838665 0.791411 1.0 0.935829 14/76 16.27% NFKBIZ;C1GALT1C1;GZMH;PPA1;CTSC;TYMP;PIM1;GCH1... ... 0.05 10 1000 1000 1.0 2025 6 227.0 76.0 0.334802
4 prerank IFN-beta -0.188577 -0.845274 0.798793 1.0 0.931551 18/92 16.27% SP140;PPM1K;TAP2;UBE2L6;CASP1;TRANK1;TYMP;GTPB... ... 0.05 10 1000 1000 1.0 2025 6 235.0 92.0 0.391489
2 prerank IL-1-beta -0.287848 -0.965184 0.484663 1.0 0.804278 10/24 26.21% MT-CO2;UBE2L6;TYMP;PIM1;HELZ2;MT2A;MT-CO1;APOL... ... 0.05 10 1000 1000 1.0 2025 6 48.0 24.0 0.500000
1 prerank IL-15 -0.243178 -1.184948 0.119835 0.750057 0.449198 34/157 16.87% VAMP5;NFKBIZ;C1GALT1C1;SQLE;GZMH;ILK;TUBA1C;CT... ... 0.05 10 1000 1000 1.0 2025 6 324.0 157.0 0.484568
0 prerank IL-4 -0.422944 -1.366421 0.097917 0.51572 0.20107 9/21 22.71% JAML;GAB3;VCL;TMEM71;LIMA1;MAPKAPK3;IMMT;CISH;... ... 0.05 10 1000 1000 1.0 2025 6 81.0 21.0 0.259259

9 rows × 30 columns

end = time.time()
print(f"Time: {int((end - start) / 60)} min {int((end - start) % 60)} sec")
Time: 0 min 12 sec
## --END-- ##