An integrated tumor, immune and microbiome atlas of colon cancer


Samples used in this observational cohort study (tumor tissue and matched healthy colon tissue, AC-ICAM cohort) are from patients with colon cancer diagnosed at Leiden University Medical Center, the Netherlands, from 2001 to 2015 that did not object for future use of human tissues for scientific research and that were consented on biospecimen protocol ‘Immunology and Genetic of colon Cancer’ approved by the Committee on Medical Ethics of Leiden University Medical Center (study protocol no. P00.193 (06/2001)). Snap-frozen tumor and healthy colon tissue were stored at −80°C until processing for DNA and RNA extraction. DNA and RNA from those samples were extracted at Leiden University Medical Center and then transferred to Sidra Medicine for sequencing together with de-identified clinicopathological data of the corresponding patients (Sidra Medicine IRB study protocols no. 1768087-1 (04/2016)/1602002725 (06/2022)). All genomic assays (WES, WGS, 16S RNA gene sequencing, RNA-seq, TCR sequencing and PCR) were performed at Sidra Medicine.

Patient information was de-identified and patient samples were anonymized and handled according to the medical guidelines described in the Code of Conduct for Proper Secondary Use of Human Tissue of The Federation of Dutch Medical Scientific Societies. This research was performed according to the recommendations outlined in the Helsinki Declaration.

For each assay we included all samples that had sufficient material (for example, DNA or RNA) available at the time of processing considering the need to preserve aliquots for additional/future assays.

Collection of biological samples

Snap-frozen tumor and healthy colon tissue were collected from patients with colon cancer who underwent surgical resection of the primary tumor between 2001 and 2015 at Leiden University Medical Center. Patients who received radiotherapy and/or chemotherapy before resection and patients with a primary tumor of non-epithelial origin were excluded. Based on tissue availability, successful nucleic acid extraction and subsequent sequencing quality control (QC), data from 348 patients were retained in the final AC-ICAM cohort (Extended Data Fig. 1). Clinicopathological and follow-up data were retrospectively collected from hospital records. Patient information was de-identified and patient samples were anonymized and handled according to the medical guidelines described in the Code of Conduct for Proper Secondary Use of Human Tissue of The Federation of Dutch Medical Scientific Societies. Extensive clinicopathological and survival data of the cohort are available (Supplementary Table 1).

Statistical analysis

Details of the statistical analysis are described in each method section. All P values were two-sided. Multiple testing corrections were performed by calculating the FDR using the Benjamini–Hochberg method, as appropriate. For missing data, no data imputations were used.

Survival analysis

Kaplan–Meier curves were generated using ggsurvplot from R package survminer (v.0.4.9). HRs between any two groups of interest and corresponding P values based on a Cox proportional hazard regression analysis and 95% confidence intervals (95% CI), were calculated using R package survival (v.2.41–3). Cox proportional hazard analysis was only computed when both groups of comparison consisted of at least ten patients. Overall P value for comparison of survival between two or more groups was also calculated by log-rank test.

Multivariate Cox regression was performed using conventional clinical and biological variables, as explained in the specific section. Separate multivariate Cox regression analyses were run including age (continuous), pathological stage (ordinal), MSI status (binary) and CMS (categorical). Additional variables that were found significant in univariate Cox proportional hazard regression analysis were added to these models. These variables included, ICR score (continuous) or ICR cluster (ordinal), GIE (binary) and MBR group (binary). Forest plots were generated using ‘forestplot’ (v.1.7.2).

Tissue processing

Tumor and healthy tissue samples (unselected for tumor cell purity) were sectioned in a cryostat until the surface area was sufficient to assess tissue morphology by H&E staining. Non-target tissue was removed by macrodissection, including necrotic or adipose tissue and for tumor tissue samples, healthy colon tissue. When macrodissection was required, an H&E-stained slide was examined after this to confirm removal of unwanted tissue types. Frozen tissue was then sectioned at 20 µm until approximately ~10–15 mg was collected per sample. A final section post-sample processing was made for H&E staining. The collected tissue was stored at −80 °C for a few months until DNA and RNA extraction.

QC metrics of RNA and DNA data were superimposable between samples collected over the years (Supplementary Figs. 13 and 14).

DNA and RNA extraction

Nucleic acid extraction from fresh-frozen tissue sections was performed using the QIAGEN AllPrep DNA/RNA Mini kit following the manufacturer’s protocol. This process was fully automated on a QIAGEN QIAcube. β-mercaptoethanol (β-ME) was added to the lysis buffer on the day of use. Lysis was performed by completely submerging the sections in 350 µl lysis buffer. Tubes were rotated for at least 1 h at room temperature to allow complete homogenization. QIAcube AllPrep DNA/RNA Mini kit Standard (v.2) program was run, after which DNA and RNA samples were stored at −80 °C. The same DNA was used for human and microbiome sequencing. Samples were shipped from Leiden University Medical Centre (LUMC), The Netherlands to Sidra Medicine, Qatar under a temperature-controlled environment at −80 °C (for 4 d). Samples from 361 patients were sequenced by WES and RNA-seq. Samples from 13 patients were excluded as they did not pass QC, including concordance between healthy and tumor samples (Extended Data Fig. 1). The final cohort included 348 patients, for which RNA-seq for tumor samples was possible and passed QC. A subset of samples from these patients were processed with additional assays including WGS, TCR sequencing and 16S RNA gene sequencing, based on the availability of samples for these assays, as described in the following sections.

RNA sequencing

The integrity and concentration of the extracted RNA was assessed on the LabChip GXII Touch HT using the RNA Assay and the DNA 5K/RNA/Charge Variant Assay LabChip (PerkinElmer). Sequencing mRNA libraries were constructed from 500 ng of total RNA using the Illumina TruSeqStranded mRNA kit (Illumina). cDNA was synthesized using Superscript IV Reverse Transcriptase (Thermo Fisher) and amplified for 15 cycles after ligating with TruSeq RNA Combinatorial Dual-Index adapters. Clonal amplification and cluster generation was performed using Illumina’s cBot 2 System. Sequencing libraries were run on Illumina HiSeq platforms using 75 bp (93% of samples) or 150 bp (7% of samples) paired-end reads at the Clinical Genomics Laboratory, Sidra Medicine. We targeted a coverage of 20 M reads per sample. Obtained coverage was 18.4 M (s.d. 4.7 M).

Transcriptomic data processing

Data conversion and demultiplexing was performed using bcl2fastq2 conversion software (v.2.20). FastQC was run to perform QC checks on the raw sequence data (Python v.2.7.1, FastQC v.0.11.2). Trimming of adaptor sequences was performed using flexbar (v.3.0.3) using Illumina primers FASTA file. Subsequently, reads were aligned to reference genome GRCh38.93 by Hisat2 (v.2.1.0) using SAMtools (v.1.3). After alignment, QC was performed to verify quality of the alignment and paired-end mapping overlap (Bowtie2, v.2.3.4.2). Finally, the featureCounts function of subreads (v.1.5.1) was used to count paired reads per genes. Gene expression normalization was performed within lanes, to correct for gene-specific effects (including GC content) and between lanes, to correct for sample-related differences (including sequencing depth) using R package EDASeq (Exploratory Data Analysis and Normalization for RNA-seq) (v.2.12.0). The resulting expression values were quantile normalized using R package preprocessCore (v.1.36.0). All downstream analysis of the expression data was performed using R (v.3.5.1, or later).

Whole-exome sequencing

DNA concentrations were quantified using Quant-iT broad range dsDNA Assay (Thermo Fisher) on the FlexStation 3 Microplate reader (Molecular Devices). DNA of both tumor and matched normal samples was available for 294 patients. Whole-exome libraries were constructed with the Agilent SureSelect XT Target enrichment kit and the exonic DNA was captured using the Agilent SureSelect XT Human All Exon V6r2 capture library for 60-Mb exonic regions. Libraries were constructed using 250 ng of DNA and were sequenced on Illumina’s HiSeq 4000 platform using 150 bp paired-end reads (150PE) at the Genomics Core, Sidra Medicine. Reads were mapped to reference genome hs37d5 (1000 Genomes Phase2 Reference Genome Sequence) based on GRCh37/hg19 using BWA (v.0.7.12)65. WES (200× for tumor and 100× for normal) had an on-target sequencing rate of 65–70%. The median (across samples) of the average target coverage (per sample) was 129× (interquartile range (IQR) 18) for tumor samples and 69× (IQR 10) for normal samples (Extended Data Fig. 5a). In tumors, sequencing achieved >20-fold coverage of at least 99% of targeted exons and >70-fold in at least 81% targeted exons. In healthy samples, sequencing achieved >20-fold coverage of at least 94% of targeted exons and >30-fold in at least 84% targeted exons. Adaptor trimming was performed using the tool trimadap (v.0.1.3). ConPair was run to evaluate concordance and estimate contamination between matched tumor–normal pairs. In eight of the pairs a mismatch was detected and for five pairs, a potential contamination was indicated. HLA typing data were used to validate these results. All potential mismatches and contaminations were excluded, retaining 281 patients for data analysis.

TCGA data

RNA sequencing

RNA-seq data (raw counts) from TCGA were downloaded and processed using R package TCGAbiolinks (v.2.18.0). Gene symbols were converted to official HGNC gene symbols and genes without symbol or gene information were excluded. Normalization was performed within lanes, to correct for gene-specific effects (including GC content) and between lanes, to correct for sample-related differences (including sequencing depth) using R package EDASeq (v.2.12.0) and quantile normalized using preprocessCore (v.1.36.0). After normalization, samples were extracted to obtain a single primary tumor tissue (TP) sample per patient. Clinical data were sourced from the TCGA Pan-Cancer Clinical Data Resource11 and survival events OS and progression-free interval (relabeled here as PFS) were used. ICR clustering and calculation of ICR score was performed exactly as described for the AC-ICAM cohort. For the TCGA-COAD cohort, the optimal number of clusters for best segregation based on the Calinski–Harabasz criterion was three. CMS classification of TCGA-COAD samples was performed as described for the AC-ICAM cohort. The Single Sample Predictor by ‘CMSclassifier’ (v.1.0) was used for comparison of CMS classification between AC-ICAM and TCGA-COAD.

A renormalized matrix of both TCGA-COAD and AC-ICAM datasets was generated by merging the raw counts matrices and performing the EDASeq normalization, as described above, on this combined matrix. These data were used to calculate ssGSEA scores for deconvoluted immune cell subpopulations, immune signatures and oncogenic pathways, to compare between cohorts.

Somatic mutation data

Somatic mutation calls from the TCGA MC3 Project were downloaded using R package TCGAmutations (v.0.3.0) using the function tcga_load() with parameters ‘COAD’ for study and ‘MC3’ for source. The downloaded Mutation Annotation Format (MAF) file contained 406 distinct TCGA tumor sample barcodes and 18,183 genes (Hugo Symbol). This file was filtered to only include nonsynonymous mutations (‘Frame_Shift_Del’, ‘Frame_Shift_Ins’, ‘In_Frame_Del’, ‘In_Frame_Ins’, ‘Missense_Mutation’, ‘Nonsense_Mutation’, ‘Splice_Site’, ‘Translation_Start_Site’, ‘Nonstop_Mutation’), analogous to the variant filter applied to the AC-ICAM somatic mutation calls.

Microbiome

Microbiome genus relative abundance matrix for TCGA-COAD cohort (125 tumor samples and 221 genera, WGS data) was downloaded from The Cancer Microbiome Atlas website 13. TCGA-COAD relative abundance matrix was filtered to exclude duplicated samples (samples from vial B, eight samples). Overall, 81 genera were present with a nonzero abundance in at least one of the 117 samples (main matrix). When we applied the same filter as the one used for AC-ICAM 16S RNA gene-sequencing data (presence in at least 10% of the samples with at least 1% relative abundance in one sample), 27 taxa at the genus level were retained.

NHS and the HPFS study data

Somatic mutation data

Somatic mutations in NHS and HPFS Colorectal Cancers were downloaded from the supplementary data of the Giannakis et al. study (Giannakis, Supplementary Table 3). The downloaded file contained 619 distinct tumor sample barcodes and 19,208 genes (Hugo Symbol). We excluded the samples with tumor anatomic site specified as rectum (anatomic site is available in Giannakis Supplementary Table 1) and retained 482 colon cancer samples. Only nonsynonymous mutations were included at the variant filter (‘Frame_Shift_Del’, ‘Frame_Shift_Ins’, ‘In_Frame_Del’, ‘In_Frame_Ins’, ‘Missense_Mutation’, ‘Nonsense_Mutation’, ‘Splice_Site’, ‘Translation_Start_Site’, ‘Nonstop_Mutation’), analogous to the variant filter applied to the AC-ICAM and TCGA-COAD somatic mutation files.

Cancer-related gene annotation

A cancer-related gene list was constructed from using different sources, as previously described:35 (1) genes used by two consortia to define germline genetic variations in pediatric cancers (n = 159;34 n = 565 (ref. 33)); (2) genes with at least one pathogenic or likely pathogenic germline variants in the TCGA cohort (n = 99)66; (3) genes classified as driver genes according to the most updated TCGA analysis (n = 299)32; (4) genes included in the MSK-IMPACT (n = 505), MSK-IMPACT HEME (n = 575), Foundation One CDx (n = 324) and Foundation One Heme (n = 593) panels; (5) cancer genes cataloged as tier 1 by the Sanger Cancer Gene Census (n = 576); and (6) cancer genes defined as such by Vogelstein et al.67. Sources 4–6 were downloaded from OncoKB68. Original sources’ gene names were converted into Ensemble GRCh37 gene symbols. The final list included 1,219 unique cancer genes and is provided in the Supplementary Information.

Transcriptome analysis

ICR score and clustering

Consensus clustering based on 20 a priori selected ICR genes (IFNG, IRF1, STAT1, IL12B, TBX21, CD8A, CD8B, CXCL9, CXCL10, CCL5, GZMB, GNLY, PRF1, GZMH, GZMA, CD274/PDL1, PDCD1, CTLA4, FOXP3 and IDO1)21, was applied to the normalized log2-transformed expression matrix using R package ConsensusClusterPlus (v.1.42.0)69 using 5,000 repeats, agglomerative hierarchical clustering with Ward criterion inner and complete outer linkage. The optimal number of clusters allowing for the best segregation of samples was based on the Calinski–Harabasz criterion. Optimal number of clusters used for segregation was three. Colon cancer samples in the cluster with the highest expression of ICR genes were designated as ‘ICR high’, the intermediate cluster as ‘ICR medium’ and the cluster with the lowest expression was designated ‘ICR low’. The mean log2-transformed expression value of the 20 ICR genes is referred to as the ICR score.

CMS classification

Samples were classified according to CMS by R package ‘CMSclassifier’ (v.1.0) using random forest method16. The obtained CMS labels (from the column ‘RF.predictedCMS’ in output dataframe) were used for all downstream analyses with the exception of the comparison of CMS subtypes between AC-ICAM and TCGA cohort. To allow between-cohort comparison, we ran the CMSclassifier using the ‘single-sample predictor’ method. This method makes it possible to predict unique samples, with a constant output whether the sample is predicted alone or within a series of samples16 and can therefore be used for comparison across cohorts.

Dimension-reduction of the complete expression matrix was performed using t-SNE by ‘Rtsne’ (v.0.15) and visualized using ggplot2 (v.3.3.2). The t-SNE plot was annotated with distinct colors to visualize the distribution of samples of different CMS (using random forest method) in high-dimensional space. The same t-SNE plot was annotated by ICR cluster in a separate panel. A circos plot to visualize the relation between CMS and ICR classifications was generated using the chordDiagram function from R package ‘circlize’ (v.0.4.8).

Immune cell deconvolution and ESTIMATE

Consensus tumor microenvironment cell estimation (ConsensusTME)70 was performed to estimate relative abundancies of specific immune cell subsets from bulk transcriptome data. This method relies on integrated gene sets from multiple sources that have been curated and validated on a per-cancer-type basis, using benchmark datasets and seems to outperform previously published methods70. We applied ConsensusTME using R package ConsensusTME (v.0.0.1.9) using parameters ‘COAD’ to specify cancer type and ‘ssgsea’ as statistical method.

The median of each ConsensusTME score was calculated per CMS stratified by ICR cluster and was displayed in a dotted heat map using R package ComplexHeatmap (v.2.1.2). The association of each ConsensusTME score with OS and PFS was calculated by Cox proportional hazard regression. HR and corresponding 95% CIs as are displayed as forest plots (forestplot v.2.0.1).

To infer estimated levels of overall stromal and immune cell infiltration to the tumor, the ESTIMATE algorithm (v.1.0.13) was applied to the expression data in R. ESTIMATE was run for both TCGA-COAD dataset and the AC-ICAM cohort. The combined ESTIMATE score for both the stromal and immune signature was compared between cohorts and a box-plot was generated using ggplot2 (v.3.3.2).

Analysis of tumor-related signatures and immune traits

Single-sample gene set enrichment analysis (ssGSEA) was applied to the log2-transformed, normalized gene expression matrix71 (GSVA, v.1.38.2). Gene sets that reflect specific tumor-related pathways were selected from multiple sources as described in detail in Roelands et al.10 and Supplementary Source Data Table 6a. Enrichment scores of each of these 48 pathways by CMS were visualized using ComplexHeatmap (v.2.1.2). To better understand the interactions between tumor-intrinsic signaling and the immune microenvironment, we calculated the Pearson correlation between the ICR score and the scores of the 48 tumor-related pathways. This analysis was performed in the total cohort as well as across CMS subtypes.

Immune traits considered for analysis were based on a collection of well-characterized immune traits3,72. This collection includes 68 gene signatures related to immunomodulatory signaling, including IFN signaling, TGF-β, wound healing (core serum response) and T cell/B cell response3,73. Gene expression values were median centered and gene symbols were mapped to EntrezIDs (org.Hs.eg.db_3.6.0). Signatures scores were then mean centered and their s.d. values were scaled to one. For all other immune traits, ssGSEA was applied. These included signatures for antigen-presenting machinery (APM1 and APM2) and angiogenesis and nine TCGA-based coexpression signatures (metagene attractors). This collection was supplemented with the tumor inflammation signature9 and two non-overlapping signatures of IFN-stimulated genes (ISGs), including IFNG hallmark gene set IFNG.GS and ISG resistance signature (ISG.RS)74, calculated using ssGSEA. Finally, the deconvoluted immune cell abundancies by ConsensusTME70 and ICR score10 were included among the immune traits. In total we used 103 immune traits (including ConsensusTME) (Supplementary Source Data 6 provides gene signatures and corresponding references).

The pairwise Pearson correlation between all immune traits was calculated and the resulting correlation matrix was plotted using ComplexHeatmap (v.2.1.2) with hierarchical clustering. Co-clustering immune traits that formed distinct modules were visualized and labeled according to the immune traits’ enrichment. The clustering was compared to previously defined immune trait modules within a pan-cancer setting, by annotation of the correlation matrix with the previously defined clusters in Sayaman et al.3.

Survival analysis on AC-ICAM subsampling

We subsampled AC-ICAM hundreds of times in two ways, one was random, the other was on a subgroup of samples with an ESTIMATE distribution that approximates that of the TCGA-COAD. The function ‘approxfun’ in R was used to generate a function to approximate the density of ESTIMATE scores in TCGA-COAD. Cases were sampled from AC-ICAM using the ‘sample’ function in R with prob argument set to sample points with probability distribution of the TCGA-COAD. Each subsampled cohort consisted of 200 samples. The number of subsets in which the Cox proportional regression for ICR score was significant was compared between the two ways of subsampling, statistical significance was determined using a chi-squared test.

TCR targeted sequencing by immunoSEQ assay

This sensitive and specific dedicated assay requires high quantity of genomic DNA (>2 µg) and sample selection was exclusively based on DNA availability. TCR sequencing was performed using extracted DNA of 114 primary tissue samples and ten matched healthy colon tissues with sufficient DNA available.

DNA samples were normalized to a concentration of 125 ng μl−1 using 3.840 μg of DNA as input per sample. The immunoSEQ assay from Adaptive Biotechnologies was used to amplify all possible variable, diversity and joining (VDJ) gene rearrangements of the TCRβ locus (TRB) using a multiplex PCR method. PCR and magnetic bead cleanup were performed according to manufacturer’s instructions. Recommended QC was performed after the first PCR and second PCR amplification steps by running the PCR product on an agarose gel. Purified second PCR amplification products were pooled and the library pool was quantified using Agilent Bioanalyzer 2100. Subsequently, pools were diluted to a concentration of 1 pM and sequenced on Illumina NextSeq 500/550 system with Mid Output kit (150 cycles) and Custom NextSeq Sequencing Primer (P/N, M150) (read 1, 156 cycles and read 2, 9 cycles). Sequencing was performed using survey resolution (two replicates per sample). A sample manifest was created in immunoSEQ Analyzer and the raw sequencing data were uploaded to the Adaptive Biotechnologies cloud following the manufacturer’s instructions. Data were processed using the company’s proprietary pipeline. Number of total templates analyzed per sample ranged 1,906–95,834 (median 21,258). The average read coverage per sample ranged 11.4–80.6 (median 36.2).

TCR analysis

TCR immunoSEQ data analysis

ImmunoSEQ sample-based output variables, as made available by the immunoSEQ Analyzer, include the total number of templates analyzed, number of productive templates, fraction productive templates, number of total rearrangements, number of productive rearrangements, productive clonality and the maximum productive frequency. Herein, the total number of templates reflects the total number of T cells analyzed, of which only the productive templates can produce a functional protein receptor (rearrangement in the sample are inframe and do not contain a stop codon). The total number of productive rearrangements is the total number of unique T cell clones and clonality is calculated by normalizing the productive entropy using the total number of productive rearrangements and subtracting the result from 1. Values for (productive) clonality range from 0 to 1, with values near 0 reflecting more polyclonal samples and values near 1 representing samples with just few predominant rearrangements dominating the observed T cell repertoire (TRB gene). A high T cell clonality implies presence of expanded T cell clones.

Relationships between ICR score, immune traits, number of productive templates and productive clonality were tested using Pearson’s correlation and visualized by scatter-plots using ggplot2 (v.3.3.2). Similarly, Pearson’s correlation coefficient was calculated between productive clonality and each of the 18,270 genes in the expression matrix. A volcano plot was used to visualize significant results (ggplot2). The top 50 genes with the highest correlation with TCR productive clonality were mapped to the Global Molecular Network and core network analysis was performed using Ingenuity Pathway Analysis software.

Data on all productive rearrangements per sample were exported from the immunoSEQ Analyzer Rearrangement Details View. This file includes the exact nucleotide sequence generated through V(D)J recombination, corresponding amino acid sequence, number of templates and productive frequency. Overlapping TCR sequences between tumor samples and matched healthy colon tissues (n = 9) were evaluated and visualized by scatter-plots (ggplot2). Sequences with a productive frequency at least 32-fold higher in the tumor compared to the healthy colon tissue and a tumor productive frequency >0.1% were defined as tumor-enriched sequences, as previously implemented by Beausang et al.75. The fraction of tumor-enriched TCR sequences in the tumor was calculated by dividing the number of productive templates of tumor-enriched sequences by the total number of productive templates per tumor sample. Pearson’s correlation coefficient between the fraction tumor-enriched TCR sequences and ICR score was calculated.

MiXCR for TCR repertoire derived from bulk RNA-seq

The software MiXCR (v.3.0.13)30 was used to retrieve the VDJ repertoire from bulk RNA-seq data aligned to reference genome GRCh37. MiXCR was run through docker and with the single command analyze shotgun. The R package ‘immunarch’ was used to analyze the MiXCR output into the R environment. For the TCRβ locus (TRB), the TCR clonality was calculated as 1 − normalized Shannon entropy (see Calculation section for details) for all samples, except seven cases for which MiXCR failed to identify clones.

Whole-exome-sequencing data analysis

Somatic mutation calling and small insertions and deletions

SNVs were called using mutect (v.1.1.7) and somatic small insertions and deletions (indels) using strelka2 (bcbio-nextgen v.1.1.1). We applied an optimized variant filtering pipeline (Extended Data Fig. 5b). To filter out false-positive single-nucleotide polymorphism calls, fpfilter was used, the applied filtering parameters are specified in the fpfiler.pl script shared on GitHub. Subsequently, MAF files were generated using VCFtoMAF tool (v.1.6.16), which also appended the SIFT (sorting intolerant from tolerant), PolyPhen and Exome Aggregation Consortium annotations. MAF files were loaded into R where indels with low complexity regions were excluded. For both SNVs and indels, a cutoff for minimum allele fraction of 5% and tumor depth of more than three reads was applied. The Exome Aggregation Consortium data were then used to filter out common variants that are encountered in >1% in the general population. After these technical exclusion criteria, biological filters were applied, including selection of nonsynonymous mutations (frame shift deletions, frame shift insertions, inframe deletions, inframe insertions, missense mutations, nonsense mutations, nonstop mutations, splice site and translation start site mutations). The resulting number of variants/mutations per Mb (capture size is 40 Mb) per sample is referred to as the nonsynonymous TMB. Next, to identify most frequently mutated genes in our cohort that might play a role in cancer, we excluded variants that are predicted to be tolerated according to SIFT annotation or benign according to PolyPhen (polymorphism phenotyping). Finally, all artifact genes, which are typically encountered as bystander mutations in cancer that are mutated for example as a consequence of a high homology of sequences in the gene, were excluded76. The OncoPlot function from ComplexHeatmap (v.2.1.2) was used to visualize the most frequent somatic mutations.

Comparison of TMB with TCGA datasets

To compare the TMB in the AC-ICAM with all 33 TCGA cohorts derived from the MC3 project, we used the tcgaCompare function from maftools (v.2.6.05, R). For AC-ICAM, the filtered MAF for nonsynonymous mutations was used as input with specified capture size of 40.

Comparison of somatic mutations with other cohorts

To define mutated genes in the AC-ICAM that were not previously described in colon cancer, we performed a comparison of the most frequently mutated genes in AC-ICAM (>5% of the tumor samples) with frequencies detected in previously published datasets containing colon cancer samples (TCGA-COAD and NHS-HPFS) as well as reported cancer driver genes32 or colon oncogenic mediators38. First, we extracted genes with a nonsynonymous mutation frequency >5% in the AC-ICAM cohort. Subsequently, only genes that are likely involved in cancer development, as described in the section ‘Cancer-related gene annotation’, were retained. All artifact genes (mutations typically encountered as bystander mutations in cancer that are mutated for example as a consequence of a high homology of sequences in the gene), were excluded. Genes that have previously been reported as colon cancer oncogenic mediator38 or cancer driver gene for colorectal cancer (COADREAD)32 were also excluded. Finally, only genes with a mutation frequency <5% in the NHS-HPFS colon cancer cohort37 and <5% in TCGA-COAD36 were maintained. As a final filter, only genes that had a nonsynonymous mutation frequency of at least twofold in AC-ICAM compared to TCGA-COAD were labeled as potentially new in colon cancer.

Estimation of MSI from whole-exome sequencing data

We applied MANTIS (v.1.0.4), a tool for rapid detection of microsatellite instability on our WES data77. Briefly, a bed file suitable for use by MANTIS was created using RepeatFinder function of the MANTIS tool, to find microsatellites regions within the reference genome (GRCh37). MANTIS was then run for each tumor and matched normal BAM file pair using these detected microsatellite loci. The instability score between the two samples within the pair was used to classify samples either as MSI-H (MANTIS score > 0.4) or MSS (MANTIS score ≤ 0.4).

Somatic mutations associated with ICR

We investigated the association of specific somatic alterations, including SNVs and small insertions or deletions (indels) and ICR immune phenotype. Binomial linear regression models were fitted to define which specific mutations associate with ICR score using the glm function with family ‘binomial’ (R). This analysis was performed in the total cohort (n = 281) as well as within hypermutated (n = 69) and non-hypermutated (n = 212) subgroups separately. The estimate and P value were extracted for each gene and FDR was calculated using the Benjamini–Hochberg method. Significant genes with an FDR < 0.1 and that were mutated in at least five patients in the analysis subgroup were plotted as OncoPrints (ComplexHeatmap, v.2.1.2).

Mutations in homologous recombination genes, mucinous histology, and ICR

Genes with an inverse association with ICR score within hypermutated colon cancer included genes involved in homologous recombination repair. The frequency of mutations in either of the identified genes (BRCA2, BRCA1 and FANCA) genes were compared between hypermutated cases of mucinous histology with hypermutated cases with other histological classifications. An unpaired Student’s t-test was used to compare ICR score between hypermutated cases of mucinous histology with hypermutated cases with other histological classifications.

Somatic copy-number alteration segmentation

A segmentation file was generated for each sample and later a merged file for all samples was uploaded to IGV (v.2.11.0). We have used a pipeline using GATK (gatk-package-4. beta.6) to generate each tumor sample’s segmentation file. We performed the below steps:

  1. 1.

    Calculated the coverage of tumor and normal BAM files for each interval using GATK CalculateTargetCoverage.

  2. 2.

    Generated the panel of ‘normal’ using normal samples by GATK CreatePanelOfNormals options.

  3. 3.

    Normalizing the tumor data using GATK NormalizeSomaticReadCounts methods using PON generated during the above step.

  4. 4.

    Performed the segmentation of tumor data using input files from the above steps using GATK PerformSegmentation.

  5. 5.

    The merged segmentation file of all the samples was uploaded to IGV and snapshots were generated.

Overview of SCNAs

We explored the prevalence of SCNAs among ICR clusters and hypermutated and non-hypermutated subgroups by exploration of the segmentation file in IGV. Briefly, the log2-transformed segmentation file was loaded in IGV with reference genome GRCh37, including an annotation text file including mutational load category (hypermutated, non-hypermutated), POLE mutation status, ICR cluster, CMS and MSI status. The samples were ordered consecutively by MSI status, CMS, ICR, POLE and mutational load category. Prevalence of amplification and deletions was visually inspected and compared between groups.

Genetic immunoediting and immunoediting score

HLA typing, neoantigen prediction and GIE

HLA typing was performed on both WES and RNA-seq data using OptiType (bcbio-nextgen v.1.1.5 in Python v.2.7.0)78. Neoantigen prediction tool pVACseq from pVACtools was run using the following predictors: MHCnuggetsI, NNalign, NetMHC, SMM, SMMPMBEC and SMMalign. The obtained vcfs from our somatic mutation calling pipeline were used as input for pVACseq, along with the predicted HLA type from WES data. Gene expression data aligned to GRCh37 in transcripts per million was annotated to the vcfs using vcf-expression-annotator. Mutant-specific binders, relevant to the restricted HLA-I allele, are referred to as neoantigens, as described in detail by Zhang et al.79. Mutated epitopes with a median IC50 binding affinity across all prediction algorithms used <500 nM, with a corresponding wild-type epitope with a median IC50 binding affinity > 500 nM, were used as criteria to infer neoantigens. Predicted neoantigens were used to calculate the GIE value. We calculated the GIE value by taking the ratio between the number of observed versus the number of expected neoantigens. The expected number of neoantigens was based on the assumption of a linearity between TMB and the number of neoantigens. We therefore assumed that samples that have a lower frequency of neoantigens than expected (lower GIE values), display evidence of immunoediting. A higher frequency of neoantigens than expected indicates a lack of immunoediting, see calculations section for details.

IES classification and analysis

The IES is a composite score based on both ICR and GIE. Tumors of IES4 are those predicted to be the most immune active, as they are ICR high and display GIE. Tumors of IES1 are expected to be most immune silent, classified both as ICR low and an absence of GIE. Tumors of the intermediate groups IES2 and IES3 reflect ICR-low and GIE and ICR-high and non-GIE tumors, respectively. Mutational load category, MSI status and pathological stage distribution was compared between IES groups using a chi-squared test. The OS was compared between patients with different IES and between GIE and non-GIE tumors in the ICR-medium group using Cox proportional hazard regression analysis. A Cox proportional hazard’s multivariate model was fitted with IES (ordinal) and pathological stage (ordinal).

Association between IES and TCR clonality

The Spearman correlation between IES as ordinal variable and TCR clonality from immunoSEQ as well as MiXCR-based clonality was calculated. We performed several additional analyses to assess whether the relationship between TCR productive clonality and IES was driven by ICR. Multiple regression analysis was performed with ICR score and immunoSEQ TCR clonality as continuous variables to predict productive TCR clonality (immunoSEQ). Second, the data were modeled through local polynomial regression fitting of the productive TCR clonality (immunoSEQ) by IES category (ordinal variable).

Microbiome: bacterial 16S rRNA PCR sequencing

This study complies with the STORM reporting guidelines; the completed checklist can be found in Supplementary Table 12.

The 16S rRNA gene sequencing was performed at the Host–microbe Interaction Laboratory, Sidra Medicine.

Hypervariable regions V3–V4 of 16S rRNA gene were amplified with PCR using the amplicon primers with Illumina adaptors (underlined):

Forward:

5′TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG′3

Reverse:

5′GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC′3.

In brief, PCR was performed in a 25-μl reaction mixture containing 5 μl each forward and reverse primer (1 μM), 2.5 μl template DNA for the samples and 12.5 μl 1× Hot Master Mix (Phusion Hot Start Master Mix). No human DNA depletion was used. The amplifications were performed on a Veriti 96-well Thermal Cycler (Thermo Scientific) with the following program: initial denaturation at 95 °C for 2 min, followed by 30 cycles of denaturation at 95 °C for 30 s, primer annealing at 60 °C for 30 s and extension at 72 °C for 30 s, with a final elongation at 72 °C for 5 min. The presence of PCR products was confirmed by electrophoresis in a 1.5% agarose gel conducted at 80 V/cm in Tris-borate–EDTA (TBE) buffer. Amplicons were then purified using AgenCourt AMPure XP magnetic beads (Beckman Coulter) according to the Illumina MiSeq 16S Metagenomic Sequencing Library Preparation protocol. As positive controls, we included DNA from stool samples (extracted with QIAGEN QIAmp Fast DNA Stool Mini kit), using the same input of DNA as the one used for the AC-ICAM samples. We obtained similar 16S rRNA amplicon PCR products across the tissue samples and the positive controls, indicating that the DNA extraction protocol used resulted in enough recovery of the microbial DNA from our specimens.

Samples were multiplexed using a dual-index approach with the Nextera XT Index kit (Illumina) according to the manufacturer’s instructions. The concentration of amplicons was determined using the Qubit HS dsDNA assay kit (Life Technologies,) followed by pooling to achieve an equimolar library concentration. The final pooled product was paired-end sequenced at 2 × 300 bp using a MiSeq Reagent kit v3 on Illumina MiSeq platform (Illumina) at the Sidra Medicine research facility. Sequencing was also performed on 27 empty wells across plates to exclude the occurrence of large-scale cross-contamination among samples during sequencing procedures: the minimum and maximum read counts were 2 and 234, respectively and the average and median reads counts were 37 and 18, respectively. No negative controls for sampling or DNA extractions were included. Samples were aliquoted randomly in the plate.

Microbiome: 16S rRNA gene sequencing and data processing

Sequenced data were demultiplexed using MiSeq Control Software. The overall quality of sequencing quality was evaluated using FastQC and the demultiplexed sequencing data were imported into Quantitative Insights into Microbial Ecology (QIIME2; v.2019.4.0) software package. The data were denoised with DADA2, which includes a multi-step process, including read filtering, dereplication and chimera removal. Paired 250-bp reads were trimmed of the initial five low-quality bases and further processed to generate the amplicon sequence variant, interchangeably called operational taxonomic units (OTUs). The data were subsampled at a depth of 22,704 and then normalized using the rarefaction on OTUs count at even depth. Taxonomic classification was performed utilizing 16S rRNA gene database from Silva classifier (silva-132-99-515-806-nb-classifier). The data were imported into R in a Biological Observation Matrix (biom) format, before further evaluation with Phyloseq (v.1.34.0). The 16S rRNA gene sequencing was performed on all samples with sufficient DNA available: 246 tumor samples and 246 matched healthy colon tissues from the same patients (AC-ICAM246) and on additional 42 tumor samples (ICAM42) for which there was no sufficient DNA available from the healthy colon counterpart.

The minimum and maximum read counts were 25,868 and 351,069, respectively. The average and median reads counts were 82,506 and 75,668, respectively. No samples were excluded from the analysis.

Alpha diversity (within sample community) was assessed by observed OTUs (sum of unique OTUs per sample), Chao1 (Chao 1987) an abundance-based richness estimator that is sensitive to rare OTUs, Shannon (Shannon 1948) and inverse Simpson (InvSimpson) (Simpson 1949), the last one being more dependent on highly abundant OTUs and less sensitive to rare OTUs. Indices were read into R using R package vegan (v.2.5–6).

Relative abundance of distinct microbiome elements was determined using the transform_sample_counts function from Phyloseq, such that sum of all abundance values per sample is equal to one (Microbiome_Relative = transform_sample_counts (pyloseq_object, function(x) x / sum(x))). OTU tables were aggregated by taxonomic ranks including phylum (26 unique phyla), class (48 classes), order (97 orders), family (207 families), genus (562 genera) and species (846 species). As the confidence for annotation of reads decreases with decreasing rank, some reads were only annotated with higher ranks.

Microbiome: WGS and data processing

Library construction and sequencing was performed at the Sidra Clinical Genomics Laboratory Sequencing Facility. DNA was quantified using the Quant-iT dsDNA Assay (Invitrogen) on the FlexStation 3 (Molecular Devices). The library was constructed from 250 ng of DNA with the Illumina TruSeq DNA Nano kit. Library quality and concentration was assessed using the DNA 1k assay on a PerkinElmer GX2 and qPCR using the KAPA Library quantification kit on a Roche LightCycler 480 II. Genomic libraries were sequenced with paired-end 150 bp on HiSeq X (32% of samples) and Novaseq 6000 (68% of samples) systems (Illumina) following the manufacturer’s recommended protocol to achieve a minimum average coverage 60× for tumor samples. Quality passed reads were aligned to the human reference genome GRCh38 using BWA. Human sequencing reads were removed and unaligned nonhost reads were extracted using SAMtools. Low-quality unaligned reads were trimmed and samples were processed for taxonomic profiling using MetaPhlAn2 (ref. 80). MetaPhlAn2 uses a library of unique clade-specific marker genes to estimate bacterial relative abundance at the species level. The program was run with default parameters except analysis type set to relative abundance and restricted to bacterial organisms only. WGS was targeted to achieve >60× coverage per sample. The median (across samples) of the average target coverage (per sample) was 76× (range of 50–92).

Of 3.2 × 1011 total reads (median 1.9 × 109 reads per sample; IQR 2.1 ×108), 1.5 × 108 (median 1×105 reads per sample; IQR 3.4 × 105) were aligned to bacteria. A total of 132 taxa, at genus level were detected, of which 3 were excluded as possible contaminants (Deinococcus, Ralstonia and Enhydrobacter)12 (main matrix). When we applied the same filter as the one used for 16S RNA gene-sequencing data (presence in at least 10% of the samples with at least 1% relative abundance in one sample), 54 taxa at the genus level were retained. WGS was performed in all samples with sufficient DNA available (n = 167).

Ruminococcus
bromii PCR

PCR was performed based on Wang et al.81 using R.bromii 16S rDNA forward primer (GAAGTAGAGATACATTAGGTG) and R.bromii 16S rDNA reverse primer (ACGAGGTTGGACTACTGA). PCR was performed using AmpliTaq Gold 360 Master Mix (Thermo Fisher, 4398881), 20 ng of sample DNA and 5 nM of each primer. The amplification conditions were one cycle of 95 °C for 10 min, then 35 cycles of 95 °C for 30 s, 50 °C for 30 s and 72 °C for 30 s and finally one cycle of 72 °C for 7 min before storing at 4 °C. PCR products (10 μl each) were separated by electrophoresis in 2% agarose gels (Sigma, A4718) containing ethidium bromide (1 μg ml−1) (Sigma, E1510) using a 100-bp DNA ladder (New England Biolabs, N0551G) for size verification. PCR band intensity was defined as negative when intensity was absent or extremely faint. PCR was considered positive if band was gradually more intense (graded from 2 to 4). PCR was performed in all samples from the AC-ICAM246 cohort with sufficient amounts of DNA available (n = 126).

Microbiome data analysis

Genus-level filtering

On tumor samples, microbiome genera were filtered to include genera which are present in at least 10% of the samples with at least 1% relative abundance in one sample; 138 out of 562 were retained. These included 137 genera and the genus labeled ‘unknown’ that reflects all reads for taxa with insufficient confidence at the genus level. The same filtering was applied to normal samples; 129 genera were retained. A total of 120 genera overlapped between normal and tumor samples, 9 genera were unique in normal samples and 18 genera were unique in tumor samples.

This set of filtered genera were used for all downstream analysis except for the comparison between tumor and normal pairs. For this analysis we include any genera that passed the filtering approach described above for either normal or tumor groups (if taxa passed the filtered in tumor samples they were retained in normal samples and vice versa; total 147 genera).

Contaminant assessment

To remove putative contaminants from the 16S rRNA gene-sequencing data, we used a list of microbial taxa that are typically found in negative blank reagents, as described by Salter et al.82. This list has previously been curated and annotated by Poore et al.12 by manual review of the literature. This curation allowed the discrimination of taxa that are ‘likely contaminants’, ‘potentially pathological or commensal genera’ and ‘mixed evidence’ genera that have been described both as pathogens as well as contaminants. We flagged those taxa that were ‘likely contaminants’ as well as ‘mixed evidence’ for potential exclusion from our 16S rRNA gene-sequencing microbiome abundance matrix.

In total, we detected 25 taxa that were ‘likely contaminants’ and 10 taxa with ‘mixed evidence’ in at least one out of the 492 samples. To evaluate the extent of potential contamination by these 35 taxa, we calculated the sum of these taxa for each sample. On average, only 0.04732% of the total microbial abundance per sample consisted of ‘flagged’ taxa (min, 0%; first quartile, 0%; median, 0%; third quartile, 0.03485%; and max, 4.46%). Furthermore, most of these putative contaminant taxa (n = 33) were detected in only fewer than 20 (out of 492) samples. Potential contaminating bacteria that we detected in the highest numbers of samples were Oxalobacter in 39 samples and Micrococcus in 28 samples. Detected putative contaminants and taxa with mixed evidence from the 16S rRNA-sequencing data were removed when we applied the minimal abundance filter (presence in at least 10% of the samples with at least 1% relative abundance in one sample).

Microbiome comparison between tumor and healthy colon tissue

At the phylum level, the overall distribution of microbiome composition was visualized using stacked bar charts. The order of samples was determined by descending relative abundance of the phylum Fusobacteria in tumor samples and the matching healthy colon samples from corresponding patients were ranked in the same order as the tumor stacked bar chart.

A paired Mann–Whitney U-test (two-sided) was used to determine microbial phyla/genera with significantly different relative abundance between tumor and paired normal samples. FDR was calculated using the Benjamini–Hochberg method. Results were visualized in volcano plots.

Microbiome comparison between ICR groups

An unpaired Mann–Whitney U-test (two-sided) was used to calculate which filtered genera (n = 138) were differentially abundant between ICR-high and ICR-low samples. FDR was calculated using the Benjamini–Hochberg method. Results were visualized in volcano plots.

Co-abundance network inference

We performed co-abundance analysis in tumor samples from the AC-ICAM246 cohort. Co-abundance analysis, which involves studying the presence of multiple components within a composition, can be difficult to perform accurately when using relative abundance. This is because the relative abundance of the different components is constrained to sum to 1, which can lead to the appearance of false correlations. To address this issue, techniques such as co-abundance network inference can be used to more accurately infer relationships between the components.

Before co-occurrence analysis, the genus labeled ‘unknown’ was excluded. SparCC48,83 was used to calculate the co-occurrences between the 137 remaining taxa using centered log-ratio (clr)-transformed OTU counts in tumor samples (Python, SparCC3). A total of 500 inference and 10 exclusion iterations were used to estimate the median correlation of each pairwise. The statistical significance of the correlations was calculated using a bootstrapping procedure to generate 500 simulated data83. For each component pair, pseudo P values (two-sided) were assigned as the proportion of simulated bootstrapped data with a correlation at least as extreme as the one computed for the original data. Benjamini–Hochberg FDR was used for multiple testing correction. All the correlations were then sorted using a statistically significant cutoff (FDR < 0.05) and SparCC correlation coefficient > ±0.3. Clusters among the networks (groups of at least three correlated genera using the cutoffs specified above) were defined via a fast greedy clustering algorithm. All co-occurrence networks were made using the R package ‘NetCoMI (v.1.1.0) – Network Construction and Comparison for Microbiome Data’84 and visualized using Cytoscape (v.3.9.1).

Within each cluster, the total relative abundance was calculated by summing up the relative abundance values for genera that positively correlated with each other. For each of the identified clusters, survival analysis was performed by binarizing each sample into high and low abundance based on the median total relative abundance of each cluster.

MBR model development, training set

We first normalized the genus abundance matrix by converting each genus column into a z score using mean and s.d. and treating the normalized abundance matrix as the training set. We built a relaxed multivariable elastic-net OS Cox regression model using the glmnet R package (v.4.1.4) on the training set. The optimal hyper parameters (γ and λ) for the best model were identified through fivefold cross-validation via a grid-search technique using the ‘cv.glmnet’ function. We used the concordance index as a performance metric. The parameters for which the mean cross-validation concordance index was the highest were selected as optimal hyper parameters. Next, the final model was built using these hyper parameters on the complete training set. To calculate risk scores in the training dataset (MBR scores), we passed the training set and best model to the ‘predict’ function. A total of 41 features (genera) were present in the best model with nonzero coefficients; we refer to these features as the ‘MBR classifier’, which represents the final model. A positive or negative coefficient of each genus of the MBR classifier can be binarized into ‘high-risk’ and ‘low-risk’ groups using the cutoff threshold of 0 and attributed to the strength of association with survival. A higher positive coefficient means high hazard risk of death, whereas a negative coefficient corresponds to lower risk of death.

MBR model validation, testing sets

We validated the final model on two datasets. Both datasets consist of samples that were not used for model training (unseen data). One is an independent internal (ICAM42) dataset, referred to as testing cohort 1 and the other is an external cohort (TCGA-COAD), referred to as testing cohort 2. The ICAM42 consists of 42 samples and TCGA-COAD consists of 117 samples. We processed the two datasets to convert the abundance values for each genus into z scores using the mean and s.d. derived from the training set. These abundance matrices were passed to the ‘predict’ function along with the best model to estimate corresponding risk scores. The risk score (MBR score) of any tested sample is only dependent on the relative abundance of the list of genera that overlap with the ones included in the MBR classifier (the risk score for each sample is not dependent on one of the other samples). Finally, the MBR scores are binarized using the cutoff threshold 0 to categorize the test sample into ‘high-risk’ (>0) and ‘low-risk’ (<0) groups as performed in the training set. Therefore, no cutoff optimization occurred in the validation phase.

MBR model performance assessment

We tested the concordance index (1) in the training set using the final MBR model; (2) in the training set using the cross-validation of the best MBR model (five permutations, 80% training and 20% validation partition); and (3) in each test set cohort separately (ICAM42 and TCGA-COAD) and in the full test set (ICAM42 and TCGA-COAD combined) using the final MBR model.

Taxa used for the MBR score calculation in other cohorts

To calculate the MBR score in each additional dataset we used taxa that overlapped with the 41 genera of the MBR classifier, which was developed using 16S rRNA gene sequencing on tumor samples.

There were 16 of 41 taxa in the TCGA-COAD (WGS data) and 18 of 41 taxa in the AC-ICAM WGS data (tumor sample) main matrices. All the 41 taxa were available in the ICAM42 cohort (tumor samples) and the MBR score for AC-ICAM healthy colon tissue samples was based on 36 genera that passed the applied genus-level filtering for healthy tissue (the list of the taxa used for each platform is available in Supplementary Table 11).

The Silva classifier used for genus attribution in the 16S rRNA gene-sequencing data includes ‘Ruminococcus 1’ and ‘Ruminococcus 2’, whereas WGS-WES TCGA data only include ‘Ruminococcus’ as genus-level taxa. Therefore, for matching purposes, when calculating the risk score, we replaced the labeling of ‘Ruminococcus 1’ and ‘Ruminococcus 2’ with ‘Ruminococcus’. In WGS AC-ICAM ‘R.bromii’ was used instead.

R.
bromii validation analysis

We characterized the specific species underlying the reads supporting the Ruminococcus 2 taxon from 16S sequencing data. Previously, a high degree of sequence similarity was reported between the Ruminococcus 2 taxa from the Silva classifier and the species R.bromii85. The subset of samples that had both 16S sequencing and WGS data available was used to calculate the Spearman correlation between each Ruminococcus species (from WGS data) and the 16S Ruminococcus 2 (16S) relative abundance. In addition, the proportion of WGS reads that mapped to each specific Ruminococcus species was calculated as fraction of all WGS reads that were assigned to the Ruminococcus genus.

To confirm the presence of R.bromii, we performed a PCR specific to R.bromii on the 126 AC-ICAM tumor samples for which sufficient DNA was still available (see section R.bromii PCR for technical details on PCR). The concordance between detection of R.bromii in PCR and 16S Ruminococcus 2 was defined as the percentage of samples for which both methods had identical results. The discordant cases were further examined by evaluation of WGS results. Furthermore, the concordance between detection of R.bromii in PCR and R.bromii in WGS was assessed in the 86 samples for which data from both methods were available.

mICRoScore development

In view of the individual contribution of analytes extrapolated by individual platforms such as the ICR (RNA-seq data), the GIE (WES data) and the MBR scores (16S data) and TCR clonality (immunoSEQ and MiXCR) we sought to develop a multi-omics parameter that could capture a subgroup of patients with exceptional survival.

Each parameter that was significant in the univariate Cox regression analysis (ICR, as ordinal variable, low, medium, high; GIE as binary variable, non-GIE versus GIE; and MBR score, as binary variable, low versus high), was assessed within a multivariable Cox regression model adjusted for age (as continuous variable), CMS subtypes (as categorical variable, CMS1–CMS3, versus CMS4), stage (as ordinal variable, I, II, III and IV) and MSI status (as binary variable, MSS versus MSI-H). The parameters that were retained by the multivariable Cox models were combined into an integrated score. For univariate analysis we used RNA-seq, WES and TCR clonality data from the entire AC-ICAM cohort and MBR score derived from 16S rRNA gene-sequencing data of the AC-ICAM246 cohort.

The mICRoScore reflects the co-presence of ICR high and MBR low risk, defined as mICRoScore high. On the AC-ICAM246 (training set), all samples with MBR-high risk and/or in ICR-medium or ICR-low group are defined as mICRoScore low. The survival between patients with mICRoScore high and mICRoScore low was compared using Cox proportional hazard regression analysis and a log-rank test.

mICRoScore validation

We used data from TCGA-COAD as external validation cohort to test the mICRoScore (testing set). The TCGA-COAD cohort includes 107 patients with both tumor microbiome data (downloaded from Dohlman et al.10) and RNA-seq data available (used for ICR estimation). ICR assignments from this cohort (see section TCGA data) were combined with the MBR classification to classify patients as mICRoScore high and mICRoScore low. The survival between patients with mICRoScore high and mICRoScore low was compared using a log-rank test.

Sample size considerations

Sample size calculation is challenging in multi-omics studies due to the multitude of parameters that could be examined (implying the use of different tests from different platforms generating data with different data distribution) and empirical methods have been used by many consortia. Correlation between ICR and survival was declared as a primary objective in the research proposal submitted to the funding agency before any genomic data were generated, representing therefore a prospective–retrospective validation (JSREP07-010-3-005).

In the submitted proposal (2015), we planned to profile 400 tumors for gene expression analysis (samples from 456 patients were screened, samples from 391 patients were available for processing and samples from 348 patients retained after QC in the final cohort, see Extended Data Fig. 1) and at least 100 tumor–normal pairs for WES analysis (initially planned only for a subgroup of ICR-high versus -low tumors) and 100 for TCR sequencing using the immunoSEQ assay considering the high amount of DNA that is necessary (>2 μg). Securing additional funds allowed us to perform WGS and 16S rRNA sequencing and to expand the WES and TCR analyses to any sample with sufficient DNA available. No specific power calculation was performed at that time and the targeted sample size was based on the estimated number of samples that could be retrieved from LUMC (n = 400), which compared favorably with the sample size of similar studies in the field.

Regarding detection of somatic mutations and considering the overall somatic mutations frequency in colon cancer, 150 tumor exomes will give a power >90% to detect a 10% mutational frequency in 90% of genes86.

Regarding survival analysis, in terms of ICR (the primary objective in the submitted proposal), for the comparison between ICR high versus ICR low, with 77 OS events detected, our study has a power >80% for an HR of 0.5 with a two-sided α of 0.05. With 154 OS events in the whole cohort, our study has a power of 90% for an HR of 0.59 (assuming two group of equal size c) and a power of 90% for an HR of 0.57 (assuming groups with unequal sample size, 2:1) with a two-sided α of 0.05.

Calculations

TCR clonality calculation by immunoSEQ assay data (targeted DNA)

Entropy (H) is calculated by a standard Shannon entropy calculation with log base 2. Clonality is the inverse of the normalized entropy calculation. The equations are below:

$${{{\mathrm{Shannon}}}}\;{{{\mathrm{entropy:}}}}\;{{{{H}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{ = – }}}}\Sigma {{{{P}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{log}}}}_{2}\left[ {{{{{P}}}}\left( {{{\mathrm{x}}}} \right)} \right]$$

Specifically, for our data:

For a productive (inframe) sequence x,

$${{{{P}}}}\left( {{{\mathrm{x}}}} \right) = {{{\mathrm{sequence}}}}\;{{{\mathrm{count}}}}/{{{\mathrm{total}}}}\;{{{\mathrm{productive}}}}\;{{{\mathrm{count}}}}$$

$$\begin{array}{l}{{{\mathrm{Entropy}}}}=\\ {{{\mathrm{ – 1}}}}\; \times \;{{{\mathrm{the}}}}\;{{{\mathrm{sum}}}}\;{{{\mathrm{over}}}}\;{{{\mathrm{all}}}}\;{{{\mathrm{unique}}}}\;{{{\mathrm{productive}}}}\;\left( {{{{\mathrm{inframe}}}}} \right)\;{{{\mathrm{sequences}}}}\;{{{\mathrm{of}}}}\\ \left(\right. \left( {{{{\mathrm{sequence}}}}\;{{{\mathrm{count}}}}/{{{\mathrm{total}}}}\;{{{\mathrm{productive}}}}\;{{{\mathrm{count}}}}} \right)\\ \times {{{\mathrm{log}}}}_{2}\left( {{{{\mathrm{sequence}}}}\;{{{\mathrm{count}}}}/{{{\mathrm{total}}}}\;{{{\mathrm{productive}}}}\;{{{\mathrm{count}}}}} \right) \left.\right)\end{array}$$

$$\begin{array}{l}{{{\mathrm{Normalized}}}}\;{{{\mathrm{entropy}}}} =\\{{{\mathrm{entropy}}}}/{{{\mathrm{log}}}}_{2}\left( {{{{\mathrm{productive}}}}\;{{{\mathrm{unique}}}}\;{{{\mathrm{inframe}}}}\;{{{\mathrm{sequences}}}}} \right)\end{array}$$

$${{{\mathrm{Clonality}}}} = 1 – {{{\mathrm{normalized}}}}\;{{{\mathrm{entropy}}}}$$

TCR clonality calculation on bulk RNA-seq data (MiXCR)

Entropy (H) is calculated by a standard Shannon entropy calculation with log base 2. The equations are below:

$${{{\mathrm{Shannon}}}}\;{{{\mathrm{entropy}}}\;{{{{H}}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{ = – }}}}\Sigma {{{{P}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{log}}}}_2\left( {{{{{P}}}}\left( {{{\mathrm{x}}}} \right)} \right)$$

For a sequence x,

$${{{{P}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{ = sequence}}}}\;{{{\mathrm{count}}}}/{{{\mathrm{total}}}}\;{{{\mathrm{count}}}}$$

The Shannon entropy was normalized so that it can assume a value between 0 and 1. The normalized Shannon entropy is referred to as Pielou’s evenness and is calculated as below:

$${\mathrm{Pielou}}\hbox{‘}{\mathrm{s}}\;{{{\mathrm{evenness:}}}\;{{{{J = H}}}}/{{{\mathrm{log}}}}}\left( {{{{S}}}} \right)$$

where S is the number of unique TCR/CDR3 sequences.

Clonality is calculated as the inverse of the normalized entropy (J) calculation:

$${{{\mathrm{Clonality = 1 – }{J}}}}$$

Genetic immunoediting value

The GIE value is calculated by taking the ratio between the observed (O) versus the expected (E) number of neoantigens:

$${{{\mathrm{GIE}}}}\;{{{\mathrm{value = O/E}}}}$$

in which E is a function of the number of nonsynonymous mutations in that specific sample (x):

$${{{\mathrm{E}}}}\left( {{{\mathrm{x}}}} \right){{{\mathrm{ = – 2}}}}{{{\mathrm{.38770 + 0}}}}{{{\mathrm{.09171}}}} \times {{{\mathrm{x}}}}$$

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *