Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
62 views37 pages

Cancer Microbiome Atlas

The Cancer Microbiome Atlas (TCMA) is a public database that provides decontaminated microbial profiles from gastrointestinal cancer tissues in The Cancer Genome Atlas (TCGA), facilitating the identification of prognostic species and enabling multi-omic analyses of host-microbe interactions. The study reveals that many microbial species previously thought to be tissue-resident are actually contaminants, and it presents a statistical model to differentiate between these groups. This resource aims to enhance understanding of the microbiome's role in cancer and guide future research in host-microbe interactions.

Uploaded by

daher melo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views37 pages

Cancer Microbiome Atlas

The Cancer Microbiome Atlas (TCMA) is a public database that provides decontaminated microbial profiles from gastrointestinal cancer tissues in The Cancer Genome Atlas (TCGA), facilitating the identification of prognostic species and enabling multi-omic analyses of host-microbe interactions. The study reveals that many microbial species previously thought to be tissue-resident are actually contaminants, and it presents a statistical model to differentiate between these groups. This resource aims to enhance understanding of the microbiome's role in cancer and guide future research in host-microbe interactions.

Uploaded by

daher melo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Resource

The cancer microbiome atlas: a pan-cancer


comparative analysis to distinguish tissue-resident
microbiota from contaminants
Graphical abstract Authors
Anders B. Dohlman,
Diana Arguijo Mendoza,
Shengli Ding, ..., Iliyan D. Iliev,
Steven M. Lipkin, Xiling Shen

Correspondence
[email protected] (A.B.D.),
[email protected] (X.S.)

In Brief
Dohlman et al. present The Cancer
Microbiome Atlas, a public database of
decontaminated, tissue-resident
microbial profiles of TCGA
gastrointestinal cancer tissues. As these
profiles are matched to specific TCGA
tissue samples, this work allows
identification of prognostic species and
provides a resource for performing multi-
omic, pan-cancer analyses of host-
microbe interactions.

Highlights
d Decontaminated microbial compositions for 3,689
gastrointestinal cancer samples

d Resolved ‘‘mixed-evidence’’ species with gene and


nucleotide resolution

d Identified prognostic species and blood signatures of


mucosal barrier injury

d Enabled matched multi-omic, pan-cancer analyses of host-


microbe interactions

Dohlman et al., 2021, Cell Host & Microbe 29, 281–298


February 10, 2021 ª 2020 Elsevier Inc.
https://doi.org/10.1016/j.chom.2020.12.001 ll
ll

Resource
The cancer microbiome atlas: a pan-cancer
comparative analysis to distinguish tissue-resident
microbiota from contaminants
Anders B. Dohlman,1,* Diana Arguijo Mendoza,1 Shengli Ding,1 Michael Gao,2 Holly Dressman,3 Iliyan D. Iliev,4
Steven M. Lipkin,4 and Xiling Shen1,5,*
1Department of Biomedical Engineering, Center for Genomics and Computational Biology, Duke Microbiome Center, Duke University,

Durham, NC 27708, USA


2Duke Institute for Health Innovation, Duke University, Durham, NC 27701, USA
3Department of Molecular Genetics and Microbiology, Director of Duke Microbiome Center, Duke University, Durham, NC 27708, USA
4Department of Medicine, Weill Cornell Medical College, Cornell University, New York City, NY 10065, USA
5Lead contact

*Correspondence: [email protected] (A.B.D.), [email protected] (X.S.)


https://doi.org/10.1016/j.chom.2020.12.001

SUMMARY

Studying the microbial composition of internal organs and their associations with disease remains chal-
lenging due to the difficulty of acquiring clinical biopsies. We designed a statistical model to analyze the prev-
alence of species across sample types from The Cancer Genome Atlas (TCGA), revealing that species
equiprevalent across sample types are predominantly contaminants, bearing unique signatures from each
TCGA-designated sequencing center. Removing such species mitigated batch effects and isolated the tis-
sue-resident microbiome, which was validated by original matched TCGA samples. Gene copies and nucle-
otide variants can further distinguish mixed-evidence species. We, thus, present The Cancer Microbiome
Atlas (TCMA), a collection of curated, decontaminated microbial compositions of oropharyngeal, esopha-
geal, gastrointestinal, and colorectal tissues. This led to the discovery of prognostic species and blood
signatures of mucosal barrier injuries and enabled systematic matched microbe-host multi-omic analyses,
which will help guide future studies of the microbiome’s role in human health and disease.

INTRODUCTION sites and disease states, and these sequencing datasets contain
a significant number of sequencing reads of microbial origin
The human body supports an ecosystem of 10–100 trillion micro- (Kostic et al., 2011; Poore et al., 2020; Robinson et al., 2017).
organisms (Luckey, 1972; Sender et al., 2016), representing 500– Large sequencing projects can, thus, be mined to promote un-
1,000 unique species per individual (Human Microbiome Project derstanding of host-microbe interactions in both healthy and
Consortium, 2012; Qin et al., 2010). Perturbations to this diseased human tissue. To that end, the bioinformatics tool
ecosystem, termed dysbiosis, can impact human health: micro- PathSeq (Kostic et al., 2011) was used to identify enrichment
bial alterations have been implicated in a variety of health condi- of Fusobacterium nucleatum in The Cancer Genome Atlas
tions, including obesity, diabetes, inflammatory bowel disease, (TCGA) colorectal cancer (CRC) tumors (Kostic et al., 2013; Kos-
cancer, and other diseases (Elinav et al., 2019; Iliev and Leo- tic et al., 2012). Since then, dozens of research articles explored
nardi, 2017; Levy et al., 2017; Schirmer et al., 2019). While public the role of F. nucleatum in tumorigenesis, finding associations
microbiome projects such as the Human Microbiome Project with stage, survival, metastasis, and even drug response (Bull-
(HMP) and MetaHIT have helped bring tremendous insights man et al., 2017; Flanagan et al., 2014; Yu et al., 2017). More
into the diversity and function of human flora, these databases broadly, sequencing data from TCGA has been used ad hoc to
are dominated by tissue swabs and stool samples that do not screen for viral and bacterial presence in stomach adenoma
necessarily reflect the microbial composition of internal organs (Cancer Genome Atlas Research Network, 2014) and cervical
(Grice et al., 2008; Prast-Nielsen et al., 2019). Collection of clin- cancer (Cancer Genome Atlas Research Network et al., 2017)
ical biopsies specifically dedicated to microbial profiling remains specifically, as well as viromes (Tang et al., 2013) and bacter-
difficult despite many disease-related host-microbe interactions iomes (Robinson et al., 2017). Recently, analysis of TCGA
occurring at the epithelium of internal body sites. sequencing data has been used to demonstrate the potential
Next-generation sequencing (NGS) is frequently used to pro- for bloodborne microbial DNA to diagnose certain cancers
file biopsied human tissue samples at a broad range of body (Poore et al., 2020). Given that even low-biomass tumors contain

Cell Host & Microbe 29, 281–298, February 10, 2021 ª 2020 Elsevier Inc. 281
ll
Resource

Figure 1. WGS and WXS harbor colorectal bacterial reads distinct from blood and brain
See also Figure S1
(A) Matched analysis of bacterial sequencing reads per million (RPM) in normal tissue (yellow), tumor tissue (blue), and blood (red) from CRC and BC patients in
TCGA. Significance is given by paired, one-sided t tests.
(B) Abundance data from (A) but comparing solid tissue (pooled tumor and normal) with blood samples from BC (green) and CRC (brown) patients. Significance is
given by one-sided t tests.
(C) Comparison of bacterial species prevalence in WGS data for CRC blood and CRC tissue samples reveals populations of tissue-enriched species (blue) and
species that are equiprevalent in blood and tissue (red). Black circles denote species associated with MBI.
(legend continued on next page)
282 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

tissue-specific microbiomes (Nejman et al., 2020), analysis of tracted from TCGA sequencing data matched the microbial
microbial DNA and RNA in TCGA sequencing data has great po- composition of the original tissue samples.
tential for diagnostic applications, as well as for exploring host- Finally, we ran the vetted decontamination algorithm to estab-
microbe interactions along molecular and clinical axes. lish the TCMA database, which users can access from an inter-
However, few actionable microbiota targets like F. nucleatum active website (https://tcma.pratt.duke.edu). The database
have emerged from such analyses. When examining a subset of contains curated tissue-resident microbial profiles for 4,937
TCGA sequencing data, previous analyses (Robinson et al., sequencing runs on 3,689 unique samples from 1,772 patients
2017) found that microbial reads from a number of species representing 5 TCGA projects and 21 anatomic sites with tis-
were the result of contamination, and that distinguishing sue-resident populations. As proof-of-principle, we used
contamination from tissue-embedded microbes remained an TCMA to identify two bacterial co-abundance groups in CRC tis-
outstanding challenge for use of this dataset. Indeed, while con- sue, including species enriched in CRC tumors compared with
cerns over contamination are less pressing for samples with high matched adjacent normal tissue and species prognostic of
microbial biomass such as stool or swabs, microbiome studies patient survival. TCMA enabled a matched microbe-host tran-
on low-biomass samples suffer from contamination during sam- scriptomic, proteomic, and epigenetic analysis that identified as-
ple collection and DNA extraction. Contamination can originate sociations between microbes and host gene expression patterns
from the laboratory environment, including nucleic acid extrac- and pathways. Finally, by comparing TCMA-curated blood sam-
tion kits (Davis et al., 2018; Eisenhofer et al., 2019; Glassing ples of CRC and brain cancer (BC) patients, we identified a bac-
et al., 2016). Thus, controlling for contamination in these data- terial signature associated with colorectal mucosal barrier injury
sets is a crucial step that must precede downstream analyses unique to CRC blood samples. Thus, TCMA constitutes a power-
of host-microbe interactions. Samples for multi-institutional pro- ful resource for validation and hypothesis generation in future
jects are acquired, processed, and sequenced at different sites, studies of host-microbe interactions relevant to cancer.
each of which may introduce its own contaminants that influence
the extracted microbial profiles, impede reproducibility, and RESULTS
complicate discovery of microbial biomarkers. Thus, a number
of strategies have been deployed to identify contamination in WGS and WXS harbor colorectal bacterial reads distinct
TCGA sequencing data, through examination of batch effects, from blood and brain
sample analyte concentrations, and through manual curation To explore the microbial populations of sequenced TCGA tissue,
(Poore et al., 2020; Robinson et al., 2017). To date, such analyses we began by analyzing multi-platform sequencing data for 730
have never been validated by original TCGA tissue or blood sam- tissue and 555 blood samples from 617 CRC (TCGA projects
ples, nor have decontaminated TCGA microbiome datasets COAD/READ) patients and for 958 tissue and 914 blood samples
been made readily available. from 923 BC (TCGA projects GBM/LGG) patients. For several
Sequencing data in TCGA provide a unique opportunity for thousand whole-genome sequencing (WGS) and whole-exome
identifying tissue-specific microbiota, since matched tissue sequencing (WXS) experiments, we retrieved raw sequencing
and blood samples from various cancer types are processed data from the TCGA database and extracted and mapped
and sequenced in parallel using various sequencing platforms high-quality reads of bacterial origin using PathSeq (Kostic
at designated centers (Choi et al., 2017). Using an unbiased sta- et al., 2011). We found that microbial reads were more abundant
tistical model comparing the prevalence of microbial species in and more diverse in solid tissue than in matched blood samples
tissue and blood samples, we isolated the tissue-resident micro- from CRC patients; in contrast, the abundance and diversity of
biome in TCGA sequencing data. We found that species equally microbial reads were no greater in BC tissue than matched blood
prevalent across tissue types and blood samples are mostly ar- samples (Figures 1A and S1A). Furthermore, CRC tissue had
tifacts or contaminants that (1) bear unique signatures from the more abundant and diverse microbiota than BC tissue (Figures
designated TCGA sequencing center and (2) comprise more 1B and S1B), consistent with the notion that blood and brain tis-
than half of all detectible microbial sequencing reads in many tis- sue are more sterile than colorectal tissues. Notably, microbial
sue samples. With gene- and nucleotide-level resolution, our reads were also more abundant and diverse in blood samples
model is also capable of normalizing read counts for ‘‘mixed-ev- from CRC patients than in those of BC patients (Figures 1B
idence’’ cases, in which sequencing reads aligning a given and S1B).
species may come from a combination of endogenous and Comparative analysis of microbial reads between CRC tissue
contaminant microbiota. To validate our approach, we obtained and blood samples revealed two distinct groups of bacterial spe-
original matched CRC tissue and plasma samples that were pre- cies: those enriched in tissue and those equally prevalent in tis-
viously sequenced by TCGA and performed 16S rRNA amplicon sue and blood (Figures 1C and S1C). Species were seldom more
sequencing. This independently confirmed not only the absence prevalent in blood than in tissue. Species that were more preva-
of putative contaminants but also that the tissue-resident, lent in CRC tissue than in CRC blood included many species
computationally decontaminated microbial profiles that we ex- known to be associated with mucosal barrier injury (MBI)

(D) Comparison of bacterial species prevalence in WGS data for CRC and BC samples reveals populations of CRC-enriched species (brown) and species that are
equiprevalent in CRC and BC (green).
(E) Relative abundance of bacterial phyla in WGS data for tissue (top) and blood (bottom) samples from CRC (left) and BC (right) patients.
(F and G) Heat tree comparing relative abundance of bacteria in WGS data for (F) matched blood samples (red) versus tissue samples (blue) and (G) CRC tissue
(brown) versus BC tissue (green).

Cell Host & Microbe 29, 281–298, February 10, 2021 283
ll
Resource

Figure 2. Most equiprevalent taxa are common contaminants and associated with particular sequencing centers
See also Figure S2
(A) Genera commonly found in negative controls of metagenomic sequencing experiments (Eisenhofer et al., 2019) are highly prevalent in blood samples.

(legend continued on next page)


284 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

(CDC, 2019), whereas the group equally present in CRC tissue negative controls of metagenomic sequencing experiments (Ei-
and blood contained very few such species (Figures 1C and senhofer et al., 2019). Overall, genera in this ‘‘common contam-
S1C). By comparison, nearly all species detected in samples inant’’ set were more prevalent in blood samples (p = 4.45e 10;
from BC patients were equiprevalent in tissue and blood, with Figures 2B and 2A) than genera not in the list.
few enriched in tissue (Figures 1C and S1C). Similar comparative Species equiprevalent in CRC tissue and blood were also
analyses of tissue and blood from CRC and BC patients showed considerably more genetically and phenotypically diverse than
significant populations of disease-enriched species for CRC but species enriched in CRC with respect to their G-C content,
few for BC (Figures 1D and S1D). We then repeated this compar- genome size, and optimal growth conditions. Conversely,
ative prevalence analysis using samples from ovarian cancer CRC-enriched species were much less tolerant to extreme
(OVC; TCGA project OV; Figures S1E and S1F). growth conditions, with optimal temperature, pH, and NaCl
The microbial composition of CRC tissue samples was also levels more closely resembling those of human homeostasis
markedly distinct from that of matched blood samples or BC tis- (Figures 2C and S2B). Together, these results suggest that the
sue samples. Among the most dominant phyla in CRC tissue equiprevalent group may contain contaminant species, which
were Bacteroidetes and Firmicutes, which were relatively absent have larger genomes—a signature of ‘‘generalist’’ bacteria that
in blood and brain tissue samples (Figures 1E and S1G). Next, we must endure more variable and unstable environmental condi-
compared the relative abundance of bacterial taxa in CRC tissue tions than gut microbiota (Sriswasdi et al., 2017).
versus matched blood samples (Figure 1F) and CRC tissue
versus BC tissues (Figure 1G) from different donors. These ana- Equiprevalent species are associated with particular
lyses were largely consistent, with taxa from Bacteroidetes, sequencing centers
Firmicutes, and Fusobacteria clades being consistently overrep- Principle coordinate analysis (PCoA) of UniFrac distances be-
resented in CRC tissue, compared with Proteobacteria and Ac- tween CRC samples demonstrated that the primary axis of
tinobacteria, which accounted for a relatively greater fraction of variation in TCGA microbiome data could be attributed to differ-
reads in CRC blood samples and BC tissue samples. Genera ences between blood samples and tissue samples (41.43%)
that were relatively more abundant in CRC blood samples or (Figures 2D and S2C). Interestingly, the second axis of microbial
BC tissue samples compared with CRC tissue samples consis- variation was determined by the sequencing center at which the
tently included Acinetobacter, Mycobacterium, and Ralstonia, samples were processed (17.20%), regardless of sample type.
among others (Figures 1F and 1G). Metagenomic profiling of All TCGA samples (tissue and blood) were harvested at a tissue
TCGA samples using Kraken2 (Wood et al., 2019) largely recapit- source site and then sent to designated genome sequencing
ulated PathSeq results (Figures S1H–S1J). centers (Figure S2D). While the first PCoA axis captured differ-
Together, these comparative analyses were capable of distin- ences in the presence of tissue-enriched species that are more
guishing species enriched in CRC tissues from those with similar abundant in CRC tissue than in blood, the second axis captured
prevalence across different blood samples and disease types. species found in both tissue and blood samples at similar levels,
The analyses confirmed that bacteria in CRC tissues were (1) many of which were associated with sequencing center (Figures
more diverse and abundant and (2) were enriched for mucosa- 2D, S2E, and S2F). We then examined the abundance of equi-
related species. prevalent species in blood samples and found significant clus-
tering according to the sequencing centers at which the samples
Species equiprevalent in tissue and blood are were processed (Figures 2E and S2F). For comparison, we per-
predominantly contaminants formed the same analysis on tissue and blood samples from BC
Besides species enriched in CRC tissues, a significant number of patients, which revealed no discernible variation between tissue
detected species were equally prevalent in blood, CRC tissue, and blood samples, but rather significant clustering by
BC tissue, and OV tissue (Figures 1C, 1D, and S1C–S1F). While sequencing center (Figures S2G and S2H).
compromised epithelial barrier function may allow the transloca- Therefore, the majority of species equiprevalent in tissue and
tion of microorganisms to the bloodstream at low levels (Chelak- blood are not endogenous but are mostly artifacts introduced
kot et al., 2018), we expected that such species would be prev- during processing and profiling at respective sequencing cen-
alent at much lower levels in blood than in CRC tissue. To ters. For ease of description, we will refer to equiprevalent
analyze equiprevalent species and determine their origin, we first species as ‘‘contaminants’’ and tissue-enriched species as ‘‘tis-
examined a set of 70 bacterial genera known to be enriched in sue-resident’’ for the remainder of the article. However, the

(B) Prevalence of common contaminants in blood correlates with absolute abundance.


(C) Genome size and temperature tolerance of equiprevalent species are differential (pW, Wilcoxon’s test) and more variable (pL, Levine’s test) than tissue-en-
riched species.
(D) PCoA of WGS data for CRC samples reveals considerable variation between blood samples and tissue samples along the first axis of variation and batch
effects along the second axis.
(E) Heatmap clustering of bacterial species’ abundance in blood samples demonstrates the presence of center-specific contamination. The left vertical axis
shows each species’ prevalence (gray).
(F) The fraction of all bacterial reads that is contamination in normal (yellow), tumor (blue), and blood (red) samples from CRC patients.
(G) The fraction of bacterial reads that is contamination in WGS data of normal (yellow), tumor (blue), and blood (red) from CRC patients, broken down by the five
most prevalent phyla.
(H) Correlations between centered log ratio (CLR)-transformed relative abundances of WGS and WXS data for the five most prevalent phyla in tissue samples.
Phyla contributing the most contaminant reads have the lowest correlation between assays.

Cell Host & Microbe 29, 281–298, February 10, 2021 285
ll
Resource

Figure 3. Detecting tissue-resident and contaminant species with gene-level resolution


See also Figure S3
(A–C) Prevalence of genes belonging to B. vulgatus (A; tissue-resident), A. junii (B; contaminant), and E. coli (C; mixed-evidence) in blood versus tissue. The large
dot indicates species-level prevalence.
(D–F) Kernel-density estimate of gene prevalence in blood (red) and tissue (blue) for B. vulgatus (A), A. junii (B), and E. coli (F).
(legend continued on next page)
286 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

equiprevalent population may still contain biologically relevant Detecting tissue-resident and contaminant species with
species that are detected in both tissue and blood. gene-level resolution
For species designated as tissue-resident (e.g., B. vulgatus) or
A generalizable model for isolating tissue-resident as contamination (e.g., A. junii), we subsequently explored the
microbiota in TCGA tumor samples extent to which microbial genes could be reliably detected in
Based on the comparative analyses of prevalence in tissue and TCGA sequencing data. Using annotated genomes to search
blood, we developed a generalizable statistical model to distin- for gene-level assignments, we found that for many such spe-
guish tissue-resident microbiota from contaminant species cies, sequencing alignments provided coverage of the full micro-
across cancer types in TCGA. Of the 1,136 bacterial species bial genome. As expected, gene prevalence profiles of tissue
detectable in more than 5% of CRC tissue samples, this model and blood samples largely recapitulated those of species-level
classified 769 species as tissue-resident (67.69%) and 367 spe- assignments (Figures 3A, 3B, S3A, and S3B). For tissue-resident
cies as contaminants (32.31%). Tissue-resident populations species, the gene prevalence distribution was much lower in
identified by comparing prevalence in tissue and blood were blood samples than tissue, while for contaminants, the gene
largely consistent with prevalence comparisons of CRC tissue prevalence distribution of blood and tissue samples were nearly
with BC tissue, as well as analogous comparisons made with identical (Figures 3D, 3E, S3C, and S3D); genome coverage was
WXS data (Figure S2I). The model was used to perform binary greater in tissue than blood for tissue-resident species but iden-
classifications for all observed species. For each sample in the tical for contaminant species (Figures 3G, 3H, S3E, and S3F).
cohort, and at each subsequent taxonomic level, we then used Likewise, genome coverage in tissue samples was nearly equal
species-level classifications to design a mixture model esti- at Harvard and Baylor for tissue-resident species but not for
mating the fraction of contaminant read counts within a given contaminant species (Figures S3G and S3H). These results sug-
clade. This method provides a generalizable approach for de- gest that gene- and nucleotide-level analyses of microbial
composing observed microbial populations into their tissue-resi- sequencing reads may be leveraged to help distinguish contam-
dent and contaminant fractions on a sample-by-sample basis at ination from tissue-resident populations.
multiple taxonomic levels.
Distinguishing tissue-resident Escherichia reads from
Proteobacteria and Actinobacteria contribute the contamination
largest fraction of contaminant reads An outstanding challenge in controlling contamination is the
As expected, the removal of contaminant species resulted in a problem of mixed-evidence cases in which detected sequencing
reduction in the number of bacterial reads in all sample types. reads come from an unknown combination of endogenous and
Specifically, bacterial species classified as contaminants ac- contaminant sources (Poore et al., 2020; Robinson et al.,
counted for a median of 16.27% bacterial read counts in tissue 2017). For example, although Escherichia coli is ubiquitous
but varied considerably (Figure 2F). Contaminants consistently among human microbiomes, species-level E. coli reads were
dominated blood samples, with a median of 99.45% of detected present in tissue (64.68%) and blood (66.29%) at nearly equal
WGS reads being the result of contamination. The phyla Proteo- rates and were strongly associated with sequencing center (Fig-
bacteria and Actinobacteria contributed the greatest fraction of ure 2E). We therefore explored whether gene-level read align-
contaminant reads in WGS data, with medians of 76.67% and ments could provide greater resolution and could be used to
80.95% of reads, respectively, found in CRC tissue samples being estimate the fraction of sequencing reads resulting from contam-
the result of contamination (Figure 2G). By contrast, only small ination versus endogenous microbiota. To test this, we mapped
fractions of Firmicutes (1.70%), Bacteroidetes (0.02%), and Fuso- microbial sequencing reads from TCGA tissue and blood sam-
bacteria (0.00%) reads were predicted to be the result of contam- ples to genes in the annotated E. coli genome.
ination (Figure 2G). Contamination rates were largely similar for Overall, reads aligning to E. coli genes in tissue and blood sam-
WXS and across sequencing centers (Figures S2J and S2K). ples were detected at up to the same rates as species-level E. coli
Additionally, correlation between the normalized relative alignments (Figure 3C) and had similar genome coverage (Fig-
abundances of taxa in matched WGS and WXS samples was ure 3I). However, a small number of E. coli genes displayed a
predictive of contamination rates. Correlations between Bacter- signature analogous to tissue-resident microbiota in our spe-
oidetes, Fusobacteria, and Firmicutes abundances in WGS and cies-level prevalence analysis (Figure 3C). Moreover, we
WXS were consistently high, in contrast to Actinobacteria and observed bimodality in the blood prevalence of E. coli genes (Fig-
Proteobacteria (Figure 2H). For blood samples, the normalized ure 3F), suggesting the presence of distinct tissue-resident and
relative abundances of these five phyla were wholly uncorrelated contaminant E. coli populations. We identified a set of 119
between matched WGS and WXS (Figure S2L). Overall, these re- E. coli genes significantly enriched in tissue samples (q < 0.01; Fig-
sults show that significant fractions of the bacterial reads in WGS ure 3J), several of which have credible reasons for being enriched
data for CRC tissue and blood samples are the result of contam- in tissue samples. The top candidate, cadA (q = 4.44E-9), is a gene
ination from Actinobacteria and Proteobacteria species. encoding one of two lysine decarboxylases (Kikuchi et al., 1997;

(G–I) Coverage of WGS reads aligning to genomes of B. vulgatus (G), A. junii (H), and E. coli (I) in blood (red) and tissue (blue).
(J) Top 25 E. coli genes most significantly enriched in tissue.
(K) Comparison of the prevalence of E. coli genes, cadA and ldcC, in blood (red) and tissue (blue).
(L) Results of GO pathway analysis of tissue-enriched E. coli genes.
*Indicates tissue-enriched E. coli genes

Cell Host & Microbe 29, 281–298, February 10, 2021 287
ll
Resource

Figure 4. Decontamination removes sequencing center artifacts and original TCGA tissue and blood samples validate tissue-resident micro-
bial compositions and equiprevalent species as contaminants, see also Figure S4; Table S1
(A) Abundance of WGS bacteria before and after decontamination. Samples with no reduction in bacterial reads lie along the gray line. Experiments with low
microbial biomass a priori are disproportionally affected by decontamination.
(B) Relative abundance of bacterial phyla in tissue samples before and after decontamination, sorted by their a priori abundance of Actinobacteria.
(C) PCoA of the decontaminated, tissue-resident microbial component reveals retention of variation related to sample type but not sequencing center.
(D) PCoA of the contaminant microbial component reveals retention of variation related to sequencing center but not sample type.
(E and F) Prevalence of bacterial species in tissue samples sequenced at Baylor versus Harvard (E) before and (F) after removing contamination.

(legend continued on next page)


288 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

Yamamoto et al., 1997) produced by E. coli; the other is ldcC, most prominently by the removal of contaminant Actinobacteria
which is not enriched in tissue samples (q = 0.16) (Figures 3K and Proteobacteria reads (Figure 4B).
and S3J). While ldcC encodes a gene that is constitutively ex- Despite being naive to sequencing center, our prevalence-
pressed, cadA transcription is induced under conditions of anaer- based model for decomposing the TCGA microbiome data into
obic growth at low pH and its gene product displays greater ther- tissue-resident and contaminant fractions also mitigated cen-
mostability and acid tolerance (Lemonnier and Lane, 1998). ter-related batch effects. Prior to removing contamination,
Additionally, genes in the pks island encoding colibactin were TCGA microbiome data clustered by both sample type and
significantly more prevalent in tissue than blood, matching previ- sequencing center (Figure 2D). However, unsupervised clus-
ous reports that E. coli strains expressing this gene are associated tering of the tissue-resident component extracted from the orig-
with CRC tissues (Figure S3L) (Arthur et al., 2012). inal TCGA sequencing data showed no dependency on
Discrepancies in intraspecies genome content may be ex- sequencing center and maintained variation related to sample
plained by adaptive gene loss, an evolutionary mechanism type (Figure 4C). Examining the contamination component, we
whereby bacteria dispense with genes that are unnecessary found the opposite: samples no longer clustered by sample
for their environmental conditions (Koskiniemi et al., 2012; type but rather organized exclusively according to sequencing
Mira et al., 2001). Pathway analysis (Huang da et al., 2009) re- center (Figure 4D). These results reflected the removal of species
vealed that tissue-enriched E. coli genes were significantly that were uniquely prevalent in tissue samples from either Baylor
associated with processes including iron-ion homeostasis, en- or Harvard (Figures 4E, 4F, and S4A).
terobactin biosynthesis, ion transport, ferric-enterobactin trans- Finally, our algorithm greatly increased the similarity between
port, and copper-ion response (p < 0.01) (Figure 3L). Iron (Fe3+) the microbial populations in patient-matched tissue samples
and copper (Cu2+) are abundant in the host and can be toxic to sequenced at both Harvard and Baylor, while maintaining diversity
E. coli in acidic, aerobic conditions; therefore, strains of E. coli among samples overall (Figure 4G). Thus, our prevalence-based
must tightly regulate intracellular concentrations of these model is able to homogenize matched samples sequenced at
metals and undergo selection to do so (Porcheron et al., different centers and mitigate sequencing center artifacts.
2013; Rensing and Grass, 2003). Given that hypothetically
bloodborne E. coli would also have to contend with high con- Original TCGA tissue and blood samples validate tissue-
centrations of copper and iron, enrichment of these genes resident microbial compositions and equiprevalent
and processes in tissue relative to blood suggests that the species as contaminants
majority of E. coli reads detected in blood samples are not To benchmark our analysis, we obtained five primary CRC tumor
endogenous but rather the result of contamination. However, samples and matched plasma samples from an original TCGA tis-
tissue-enriched genes such as cadA and others serve as sue provider (Table S1). These samples were specifically chosen to
benchmarks for distinguishing the two. ensure that each tissue and plasma sample was profiled by WGS at
both Baylor and Harvard. For controls, we also procured three
Tissue-enriched sequencing reads can be identified plasma samples from healthy individuals and spiked one plasma
with nucleotide precision sample with E. coli. We then used 16S amplicon sequencing to vali-
We then examined microbial sequencing reads at nucleotide- date that the bacterial composition of the original TCGA samples
level resolution. Given that gene-level alignments helped resolve resembled the decontaminated compositions extracted from
mixed-evidence cases, we explored whether bacterial sequence TCGA sequencing data for matched tumor samples (Figure 4H).
variants, such as SNPs, could be used in a similar fashion. We found that the original TCGA tumor samples contained
Variant prevalence across CRC tissue and blood samples largely ample bacterial diversity and read counts (Figures 4I and S4B)
recapitulated the results from species- and gene-level profiles and that their bacterial composition largely recapitulated the de-
(Figures S3L–S3M). Interestingly, we also found populations of contaminated, tissue-resident microbial population our decon-
apparent tissue-enriched and equiprevalent variants in E. coli tamination model extracted from TCGA WGS data on matched
genomes, suggesting that analyses of sequence variants may samples (Figures 4J and S4D). In addition to increasing the sim-
prove useful in distinguishing between endogenous and contam- ilarity between WGS and 16S validation results (Figure S4C),
inant sequencing reads in mixed-evidence cases. decontamination greatly improved the concordance between
microbial compositions of matched WGS experiments per-
Decontamination removes sequencing center artifacts formed at Harvard and Baylor (Figure 4G). Despite detecting a
Removing contamination affected all samples, but samples with large number of bacterial sequencing reads in tumor samples,
low bacterial abundance a priori were the most affected (Fig- bacterial diversity of CRC plasma was not significantly greater
ure 4A), consistent with observations that low-biomass samples than healthy plasma (p = 0.30) or water controls (p = 0.44). More-
are the most profoundly affected by contamination (Eisenhofer over, the 16S bacterial composition of original TCGA plasma
et al., 2019; Glassing et al., 2016). Decontamination also regular- samples was distinct from the bacterial composition of WGS
ized the relative abundance profiles of CRC tissue samples, data from the same samples (Figures S4E and S4F), supporting

(G) Comparison of weighted UniFrac distances before and after removing contamination among all tissues (left) and specifically matched tissues sequenced at
both Baylor and Harvard (right).
(H) Design of the validation experiment. Data are represented as mean ± 95% CI.
(I) Bacterial diversity of 16S rRNA-seq results from tissue (blue), plasma (red), and controls (bottom panel).
(J) Relative abundances in 16S results for tissue compared with tissue samples sequenced using WGS at Harvard and Baylor, before and after contamination.

Cell Host & Microbe 29, 281–298, February 10, 2021 289
ll
Resource

(legend on next page)

290 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

the notion that the majority of bacterial reads detected in TCGA sue types. As proof-of-principle, we used batch-normalized
blood samples are contamination introduced during DNA extrac- RPPA, mRNA-seq, miRNA-seq, and methylation 27-K data
tion and sequencing, rather than at the time of procurement. from TCGA to compute correlations between features in these
These validation results demonstrate that our model accurately datasets with genera in the Fusobacterium and Bacteroides
identifies and removes contaminants and that computational clusters identified previously (Figures 5D, 5E, S5C, and S5D).
decontamination produced profiles that represent the true mi- For each of these assays, we found that these bacterial co-abun-
crobial composition of tissue. dance groups were predictive of host gene expression patterns.
For instance, in RPPA protein expression data we found that
Colorectal tissue microbiomes cluster into ADAR1 and PARP1 expression appeared to distinguish these
Fusobacterium and Bacteroides co-abundance groups co-abundance groups (Figure 5E). The protein ADAR1 is upregu-
Having validated the contamination-adjusted microbial profiles, lated by inflammatory mediators such as TNF-alpha and IFN-
we sought to leverage the decontaminated TCMA dataset to gamma (Yang et al., 2003) and regulates pathogen detection
investigate whether certain subgroups of microbiota were more and autoinflammation by discriminating self from non-self RNA
likely to be found together in tissue from CRC patients. Using (Chung et al., 2018), while PARP1 regulates DNA repair and is
the bootstrapping procedure, SparCC (Friedman and Alm, activated by Helicobacter pylori in gastric cancer (Nossa et al.,
2012), we found two anticorrelated co-abundance groups (Fig- 2009). Independently, we found that ADAR1 expression corre-
ures 5A–5C, S5A, and S5B): the ‘‘Fusobacterium cluster’’ con- lated with expression of PARP1, TNF-alpha, and IFN-gamma in
tained Porphyromonas, Prevotella, Peptostreptococcus, and TCGA RNA-seq data for CRC (Figure S5E). These results sug-
Campylobacter, among other species that were associated with gest that genes regulating inflammation and pathogen response
tumor samples; the second ‘‘Bacteroides cluster’’ is larger and may distinguish the Fusobacterium and Bacteroides co-abun-
contained a highly correlated set of microbes, including Parabac- dance groups in CRC. More broadly, these analyses illustrate
teroides, Clostridium, and Alistipes. This group may represent a the utility of TCMA as a unique resource for comparing microbial
more normal/healthy microbiome, as several of these species and multi-omic host profiles from matched tissue samples.
were positively associated with normal tissue samples. Taxa in
the Fusobacterium cluster were significantly associated with colo- Matched tumor-normal analysis reveals species
rectal neoplasms, while taxa in the Bacteroides cluster were asso- associated with colorectal neoplasms
ciated with C. difficile infection, irritable bowel syndrome, and The TCGA database contains detailed annotations on each tis-
cirrhosis (q < 0.05). These co-abundance groups may represent sue donor, including statistics on tumor stage, size, morphology,
two distinct ‘‘enterotypes’’ of CRC tissue microbiomes. and location, as well information on patient survival, treatment
history, and therapeutic response. To identify microbes predic-
Bacterial co-abundance groups are predictive of the tive of pathological and prognostic characteristics of CRC tissue,
host tissue molecular environment we used matched normal tissue and primary tumor samples to
Next, we evaluated whether the bacterial co-abundance groups perform a paired comparison of tissue-resident microbes (Fig-
we identified had discernible effects on host gene expression or ures 5F, S5F, and S5G). This analysis identified 37 species that
regulation. Microbiota and host cells are known to engage with were significantly enriched in either normal (n = 14) or tumor
one another through a complex variety of molecular interactions. (n = 23) samples (p < 0.05) (Table S2).
Host-derived nutrients and dietary macromolecules are utilized The species most significantly associated with CRC tumors
by microorganisms as a food source, while microbial byproducts compared with matched normal tissue was F. nucleatum
including short-chain fatty acids (SCFAs) are known to modulate (p = 1.82E-3), which is known to promote intestinal tumorigen-
gene expression, cell differentiation, and inflammatory response esis. Overall, approximately half of tumor-associated species
(Donohoe et al., 2011; Furusawa et al., 2013). belonged to the genus Fusobacterium, including F. hwasookii,
The TCGA database contains a dense cube of molecular F. massiliense, and a number of unclassified Fusobacterium
profiling data, including matched genetic, epigenetic, transcrip- spp. (p < 0.01). Non-Fusobacterium species associated
tional, and proteomic assays performed on thousands of sam- with CRC tumors included P. micra, S. moorei, and P. stomatis
ples. Thus, the ability to compare microbial profiles with (p < 0.05), several of which belonged to the Fusobacterium
matched host molecular profiles represents an unprecedented co-abundance group (Figure S5A) and have previously been
opportunity for querying host-microbe interactions in various tis- implicated in CRC (Kostic et al., 2013; Purcell et al., 2017; Warren

Figure 5. Colorectal tissue microbiomes cluster into Fusobacterium and Bacteroides co-abundance groups predictive of host tissue molec-
ular environment
See also Figure S5; Table S2
(A) Heatmap clustering of correlations between bacterial genera reveals anticorrelated clusters of genera, characterized by Bacteroides and Fusobacterium
(purple triangles). Axes are colored according to species’ association with tumor (blue) or matched adjacent normal tissue (yellow).
(B and C) (B) Bacteroides- and (C) Fusobacterium-associated co-abundance networks. Node size is proportional to the prevalence of the genera in tissue
samples, and node hue is proportional to abundance.
(D and E) Co-abundance groups are predictive of gene expression (D; RNA-seq) and protein expression (E; RPPA).
(F) Heat tree comparing bacterial taxa abundance in tumor samples (blue) or matched normal tissue (yellow).
(G) Survival analysis p values of species in the Bacteroides and Fusobacterium co-abundance groups.
(H) OS curves for Bacteroides spp.

Cell Host & Microbe 29, 281–298, February 10, 2021 291
ll
Resource

Figure 6. Microbial presence in CRC tissue is predictive of host gene expression pathways and MBI See also Figure S6
(A) Correlation between host gene expression (columns) and CLR-transformed species abundances (rows). Rows are colored according to each species’ as-
sociation with tumor (blue) or normal tissue (yellow).

(legend continued on next page)


292 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

et al., 2013). Other species, including several Campylobacter spective of their association with tumor or normal tissue (Figures
spp. did not have extant links to the disease. Of these, 6B and S6B) and (2) processes related to inflammatory cancer
C. ureolyticus (Log2FC = 1.97; p = 2.19e 2) is an emerging pathways and cell-cell adhesion were enriched among genes
gastrointestinal pathogen implicated in inflammatory bowel dis- correlated with tumor-associated and normal tissue-associated
ease and colitis (Bullman et al., 2013; O’Donovan et al., 2014), species, respectively (Figures 6C, 6D, S6C, and S6D). Specif-
prompting further examination. C. ureolyticus abundance corre- ically, both tumor- and normal tissue-associated species were
lated with expression of several genes, including CAMK2D and enriched for processes relating to intestinal IgA production, anti-
UGDH (Figures S5J, S5H, and S5I), and several genes expressed gen presentation, natural killer cell-mediated cytotoxicity, cyto-
by C. ureolyticus were significantly associated with tumor sam- kine signaling, and primary immunodeficiency, suggesting
ples compared with normal tissue (Figure S5J). Additionally, near-universal activation of an immunogenic transcriptional
C. ureolyticus was associated with worse progression-free inter- response to the presence of these bacteria (Figures 6B and S6B).
val (PFI) in recurrent CRC patients (Figure S5K). We also found that pathways including DNA replication, DNA
Taxa that were significantly more abundant in adjacent normal repair, oxidative phosphorylation, p53 signaling, and ribosome
tissue compared with matched tumor tissue were dominated by activity were all negatively enriched among normal tissue-asso-
Bacteroides and Parabacteroides spp. (p < 0.05) (Figure S5G), ciated species, and positively enriched among tumor-associ-
many of which belonged to the Bacteroides co-abundance ated species, particularly for Fusobacterium spp. (Figures 6C
group (Figure S5A). By leveraging patient-matched tumor and and S6C). Conversely, genes involved in the regulation of cellular
normal tissue samples, TCMA may thus be used identify bacte- adhesion were positively enriched among normal tissue-associ-
rial associations with CRC and other gastrointestinal cancers. ated species and negatively enriched among tumor-associated
species (Figures 6D and S6D). Together, these results indicate
Survival analysis reveals candidate microbial that within this cohort of CRC tumor samples, tumor-associated
biomarkers predictive of clinical outcomes species may be associated with proinflammatory, neoplastic
Using survival data collected by the PanCanAtlas (Liu et al., transformations and loss of epithelial integrity.
2018), we next examined whether co-abundance groups were
predicative of overall survival (OS). For each species, we used Microbial presence in CRC blood samples indicate
a log-rank test to assess its individual prognostic value. Interest- mucosal barrier injury
ingly, species in the Bacteroides co-abundance group were As shown in Figures 1B and S1B, bacteria were significantly more
generally more prognostic of survival than the Fusobacterium abundant and diverse in blood samples from CRC patients than
co-abundance group (Figure 5G). We found over a dozen Bac- from BC patients (p < 0.01). The presence of transient, endoge-
teroides spp. that were prognostic of survival, including nous microbial DNA in the bloodstream has been reported in pri-
B. cellulosilyticus and several unclassified Bacteroides spp. (Fig- mary CRC patients, often before diagnosis, and may even be pre-
ures 5H and S5L). These findings demonstrate the utility of dictive of tumor stage and location (Abdulamir et al., 2011; Poore
TCMA for the identification of prognostic microbial biomarkers et al., 2020). Loss of mucosal barrier function is a common feature
relevant to CRC and other cancers. of CRC and other chronic inflammatory conditions and may lead
to microbial translocations from CRC tumors to the lamina propria
Microbial presence in CRC tissue is predictive of host and bloodstream (Oshima and Miwa, 2016; Yu, 2018).
immunogenic response, inflammatory cancer pathways, To explore this possibility, we examined the abundance of
and cell-cell adhesion subsets of bacterial species and genera designated as common
Next, we explored whether the 37 species that we identified as commensal (n = 407) or MBI-associated (n = 693) (CDC, 2019).
significantly associated with either tumor or normal tissue sam- Examining MBI-associated species among decontaminated
ples had identifiable effects on host gene expression or related blood samples within the CRC cohort, we found that species
biological pathways. Comparing normalized abundances of associated with MBI were considerably more prevalent than
these species with matched mRNA expression data from 159 those that were not (p = 2.42e 7; Figure 6E). We then compared
CRC tumor samples, we computed correlations and found tran- the abundance of MBI-associated genera with that of common
scriptional patterns that were associated with both tumor- and commensals and discovered that genera associated with MBI
normal tissue-associated species (Figures 6A and S6A). Given were frequently more abundant in the blood of CRC patients
the observed differences in the transcriptional correlations of tu- than BC patients (Figures 6F and S6E). These results point to-
mor- and normal tissue-associated bacteria, we subsequently ward the potential utility of bloodborne bacterial DNA from
performed gene-set enrichment analysis (Subramanian et al., MBI-associated organisms as a potential biomarker for CRC.
2005) to identify biological pathways associated with the abun-
dance of these species. Contamination-adjusted tissue microbiome profiles for
Pathway analysis revealed that (1) genes correlated with the all gastrointestinal cancers in TCGA
abundance of bacterial species were consistently enriched for Having successfully identified the CRC tissue-associated mi-
the activation of immune system pathways and processes, irre- crobial component in the TCGA dataset, we analyzed samples

(B–D) Comparison of differentially abundant species and their association with tissue type (x axis) versus enrichment score (y axis) for KEGG terms (A) ‘‘natural
killer cell-mediated cytotoxicity’’ (B), ‘‘DNA replication’’ (C), and ‘‘cell adhesion molecules’’ (D).
(E) Bacterial species implicated in MBI are more prevalent in decontaminated blood samples than other species.
(F) Bacterial genera implicated in MBI (red) are more abundant in CRC blood (brown) than BC blood (green), in contrast to some commensal species (blue).

Cell Host & Microbe 29, 281–298, February 10, 2021 293
ll
Resource

Figure 7. Contamination-adjusted tissue microbiome profiles for all gastrointestinal cancers in TCGA
See also Figure S7
(A) Pan-cancer abundance of bacteria in solid tissue samples from TCGA projects prior to decontamination. Data are represented as mean ± 95% CI.
(B) Estimated fraction of contaminant reads for sequencing experiments on tumor (blue), normal (yellow), and blood (red) samples for each sequencing project
in TCMA.
(C) Classification of tissue-resident (blue) and contaminant (red) species across TCGA gastrointestinal tissues by comparison of prevalence in blood and tissue.
(D) Labeling of tissue-resident (blue) and contaminant (red) species across gastrointestinal tissues by comparison of prevalence in brain tissue and disease-
specific tissue, using classification from (C).
(E) Estimated proportions of tissue-resident (blue) and contaminant (red) species for each TCGA project.

from other cancer types to search for tissue-resident micro- signature of tissue-resident species in HNSC, STAD, and
biota. All sequencing datasets contained some bacterial ESCA projects by comparing species prevalence in tissue with
reads but as expected, they were most abundant in gastroin- blood and brain samples (Figures 7C and 7D), and we estimated
testinal cancers (Figure 7A). In particular, tissue samples the fraction of minimally detectable species that were tissue-
from head and neck cancer (HNSC), colon cancer (COAD), resident or contaminants (Figure 7E). For each of these cancer
rectal cancer (READ), esophageal cancer (ESCA), and types, we applied our decomposition model to isolate tissue-
stomach cancer (STAD) had the greatest number of bacterial resident populations and establish TCMA. Few statistically sig-
reads prior to decontamination, whereas uveal melanoma nificant tissue-resident populations in bladder (BLCA), breast
(UVM), lung squamous-cell (LUSC), and glioblastoma had (BRCA), uterine (UCEC), cervical (CESC), or prostate (PRAD)
the fewest. cancers could be detected (Figures S7A and S7B). The microbial
Given the abundance of microbial reads in gastrointestinal biomass in these tissues is known to be magnitudes less
cancer types, we used our prevalence-based approach to deter- than that of gastrointestinal tissues despite similar levels of
mine whether tissue-resident microbiota were present and esti- contamination (p = 0.56), hence it may be more challenging to
mate the fraction of contaminant reads in each sample type (Fig- distinguish the few tissue-resident species from the over-
ure 7B). In addition to COAD and READ, we found a strong whelming proportion of contaminants.

294 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

DISCUSSION Beyond CRC, the TCMA database will allow interrogations of


pan-cancer relationships between the microbiome and tumor
By comparing and integrating data from multiple NGS platforms development. In most cases, the role of microbiota in cancer is
and various sample types, we isolated and experimentally vali- context-specific. For example, H. pylori is known to advance
dated the tissue-resident component of these datasets, thus gastric cancers but seemingly offer a protective effect in esoph-
producing a public resource of computationally decontaminated ageal adenocarcinoma (Islami and Kamangar, 2008). However,
microbial profiles in TCGA tissue samples. This examination of certain pathogenic processes, such as chronic inflammation,
equiprevalence provides a blueprint for future analyses of altered metabolic states, and abrogation of viral latency display
sequencing data for metagenomic profiling of tissue-resident mi- commonality across cancers (Plottel and Blaser, 2011). Since
crobiota. Putative contaminant species are more likely to origi- TCGA samples were collected and analyzed with common
nate from a single source and are also expected to demonstrate methodologies, the decontaminated metagenomic profiles for
a lesser degree of intraspecies genetic variation, meaning that thousands of tissue samples presented here provide an ideal
additional analyses of gene- and nucleotide-level prevalence platform for examining host-microbe relationships that span
may be helpful for controlling contamination, as we demon- cancer types, in contrast to meta-analyses, which must integrate
strated for mixed-evidence cases such as E. coli. Prevalence- data from disparate sources. Thus, in addition to providing a
based analyses are likely to supplement standard batch-correc- methodology for comprehensively identifying and removing
tion tools, which control technical variation but do not explicitly contamination, TCMA represents an unprecedented resource
model contamination. More statistically rigorous tools that for exploring the role of tissue-resident microbiota in various
leverage prevalence and other technical variables to explicitly cancer types and identifying predictive microbial biomarkers.
define observed metagenomic data as some linear combination
of endogenous and contaminant read counts, may therefore be STAR+METHODS
warranted.
The ability to retroactively remove contaminant species Detailed methods are provided in the online version of this paper
from NGS sequencing datasets will greatly expand the and include the following:
breadth and accessibility of metagenomic profiles for down-
stream analyses. Multi-institutional initiatives such as TCGA d KEY RESOURCES TABLE
and GTEx have collected tens of thousands of tissue samples d RESOURCE AVAILABILITY
for sequencing, many of which are from internal organs and B Lead contact
tissue types known to harbor microbiota. Most of these sam- B Materials availability
ples have been characterized extensively along genetic, B Data and code availability
epigenetic, transcriptional, and proteomic axes or provide d METHOD DETAILS
detailed clinical profiles on patient donors. Meanwhile, a B Acquisition and metagenomic profiling of TCGA
growing body of evidence suggests that alterations to the mi- sequencing data
crobiome are associated with cancer development, progres- B Decomposition of observed TCGA microbial profiles
sion, and drug response (Gopalakrishnan et al., 2018; Sivan into tissue-resident and contaminant fractions
et al., 2015; Viaud et al., 2013). Therefore, obtaining robust B Gene-level sequencing analysis of representative
profiles of the microbial composition of human tissues in these species
sequencing databases will provide new insights into multi- B Nucleotide-level analysis of bacterial sequence
omic host-microbe interactions in human tissue samples variants
that would otherwise be difficult to acquire and analyze. B Acquisition and analysis of original TCGA tissue and
As proof-of-principle, we used TCMA to identify two domi- plasma samples
nant clusters of tissue-resident bacteria in CRC samples, as B Estimation of bacterial co-abundance groups and
well as their associated molecular expression patterns and associated molecular signatures
prognostic significance. Pathway analysis of matched tran- B Identification of tumor- and normal tissue-associated
scriptional data demonstrated that tumor-associated species microbiota
were positively correlated with cancer-related inflammatory B Survival analysis
pathways and negatively associated with cellular adhesion ma- B Pathway analysis of species associated with tumors or
chinery. Specifically, enrichment of ribosome, p53 signaling, adjacent normal tissue
DNA repair, oxidative phosphorylation, and cellular adhesion d QUANTIFICATION AND STATISTICAL ANALYSIS
pathways may point to a previously described mechanism
wherein inflammatory cytokines downregulate p53 by stimu- SUPPLEMENTAL INFORMATION
lating ribosome biogenesis in colonic epithelial cells, leading
to downregulation of E-cadherin and epithelial-mesenchymal Supplemental Information can be found online at https://doi.org/10.1016/j.
chom.2020.12.001.
transition (Brighenti et al., 2014). Given the established ability
of many of these species to induce inflammation (Kostic
ACKNOWLEDGMENTS
et al., 2013), stimulate cytokine activity (Gemmell and Seymour,
1993), and modulate E-cadherin (Rubinstein et al., 2013) in co- The authors would like to thank Mark Walker and Daniel Somers for technical
lonocytes, the contribution of these species to this inflamma- guidance in deploying the PathSeq pipeline as well as Mark Turner for help with
tory cancer pathway necessitates further exploration. developing the TCMA website. We also thank Sayan Mukherjee, Lingchong

Cell Host & Microbe 29, 281–298, February 10, 2021 295
ll
Resource
You, Lawrence David, Justin Silverman, Kimberly Roche, and Darien Lombardi Choi, J.H., Hong, S.E., and Woo, H.G. (2017). Pan-cancer analysis of system-
for constructive discussion and advice. We are grateful to Caroline Connor for atic batch effects on somatic sequence variations. BMC Bioinformatics
reviewing and editing the manuscript. This study was supported by NIH 18, 211.
R35GM122465, DK119795, and DARPA W911NF1920111. Chung, H., Calis, J.J.A., Wu, X., Sun, T., Yu, Y., Sarbanes, S.L., Dao Thi, V.L.,
Shilvock, A.R., Hoffmann, H.H., Rosenberg, B.R., and Rice, C.M. (2018).
AUTHOR CONTRIBUTIONS Human ADAR1 prevents endogenous RNA from triggering translational shut-
down. Cell 172, 811–824.e14.
A.B.D. and X.S. initiated the concept, designed the analyses, and interpreted
the data. A.B.D. performed all the analyses. D.A.M. assisted A.B.D. with gene- Davidson-Pilon, C., Kalderstam, J., Jacobson, N., Zivich, P., Kuhn, B.,
and sequence-level analyses. A.B.D., S.D., and H.D. designed and performed Williamson, M., Sean-Reed, J.K., Fiore-Gartland, A., Datta, D., and Moneda,
the experiments. A.B.D. and M.G. designed and deployed the TCMA website. L. (2020). CamDavidsonPilon/lifelines: 0.24.6 (Zenodo).
A.B.D. and X.S. wrote the manuscript. S.M.L. and I.D.I. assisted with concep- Davis, N.M., Proctor, D.M., Holmes, S.P., Relman, D.A., and Callahan, B.J.
tion, scientific direction, and writing. (2018). Simple statistical identification and removal of contaminant sequences
in marker-gene and metagenomics data. Microbiome 6, 226.
DECLARATION OF INTERESTS DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K.,
Huber, T., Dalevi, D., Hu, P., and Andersen, G.L. (2006). Greengenes, a
The authors declare no competing interests.
chimera-checked 16S rRNA gene database and workbench compatible with
ARB. Appl. Environ. Microbiol. 72, 5069–5072.
Received: June 12, 2020
Revised: September 29, 2020 Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut,
Accepted: December 1, 2020 P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq
Published: December 30, 2020 aligner. Bioinformatics 29, 15–21.
Donohoe, D.R., Garge, N., Zhang, X., Sun, W., O’Connell, T.M., Bunger, M.K.,
REFERENCES and Bultman, S.J. (2011). The microbiome and butyrate regulate energy meta-
bolism and autophagy in the mammalian colon. Cell Metab 13, 517–526.
Abdulamir, A.S., Hafidh, R.R., and Abu Bakar, F. (2011). The association of Eisenhofer, R., Minich, J.J., Marotz, C., Cooper, A., Knight, R., and Weyrich,
Streptococcus bovis/gallolyticus with colorectal tumors: the nature and the L.S. (2019). Contamination in low microbial biomass microbiome studies: is-
underlying mechanisms of its etiological role. J. Exp. Clin. Cancer Res. 30, 11. sues and recommendations. Trends Microbiol 27, 105–117.
€hlbauer, M., Tomkovich, S., Uronis, J.M.,
Arthur, J.C., Perez-Chanona, E., Mu Elinav, E., Garrett, W.S., Trinchieri, G., and Wargo, J. (2019). The cancer micro-
Fan, T.J., Campbell, B.J., Abujamel, T., Dogan, B., Rogers, A.B., et al. (2012). biome. Nat. Rev. Cancer 19, 371–376.
Intestinal inflammation targets cancer-inducing activity of the microbiota.
Flanagan, L., Schmid, J., Ebert, M., Soucek, P., Kunicka, T., Liska, V., Bruha,
Science 338, 120–123.
J., Neary, P., Dezeeuw, N., Tommasino, M., et al. (2014). Fusobacterium nucle-
Bolyen, E., Rideout, J.R., Dillon, M.R., Bokulich, N.A., Abnet, C.C., Al-Ghalith, atum associates with stages of colorectal neoplasia development, colorectal
G.A., Alexander, H., Alm, E.J., Arumugam, M., Asnicar, F., et al. (2019). cancer and disease outcome. Eur. J. Clin. Microbiol. Infect. Dis. 33,
Reproducible, interactive, scalable and extensible microbiome data science 1381–1390.
using QIIME 2. Nat. Biotechnol. 37, 852–857.
€nwald, N.J. (2017). Metacoder: an R pack-
Foster, Z.S., Sharpton, T.J., and Gru
Brighenti, E., Calabrese, C., Liguori, G., Giannone, F.A., Trerè, D., Montanaro, age for visualization and manipulation of community taxonomic diversity data.
L., and Derenzini, M. (2014). Interleukin 6 downregulates p53 expression and PLoS Comput. Biol. 13, e1005404.
activity by stimulating ribosome biogenesis: a new pathway connecting inflam-
mation to cancer. Oncogene 33, 4396–4406. Friedman, J., and Alm, E.J. (2012). Inferring correlation networks from genomic
survey data. PLOS Comput. Biol. 8, e1002687.
Bullman, S., Lucid, A., Corcoran, D., Sleator, R.D., and Lucey, B. (2013).
Genomic investigation into strain heterogeneity and pathogenic potential of Furusawa, Y., Obata, Y., Fukuda, S., Endo, T.A., Nakato, G., Takahashi, D.,
the emerging gastrointestinal pathogen Campylobacter ureolyticus. PLoS Nakanishi, Y., Uetake, C., Kato, K., Kato, T., et al. (2013). Commensal
One 8, e71515. microbe-derived butyrate induces the differentiation of colonic regulatory
T cells. Nature 504, 446–450.
Bullman, S., Pedamallu, C.S., Sicinska, E., Clancy, T.E., Zhang, X., Cai, D.,
Neuberg, D., Huang, K., Guevara, F., Nelson, T., et al. (2017). Analysis of Gemmell, E., and Seymour, G.J. (1993). Interleukin 1, interleukin 6 and trans-
Fusobacterium persistence and antibiotic response in colorectal cancer. forming growth factor-beta production by human gingival mononuclear cells
Science 358, 1443–1448. following stimulation with Porphyromonas gingivalis and Fusobacterium nu-
cleatum. J. Periodont. Res. 28, 122–129.
Bush, S.J., Foster, D., Eyre, D.W., Clark, E.L., De Maio, N., Shaw, L.P.,
Stoesser, N., Peto, T.E.A., Crook, D.W., and Walker, A.S. (2020). Genomic di- Glassing, A., Dowd, S.E., Galandiuk, S., Davis, B., and Chiodini, R.J. (2016).
versity affects the accuracy of bacterial single-nucleotide polymorphism-call- Inherent bacterial DNA contamination of extraction and sequencing reagents
ing pipelines. GigaScience 9, giaa007. may affect interpretation of microbiota in low bacterial biomass samples.
Gut Pathog 8, 24.
Cancer Genome Atlas Research Network (2014). Comprehensive molecular
characterization of gastric adenocarcinoma. Nature 513, 202–209. Gopalakrishnan, V., Spencer, C.N., Nezi, L., Reuben, A., Andrews, M.C.,
Karpinets, T.V., Prieto, P.A., Vicente, D., Hoffman, K., Wei, S.C., et al.
Cancer Genome Atlas Research Network; Albert Einstein College of Medicine;
(2018). Gut microbiome modulates response to anti-PD-1 immunotherapy in
Analytical Biological Service; Barretos Cancer Hospital; Baylor College of
melanoma patients. Science 359, 97–103.
Medicine; Beckman Research Institute of City of Hope; Buck Institute for
Research on Aging; Canada’s Michael Smith Genome Sciences Center; Grice, E.A., Kong, H.H., Renaud, G., Young, A.C., NISC Comparative
Harvard Medical School; Helen F. Graham Cancer Center &Research Sequencing Program, Bouffard, G.G., Blakesley, R.W., Wolfsberg, T.G.,
Institute at Christiana Care Health Services, et al. (2017). Integrated genomic Turner, M.L., and Segre, J.A. (2008). A diversity profile of the human skin mi-
and molecular characterization of cervical cancer. Nature 543, 378–384. crobiota. Genome Res 18, 1043–1050.
CDC (2019). NHSN organism list. In National Healthcare Safety Network Huang, da W., Sherman, B.T., and Lempicki, R.A. (2009). Systematic and inte-
(NHSN) patient safety component manual (Centers for Disease Control and grative analysis of large gene lists using DAVID bioinformatics resources. Nat.
Prevention). Protoc. 4, 44–57.
Chelakkot, C., Ghim, J., and Ryu, S.H. (2018). Mechanisms regulating intesti- Human Microbiome Project Consortium (2012). Structure, function and diver-
nal barrier integrity and its pathological implications. Exp. Mol. Med. 50, 103. sity of the healthy human microbiome. Nature 486, 207–214.

296 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource
Iliev, I.D., and Leonardi, I. (2017). Fungal dysbiosis: immunity and interactions Oshima, T., and Miwa, H. (2016). Gastrointestinal mucosal barrier function and
at mucosal barriers. Nat. Rev. Immunol. 17, 635–646. diseases. J. Gastroenterol. 51, 768–778.
Islami, F., and Kamangar, F. (2008). Helicobacter pylori and esophageal can- Pertea, G., and Pertea, M. (2020). GFF Utilities: GffRead and GffCompare
cer risk: a meta-analysis. Cancer Prev. Res. (Phila) 1, 329–338. [version 1; peer review: 3 approved. F1000Res 9, 304.
Jiang, W. (2018). A protocol for quantizing total bacterial 16S rDNA in plasma Plottel, C.S., and Blaser, M.J. (2011). Microbiome and malignancy. Cell Host
as a marker of microbial translocation in vivo. Cell. Mol. Immunol. 15, 937–939. Microbe 10, 324–335.
Kikuchi, Y., Kojima, H., Tanaka, T., Takatsuka, Y., and Kamio, Y. (1997). Poore, G.D., Kopylova, E., Zhu, Q., Carpenter, C., Fraraccio, S., Wandro, S.,
Characterization of a second lysine decarboxylase isolated from Escherichia Kosciolek, T., Janssen, S., Metcalf, J., Song, S.J., et al. (2020). Microbiome an-
coli. J. Bacteriol. 179, 4486–4492. alyses of blood and tissues suggest cancer diagnostic approach. Nature 579,
Koskiniemi, S., Sun, S., Berg, O.G., and Andersson, D.I. (2012). Selection- 567–574.
driven gene loss in bacteria. PLoS Genet 8, e1002787. Poplin, R., Ruano-Rubio, V., DePristo, M.A., Fennell, T.J., Carneiro, M.O., Van
Kostic, A.D., Chun, E., Robertson, L., Glickman, J.N., Gallini, C.A., Michaud, der Auwera, G.A., Kling, D.E., Gauthier, L.D., Levy-Moonshine, A., Roazen, D.,
M., Clancy, T.E., Chung, D.C., Lochhead, P., Hold, G.L., et al. (2013). et al. (2017). Scaling accurate genetic variant discovery to tens of thousands of
Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates samples. bioRxiv https://www.biorxiv.org/content/10.1101/201178v3.
the tumor-immune microenvironment. Cell Host Microbe 14, 207–215.
Porcheron, G., Garénaux, A., Proulx, J., Sabri, M., and Dozois, C.M. (2013).
Kostic, A.D., Gevers, D., Pedamallu, C.S., Michaud, M., Duke, F., Earl, A.M., Iron, copper, zinc, and manganese transport and regulation in pathogenic
Ojesina, A.I., Jung, J., Bass, A.J., Tabernero, J., et al. (2012). Genomic analysis Enterobacteria: correlations between strains, site of infection and the relative
identifies association of Fusobacterium with colorectal carcinoma. Genome importance of the different metal transport systems for virulence. Front. Cell.
Res 22, 292–298. Infect. Microbiol. 3, 90.
Kostic, A.D., Ojesina, A.I., Pedamallu, C.S., Jung, J., Verhaak, R.G., Getz, G., Prast-Nielsen, S., Tobin, A.M., Adamzik, K., Powles, A., Hugerth, L.W.,
and Meyerson, M. (2011). PathSeq: software to identify or discover microbes Sweeney, C., Kirby, B., Engstrand, L., and Fry, L. (2019). Investigation of the
by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396. skin microbiome: swabs vs. biopsies. Br. J. Dermatol. 181, 572–579.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Purcell, R.V., Visnovska, M., Biggs, P.J., Schmeier, S., and Frizelle, F.A. (2017).
Jones, S.J., and Marra, M.A. (2009). Circos: an information aesthetic for Distinct gut microbiome patterns associate with consensus molecular sub-
comparative genomics. Genome Res 19, 1639–1645. types of colorectal cancer. Sci. Rep. 7, 11590.
Lemonnier, M., and Lane, D. (1998). Expression of the second lysine decar-
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen,
boxylase gene of Escherichia coli. Microbiology 144, 751–760.
T., Pons, N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene
Levy, M., Kolodziejczyk, A.A., Thaiss, C.A., and Elinav, E. (2017). Dysbiosis and catalogue established by metagenomic sequencing. Nature 464, 59–65.
the immune system. Nat. Rev. Immunol. 17, 219–232.
Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., comparing genomic features. Bioinformatics 26, 841–842.
Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing
€ndar, F., Diehl, S., Gru
Ramı́rez, F., Du €ning, B.A., and Manke, T. (2014).
Subgroup (2009). The Sequence Alignment/Map format and SAMtools.
deepTools: a flexible platform for exploring deep-sequencing data. Nucleic
Bioinformatics 25, 2078–2079.
Acids Res 42, W187–W191.
Liao, Y., Smyth, G.K., and Shi, W. (2014). featureCounts: an efficient general
Rensing, C., and Grass, G. (2003). Escherichia coli mechanisms of copper ho-
purpose program for assigning sequence reads to genomic features.
meostasis in a changing environment. FEMS Microbiol. Rev. 27, 197–213.
Bioinformatics 30, 923–930.
Liu, J., Lichtenberg, T., Hoadley, K.A., Poisson, L.M., Lazar, A.J., Cherniack, Robinson, K.M., Crabtree, J., Mattick, J.S., Anderson, K.E., and Dunning
A.D., Kovatich, A.J., Benz, C.C., Levine, D.A., Lee, A.V., et al. (2018). An inte- Hotopp, J.C. (2017). Distinguishing potential bacteria-tumor associations
grated TCGA pan-cancer clinical data resource to drive high-quality survival from contamination in a secondary data analysis of public cancer genome
outcome analytics. Cell 173, 400–416.e11. sequence data. Microbiome 5, 9.

Luckey, T.D. (1972). Introduction to intestinal microecology. Am. J. Clin. Nutr. Rubinstein, M.R., Wang, X., Liu, W., Hao, Y., Cai, G., and Han, Y.W. (2013).
25, 1292–1294. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating
E-cadherin/beta-catenin signaling via its FadA adhesin. Cell Host Microbe 14,
Ma, W., Huang, C., Zhou, Y., Li, J., and Cui, Q. (2017). Micropattern: a web-
195–206.
based tool for microbe set enrichment analysis and disease similarity calcula-
tion based on a list of microbes. Sci. Rep. 7, 40200. Schirmer, M., Garner, A., Vlamakis, H., and Xavier, R.J. (2019). Microbial genes
and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17,
Marotz, C., Amir, A., Humphrey, G., Gaffney, J., Gogul, G., and Knight, R.
497–511.
(2017). DNA extraction for streamlined metagenomics of diverse environ-
mental samples. BioTechniques 62, 290–293. Sender, R., Fuchs, S., and Milo, R. (2016). Revised estimates for the number of
McMurdie, P.J., and Holmes, S. (2013). phyloseq: an R package for reproduc- human and bacteria cells in the body. PLoS Biol 14, e1002533.
ible interactive analysis and graphics of microbiome census data. PLoS One 8, Sivan, A., Corrales, L., Hubert, N., Williams, J.B., Aquino-Michaels, K., Earley,
e61217. Z.M., Benyamin, F.W., Lei, Y.M., Jabri, B., Alegre, M.-L., et al. (2015).
Mira, A., Ochman, H., and Moran, N.A. (2001). Deletional bias and the evolution Commensal Bifidobacterium promotes antitumor immunity and facilitates
of bacterial genomes. Trends Genet 17, 589–596. anti-PD-L1 efficacy. Science 350, 1084–1089.

Nejman, D., Livyatan, I., Fuks, G., Gavert, N., Zwang, Y., Geller, L.T., Rotter- Sriswasdi, S., Yang, C.C., and Iwasaki, W. (2017). Generalist species drive mi-
Maskowitz, A., Weiser, R., Mallel, G., Gigi, E., et al. (2020). The human tumor crobial dispersion and evolution. Nat. Commun. 8, 1162.
microbiome is composed of tumor type-specific intracellular bacteria. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L.,
Science 368, 973–980. Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and
Nossa, C.W., Jain, P., Tamilselvam, B., Gupta, V.R., Chen, L.F., Schreiber, V., Mesirov, J.P. (2005). Gene set enrichment analysis: a knowledge-based
Desnoyers, S., and Blanke, S.R. (2009). Activation of the abundant nuclear fac- approach for interpreting genome-wide expression profiles. Proc. Natl.
tor poly(ADP-ribose) polymerase-1 by Helicobacter pylori. Proc. Natl. Acad. Acad. Sci. USA 102, 15545–15550.
Sci. USA 106, 19998–20003. Tang, K.W., Alaei-Mahabadi, B., Samuelsson, T., Lindh, M., and Larsson, E.
O’Donovan, D., Corcoran, G.D., Lucey, B., and Sleator, R.D. (2014). (2013). The landscape of viral expression and host gene fusion and adaptation
Campylobacter ureolyticus: a portrait of the pathogen. Virulence 5, 498–506. in human cancer. Nat. Commun. 4, 2513.

Cell Host & Microbe 29, 281–298, February 10, 2021 297
ll
Resource
Thorvaldsdóttir, H., Robinson, J.T., and Mesirov, J.P. (2013). Integrative Yamamoto, Y., Miwa, Y., Miyoshi, K., Furuyama, J.-i., and Ohmori, H. (1997).
Genomics Viewer (IGV): high-performance genomics data visualization and The Escherichia coli ldcC gene encodes another lysine decarboxylase, prob-
exploration. Brief. Bioinform. 14, 178–192. ably a constitutive enzyme. Genes Genet. Syst. 72, 167–172.

Viaud, S., Saccheri, F., Mignot, G., Yamazaki, T., Daillère, R., Hannani, D., Yang, J.H., Luo, X., Nie, Y., Su, Y., Zhao, Q., Kabir, K., Zhang, D., and
Enot, D.P., Pfirschke, C., Engblom, C., Pittet, M.J., et al. (2013). The intestinal Rabinovici, R. (2003). Widespread inosine-containing mRNA in lymphocytes
microbiota modulates the anticancer immune effects of cyclophosphamide. regulated by ADAR1 in response to inflammation. Immunology 109, 15–23.
Science 342, 971–976. Yu, L.C. (2018). Microbiota dysbiosis and barrier dysfunction in inflammatory
bowel disease and colorectal cancers: exploring a common ground hypothe-
Warren, R.L., Freeman, D.J., Pleasance, S., Watson, P., Moore, R.A., sis. J. Biomed. Sci. 25, 79.
Cochrane, K., Allen-Vercoe, E., and Holt, R.A. (2013). Co-occurrence of anaer-
Yu, T., Guo, F., Yu, Y., Sun, T., Ma, D., Han, J., Qian, Y., Kryczek, I., Sun, D.,
obic bacteria in colorectal carcinomas. Microbiome 1, 16.
Nagarsheth, N., et al. (2017). Fusobacterium nucleatum promotes chemore-
Wood, D.E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis sistance to colorectal cancer by modulating autophagy. Cell 170, 548–
with Kraken 2. Genome Biol 20, 257. 563.e16.

298 Cell Host & Microbe 29, 281–298, February 10, 2021
ll
Resource

STAR+METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER


Biological Samples
Original TCGA tissue and plasma samples Indivumed N/A
Healthy plasma samples Duke Hospital N/A
Critical Commercial Assays
MagAttract PowerSoil DNA KF Kit Qiagen Cat# 27000-4-KF
QIAamp UCP Pathogen Mini Kit Qiagen Cat# 50214
Deposited Data
TCGA WGS bam files GDC API https://api.gdc.cancer.gov/
TCGA WXS bam files GDC API https://api.gdc.cancer.gov/
TCGA sequencing metadata GDC API https://api.gdc.cancer.gov/
TCGA sample metadata (biotab) GDC web portal https://portal.gdc.cancer.gov/
TCGA patient metadata PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
TCGA clinical data resource outcomes PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
Human and microbe reference genomes PathSeq bundle ftp://gsapubftp-anonymous@ftp.
broadinstitute.org/bundle/pathseq/
Human and microbe reference genomes PathSeq bundle ftp://gsapubftp-anonymous@ftp.
broadinstitute.org/bundle/pathseq/
Species and genera designated as commensal (CDC, 2019) https://www.cdc.gov/nhsn/pdfs/pscmanual/
or MBI-associated (NHSN Organism List) pcsmanual_current.pdf (Chapter 4 p.33)
Genera designated as common contaminants (Eisenhofer et al., 2019) Table 1 in Eisenhofer et al. 2019
Genome for B. vulgatus GenBank CP000139.1
Genome for A. junii GenBank NZ_CP019041.1
Genome for E. coli GenBank U00096.3
Genome for C. provencense GenBank NZ_CP024988.1
Genome for F. nucleatum GenBank AE009951.2
Pks gene cluster (colibactin) GenBank AM229678.1
Greengenes classifier (gg-13-8-99-515- (DeSantis et al., 2006) https://docs.qiime2.org/
806-nb-classifier.qza)
TCGA mRNA-seq data PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
TCGA miRNA-seq data PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
TCGA RPPA data PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
TCGA DNA Methylation 27K data PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
PARADIGM Pathway inference matrix PanCanAtlas https://gdc.cancer.gov/about-data/
publications/pancanatlas
Gene sets for GSEA (KEGG, GO) MSigDB v7.1 https://www.gsea-msigdb.org/gsea/
msigdb/genesets.jsp
TCMA tissue-resident profiles (COAD, This paper https://doi.org/10.7924/r4rn36833
READ, HNSC, ESCA, STAD)
16S sequencing results This paper N/A
(Continued on next page)

Cell Host & Microbe 29, 281–298.e1–e5, February 10, 2021 e1


ll
Resource

Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
Oligonucleotides
Primers for 16S analysis IDT N/A
FWD:GTGYCAGCMGCCGCGGTAA
REV:GGACTACNVGGGTWTCTAAT
Software and Algorithms
GATK 4.0.3 (PathSeq & HaplotypeCaller) (Kostic et al., 2011; Poplin et al., 2017) https://github.com/broadinstitute/gatk/
SAMtools 1.9 (Li et al., 2009) http://samtools.sourceforge.net/
phyloseq 1.30.0 (McMurdie and Holmes, 2013) https://github.com/joey711/phyloseq
metacoder 0.3.3 (Foster et al., 2017) https://grunwaldlab.github.io/
metacoder_documentation/
gffread 0.11.6 (Pertea and Pertea, 2020) https://github.com/gpertea/gffread
STAR 2.7.3a (Dobin et al., 2013) https://github.com/alexdobin/STAR/
subread 1.6.4 (Liao et al., 2014) http://subread.sourceforge.net/
deepTools 3.3.0 (Ramı́rez et al., 2014) https://github.com/deeptools/deepTools
bedtools 2.29.0 (Quinlan and Hall, 2010) https://github.com/arq5x/bedtools2
circos 0.69.8 (Krzywinski et al., 2009) http://circos.ca/software/download/
IGV 2.4.14 (Thorvaldsdóttir et al., 2013) http://software.broadinstitute.org/
software/igv/
QIIME2 2019.7 (Bolyen et al., 2019) https://qiime2.org/
SparCC (Friedman and Alm, 2012) https://bitbucket.org/yonatanf/
sparcc/src/default/
MicroPattern (Ma et al., 2017) http://www.cuilab.cn/micropattern
lifelines 0.23.8 (Davidson-Pilon et al., 2020) https://github.com/CamDavidsonPilon/
lifelines/tree/0.24.6
GSEA 4.0.3 (Subramanian et al., 2005) https://www.gsea-msigdb.org/gsea/
Other
Patient metadata for TCGA validation This paper Table S1
samples
Tumor- and normal tissue-associated This paper Table S2
taxa

RESOURCE AVAILABILITY

Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Xiling Shen (xiling.shen@
duke.edu).

Materials availability
The Cancer Genome Atlas (TCGA) collected biospecimens and associated clinical information from human subjects, under informed
consent and authorization of local institutional review boards. All TCGA sequencing data were accessed from the Genomic Data
Commons (GDC) portal in accordance with the TCGA Data Use Certification Agreement and under authorization of Duke’s campus
institutional review board. Original TCGA tissue and plasma samples were acquired from Indivumed, a third-party vendor. Healthy
patient samples used for validation analyses were collected and analyzed under authorization of Duke’s campus institutional review
board. Primers used from 16S sequencing were acquired from IDT. This study did not generate new unique reagents.

Data and code availability


The TCMA database can be accessed via the website (https://tcma.pratt.duke.edu). The accession number for the data reported in
this paper is https://doi.org/10.7924/r4rn36833. The TCGA sequencing data and associated aliquot, sample, and patient metadata
on which this work was based were accessed from the GDC API. Molecular profiling data and clinical endpoints used for the survival
analysis were obtained from the PanCanAtlas publication page. Scripts used for this work are available on request. More detailed
information about data and code can be found in the Key resources table.

e2 Cell Host & Microbe 29, 281–298.e1–e5, February 10, 2021


ll
Resource

METHOD DETAILS

Acquisition and metagenomic profiling of TCGA sequencing data


The raw TCGA bam files and metadata associated with each sequencing run were obtained from the NCI Genomic Data Commons
(GDC) via the GDC’s application programming interface (API). Specifically, WXS data were accessed from the GDC data repository
and WGS data were accessed from the GDC’s legacy archive. Overall, we acquired bam files from 19,409 sequencing runs (WGS:
n = 4,608; WXS: n = 15,066) for all TCGA cancer types with WGS or WXS data available. Sample-specific metadata were obtained
from the GDC web portal, and patient-specific metadata were obtained from the PanCanAtlas publication page.
All WGS and WXS data from TCGA samples were screened for microbial content using the PathSeq pipeline (Kostic et al., 2011),
which is made available as part of the Broad Institute’s Genome Analysis Toolkit (GATK 4.0). The PathSeq analysis was performed
using prebuilt human and microbial reference genomes and the NCBI taxonomy database from the PathSeq resource bundle, which
were accessed via ftp from the Broad Institute in December 2017. PathSeq was used with default settings, with the exception of the
minimum clipped read length, which was set to 50 to minimize the false positive rate. All sequencing data were analyzed on a local
high-performance computing (HPC) cluster, which is comprised of 60 compute nodes, 1,512 CPU cores, and approximately 15TB
of RAM.
Unambiguously aligned sequencing reads for bacteria at each taxonomic level were aggregated for available WGS and WXS data
from 22 TCGA sequencing projects representing a total of 19,409 sequencing runs (4,608 WGS and 15,066 WXS). Total read counts
for TCGA input bam files and PathSeq output bam files were calculated using SAMtools’ flagstats function for RPM normalization.
Total bacterial abundance values were then normalized to the total read count (in millions) of the input bam files. Aggregated PathSeq
results and associated metadata for each sequencing run were then deposited as phyloseq objects (McMurdie and Holmes, 2013) for
downstream analyses in R.

Decomposition of observed TCGA microbial profiles into tissue-resident and contaminant fractions
The classification of tissue-resident microbiota for each TCGA project was performed at the species-level using WGS sequencing
data. To assess whether a species deviated significantly from equiprevalence and identify a tissue-resident population, we found
the most generalizable criteria combined a statistical test of proportions with a hard cutoff on blood prevalence. Species were defined
as tissue-resident if they were prevalent in fewer than 20% of blood samples and significantly more prevalent in tissue than blood by a
one-sided Fisher exact test (q < 0.05).
Fisher’s test offered two major benefits: (1) it performs well for low prevalence cases, meaning that it naturally removed low-prev-
alence species which could not be statistically distinguished from contamination, and (2) it is sensitive to the group sample sizes pro-
vided (tissue and blood), making it sufficiently generalizable across sequencing projects which had varying numbers of tissue and
blood samples. Because of the very large total number of detectible species (n = 11,745 in CRC), we used FDR-correction to adjust
for multiple tests (q < 0.05). The second filter was hard cutoff on blood prevalence (<20%). This was effective for classifying high-
prevalence species, which were statistically more prevalent in tissue than blood but still detectible in more blood samples than
was plausible for endogenous blood-borne bacteria. Ultimately, for CRC data the second cutoff was only relevant for four species
from the Enterobacteriaceae family, which likely represent mixed-evidence species. Finally, we defined ‘‘detectible’’ as having more
than one sequencing read aligning to a given taxon (R2 reads). Singletons (taxa with a single read) are known to frequently be
sequencing artifacts or false positives and are commonly removed to reduce noise in downstream metagenomic analyses.
Many reads are not aligned at the species level. For example, unambiguous genus-level alignments are not necessarily equal to the
sum of unambiguous species-level alignments from species within that genus. Therefore, in order to preserve read counts at taxo-
nomic ranks above species-level, we adjusted read counts to reflect that a given clade could be comprised of a combination of
contaminant and tissue-resident species. The decomposition of observed metagenomic data (K) into tissue-resident (T) and contam-
inant (C) components for a given taxon in a given sample can be described using a mixture with two components of the form K = mT +
nC, where m and n represent the estimated fractions of tissue-resident or contaminant sequencing reads belonging to a given taxon,
respectively (such that m + n = 1). For all taxa above the species-level, we assigned m and n using the relative fractions of unambig-
uously aligned sequencing reads from species classified as tissue-resident or contaminant within the corresponding clade. For taxa
with fewer than 5 unambiguously assigned reads, we imputed mixtures from other sequencing runs processed on the same plate or
center. Defining these mixtures thus allowed us to propagate the classification of tissue-resident species to higher taxonomic ranks
on a sample-by-sample basis while preserving read counts that were unambiguously aligned above the species level.

Gene-level sequencing analysis of representative species


For each of bacterial genomes of interest, the fasta sequences and gff3 files were downloaded from GenBank. The gff3 files were
converted to gtf using GffRead (Pertea and Pertea, 2020). The fasta and gtf files were then analyzed with STAR (Dobin et al.,
2013) genomeGenerate to make genome files for the alignment process. For each sequencing run, the output bam file from PathSeq
was used to align to annotated bacterial genomes. Using subread’s featureCounts (Liao et al., 2014), the output from the STAR
aligner was then used to determine the read counts for each gene. For the genome coverage analysis, outputs from STAR were
sorted and converted to bam files using SAMtools (Li et al., 2009). Deeptools (Ramı́rez et al., 2014) bamCoverage function was
then used to generate the bedgraph files using RPKM normalization. The files were intersected using bedtools (Quinlan and Hall,
2010). The log10 read counts of each of the samples were summed and divided by the total number of samples per track. These

Cell Host & Microbe 29, 281–298.e1–e5, February 10, 2021 e3


ll
Resource

genome tracks were then plotted using the Circos software (Krzywinski et al., 2009). For visualization of cadA and ldcC alignments,
counts from each sample and bin (10bp) were summed and divided by the total number of samples per track. The bedgraph files were
converted to bigwig using UCSC bedGraphToBigWig, then bigwigs were plotted using The Integrated Genomics Viewer (IGV) (Thor-
valdsdóttir et al., 2013). Each track was scaled to the max bin height within the viewable region.

Nucleotide-level analysis of bacterial sequence variants


For each bacterial genome in the PathSeq reference, we screened each BAM file from the PathSeq output (COAD-READ, WGS) for
sequence variants using the GATK HaplotypeCaller pipeline (Poplin et al., 2017). Variant calling and quality filtering parameters were
chosen according to previously described methodology for bacterial sequencing data (Bush et al., 2020). The output VCF files were
then converted to TSV format and were aggregated across sequencing runs. We defined each sequencing variant as a unique com-
bination of genome accession ID, nucleotide position, reference base, and alternative base, producing a total of 3,445,630 unique
variants across all genomes and sequencing datasets. For downstream analysis, this total was further filtered to select 143,215
(4%) features that were present in at least three sequencing runs. Strain-level genome accession IDs were then mapped to NCBI
taxonomy IDs and associated lineage using the ete3 python package, then aggregated by species and genus for comparative prev-
alence analysis.

Acquisition and analysis of original TCGA tissue and plasma samples


For validation of TCMA we obtained original, matched tissue and plasma samples from a total of five CRC patients from Indivumed,
an original TCGA tissue provider. Plasma samples from three healthy subjects were obtained from patients at Duke University Hos-
pital. For tissue samples, microbial DNA was extracted from tissue samples using the MoBio PowerMag Soil DNA isolation kit (Qiagen
Cat# 27000-4-KF), following the Earth Microbiome Project (EMP) protocol (http://www.earthmicrobiome.org/) (Marotz et al., 2017).
Microbial DNA was extracted from plasma using the QIAamp UCP Pathogen Mini Kit (Qiagen Cat# 50214) following a protocol devel-
oped by Jiang et. al (Jiang, 2018). Briefly, plasma samples were pre-treated with proteinase K, followed by lysing and spin-down
through QIA amp UCP spin column. After washing with AW1 and AW2 buffers, microbial DNA was eluted in 50uL buffer AVE for down-
stream 16S library preparation and sequencing.
Bacterial compositions of isolated DNA samples were determined by amplification of the V4 variable region of the 16S rRNA gene
by polymerase chain reaction using the forward primer 515 and reverse primer 806, following the EMP protocol. These primers were
obtained from IDT and carry unique barcodes that allow for multiplexed sequencing. Equimolar 16S rRNA PCR products from all
samples were quantified and pooled prior to sequencing. Sequencing was performed on a 250bp PE MiSeq lane at the Duke Uni-
versity Center for Genomic and Computational Biology sequencing core. The 16S sequencing results were analyzed using QIIME2
(Bolyen et al., 2019). Paired-end sequencing reads (250bp) were demultiplexed, denoised, and forward reads were trimmed at 10bp
from the left and at 240bp on the right, while reverse reads were trimmed at 10bp from the left and at 220bp on the right. Taxonomic
assignments were performed using the GreenGenes database with 99% OTUs at all taxonomic levels (DeSantis et al., 2006). Read
counts for all observed taxa were summed over all assigned operational taxonomic units.

Estimation of bacterial co-abundance groups and associated molecular signatures


Compositional effects in microbiome data often complicate the calculation of correlations between microbiota; we therefore used
SparCC (Friedman and Alm, 2012) to estimate taxa that are coabundant. This method relies on a bootstrapping procedure to control
for spurious results common in microbiome survey data. Following the filtering criteria recommended in the SparCC paper, we
removed samples with fewer than 500 reads and taxa with an average abundance of fewer than 2 reads per sample prior to calcu-
lating correlations. We ran SparCC with default parameters on decontaminated CRC tissue sequencing data for 100 iterations to
identify coabundant taxa. The results of MicroPattern (Ma et al., 2017) pathway and disease-association enrichment analysis
were obtained by identifying the top 20 genera most correlated with each of Fusobacterium and Bacteroides.
To estimate molecular signatures associated with these co-abundance groups, we collected batch-normalized molecular profiling
data from the PanCanAtlas publication page, including RPPA, miRNA-seq, mRNA-seq, and Methylation 27K experiments performed
on matched TCGA samples. Prior to calculating Pearson correlations between matched samples, we performed preliminary normal-
izations on both molecular profiling data and decontaminated tissue profiles. A log10 transform was used to ensure RNA-seq and
miRNA-seq expression profiles were normally distributed. The RPPA and Methylation 27K data were left unchanged. The relative
abundances of decontaminated CRC tissue profiles were normalized using pseudocaounts and a centered log-ratio (CLR) transform.

Identification of tumor- and normal tissue-associated microbiota


Microbes associated with tumor samples or matched normal tissue were calculated in R, using a custom paired analysis function
written for metacoder (Foster et al., 2017). We filtered decontaminated microbial compositions in TCMA by selecting taxa using
filtering criteria suggested by the PhyloSeq preprocessing tutorial. Such filters are standard when preparing for downstream meta-
genomic analyses as they remove low-abundance and low-prevalence taxa which frequently have small means and large coefficients
of variation, contributing unnecessary noise for downstream differential abundance comparisons. After adding pseudocounts, we
calculated the relative abundance of microbiota for each sequencing run. Across all patients with matched tumor and normal tissue,

e4 Cell Host & Microbe 29, 281–298.e1–e5, February 10, 2021


ll
Resource

we then calculated the median log2 ratio between the relative abundance of each taxa in each tissue type. Significance values were
calculated using Wilcoxon’s rank-sums test and corrected for false discovery rate. Taxa with significant p-values (p < 0.05) were
selected for downstream analysis.

Survival analysis
We performed our survival analyses using a log-rank test, using the lifelines survival analysis python package (Davidson-Pilon et al.,
2020). Relative abundances of decontaminated bacterial compositions for all tissue samples belonging to a given patient were used
for both models. Data on patient survival, disease-free interval, and progression-free interval were collected from the PanCanAtlas’
clinical follow-up data (Liu et al., 2018). For log-rank tests, patients were segregated into two groups: one with taxa relative abun-
dance below the bottom quartile (‘‘low’’ or ’’absent’’), and one with above the top quartile (‘‘high’’). In many cases, particularly for
species-level alignments, the bottom quartile was zero and therefore may include more than a quarter of patients. To ensure quartiles
were non-equal, taxa that were present in fewer than 25% of samples were excluded from the analysis. The CPH test was performed
with default parameters and 10-fold cross-validation.

Pathway analysis of species associated with tumors or adjacent normal tissue


We used GSEA (Subramanian et al., 2005) to analyze gene expression pathways associated with species of interest. For each spe-
cies, we defined a continuous phenotype (cls) using CLR-transformed abundance values of decontaminated TCMA data. Using RNA-
seq expression data from the PanCanAtlas as the expression dataset and gene lists obtained from MSigDB v7.1 (KEGG, GO Biolog-
ical Process, GO Molecular Function), we ran GSEA for 1000 iterations. Analysis was performed for 158 matched tumor samples, as
well as for each subset with pathological stage (I: n = 33; II: n = 60; III: n = 44; IV: n = 19) within this cohort.

QUANTIFICATION AND STATISTICAL ANALYSIS

All statistical tests between unmatched groups were performed using a Wilcoxon rank-sums test (p-value), and all statistical tests
between matched groups were performed using a Wilcoxon signed-rank test (p-value) unless otherwise specified. Statistical tests
of prevalence were performed using a one-sided Fisher’s exact test. Statistical tests of variance for microbial compositions were
performed using PERMANOVA. Statistical tests for survival analyses were performed using the log-rank test. For multiple tests,
the false discovery rate (FDR; q-value) was calculated using the Benjamini-Hochberg method. All analyses were performed in python
3.7.1 and R 3.6.1. P-values are indicated as follows: *, <0.05; **, <0.01; ***, <0.001.

Cell Host & Microbe 29, 281–298.e1–e5, February 10, 2021 e5


Cell Host & Microbe, Volume 29

Supplemental Information

The cancer microbiome atlas: a pan-cancer


comparative analysis to distinguish tissue-resident
microbiota from contaminants
Anders B. Dohlman, Diana Arguijo Mendoza, Shengli Ding, Michael Gao, Holly
Dressman, Iliyan D. Iliev, Steven M. Lipkin, and Xiling Shen
SUPPLEMENTAL FIGURE LEGENDS

Figure S1. WGS and WXS harbor colorectal bacterial reads distinct from blood and brain, Related to

Figure 1

(A) Matched analysis of Shannon diversity of bacteria in normal tissue (yellow), tumor tissue (blue), and

blood (red) from CRC and BC patients in TCGA. By a paired, one-sided t-test, tumor and normal

tissue from CRC patients have more bacterial diversity than CRC blood, but BC tumor tissue does not

have more bacterial diversity than BC blood.

(B) Shows diversity data from (A), but pools together tumor and normal tissue samples from BC (green)

and CRC (brown) patients. By an unpaired, one-sided t-test CRC tissue has greater bacterial diversity

than BC tissue, and CRC blood has greater bacterial diversity than BC blood.

(C) Comparison of bacterial species prevalence in WXS data for blood and tissue samples from CRC and

BC.

(D) Comparison of bacterial species prevalence in WXS data for BC and CRC samples from tissue and

blood.

(E) Comparison of bacterial species prevalence in WGS and WXS data for blood and tissue samples from

OVC.

(F) Comparison of bacterial species prevalence in WGS data for OVC and CRC samples from tissue and

blood.

(G) Relative abundance of bacterial phyla in WXS data for tissue and blood samples from CRC and BC

patients.

(H) Comparison of bacterial species prevalence in tissue and blood using Kraken2 results.

(C – D) Correlation between PathSeq and Kraken2 read counts estimated for selected phyla (I) and genera (J)

in matched sequencing experiments from tumors (blue) and normal tissue (yellow).

Figure S2. Most equiprevalent taxa are common contaminants and are associated with particular

sequencing centers, Related to Figure 2

1
(A) Genera ranked by their prevalence in CRC blood samples. Common contaminants are marked in red.

(B) Panel comparing gene count, G-C content, pH optima, and NaCl optima for tissue-enriched (blue) and

equiprevalent (red) species. Dashed lines represent median and quartiles.

(C) PCoA of WXS data for CRC samples, labeled by sample type (left) and sequencing center (right).

(D) Schematic showing the path of TCGA sequencing samples from their tissue source site (TSS) to their

genome characterization centers (GSC) for sequencing by various platforms.

(E) Abundance of selected tissue-resident species overlaid on PCoA plot of WGS sequencing data (Figure

2D).

(F) Abundance of putative contaminant species overlaid on PCoA plot of WGS sequencing data (Figure

2D).

(G) PCoA of WGS data for BC samples, labeled by sample type (top) and sequencing center (bottom).

(H) Heatmap clustering of bacterial species’ normalized log10-abundance in blood samples from BC

patients.

(I) Species classified as tissue-resident by comparing WGS prevalence in tissue and blood samples (blue)

are consistent with prevalence comparisons of BC tissue vs. CRC tissue and for analogous

comparisons made with WXS data.

(J) Fraction of contaminant bacterial reads from WXS data for normal (yellow), tumor (blue), and blood

(red) samples from CRC patients, broken down by the five most prevalent phyla.

(K) Fraction of contaminant bacterial reads from WGS data for each sample type at Harvard, Baylor, and

Broad.

(L) The CLR-normalized relative abundances of the five most prevalent phyla are uncorrelated between

matched WGS and WXS blood samples. Dashed lines represent

Figure S3. Detecting tissue-resident and contaminant species with gene-level resolution, Related to

Figure 3

(A) Prevalence of genes belonging to F. nucleatum genes in blood vs. tissue; representative of tissue-

resident species. The large blue dot indicates species-level prevalence.

2
(B) Prevalence of genes belonging to C. provencense genes in blood vs. tissue; representative of

contaminant species. The large red dot indicates species-level prevalence.

(C – D) Distribution of gene prevalence in blood (red) and tissue (blue) for F. nucleatum (C) and C. provencense

(D).

(E – F) Genome coverage of reads aligning to F. nucleatum (E) and C. provencense (F) in blood (red) and

tissue (blue).

(G – I) Genome coverage of B. vulgatus (G), A. junii (H), E. coli (I) in samples from Baylor and Harvard.

(J) Distribution of sequencing reads aligned to cadA (left) and ldcC (right) genes in blood (red) and tissue

(blue).

(K) Prevalence of genes in the pks island encoding colibactin in tissue and blood samples. These genes

are enriched in tissue from CRC patients, consistent with previous reports.

(L) Barplot showing average prevalence of variants from species defined as tissue-resident and

equiprevalent species in blood (red) and tissue (blue). Data are represented as mean ± 95% CI.

(M) Prevalence of nucleotide variants in selected tissue-resident (B. vulgatus), contaminant

(Mycobacterium spp.), and mixed-evidence species (E. coli). E. coli appears to harbor both tissue-

enriched and equiprevalent sequencing variants.

* Denotes tissue-resident variants.

Figure S4. Metagenomic analysis of original TCGA tissue and blood samples validate tissue-resident

microbial compositions and equivalent species as contaminants, Related to Figure 4

(A) Prevalence of species in blood samples sequenced at Baylor vs. Harvard.

(B) Total bacterial reads from 16S results of tissue (blue), plasma (red) and controls (bottom panel). Data

are represented as mean ± 95% CI.

(C) Weighted UniFrac distances between microbial compositions of tissue sequenced at Harvard and

Baylor (WGS) compared with patient-matched tissue analyzed at Duke (16S), before and after

correcting for contamination.

3
(D) Relative abundances of bacterial phyla in 16S results for tissue compared with tissue samples

sequenced using WGS at Harvard and Baylor.

(E – F) Relative abundances of bacterial phyla in 16S results for blood compared with blood samples

sequenced using WGS at Harvard and Baylor (E) and WXS (F).

Figure S5. Analysis of microbe-microbe associations in contamination-adjusted data yields clinically

relevant bacterial coabundance groups, Related to Figure 5

(A) T-distributed stochastic neighborhood embedding (T-SNE) visualization showing coabundance of

tumor-associated (blue) and normal tissue-associated (yellow) species. Points are sized proportional to

the species’ prevalence in tissue.

(B) T-SNE visualization showing coabundance of detected species and their relative abundances (purple).

Points are sized according to the species prevalence in tissue.

(C) Coabundance groups are predictive of gene methylation (Methylation 27K).

(D) Coabundance groups are predictive of micro-RNA expression (miRNA-seq).

(E) ADAR1, which distinguished of Fusobacterium and Bacteroides coabundance groups, is positively

correlated with PARP1, IFN-gamma, and TNF-alpha.

(F) Volcano plot showing species that are more abundant in tumor samples (blue) or matched normal tissue

(yellow).

(G) Heat-tree showing Bacteroides spp. and their associations with normal (yellow) and tumor (blue) tissue.

(H) Barplot showing genes significantly correlated with C. ureolyticus expression in tissue samples from

CRC patients.

(I) Scatterplots showing correlation of C. ureolyticus abundance (CLR) and gene expression of CAMK2D

and UGDH.

(J) Several in the C. ureolyticus genome are significantly associated with tumor samples (blue) relative to

adjacent normal tissue (yellow).

(K) C. ureolyticus is predictive of progression-free interval (PFI) among among patients with CRC

recurrence.

4
(L) Several species in the Bacteroides cluster are negatively predictive of patient survival (OS).

Figure S6. Bacteria in TCGA sequencing data are prognostic of mucosal barrier injury, immune

infiltration, and tumor staging, Related to Figure 6

(A) Heatmap showing correlations between CLR-transformed abundances of tumor- and normal tissue-

associated species with features in the PARADIGM pathway matrix. PARADIGM scores are inferred

from gene expression data.

(B – D) Comparison of differentially abundant species and their association with tissue type (x-axis) versus

enrichment score (y-axis) for KEGG terms associated with immune pathways (B), cancer-related

pathways (C), and cellular adhesion pathways (D).

(E) Enrichment scores for tumor-associated (blue) and normal tissue-associated (yellow) species across

pathologic stages for selected pathways.

(F) Panel showing enrichment scores across pathologic stage for all tumor-associated (blue) and normal

tissue-associated (yellow) species and selected pathways.

(G) Abundance of species in BC blood (green) and CRC blood (brown) samples, ordered by statistical

significance.

Figure S7. Contamination-adjusted tissue microbiome profiles for all gastrointestinal cancers in TCGA,

Related to Figure 7

(A) Prevalence of species in blood and tissue from non-gastrointestinal cancers.

(B) Prevalence of species in BC tissue and disease-specific tissue from non-gastrointestinal cancers.

You might also like