Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views12 pages

MOSA Software

for improving cancer analysis

Uploaded by

Nagasai Kavya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

MOSA Software

for improving cancer analysis

Uploaded by

Nagasai Kavya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Article https://doi.org/10.

1038/s41467-024-54771-4

Synthetic augmentation of cancer cell line


multi-omic datasets using unsupervised
deep learning
Received: 21 December 2023 Zhaoxiang Cai 1,5, Sofia Apolinário 2,3,5, Ana R. Baião2,3, Clare Pacini 4,
Miguel D. Sousa2,3, Susana Vinga 2,3, Roger R. Reddel 1, Phillip J. Robinson 1
,
Accepted: 18 November 2024
Mathew J. Garnett 4, Qing Zhong 1 & Emanuel Gonçalves 2,3

Check for updates Integrating diverse types of biological data is essential for a holistic under-
1234567890():,;
1234567890():,;

standing of cancer biology, yet it remains challenging due to data hetero-


geneity, complexity, and sparsity. Addressing this, our study introduces an
unsupervised deep learning model, MOSA (Multi-Omic Synthetic Augmenta-
tion), specifically designed to integrate and augment the Cancer Dependency
Map (DepMap). Harnessing orthogonal multi-omic information, this model
successfully generates molecular and phenotypic profiles, resulting in an
increase of 32.7% in the number of multi-omic profiles and thereby generating
a complete DepMap for 1523 cancer cell lines. The synthetically enhanced data
increases statistical power, uncovering less studied mechanisms associated
with drug resistance, and refines the identification of genetic associations and
clustering of cancer cell lines. By applying SHapley Additive exPlanations
(SHAP) for model interpretation, MOSA reveals multi-omic features essential
for cell clustering and biomarker identification related to drug and gene
dependencies. This understanding is crucial for developing much-needed
effective strategies to prioritize cancer targets.

The growing molecular and phenotypic characterization of cancer cell Despite recent successes of deep learning7 multi-omics integra-
lines makes them one of the most studied human cell models1. This tion faces several limitations, most importantly high heterogeneity of
ever-growing and rich multi-omic data continues to drive the identi- different data types (e.g., discrete vs. continuous distributions),
fication of cancer genes and the discovery of therapeutic targets2–4. intrinsic technological limitations (e.g., missing values), and limited
Although genomics has been a primary focus in the search for pre- data availability (e.g., in this study, only 25.8% of the cancer cell lines
dictive biomarkers in cancer, recent functional genetic screens con- have a complete set of all seven omic datasets under consideration)8.
ducted by the Cancer Dependency Map (DepMap) consortium Unsupervised machine learning has been successful in multi-omics
revealed that less than 20% of RNAi cancer dependencies could be integration capturing patterns of data variation shared across different
explained by mutations and copy number alterations5. This highlights omics9,10. This approach highlighted cancer cellular states associated
the importance of developing holistic machine learning models cap- with epithelial-to-mesenchymal transition (EMT), a key process in drug
able of vertically integrating orthogonal datasets. In this case, vertical resistance and metastasis11. Unsupervised deep learning based models
integration involves not only genomics but also other types of can generate improved versions of input datasets by reconstructing
omics data6. missing measurements and correcting experimental error, and

1
ProCan®, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia. 2INESC-ID, 1000-029
Lisboa, Portugal. 3Instituto Superior Técnico (IST), Universidade de Lisboa, 1049-001 Lisboa, Portugal. 4Wellcome Sanger Institute, Wellcome Genome
Campus, Cambridge CB10 1SA, UK. 5These authors contributed equally: Zhaoxiang Cai, Sofia Apolinário. e-mail: [email protected];
[email protected]

Nature Communications | (2024)15:10390 1


Article https://doi.org/10.1038/s41467-024-54771-4

thereby augmenting downstream analysis12,13. Although linear dimen- Results


sionality reduction models10,14 have been designed for similar pur- Unifying deep generative model for cancer multi-omics
poses, the application of deep generative models to large-scale multi- Taking advantage of the DepMap project5,6,23,24, we assembled seven
omic cancer cell models is lagging behind. This leaves a gap in the different cancer cell line datasets, i.e., genomics2,3, methylomics25,
utilization of these non-linear approaches to augment datasets and transcriptomics26, proteomics11, metabolomics27, drug response2,25,28,29,
perform statistical analysis to improve the characterization of cancer and CRISPR-Cas9 gene essentiality4,30 (Fig. 1a). This comprises a total of
mechanisms, biomarkers and drug targets5,15,16. Deep learning models, 1523 cancer cell lines for which at least two datasets were available
such as variational autoencoders (VAE), provide more complex for- (Supplementary Data 1). We designed MOSA tailored to the cancer cell
mulations of the underlying biological data. Moreover, VAEs have lines multi-omic datasets, performed robust data augmentation, and
highly flexible designs that can handle data sparsity robustly and are provided model explanations for biomarker discovery (Fig. 1b, see
easily extensible to incorporate different data types. In particular, Methods).
methods based on VAE models have demonstrated significant success First, following a late integration31 approach, we trained a separate
in the field of single-cell multi-omics integration and augmentation. encoder for each dataset to derive latent embeddings specific to each
However these methods often presuppose the presence of specific omic layer. These embeddings were then concatenated and further
data types, such as count data from scRNA-seq and scATAC-seq, lim- reduced to formulate a joint multi-omic latent representation (Fig. 1c,
iting their applicability across broader omic landscapes17–21. Supplementary Data 2). Here, a latent representation is a learned,
Here, we developed a Multi-Omic Synthetic Augmentation abstracted feature set (embeddings) within the hidden layers of the
(MOSA) VAE model that integrates and synthetically augments multi- neural network that encapsulates the major information from the
omic datasets from >1500 cancer cell lines of the DepMap. MOSA input data. Compared to a multi-omic linear dimensionality reduction
provides a generative unsupervised deep learning model for cancer method, MOFA10,14, and another VAE-based method MOVE32, our model
discovery that utilizes SHapley Additive exPlanations (SHAP)22 values provides better separation of cell lines by tissue in the multi-omic
for model explainability, facilitating the identification of underlying latent space (Fig. 1c, Supplementary Fig. 1).
biological mechanisms and drug targets. In our study, we system- Second, genomics presents a unique challenge due to the sparsity
atically evaluated and benchmarked MOSA, demonstrating its gen- and qualitative nature of its data. To address this, we use only cancer
erative capacity across independent drug response and proteomic driver events and split genomics into copy number alterations and
datasets and accurately recovering cancer tissue-of-origin clustering. mutations. While copy number events are integrated as ordinal data
Additionally, MOSA increased the statistical power to find genomic through a separate encoder/decoder akin to other omics, mutations
associations with CRISPR-Cas9 gene essentiality screens. Synthetically are integrated as binary conditionals to each encoder (Fig. 1b, see
screened cancer cell lines revealed vulnerabilities consistent with Methods). The rationale is that genetic backgrounds influence cellular
genomic profiles, such as FLI1-EWSR1 fusion dependency. With MOSA, profiles and phenotypes, thereby conditioning other omic layers. The
we generated a complete multi-omic profile across all seven different conditional matrix contains genetic alterations in cancer driver genes
omics, increasing by 32.7% the number of available screens. (including gene fusions), cell line tissue of origin, cell line growth rate

Fig. 1 | Cancer multi-omics integration with MOSA. a Cancer cell line multi-omic of only two datasets is represented. Highlighted designs of MOSA are illustrated on
datasets across the 1523 cancer cell lines. Purple represents measured screens, the right. Created in BioRender. Cai, Z. (2023) BioRender.com/m96b457.
while orange represents gaps, i.e., missing screens, which were synthetically gen- c Dimensionality reduction visualized using Uniform Manifold Approximation and
erated with MOSA. b Schematic of the autoencoder, MOSA, where encoders are Projection (UMAP) representation of the trained MOSA joint latent space, where
represented at the top and decoders at the bottom. For simplicity, the integration each dot represents a cancer cell line colored according to its tissue of origin.

Nature Communications | (2024)15:10390 2


Article https://doi.org/10.1038/s41467-024-54771-4

measurements, and microsatellite instability information (MSI high), reconstructing whole-omics (full dataset augmentation) through ver-
totaling 237 conditional variables (Supplementary Data 3). This con- tical integration (at least two omics are required for a cell line to be
ditional matrix is further concatenated to the learned multi-omic joint considered in this study). For partial dataset augmentation, MOSA
latent space that works as input for the decoders. Hence, the genetic imputes incomplete features, e.g., measurements for certain proteins
background and cellular information are crucial for generating latent are sparse due to technical limitations commonly found in mass-
representations and reconstructing each omic dataset. spectrometry-based proteomics data11,37. A recent and independent
Third, compared with similar models for single-cell data12,21,33, the drug response dataset, that is completely absent during model train-
limited number of samples and heterogeneity of the omics available in ing, was accurately reconstructed (IC50s, Pearson’s r = 0.87,
the DepMap pose significant challenges to training a generalizable n = 32,659) (Fig. 2b), outperforming MOFA10,14, MOVE32 and naive mean
model for cancer cell lines. To reduce model complexity, MOSA only imputation (Fig. 2c–e). Pronounced discrepancies between MOSA’s
considers the most variable features as input for the encoders, while all reconstruction and the original datasets revealed likely inaccurate
features are reconstructed by the decoders for synthetic data gen- experimental measurements. For example, the response to the MEK1/2
eration, resulting in an asymmetrical design of VAE (Fig. 1b, Supple- inhibitor trametinib was not consistent with replicate measurements
mentary Fig. 2a, b, Supplementary Data 4). This unique design of and drugs with the same canonical target in the same cell line (Sup-
MOSA allows us to discard low informative features, such as genes with plementary Fig. 4a). Such discrepancies also spotlighted drugs (e.g.,
constant expression and non-essential genes across all cancer cell venetoclax) or classes of drugs (e.g., antiapoptotic inhibitors) for
lines. This reduces the number of trainable parameters by 39.2% while which no effective molecular biomarkers are available (Supplementary
maintaining low reconstruction error. Fig. 4b), underlining the challenge of devising reliable predictive
Fourth, the diverse size of the omic multi-omic datasets may lead models for their response. Additionally, proteomics is riddled with
to some datasets dominating during training, diminishing the model’s missing values (Supplementary Fig. 5a), affecting more predominantly
generalizability and explainability. We develop a whole omic (view) lowly abundant proteins. MOSA augmented the proteomics data by
dropout layer, which masks a complete omic layer based on a hyper- filling approximately 32% of the original matrix using information from
parameter. This provides a significant improvement in the model’s all omics, while preserving sample correlations with an independent
generalization, providing better reconstructions for cancer cell lines proteomic dataset (CCLE38) (Supplementary Fig. 5b, Supplementary
by specific omics (see Methods, Fig. 1b). We then perform a multi-omic Data 7). Notably, MOSA effectively reconstructed the protein profiles
model explanation by calculating SHAP values22 for all omic input of SMAD4 in cell lines characterized by SMAD4 gene deletions, which
features to assess their importance for the latent space integration and are typically associated with low SMAD4 gene expression and protein
the reconstruction of omic features (see Methods). This provides a abundance (Supplementary Fig. 5c). The MOSA-augmented proteomic
systematic resource to explore potential nonlinear cancer genotype- matrix preserves the ability to identify protein interactions through
phenotype associations. protein pairwise correlations39 (Supplementary Fig. 5d). In contrast to
Taken together, MOSA provides an unsupervised model that the original matrix that has missing values, MOSA’s augmented protein
integrates all cancer cell line omics simultaneously. Using a 10-fold matrix is complete and directly usable for downstream analysis, such
cross-validation strategy, MOSA’s reconstructed hold-out folds for as generalized linear models40, which improved the recall of protein
CRISPR-Cas9 and drug responses were robustly correlated with the complex interactions (Supplementary Fig. 5d).
original data (mean feature Pearson’s r of 0.35 and 0.65, respectively) Subsequently, full dataset augmentation was assessed. Synthetic
(Fig. 2a, Supplementary Fig. 3, Supplementary Data 5). MOSA per- proteomic data generated by MOSA for cancer cell lines lacking pro-
formed better compared to a similar systematic supervised analysis teomic measurements showed correlations with independent pro-
designed to predict each CRISPR-Cas9 gene dependency either using teomic measurements comparable to those of cell lines that had actual
core-omics (e.g., genomics, transcriptomics), only genomic, or only proteomic data (Fig. 3a). For drug response, reconstructions of 107
functionally related genes (mean feature best Pearson’s r = 0.25)34. overlapping drugs correlated robustly with measurements in an
independent dataset (CTD241,42) (Fig. 3b). Lastly, we performed a
Evaluation of multi-omics synthetic data generation similar analysis using independently processed transcriptomics, which
A significant advantage of multi-omic vertical integration and unsu- included data for 272 cancer cell lines that did not have tran-
pervised deep-generative models is their ability to synthetically gen- scriptomics data during the training of MOSA26. MOSA’s tran-
erate datasets that are missing in specific samples, such as scriptomic reconstructions were strongly correlated with real data
reconstructing a dataset that is entirely absent for certain cell lines. even for cell lines with no transcriptomics data for training (mean
This is particularly crucial given the pervasive dataset gaps, even in pearson’s r = 0.90) (Supplementary Fig. 5e). Crucially, this shows the
well-characterized models such as cancer cell lines (Fig. 1a). Multi-omic capacity of MOSA as a generative model for synthetic cancer cell line
profiling is both costly and labor-intensive, thus data-driven generative multi-omic and phenotypic screening.
models are key to prioritizing the design of the most informative We evaluated downstream analysis by comparing the original data
experiments. However, benchmarking generative models is challen- matrices with the augmented ones. MOSA increased by 34.9% the
ging as it requires independent, ideally large-scale, datasets to validate number of CRISPR-Cas9 cell line screens, and the augmented dataset
the model’s predictions. We initially tested 16 multi-omics integration improved the statistical power to find genetic associations (Fig. 3c,
methods (Supplementary Data 6, see Methods), but due to constraints Supplementary Data 8). Gene essentiality specificity (Fisher’s skewness
such as the number of omics supported, type of data distribution, and test), which can be used to identify selective cancer vulnerabilities,
limitations in design and implementation, we were narrowed to three showed a moderate positive correlation (Pearson’s r = 0.52) between
state-of-the-art methods: MOFA10,14, MOVE32, and mixOmics35,36. These the synthetic CRISPR-Cas9 screened cell lines and the previously
methods, encompassing linear, VAE-based, and correlation analysis available screens (Fig. 3d). Nonetheless, this correlation is likely
approaches, were capable of integrating all seven omics datasets underestimated due to the presence of potential outlier non-essential
considered here. We have delineated a series of benchmarks over the genes. MOSA accurately reconstructed gene dependencies, for
following sections into increasing model complexity. example, BRAF dependency in BRAF gain-of-function mutant cancer
MOSA reconstructs the input data matrices by leveraging the cell lines (Fig. 3e), and FLI1 dependency in cell lines harboring an FLI1-
multi-omic latent space learned from the original data. Data recon- EWSR1 fusion gene (Fig. 3f).
struction generates complete omic matrices, thus handling both Lastly, we aimed to assess the advantages of developing a method
missing values (partial dataset augmentation), and more importantly, capable of natively integrating more than two omics. Specifically, we

Nature Communications | (2024)15:10390 3


Article https://doi.org/10.1038/s41467-024-54771-4

Fig. 2 | MOSA reconstruction of drug response and CRISPR-Cas9 datasets. represent replicated screens for the same drug. Representative examples of
a MOSA reconstruction quality measured using a 10-fold cross-validation. After strongly selective CRISPR-Cas9 and drug responses are labeled. b MOSA’s partial
reconstructing all test folds, they are concatenated and the reconstruction quality dataset augmentation (missing value imputation) of drug IC50s compared to
score is calculated as the Pearson’s r between the reconstructed and actual mea- recent independent drug response screens. c–e, similar to b, using MOFA, MOVE
sured values. Features ranked by their reconstruction quality are shown for the and mean imputed values, respectively.
drug response (left) and the CRISPR-Cas9 (right) datasets. Duplicated drug names

focused on transcriptomics and drug response datasets, which MOSA provided a better reconstruction of transcriptomics and drug
represent molecular and phenotypic datasets, respectively. These are response data (Supplementary Fig. 6a, b). Particularly, adding more
also commonly utilized in multi-omics integration and are among the omics to MOSA provided a significant improvement over existing
most informative omic types for our benchmarks. From the list of methods, supporting the utility of using holistic multi-omics models.
methods we evaluated, we considered iClusterPlus43, JAMIE18, Furthermore, MOSA consistently outperformed the other methods in
scVAEIT44, and moCluster45 (Supplementary Data 6, see Methods). tissue of origin clustering (Supplementary Fig. 6c). Considering only

Nature Communications | (2024)15:10390 4


Article https://doi.org/10.1038/s41467-024-54771-4

Fig. 3 | Multi-omics benchmark of MOSA. a Distribution of proteomics cancer cell across the original CRISPR-Cas9 dataset (x-axis) and the MOSA augmented dataset
lines correlation with an independent dataset (CCLE38) grouped by whether the (y-axis). Dot size represents the number of cell lines that have the gene as essential
cancer cell line had proteomic data for the model training (orange, n = 291) versus (scaled log2 fold-change < −0.5) in the original dataset. e Correlation between BRAF
cell lines without any proteomics prior (light blue, n = 78). b Distribution of cancer and MAPK1 CRISPR-Cas9 gene essentialities using both previous measured
cell line correlations (Pearson’s r) between an independent drug response dataset (Observed) and the synthetically reconstructed (Reconstructed). Gene essentiality
(CTD241,42) and the MOSA reconstructed dataset, grouped by whether the cancer scores are represented using copy-number corrected78 log2 fold-changes scaled by
cell line had prior availability of drug response in the datasets for the model training the median of common essential (score = −1) and non-essential (score = 0) genes30.
(orange, n = 571) versus cell lines without drug response data (light blue, n = 239). Gene essentialities are also grouped according to the presence or absence of a BRAF
c One-sided log-ratio test p-value of genetic associations with CRISPR-Cas9 gene mutation, mostly V600E gain-of-function mutations. f CRISPR-Cas9 gene essenti-
essentiality with the original dataset (x-axis) and the augmented MOSA dataset (y- ality association with FLI1-EWSR1 fusion. Confidence intervals of 95% are displayed
axis). False discovery rate (FDR) correction is applied using the Benjamini- for the regression lines in panels d, e, and f. Box-and-whisker plots show 1.5×
Hochberg method to adjust for multiple comparisons. d Fisher skew test per gene interquartile ranges, centers indicate medians in panels e and f.

transcriptomics and drug response resulted in the best tissue of origin copy number alterations exhibited the highest average feature impor-
clustering, reflecting the strong structuring of these omics by tissue of tance (Supplementary Fig. 7a). Regarding conditional features, although
origin46. In contrast, other omics such as proteomics and metabo- their average feature importance was modest, certain key features, such
lomics are more loosely structured by tissue11,47. Consequently, as TP53 mutation, growth rate, and tissue of hematopoietic and lym-
including omics that are less strongly structured by tissue will naturally phoid origin emerged as highly significant, even when compared with
result in looser tissue clustering. other omic datasets (Fig. 4a, Supplementary Data 9). This underscores
Taken together, these diverse examples demonstrate MOSA’s the importance of incorporating conditional variables into the model.
ability to perform both partial and full dataset augmentation validated Features ranked in the top five from each omic dataset also validated the
using various independent datasets and from different laboratories. capacity of our approach to recover well-established molecular pro-
The generation of large-scale multi-omic datasets is both time and cesses associated with cancer (Fig. 4a), for example, CDKN2A copy
resource-intensive, thereby positioning MOSA as a valuable tool for in number alterations, as well as sensitivity to the SRC family inhibitor,
silico testing and prioritization of drug targets for experimental dasatinib. Interestingly, other less obvious features that were highly
validation. ranked shed light on previously less explored biological mechanisms.
One specific example is the metabolite, 1-methylnicotinamide involved
Model interpretation reveals cancer cell states in the nicotinate and nicotinamide metabolism, which was calculated to
To prioritize the most promising targets, a model needs to be explain- be the most important feature in the metabolomics towards the multi-
able beyond producing reliable predictions. Hence, we used the SHAP22 omics latent representation (Fig. 4a). We observed a strong relation
algorithm to calculate the feature importance, defined as the amount of between increased 1-methylnicotinamide intracellular abundance and
contribution of each feature to the latent space (Fig. 1b, Supplementary the overexpression of Nicotinamide N-Methyltransferase (NNMT)
Data 9, see Methods). When grouping features by their corresponding enzyme, which catalyzes the production of this metabolite (Supple-
omic datasets, we observed that metabolomics, drug response, and mentary Fig. 7b). We also observed an association between

Nature Communications | (2024)15:10390 5


Article https://doi.org/10.1038/s41467-024-54771-4

Fig. 4 | SHapley Additive exPlanations (SHAP) model explanation of MOSA. 1-methylnicotinamide. c Top features that contribute the most to the reconstruc-
a Top features from each omic layer that contribute the most to the multi-omic tion of the drug response of Daraprim (Pyrimethamine).
latent space. b Top drugs that have the highest feature importance from metabolite

1-methylnicotinamide and the EMT state of cancer cell lines, as corro- dataset as an anti-cancer drug, were found to be related to EMT in
borated by the expression of VIM and CDH111 (Supplementary Fig. 7c, d). recent studies. Specifically, UNC063850, Entinostat51, and BIX0218952
This confirms a recent single-cell study’s finding that the PC-9 non-small suppress EMT, while methotrexate53 shows the ability to induce
cell lung carcinoma line, which harbors an activating EGFR mutation, EMT. This finding suggests that the top-ranked drug Daraprim may
develops a cellular state resistant to EGFR inhibitors through expression also harbor a close relation to EMT, presenting a potential avenue
of EMT markers as well as accumulation of 1-methylnicotinamide48. for repurposing in cancer treatment. Other EMT-related features
Additionally, 1-methylnicotinamide was observed with a significant such as GPX1 protein intensity54 also ranked as top features for
increase during the early stages of EMT in the A549 cell line, and this Daraprim, indicating the potential to utilize other features in the list
increase was associated with changes in glycolytic metabolites and for the discovery of the most promising biomarkers for drug
histone post-translational modifications, indicating a link between response (Fig. 4c). Among the other top features for the top drugs,
1-methylnicotinamide and epigenetic modifications during EMT49. KRAS and KMT2D were consistently identified as being of high
While further experimental validation is necessary, this could pave the importance, and both of these genes have been implicated in
way for the identification of cancer cellular states underlying drug EMT55,56 (Supplementary Fig. 8b–f). Lastly, we utilized an external
resistance. metabolomic dataset47 to validate the drugs associated with
To delve deeper, we subsequently used the SHAP algorithm to 1-methylnicotinamide by SHAP values. Although the abundance of
calculate the feature importance specifically for the reconstruction 1-methylnicotinamide was not directly measured in their study, we
of drug response, thereby facilitating the discovery the most pro- analyzed the drugs linked to nicotinate and nicotinamide metabo-
mising biomarkers (see Methods). As expected, the drug response lism, where 1-methylnicotinamide is a direct product of nicotina-
features themselves were the most important features on average mide methylation. Several highlighted drugs identified by SHAP
(Supplementary Fig. 8a, Supplementary Data 10). Notably, the con- values, including Daraprim, UNC0638, Entinostat (MS-275), and
ditionals emerged as the second most important omics, reflecting PAC-1, were also ranked highly as either resistant or sensitive drugs
the critical role of tissue of origins, mutations, and growth rate in (Supplementary Fig. 9).
influencing drug responses (Supplementary Fig. 8a). Centering on Taken together, our findings suggest a broad association of
the metabolite 1-methylnicotinamide, drugs known to be EMT- 1-methylnicotinamide and EMT across hundreds of cancer cell lines
related were ranked as the top drugs showing high feature impor- with a potential role in drug resistance. While further assessment is
tance from 1-methylnicotinamide (Fig. 4b). All the top five drugs, needed to substantiate this, more generally, it unveils the possibility of
except Daraprim (Pyrimethamine) which was not included in the using MOSA as a holistic model that integrates molecular and

Nature Communications | (2024)15:10390 6


Article https://doi.org/10.1038/s41467-024-54771-4

phenotypic data of cancer cells to investigate cancer cell states, drug CCLE proteomic characterization of 375 cancer lines38, of which 291
resistance and their underlying mechanisms. comprise the proteomic dataset11 used for training. The second dataset
represents recent drug response screens with the same platform as the
Discussion drug screens used for training25,28,29 that were obtained from the
The application of deep generative models, including MOSA, in cancer Genomics of Drug Sensitivity in Cancer (GDSC) portal (https://www.
research is promising but comes with limitations, mainly related to the cancerrxgene.org/)66 comprising a total of 32,659 IC50s measured
restricted sample size which impaired exploring more complex VAE across 313 unique drugs and 781 overlapping cancer cell lines. The
designs, and more complex designs led to worse dataset reconstruc- third dataset is an independent drug response dataset (CTD2)41,42,
tions. While the overall reconstruction of the datasets was robust, comprising a total of 545 drugs and 887 cancer cell lines, for which 106
there are examples where it could be improved, particularly for pro- and 575, respectively, overlap with the drug response data used for
teomics where intrinsic data sparseness makes it more challenging for training25,28,29.
the model to train successfully. Thus, the addition of more char-
acterized cancer models will likely allow us to train better models and Data preprocessing
reduce reconstruction error. Future efforts should leverage multi-omic A total of seven datasets were considered: copy number (n = 777 fea-
resources from cancer patients and derived models, such as organoids tures); methylome (n = 14,608); transcriptome (n = 15,278); proteome
and patient derived xenografts (PDXs), to enhance training and (n = 4922); metabolome (n = 225); drug response (n = 810); and
explore transfer learning opportunities. In addition to tabular omic CRISPR-Cas9 gene essentiality (n = 17,931). A total of 1523 cancer cell
data, VAEs have demonstrated great success integrating image and lines were profiled with each cell line having at least two of these
text-based data57,58, and MOSA can be further enhanced to integrate datasets.
these types of data and enable multi-modal data augmentation. We For CRISPR-Cas9 gene essentiality, transcriptomic and methy-
also aim to address the complex challenge of data missing not at lomic feature reduction was performed to exclude lowly variable fea-
random (MNAR), a scenario commonly encountered in omics datasets, tures. For gene essentiality, samples were scaled using essential and
by adapting VAE architectures to more accurately identify and handle non-essential genes making their median per sample -1 and 0,
MNAR scenarios59–61. SHAP analysis offers an explanation for deep respectively. Never essential genes were discarded, i.e., genes that do
learning models, however, there are still some obstacles in verifying not have an essentiality profile lower than 50% of the median log2 fold-
the biological significance of certain highlighted features. These chal- change of essential genes in at least one cell line were removed. For
lenges could be associated with the inherent limitations of SHAP and transcriptomics and methylomics, a standard deviation filter was
Shapley values62, thus additional research is required to ascertain the applied. By taking the standard deviation of all genes across samples, a
importance of these emphasized features. Furthermore, while this has Gaussian mixture model (k = 2) was fitted, identifying lowly variable
provided strong initial support for the EMT-related associations, fur- genes and the rest. A standard deviation threshold was defined as the
ther experimental work is necessary to validate and confirm these rightmost intercept of the two Gaussian distributions (Supplementary
findings across different cancer cell models. Fig. 2a), and any gene with a standard deviation lower than that was
In summary, MOSA augmented the multi-omic profiles of 1523 discarded. Moreover, for the proteomic, drug response, metabolomic
cancer cell lines by robustly filling in gaps in the existing experimental and CRISPR-Cas9 datasets, any feature with a missing rate higher than
screens. Deep learning-based synthetic data generation can augment 85% was discarded. All datasets were standardized by z-score, except
experimental screens by facilitating the creation of realistic datasets to copy number. Missing values were replaced with 0 and their position in
guide experimental design and accelerate the validation of the most the original dataset was stored for use in the model (e.g., to exclude
promising targets. Looking ahead, this model is readily adaptable to them from the loss functions). In addition to these seven datasets,
integrate other types of data modalities, such as imaging, further driver gene mutations, fusion genes, microsatellite instability, growth
enabling the discovery of molecular/phenotype associations. rate, cancer and tissue type information were concatenated into a
single matrix to be used as labels of the cancer cell lines.
Methods
Cancer cell line multi-omic data collection Multi-omics synthetic augmentation (MOSA)
The aim was to assemble the most up-to-date and comprehensive MOSA is a conditional multi-view variational autoencoder imple-
molecular, phenotypic and cancer cell line sample information. All mented using PyTorch (v2.0)67. In the next section, we describe
datasets were downloaded from the DepMap (https://depmap.org/), and MOSA’s architecture, use of conditionals, dropout layer and SHAP
the CellModelPassports (https://cellmodelpassports.sanger.ac.uk/)23 explainability analysis.
portals, with the exception of the metabolomics data which were taken
directly from the original publication supplementary materials27. For Architecture. MOSA follows a traditional design of conditional VAEs
reproducibility, all data used in this study are provided in a figshare (Fig. 1b). For each of the seven datasets (views), an encoder is trained
repository (see Code and data availability). with multiple fully connected layers, which are all proportional to the
We integrated genomics2,63, transcriptomics26, methylomics25, number of input features of the dataset plus the number of labels
proteomics11, metabolomics27, drug response25,28,29, and CRISPR-Cas9 (concatenated conditionals). First, joint fully connected layers take as
gene essentiality4,64. This comprised a total of 1523 cancer cell lines input each dataset and reduce them to a fixed number of joint latent
with at least two datasets available for each cell line. All datasets have dimensions. Different techniques were tested to integrate the omic-
been previously processed, normalized/scaled, and batch corrected in specific latent dimensions (e.g., product of experts), but concatenation
each of their individual publications addressing technical and design obtained the smallest reconstruction loss. The multi-omics joint latent
aspects important to each dataset (e.g., integration of CRISPR-Cas9 dimensions are further reduced to a specified number of latent
screens across different laboratories65, driver mutations and copy dimensions (hyperparameter). Then, the joint layer outputs two layers
number alterations, and gene expression samples from different representing Gaussian distribution mean and variance. These are
datasets26). important for the regularization of the latent space and are used to
sample the latent dimensions. Finally, the latent dimensions (z) are
Cancer cell line validation datasets concatenated with the conditionals and provided to the decoders of
Three independent datasets were used in this study for validation, i.e. each dataset. The decoders have a similar but inverse architecture to
they were not used for model training. The first dataset presents the the encoders.

Nature Communications | (2024)15:10390 7


Article https://doi.org/10.1038/s41467-024-54771-4

Conditionals. We introduced a conditional architecture to enhance identify EMT-related drugs (Fig. 4b). Other important features for the
the model’s reconstruction performance and biological relevance. drugs of interest were then ranked by selecting the row of the drug and
Conditionals (n = 237) include key biological features, such as cancer then ranking the features in the descending order (Fig. 4c). Due to the
driver mutations, tissue types, gene fusions, MSI status, and cell line limitation of the computational resource, 20% of the samples were
growth rate. These were used in two stages in model architecture: 1) randomly selected to compute the feature importance for recon-
concatenated to each omic layer prior to encoding; 2) concatenated to structing drug response and copy number datasets, while 20 samples
the multi-omic joint latent representation before decoding. The con- were randomly selected for other omic datasets which have much
ditional concatenation serves two crucial purposes: it contextualizes larger number of dimensions.
the input data within specific cellular or genetic backgrounds, and it Overall, the SHAP analysis allowed us to identify features that are
allows the decoder to generate condition-specific reconstructions of important for the multi-omic latent dimension and for explaining the
the data. The inclusion of conditionals offered several advantages. reconstruction of features, such as drug response. Feature importance
First, it ensured that the model was not merely capturing patterns aggregated across all the samples and output dimensions can be found
within individual omic layers in isolation. Instead, complex interac- in Supplementary Data 9 and 10. More granular feature importances
tions among multi-omic data and genomic and physiological variables for each output dimension can be downloaded from the figshare
were accounted for, facilitating a more holistic understanding of the repository provided in the code and data availability section.
underlying biological processes and phenomena. Second, by embed-
ding these conditionals into the decoder, the model can generate data Loss function. The loss function is the summation of three compo-
reconstructions contextualized to specific cell line conditions. nents: 1) Reconstruction error across all input datasets; 2) weighted
variational Kullback–Leibler (KL) regularization term of the multi-omic
View dropout layer. A special dropout strategy, namely the view joint latent dimensions70; and 3) a contrastive loss using tissue types as
dropout layer, was included in MOSA to both improve the model’s labels:
predictive power and interpretability. Unlike traditional dropout lay-
ers, which randomly set individual features to zero, the view dropout Losstotal = Lreconstruction + λLKL + αLcontrastive ð1Þ
layer zeroes out all the input features of a single omic layer. This
The reconstruction loss Lreconstruction is defined as:
approach encouraged the model to reconstruct the data by learning
the relationships among multiple omic layers, rather than relying on X
Lreconstruction = ld ð2Þ
one specific omic layer. For example, in generating drug response
d
predictions, the MOSA model could disproportionately emphasize the
input drug response data, neglecting the potential contributions from Where l d represents the reconstruction loss for dataset d, calculated
other omic layers, such as transcriptomic and proteomic data. By using using the mean squared error (MSE)71,72.
the view dropout layer, we significantly improved the latent space cell In Eq. (1) the λ and α are optimized hyperparameters to weight the
line separation (Fig. 1c, Supplementary Figs. 1b, 10a, b) and recon- KL divergence and contrastive loss terms, respectively. LKL calculates
struction for both the proteomic (Fig. 3a, Supplementary Fig. 10c) and the KL divergence between the learned gaussian distribution with
drug response data (Supplementary Fig. 10d). The dropout rate for this mean (μ) and variance (σ 2 ) of the VAE and a standard normal prior
layer is controlled by the hyperparameter view_dropout, which was distribution70.
optimally set as 0.5 for the final model. The last part of the loss function is a contrastive loss defined as:

Model explanation via SHapley Additive exPlanations (SHAP). For Lcontrastive = ½mpos  sp  + + ½sn  mneg  + ð3Þ
model explanation, we used the Python package SHAP22 (v0.42.1) with
technical modifications to support the multi-omic data as the input to where sp and sn represents the cosine similarity between positive pairs
MOSA. Specifically, the GradientExplainer, which combines and negative pairs, which are defined by whether two samples have the
IntegratedGradient68 and SmoothGrad69, was used to calculate the same tissue type. mpos and mneg are positive and negative margins,
changes of the gradients on the model’s output regarding its input to which are hyperparameters tuned as described in the section below.
attribute an importance value to each feature. The SHAP calculation
was performed in two ways. Asymmetrical VAE. MOSA was also engineered with an asymmetrical
First, SHAP was run to explain the encoder part of MOSA, treating structure to optimize model efficiency by reducing the number of
the integrated latent dimensions as the output. The result contains parameters. Specifically, feature selection was conducted in a data-type-
SHAP values in a multidimensional array with shapes of specific manner before the encoding process. For transcriptomic and
ðN latent dim , N samples , N f eatures Þ, where each N represents the number of methylation data, only features that exhibited high variability were
latent dimensions, samples and features, respectively. To achieve the selected as input to the model. Highly variable features were defined
global level feature importance for analysis, the multidimensional using a gaussian mixture model with two components fitted to the
array was first taken as the absolute value to account for both positive standard deviation of all features, thus capturing two distributions of
and negative impact, and then summed across latent dimensions, lowly and highly variable features. The standard deviation threshold is
followed by averaging by samples. This then resulted in a list of length defined as the biggest value at which the densities of the two distribu-
N f eatures , representing the overall feature importance contributing to tions are equal, hence features with a standard deviation greater than
the latent space (Fig. 4a). 1.122 for transcriptomic data and 0.064 for methylation data are con-
Second, SHAP was run to explain MOSA’s reconstruction of each sidered highly variable and selected as input for the encoder. For
omic dataset. Taking drug response as an example, similarly to CRISPR-Cas9 data, gene knock-outs that did not significantly impact any
explaining the latent space, the shape of the SHAP values is cell line, as indicated by a gene fitness score higher than −0.5 in every cell
ðN drugs , N samples , N f eatures Þ, where N drugs represents the number of line, were excluded from the input layer. This targeted feature selection
drugs, and N samples , N f eatures are described as above. In this analysis, effectively reduced the model’s computational burden. Despite this
the array was only averaged across samples, resulting in a 2D array of reduction in input complexity, all available features were included dur-
ðN drugs , N f eatures Þ, which measures the feature importance for each ing the decoding process to reconstruct the data. This asymmetrical
drug. The feature 1-methylnicotinamide metabolite was first selected design was chosen for its ability to maintain the model’s predictive and
and the drugs that had the highest SHAP values were analyzed to reconstructive capacities while streamlining its architecture.

Nature Communications | (2024)15:10390 8


Article https://doi.org/10.1038/s41467-024-54771-4

Table 1 | Optimized hyperparameters used for training MOSA e.g., cancer cell line growth rates and doubling times from indepen-
dent studies, features. These data were included as a separate layer in
Parameter Value
MOVE, however, MOFA does not support a mixed distributed view,
Number of epochs 500
e.g., Gaussian and Bernoulli. Thus we could not integrate the growth
Batch size 256 rate and doubling time in the conditional view, which apart from these
Learning rate 3e−4 two have only binary features, and therefore the prior likelihood dis-
Number of cross-validation folds 3 tribution was set to Bernoulli. These configurations produced the best
Feature missing rate threshold (%) 0.85 multi-omic dimensionality reduction and view reconstruction using
Latent dimensions of each view (%) 0.25 MOFA. The optimized model was saved as an HDF5 file and is also
provided in the figshare repository. Similarly, the tissue-of-origin data
Number of joint latent dimensions 200
was used as the target variable in mixOmics following the doc-
Hidden dimensions (%) 0.7
umentation of the package. Specifically, the DIABLO mode36 (N-inte-
Dropout probability 0.4
gration) in the mixOmics suite, which was an extension to the original
View dropout probability 0.3 mixOmics toolset35, was used for the multi-omics data integration task
Weight of Kullback–Leibler (KL) loss term 0.0001 in this study.
Weight of the contrastive loss term 0.005 In order to evaluate MOSA more comprehensively and to address
Positive margin of the contrastive loss 0.85 the limitations of many state-of-the-art methods that support only a
Negative margin of the contrastive loss 0.15 limited number of omic modalities, we conducted a separate bench-
marking analysis considering only two omic modalities as input. This
Optimizer Adam
approach allowed us to include four additional methods: JAMIE18,
Weight decay 5e−4
scVAEIT44, iClusterPlus74 and moCluster45. Transcriptomic and drug
Activation function PReLU response data were used to train and benchmark MOSA against seven
Scheduler Plateau other methods, as these omic types are the focus of our benchmarks.
Scheduler threshold 1e−4 Since only two omics were included, we removed the requirement for a
Scheduler factor 0.6 sample to have data from at least two omics. JAMIE, iClusterPlus and
Scheduler patience 7 moCluster were successfully run using 200 dimensions for the inte-
Scheduler minimum learning rate 1e−7 grated latent space, which was the same as MOSA. However, scVAEIT
with 200 latent dimensions generated poor results, especially for the
latent space clustering comparison. Similar to MOFA, we manually
Hyperparameters. The choice of hyperparameters (Table 1) was gui- searched for the optimal setting and decided on using 100 latent
ded by an automatic optimization framework based on parallel trials dimensions for scVAEIT. Additionally, Gaussian distribution was set for
(Optuna73) and then manually adjusted. For each run a stratified shuffle the dist_block hyperparameter for both the transcriptomic and drug
split is performed, stratifying by hematopoietic and lymphoid cell response data. MOSA, MOFA, MOVE, JAMIE and scVAEIT were eval-
lines, leaving 20% of the samples for testing. A total of 600 trials were uated for both synthetic data reconstruction and clustering perfor-
performed, where each trial was capped to 150 epochs. mance, while mixOmics, iClusterPlus, and moCluster were only
included in the latent space clustering comparison as they are not
Benchmark state-of-the-art methods generative models. To ensure fair comparisons, conditionals were not
For comparison with the unsupervised multi-omics approach taken by incorporated into any of the selected methods during training.
MOSA, 16 multi-omics integration methods were tested. However, as
listed in Supplementary Data 6, for most of the models, we encoun- Protein-protein interaction co-abundance analysis
tered issues related to intrinsic design and implementation choices. Protein-protein interactions (PPIs) were estimated using two methods
These included, for example, limitations on the number of supported to compare the ability of MOSA augmented proteomics matrix and the
omic modalities and specific designs tailored only for count data original proteomic matrix to recapitulate PPIs present in specific pro-
processing since many of the models are designed for single-cell data. tein interaction resources datasets: CORUM75, BioGRID76 and STRING77.
Therefore, we have managed to run and systematically benchmark our The first method, Pearson’s r, has been previously used for this task39.
results for all seven omics datasets considered against three other Due to its inherent limitations, e.g., its inability to account for con-
state-of-the-art methods for multi-omics data integration, including founding effects and data structure, a new method, similar to that
MOFA10,14 as a linear multi-omics dimensionality reduction approach, described by Wainberg et al.40 based on a generalized linear model
MOVE32 as a VAE-based approach, and mixOmics35,36, which is based on (GLM) was tested. This method applies Cholesky’s Whitening trans-
generalized canonical correlation analysis. To make comparisons as formation to proteomics data by using the inverse of its covariance
close as possible and focus solely on methodological differences, the matrix, which decorrelates samples and pushes data towards normal-
same data preprocessing was used for MOSA, MOFA, MOVE and ity. This transformed data is then used in an ordinary least squares
mixOmics. Similarly, the number of factors for MOFA was initially set (OLS), whose calculated weights are a correlative metric between two
to the same optimal number of joint latent dimensions, i.e., 200 proteins. Each method was calculated for every protein pair, with all
(Table 1). However, this generated poorly performant results, i.e., pairs being ordered for each method by ascending p-value. Afterwards,
poorly reconstructed datasets. Through manual exploration, the a curve was drawn based on the cumulative sum of presence of that
optimal number of factors was set to 100, which was automatically pair on a PPI set, either k1 (presence) or 0 (absence), where k is the total
reduced during training to 97 by discarding factors with variance number of present pairs. Thus, the better the method, the greater the
explained lower than 0.0001. Each view was scaled independently, and AUC of the recall curve.
the model was run until it converged (convergence_mode = slow). The
number of dimensions was successfully set as 200 in MOVE. Since Statistics & reproducibility
mixOmics requires the number of dimensions separately associated Sample sizes were determined by the availability of cancer cell lines
with each omic dataset, 210 dimensions were used in mixOmics to and the associated multi-omic datasets from the Cancer Dependency
achieve the closest comparison. The conditionals layer in MOSA con- Map (DepMap). A total of 1523 cancer cell lines were included, for
tains both binary, e.g., mutations and tissue of origin, and continuous, which at least two datasets were available. No statistical method was

Nature Communications | (2024)15:10390 9


Article https://doi.org/10.1038/s41467-024-54771-4

used to predetermine sample size. Data exclusion details can be found 10. Argelaguet, R. et al. MOFA+: a statistical framework for compre-
under the Data preprocessing section in Methods. Data were randomly hensive integration of multi-modal single-cell data. Genome Biol.
split for cross-validation purposes. The randomization was stratified 21, 111 (2020).
by hematopoietic and lymphoid cell lines to ensure balanced repre- 11. Gonçalves, E. et al. Pan-cancer proteomic map of 949 human cell
sentation across cell line types and distinct culture conditions, i.e. lines. Cancer Cell 40, 835–849.e8 (2022).
suspension vs adherent. Blinding was not applicable to this study, as all 12. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J.
analyses were conducted using publicly available in vitro data from Single-cell RNA-seq denoising using a deep count autoencoder.
cancer cell lines. The findings were validated using independent Nat. Commun. 10, 390 (2019).
datasets for proteomics (CCLE), drug response (GDSC and CTD2), and 13. Freeman, B. A. et al. MIRTH: Metabolite Imputation via Rank-
transcriptomics (DepMap). A 10-fold cross-validation strategy was Transformation and Harmonization. Genome Biol. 23, 184 (2022).
applied to assess the reproducibility of the MOSA model across mul- 14. Argelaguet, R. et al. Multi-Omics Factor Analysis-a framework for
tiple omics layers. unsupervised integration of multi-omics data sets. Mol. Syst. Biol.
14, e8124 (2018).
Inclusion and ethics 15. Boehm, J. S. et al. Cancer research needs a better map. Nature 589,
All authors have committed to upholding the principles of research 514–516 (2021).
ethics and inclusion as advocated by the Nature Portfolio journals. 16. Poulos, R. C., Cai, Z., Robinson, P. J., Reddel, R. R. & Zhong, Q.
Opportunities for pharmacoproteomics in biomarker discovery.
Reporting summary Proteomics 23, e2200031 (2023).
Further information on research design is available in the Nature 17. Minoura, K., Abe, K., Nam, H., Nishikawa, H. & Shimamura, T. A
Portfolio Reporting Summary linked to this article. mixture-of-experts deep generative model for integrated analysis of
single-cell multiomics data. Cell Rep. Methods 1, 100071 (2021).
Data availability 18. Cohen Kalafut, N., Huang, X. & Wang, D. Joint variational auto-
All data were assembled from the Cancer DepMap and synthetic encoders for multimodal imputation and embedding. Nat. Mach.
datasets generated have been deposited in figshare under the follow- Intell. 5, 631–642 (2023).
ing URLS: DepMap datasets: https://doi.org/10.6084/m9.figshare. 19. He, Z. et al. Mosaic integration and knowledge transfer of single-cell
24420580. https://doi.org/10.6084/m9.figshare.24420598. MOSA multimodal data with MIDAS. Nat. Biotechnol. 42, 1594–1605
augmented datasets and latent representation: https://doi.org/10. (2024).
6084/m9.figshare.24562765. MOSA feature importance: https://doi. 20. Ghazanfar, S., Guibentif, C. & Marioni, J. C. Stabilized mosaic single-
org/10.6084/m9.figshare.24473005. MOFA multi-omics reconstruc- cell data integration using unshared features. Nat. Biotechnol. 42,
tion and latent representation: https://doi.org/10.6084/m9.figshare. 284–292 (2024).
24420631. MixOmics multi-omics latent representation: https://doi. 21. Ashuach, T. et al. MultiVI: deep generative model for the integration
org/10.6084/m9.figshare.25764408. MOVE diabetes multi-omics of multimodal data. Nat. Methods 20, 1222–1231 (2023).
reconstruction and latent representation: https://doi.org/10.6084/ 22. Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting
m9.figshare.25764438. Model Predictions. In Advances in Neural Information Processing
Systems (eds. Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017).
Code availability 23. van der Meer, D. et al. Cell Model Passports-a hub for clinical,
All code is available at https://github.com/QuantitativeBiology/ genetic and functional datasets of preclinical cancer models.
PhenPred (https://doi.org/10.5281/zenodo.13945138). The pretrained Nucleic Acids Res. 47, D923–D929 (2019).
weights of MOSA are available at https://huggingface.co/ 24. Dwane, L. et al. Project Score database: a resource for investigating
QuantitativeBiology/MOSA_pretrained (https://doi.org/10.57967/ cancer cell dependencies and prioritizing therapeutic targets.
hf/3634). Nucleic Acids Res. 49, D1365–D1372 (2021).
25. Iorio, F. et al. A Landscape of Pharmacogenomic Interactions in
References Cancer. Cell 166, 740–754 (2016).
1. Trastulla, L., Noorbakhsh, J., Vazquez, F., McFarland, J. & Iorio, F. 26. Garcia-Alonso, L. et al. Transcription Factor Activities Enhance
Computational estimation of quality and clinical relevance of can- Markers of Drug Sensitivity in Cancer. Cancer Res. 78,
cer cell lines. Mol. Syst. Biol. 18, e11017 (2022). 769–780 (2018).
2. Garnett, M. J. et al. Systematic identification of genomic markers of 27. Li, H. et al. The landscape of cancer cell line metabolism. Nat. Med.
drug sensitivity in cancer cells. Nature 483, 570–575 (2012). 25, 850–860 (2019).
3. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables pre- 28. Picco, G. et al. Functional linkage of gene fusions to cancer cell
dictive modelling of anticancer drug sensitivity. Nature 483, fitness assessed by pharmacological and CRISPR-Cas9 screening.
603–607 (2012). Nat. Commun. 10, 2198 (2019).
4. Behan, F. M. et al. Prioritization of cancer therapeutic targets using 29. Gonçalves, E. et al. Drug mechanism-of-action discovery through
CRISPR–Cas9 screens. Nature 568, 511–516 (2019). the integration of pharmacological and CRISPR screens. bioRxiv,
5. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, https://doi.org/10.1101/2020.01.14.905729 (2020).
564–576.e16 (2017). 30. Meyers, R. M. et al. Computational correction of copy number effect
6. Pacini, C. et al. A comprehensive clinically informed map of improves specificity of CRISPR-Cas9 essentiality screens in cancer
dependencies in cancer cells and framework for target prioritiza- cells. Nat. Genet. 49, 1779–1784 (2017).
tion. Cancer Cell 42, 301–316.e9 (2024). 31. Zampieri, G., Vijayakumar, S., Yaneske, E. & Angione, C. Machine
7. Wekesa, J. S. & Kimwele, M. A review of multi-omics data integration and deep learning meet genome-scale metabolic modeling. PLoS
through deep learning approaches for disease diagnosis, prog- Comput. Biol. 15, e1007084 (2019).
nosis, and treatment. Front. Genet. 14, 1199087 (2023). 32. Allesøe, R. L. et al. Discovery of drug-omics associations in type 2
8. Cai, Z., Poulos, R. C., Liu, J. & Zhong, Q. Machine learning for multi- diabetes with generative deep-learning models. Nat. Biotechnol. 41,
omics data integration in cancer. iScience 25, 103798 (2022). 399–408 (2023).
9. Argelaguet, R. et al. Multi-omics profiling of mouse gastrulation at 33. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell
single-cell resolution. Nature 576, 487–491 (2019). perturbation responses. Nat. Methods 16, 715–721 (2019).

Nature Communications | (2024)15:10390 10


Article https://doi.org/10.1038/s41467-024-54771-4

34. Dempster, J. M., Krill-Burger, J., Warren, A. & McFarland, J. Gene 55. Pan, L.-N., Ma, Y.-F., Li, Z., Hu, J.-A. & Xu, Z.-H. KRAS G12V mutation
expression has more power for predicting in vitro cancer cell vul- upregulates PD-L1 expression via TGF-β/EMT signaling pathway in
nerabilities than genomics. bioRxiv, https://doi.org/10.1101/2020. human non-small-cell lung cancer. Cell Biol. Int. 45,
02.21.959627 (2020). 795–803 (2021).
35. Rohart, F., Gautier, B., Singh, A. & Lê Cao, K.-A. mixOmics: An R 56. Zhang, Y. et al. Genome-wide CRISPR screen identifies PRC2 and
package for’omics feature selection and multiple data integration. KMT2D-COMPASS as regulators of distinct EMT trajectories that
PLoS Comput. Biol. 13, e1005752 (2017). contribute differentially to metastasis. Nat. Cell Biol. 24,
36. Singh, A. et al. DIABLO: an integrative approach for identifying key 554–564 (2022).
molecular drivers from multi-omics assays. Bioinformatics 35, 57. Hao, X. et al. MixGen: A New Multi-Modal Data Augmentation. arXiv
3055–3062 (2019). https://doi.org/10.48550/arXiv.2206.08358 (2022).
37. Poulos, R. C. et al. Strategies to enable large-scale proteomics for 58. Liu, Z. et al. Learning multimodal data augmentation in feature
reproducible research. Nat. Commun. 11, 3793 (2020). space. arXiv, https://doi.org/10.48550/arXiv.2212.14453 (2022).
38. Nusinow, D. P. et al. Quantitative Proteomics of the Cancer Cell Line 59. Pereira, R. C., Santos, M. S., Rodrigues, P. P. & Abreu, P. H.
Encyclopedia. Cell 180, 387–402.e16 (2020). Reviewing Autoencoders for Missing Data Imputation: Technical
39. Gonçalves, E. et al. Widespread Post-transcriptional Attenuation of Trends, Applications and Outcomes. JAIR 69, 1255–1285 (2020).
Genomic Copy-Number Variation in Cancer. Cell Syst. 5, 60. Ipsen, N. B., Mattei, P.-A. & Frellsen, J. not-MIWAE: Deep Generative
386–398.e4 (2017). Modelling with Missing not at Random Data. arXiv, https://doi.org/
40. Wainberg, M. et al. A genome-wide atlas of co-essential modules 10.48550/arXiv.2006.12871 (2020).
assigns function to uncharacterized genes. Nat. Genet. 53, 61. Chen, J., Xu, Y., Wang, P. & Yang, Y. Deep Generative Imputation
638–649 (2021). Model for Missing Not At Random Data. In Proceedings of the 32nd
41. Seashore-Ludlow, B. et al. Harnessing Connectivity in a Large-Scale ACM International Conference on Information and Knowledge
Small-Molecule Sensitivity Dataset. Cancer Discov. 5, Management 316–325 (Association for Computing Machinery, New
1210–1223 (2015). York, NY, USA, 2023). https://doi.org/10.1145/3583780.3614835.
42. Rees, M. G. et al. Correlating chemical sensitivity and basal gene 62. Marques-Silva, J. & Huang, X. Explainability is NOT a Game. arXiv,
expression reveals mechanism of action. Nat. Chem. Biol. 12, https://doi.org/10.48550/arXiv.2307.07514 (2023).
109–116 (2016). 63. Ghandi, M. et al. Next-generation characterization of the Cancer
43. Mo, Q. et al. Pattern discovery and cancer gene identification in Cell Line Encyclopedia. Nature 569, 503–508 (2019).
integrated cancer genomic data. Proc. Natl Acad. Sci. USA 110, 64. Pacini, C. et al. Integrated cross-study datasets of genetic depen-
4245–4250 (2013). dencies in cancer. Nat. Commun. 12, 1661 (2021).
44. Du, J.-H., Cai, Z. & Roeder, K. Robust probabilistic modeling for 65. Dempster, J. M. et al. Agreement between two large pan-cancer
single-cell multimodal mosaic integration and imputation via CRISPR-Cas9 gene dependency data sets. Nat. Commun. 10,
scVAEIT. Proc. Natl Acad. Sci. USA 119, e2214414119 (2022). 5817 (2019).
45. Meng, C., Helm, D., Frejno, M. & Kuster, B. MoCluster: Identifying 66. Yang, W. et al. Genomics of Drug Sensitivity in Cancer (GDSC): a
joint patterns across multiple omics data sets. J. Proteome Res. 15, resource for therapeutic biomarker discovery in cancer cells.
755–765 (2016). Nucleic Acids Res. 41, D955–D961 (2013).
46. Menden, M. P. et al. Machine learning prediction of cancer cell 67. Paszke, A. et al. Pytorch: An imperative style, high-performance
sensitivity to drugs based on genomic and chemical properties. deep learning library. Adv. Neural Inf. Process. Syst. 32,
PLoS One 8, e61318 (2013). 8026–8037 (2019).
47. Shorthouse, D., Bradley, J., Critchlow, S. E., Bendtsen, C. & Hall, B. 68. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic Attribution for Deep
A. Heterogeneity of the cancer cell line metabolic landscape. Mol. Networks. In Proceedings of the 34th International Conference on
Syst. Biol. 18, e11006 (2022). Machine Learning (eds. Precup, D. & Teh, Y. W.) vol. 70 3319–3328
48. Oren, Y. et al. Cycling cancer persister cells arise from lineages with (PMLR, 2017).
distinct programs. Nature 596, 576–582 (2021). 69. Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M.
49. Campit, S. E. et al. An Ensemble Metabolome-Epigenome Interac- SmoothGrad: removing noise by adding noise. arXiv, https://doi.
tion Network Identifies Metabolite Modulators of Epigenetic Drugs. org/10.48550/arXiv.1706.03825 (2017).
bioRxiv, https://doi.org/10.1101/2023.02.27.530260 (2024). 70. Asperti, A. & Trentin, M. Balancing Reconstruction Error and
50. Liu, X.-R. et al. UNC0638, a G9a inhibitor, suppresses epithe- Kullback-Leibler Divergence in Variational Autoencoders. IEEE
lial‑mesenchymal transition‑mediated cellular migration and inva- Access 8, 199440–199448 (2020).
sion in triple negative breast cancer. Mol. Med. Rep. 17, 71. Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. arXiv,
2239–2244 (2018). https://doi.org/10.48550/arXiv.1312.6114 (2013).
51. Du, L., Xie, F., Han, H. & Zhang, L. Targeting SALL4 by Enti- 72. Kingma, D. P. & Welling, M. An Introduction to Variational Auto-
nostat Inhibits the Malignant Phenotype of Gastric Cancer encoders. arXiv, https://doi.org/10.48550/arXiv.1906.02691 (2019).
Cells by Reducing EMT Signaling. Anticancer Res. 43, 73. Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A
4389–4401 (2023). Next-generation Hyperparameter Optimization Framework. In Pro-
52. Park, S. J. et al. BIX02189 inhibits TGF-β1-induced lung cancer cell ceedings of the 25th ACM SIGKDD International Conference on
metastasis by directly targeting TGF-β type I receptor. Cancer Lett. Knowledge Discovery & Data Mining 2623–2631 (Association for
381, 314–322 (2016). Computing Machinery, 2019). https://doi.org/10.1145/3292500.
53. Ojima, T., Kawami, M., Yumoto, R. & Takano, M. Differential 3330701.
mechanisms underlying methotrexate-induced cell death and 74. Mo, Q. et al. A fully Bayesian latent variable model for integrative
epithelial-mesenchymal transition in A549 cells. Toxicol. Res. 37, clustering analysis of multi-type omics data. Biostatistics 19,
293–300 (2021). 71–86 (2018).
54. Meng, Q. et al. Abrogation of glutathione peroxidase−1 drives EMT 75. Ruepp, A. et al. CORUM: the comprehensive resource of mamma-
and chemoresistance in pancreatic cancer by activating ROS- lian protein complexes. Nucleic Acids Res. 36, D646–D650 (2008).
mediated Akt/GSK3β/Snail signaling. Oncogene 37, 76. Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2015
5843–5857 (2018). update. Nucleic Acids Res. 43, D470–D478 (2015).

Nature Communications | (2024)15:10390 11


Article https://doi.org/10.1038/s41467-024-54771-4

77. Szklarczyk, D. et al. The STRING database in 2017: quality-controlled study. S.V., P.J.R., R.R.R., M.J.G, Q.Z. and E.G. acquired funding and
protein-protein association networks, made broadly accessible. contributed to methodology. Z.C, S.A., A.R.B., Q.Z. and E.G. wrote the
Nucleic Acids Res. 45, D362–D368 (2017). manuscript. All authors have revised and approved the manuscript.
78. Iorio, F. et al. Unsupervised correction of gene-independent cell
responses to CRISPR-Cas9 targeting. BMC Genomics 19, Competing interests
604 (2018). AstraZeneca, GlaxoSmithKline, and Astex Pharmaceuticals have awar-
ded M.J.G. research grants and M.J.G. is founder and advisor at Mosaic
Acknowledgements Therapeutics. All other authors declare no competing interests.
We thank the Broad Institute and the Wellcome Sanger Institute for,
through the Cancer Dependency Map consortium, making their data Additional information
freely available and readily accessible to the scientific community and Supplementary information The online version contains
thereby enabling this work. This research was funded in part by the supplementary material available at
Wellcome Trust Grant 206194. ProCan® is supported by the Australian https://doi.org/10.1038/s41467-024-54771-4.
Cancer Research Foundation, Cancer Institute New South Wales (NSW)
(2017/TPG001,REG171150), NSW Ministry of Health (CMP-01), The Uni- Correspondence and requests for materials should be addressed to
versity of Sydney, Cancer Council NSW (IG 18-01), Ian Potter Foundation, Qing Zhong or Emanuel Gonçalves.
the Medical Research Futures Fund (MRFF-PD), National Health and
Medical Research Council (NHMRC) of Australia European Union grant Peer review information Nature Communications thanks Yejin Kim, and
(GNT1170739, a companion grant to support the European Commis- the other, anonymous, reviewer(s) for their contribution to the peer
sion’s Horizon 2020 Program, H2020-SC1-DTH-2018-1,’iPC- individua- review of this work. A peer review file is available.
lized Paediatric Cure’ [ref. 826121]), and National Breast Cancer
Foundation (IIRS-18-164). Work at ProCan® is done under the auspices of Reprints and permissions information is available at
a Memorandum of Understanding between Children’s Medical Research http://www.nature.com/reprints
Institute and the U.S. National Cancer Institute’s International Cancer
Proteogenome Consortium (ICPC), that encourages cooperation among Publisher’s note Springer Nature remains neutral with regard to jur-
institutions and nations in proteogenomic cancer research in which isdictional claims in published maps and institutional affiliations.
datasets are made available to the public. Z.C. is the recipient of a PhD
Scholarship from Sydney Cancer Partners with funding from Cancer Open Access This article is licensed under a Creative Commons
Institute NSW (2021/CBG0002). A.R.B. is funded by the Portuguese Attribution 4.0 International License, which permits use, sharing,
national agency Fundação para a Ciência e a Tecnologia (FCT) through adaptation, distribution and reproduction in any medium or format, as
the research grant UI/BD/154599/2022. This work has received funding long as you give appropriate credit to the original author(s) and the
from the European Union’s Horizon 2020 research and innovation pro- source, provide a link to the Creative Commons licence, and indicate if
gram under grant agreement no. 951970 (OLISSIPO project). For open changes were made. The images or other third party material in this
access, the authors have applied a CC BY public copyright license to any article are included in the article’s Creative Commons licence, unless
Author Accepted Manuscript version arising from this submission. This indicated otherwise in a credit line to the material. If material is not
work was supported by national funds through FCT, under project UIDB/ included in the article’s Creative Commons licence and your intended
50021/2020 (https://doi.org/10.54499/UIDB/50021/2020). The authors use is not permitted by statutory regulation or exceeds the permitted
acknowledge the OSCARS project, funded by the European Commis- use, you will need to obtain permission directly from the copyright
sion’s Horizon Europe Research and Innovation Programme under grant holder. To view a copy of this licence, visit http://creativecommons.org/
agreement No. 101129751. licenses/by/4.0/.

Author contributions © The Author(s) 2024


Z.C., S.A., A.R.B., M.D.S., C.P. and E.G. implemented analyses. Z.C., S.A.
and E.G. wrote the software. E.G. supervised and conceptualized the

Nature Communications | (2024)15:10390 12

You might also like