Deep Generative Modeling For Single-Cell Transcriptomics
Deep Generative Modeling For Single-Cell Transcriptomics
https://doi.org/10.1038/s41592-018-0229-2
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and
bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell varia-
tional inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression
in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate
information across similar cells and genes and to approximate the distributions that underlie observed expression values, while
accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch
correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.
S
ingle-cell RNA sequencing (scRNA-seq) is a powerful tool (library size8,22 and batch effects10,23) and the only readily available
that is beginning to make important contributions to diverse solution for a range of analysis tasks using the same generative
research areas such as development1, autoimmunity2, and can- model (Methods, Supplementary Note 1, Supplementary Table 1).
cer3. The interpretation of scRNA-seq data remains challenging, To demonstrate its flexibility, we carried out batch removal, nor-
however, as it is confounded by nuisance factors such as limited4 malization, dimensionality reduction, clustering, and differential
and variable5 sensitivity, batch effects6, and transcriptional noise7. expression. We show here that for each of these tasks, scVI com-
Several recent studies modeled scRNA-seq bias and uncertainty by pared favorably to current state-of-the-art methods.
fitting a probabilistic model for each gene measurement in each
cell, which represents the data in a lower and potentially less noisy Results
dimension8–10. Once these models have been fit, they can be used The scVI model. We modeled the observed expression xng of each
for various tasks such as clustering11, imputation12, and differential gene g in each cell n as a sample drawn from a zero-inflated nega-
expression analysis13. tive binomial (ZINB) distribution p(xng ∣ zn, sn, ℓn) conditioned
Although these methods have provided new insights into the on the batch annotation sn of each cell (if available), as well as two
biological variation between cells, they assume that a generalized additional, unobserved random variables10,16,17 (Methods). The first
linear model can be used to accurately map onto a low-dimensional variable, ℓn, is a one-dimensional Gaussian that represents nuisance
manifold underlying the data, which is not necessarily justified. variation due to differences in capture efficiency and sequenc-
Also, different models are currently used for different tasks, whereas ing depth, and serves as a cell-specific scaling factor. The second
the application of a single distributional model to a range of down- variable, zn, is a low-dimensional vector of Gaussians (set here to
stream tasks would help to ensure consistency and interpretabil- ten dimensions; Supplementary Fig. 1) representing the remaining
ity. Finally, most existing methods cannot be applied to more than variation, which should better reflect biological differences between
tens of thousands of cells, but recent datasets include hundreds of cells24. We used it to represent each cell as a point in a low-dimen-
thousands of cells or more14. sional latent space that served for visualization and clustering. In
To address these limitations, we developed scVI, a fully proba- the scVI model, a neural network maps the latent variables to the
bilistic approach for the normalization and analysis of scRNA-seq parameters of the ZINB distribution (Fig. 1a, neural networks 5
data. scVI is based on a hierarchical Bayesian model15 with condi- and 6). This mapping goes through intermediate values ρgn, which
tional distributions specified by deep neural networks, which can be provide a batch-corrected, normalized estimate of the percentage of
trained very efficiently even for very large datasets. The transcrip- transcripts in each cell n that originate from each gene g. We used
tome of each cell is encoded through a nonlinear transformation these estimates for differential expression analysis and its scaled
into a low-dimensional latent vector of normal random variables. version (multiplying ρgn by the estimated library size ℓn) for impu-
This latent representation is then decoded by another nonlinear tation. We derived an approximation for the posterior distribution
transformation to generate a posterior estimate of the distributional of the latent variables q (zn, logℓn∣xn, sn ) by training another neural
parameters of each gene in each cell. The transformation assumes a network using variational inference and a scalable stochastic opti-
zero-inflated negative binomial distribution, which accounts for the mization procedure25–27 (Fig. 1a, neural networks 1–4).
observed overdispersion and limited sensitivity10,16,17.
Several recent papers have also demonstrated the utility of Model evaluation. We evaluated scVI along with a set of bench-
neural networks for embedding scRNA-seq datasets in a scalable mark methods for probabilistic modeling and imputation of scRNA-
manner18–21. scVI stands out from these as the only method that seq data using a collection of published datasets spanning a range
explicitly models the two key nuisance factors in scRNA-seq data of technical and biological characteristics (Supplementary Table 2
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA. 2Department of Physics, University
1
of California, Berkeley, Berkeley, CA, USA. 3Department of Statistics, University of California, Berkeley, Berkeley, CA, USA. 4Ragon Institute of MGH, MIT,
and Harvard, Cambridge, MA, USA. 5Chan Zuckerberg BioHub, San Francisco, CA, USA. *e-mail: [email protected]
Mean ln Cell-specific
scaling
NN2
xn,1 Expected
S.d. counts
NN5
Expected
...
...
xn,G Mean
NN6
NN4 zn,d
sn Expected
S.d. dropout
sn
fh(zn, sn)
Generative
Raw expression Nonlinear Variational Nonlinear
Sampling distribution
data + batch ID mapping distribution mapping Imputation
parameters
Clustering Differential
Visualization expression
Batch removal
b
360 360
60 60
30 30
10 10
Running time (min)
1 1
0 20 40 60 80 100 1,000
Dataset size (thousands of cells)
Fig. 1 | Overview of scVI. Given a gene expression matrix with batch annotations as input, scVI learns a nonlinear embedding of the cells that can be used
for multiple analysis tasks. a, The neural networks used to compute the embedding and the distribution of gene expression. NN, neural network. fw and fh
are functional representations of NN5 and NN6, respectively. b, Running times for fitting models on the BRAIN-LARGE data with a set of 720 genes and
increasing input sizes subsampled randomly from the complete dataset. Algorithms were tested on a machine with one eight-core Intel i7-6820HQ CPU
addressing 32 GB RAM, and one NVIDIA Tesla K80 (GK210GL) GPU addressing 24 GB RAM. Basic matrix factorization with FA acted as a control. For the
1-million-cell dataset, we report the results of scVI with and without early stopping (ES).
and Methods). To assess the scalability of training, we randomly cells, fewer training iterations (or epochs) were needed, and thus
subsampled a dataset of 1.3 million mouse brain cells28 (BRAIN- heuristics for stopping the learning process may save time. Indeed,
LARGE). To facilitate comparison to state-of-the-art algorithms for standard scVI, which uses a fixed number of epochs, was slower
probabilistic modeling and dimensionality reduction of single-cell than DCA, which uses the stopping heuristic by default, but scVI’s
data8–12, which may be less scalable, we limited this analysis to the early-stopping option greatly enhanced its speed (it trains in under
720 genes with the largest s.d. across all cells (Fig. 1b). We found 1 h) without affecting data fit (Supplementary Fig. 2).
that most methods were capable of processing up to 50,000 cells Next, we evaluated the extent to which the methods fit the data by
before running out of memory (using 32 GB RAM). In contrast, assessing their ability to accurately impute missing values. On five
scVI was generally faster and scaled to 1 million cells, thanks to datasets of different sizes (BRAIN-LARGE28, CORTEX29, PBMC30,
its reliance on a fixed number of cells at each iteration of iterative RETINA31, and HEMATO32; 3–27,000 cells; Supplementary Table 2),
stochastic optimization (Methods). We observed similar scalabil- we set 9% of nonzero entries (chosen randomly (Supplementary
ity with DCA20, a denoising autoencoder that also uses stochastic Figs. 3 and 4) or with a preference for low values (Supplementary
optimization. Notably, as the dataset size approached 1 million Figs. 5 and 6)) to zero and tested the ability of each method to
1.2 1.2
MNNs + PCA
1.0 1.0
SIMLR
SIMLR
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
8
8
6
scVI 6
scVI
scVI
4 4
2 2
0 0
CORTEX HEMATO RETINA
Fig. 2 | Biological signal is retained in the scVI latent space. scVI was applied to three datasets (CORTEX, n =3,005 cells; HEMATO, n =4,016 cells; and
RETINA, n =27,499 cells). CORTEX and HEMATO showed distance matrices in the latent space and 2D cell embeddings for scVI and SIMLR. Distance
matrix scales are in relative units from low to high similarity over the range of values in the entire matrix; cells are grouped using labels provided in the
original studies. CORTEX cell subsets were ordered by hierarchical clustering as in the original study. The embedding plot layout was determined by
t-distributed stochastic neighbor embedding (t-SNE) (CORTEX) or a five-nearest-neighbors graph visualized with a Fruchterman–Reingold force-directed
algorithm (HEMATO) (see Supplementary Fig. 10d for original SIMLR embedding). The color-coding is the same for embeddings and distance matrices.
For RETINA, scVI is compared with principal component analysis followed by the mutual nearest neighbors method. t-SNE on the latent space provides
embeddings. Left, cells are color-coded by batch. Right, cells are color-coded by subpopulation annotations from the original study31.
recover the values. In most cases, methods based on a ZINB protein measurements in addition to mRNA (CBMC)33 served as
distribution—namely, scVI, DCA, and ZINB-WaVE10 (when it an alternative benchmark, by allowing us to evaluate the extent to
scales to the dataset size)—performed better than methods that which the similarity between cells in the mRNA latent space resem-
use alternative distributions8,12 (e.g., log normal in ZIFA9), thus bled their similarity at the protein level (Methods).
supporting the suitability of ZINB for current scRNA-seq data- In these tests, scVI grouped cells that were from the same
sets. In one important exception, scVI was outperformed by annotated subpopulation or that expressed similar proteins, and
MAGIC12 (which imputes by means of propagation in a cell–cell it compared favorably to other methods that aim to infer a bio-
similarity graph) on the HEMATO hematopoietic differentiation logically meaningful latent space (ZIFA, ZINB-WaVE, DCA, and
dataset, which includes fewer cells (4,016) than genes (7,397). FA; Supplementary Fig. 9). Notably, a simpler version of scVI that
In such cases, scVI is expected to underfit the data, potentially does not explicitly model library size did not perform as well as the
leading to worse imputation accuracy. However, restriction of standard scVI, thus supporting our modeling choice.
the analysis to the top 700 variable genes improved imputation Next, we benchmarked scVI with SIMLR11, a method that cou-
(Supplementary Fig. 3c). ples clustering with learning of a cell–cell similarity matrix and a
As an additional evaluation of model fit, we tested the likelihood respective low-dimensional (latent) representation. SIMLR outper-
of data that were held out during training, and obtained results in formed scVI by providing a tighter representation of the compu-
agreement with those of the synthetic dropout test (Supplementary tationally annotated subpopulations. This result was expected, as
Fig. 7, Supplementary Table 3). Furthermore, scVI, like ZIFA and SIMLR explicitly aims to produce a tight representation in a target
factor analysis (FA), can also be used to generate unseen data number of clusters; however, a possible consequence of this action
via sampling of the latent space. As evidence of the validity of this is that SIMLR may not capture higher-resolution structural proper-
procedure, we sampled from the posterior distribution given the ties of the cell–cell similarity map. Indeed, in the protein-versus-
perturbed training data, and observed that the samples were largely mRNA test, scVI and DCA performed best, albeit by a small margin
consistent with the unperturbed data (Supplementary Fig. 8). (Supplementary Fig. 9c). scVI also more accurately captured hier-
archical structure among cell subsets, such as was reported for cor-
Capturing biological structure in latent space. We next evaluated tical cells (CORTEX);29 cells from related subpopulations tended
the extent to which the latent space inferred by scVI reflects biologi- to be closer to each other in scVI’s latent space (Supplementary
cal variability between cells. One way to assess this is to rely on prior Fig. 9e–g). Another important case is when variation is continuous
stratification of the cells into biologically meaningful subpopula- rather than discrete, as reported for differentiating hematopoietic
tions, which is normally done by unsupervised clustering followed cells (HEMATO)32. SIMLR identified several discrete clusters and
by manual inspection and annotation29,30. We evaluated accuracy did not reflect the continuous nature of this system as well as scVI
with respect to these stratifications (available for the CORTEX and or PCA did (Fig. 2, Supplementary Fig. 10). Finally, the data may be
PBMC datasets) by applying k-means clustering on the latent space almost entirely dominated by noise and lack structure. On a noisy
and testing for overlap with the annotated subpopulations (using dataset that we generated by sampling at random from a vector of
the same k as in the annotated data), or by comparing the proximity ZINB distributions, SIMLR erroneously reported 11 distinct clus-
of cells in the same subpopulation to the proximity of cells from dif- ters, which were not perceived by other methods (Supplementary
ferent subpopulations (Methods). A dataset that included single-cell Fig. 11). Altogether, these results suggest that the latent space of
5.0 5.0
0.22 0.725
2.5
0.20 0.700 2.5
0.0
AUC
0.18 0.675 0.0
−2.5
0.16
0.650 −2.5
−5.0
0.14
0.625
−5.0
−7.5
0.12
0.600
−7.5 P = 22.7 % −10.0 P = 20.9 %
DESeq2 DCA MAST edgeR scVI scVI
+DESeq2 (NoLib) −10 −5 0 5 10 −20 −15 −10 −5 0 5 10 15 20
Testing procedures Bayes factor scVI Signed log corrected P value edgeR
7.5 7.5
0.10
0.80
5.0 5.0
Mixture weight for reproducibility
0.09
0.75
2.5 2.5
0.08
0.70
0.0 0.0
AUC
0.07
0.65 −2.5
0.06 −2.5
Fig. 3 | Benchmarking of differential expression analysis. Performance was evaluated on the PBMC dataset (n =12,039 cells) on the basis of consistency
with published bulk data. a,b, Comparison of B cells and dendritic cells (a) and of CD4+ and CD8+ T cells (b) evaluated for consistency with the IDR41
framework (blue) and using AUROC (green). scVI (NoLib) refers to a simpler version of scVI that does not include the cell-specific scaling factor.
The range of values was derived from subsampling of 100 cells from each cluster n =20 times to determine robustness. Box plots indicate the median
(center lines), interquantile range (hinges), and 5th to 95th percentiles (whiskers). c–f, Significance levels of differential expression between B cells
and dendritic cells. Points represent individual genes (n =3,346). Bayes factors or BH-corrected P values on scRNA-seq data are compared with bulk
microarray-based BH-corrected P values. Horizontal lines denote the significance threshold of 0.05 for corrected P values. Vertical lines denote the
significance threshold for the Bayes factor of scVI (c) or 0.05 for corrected P values for DESeq2 (d), edgeR (e), and MAST (f). We also report the median
mixture weight for reproducibility p (higher values are better).
scVI is flexible and describes the data well, even when the data do results when we applied a simplified version of scVI with no batch
not fit in a simple structure of discrete cell states. variable, thus supporting our modeling choice.
Turning to confounding due to variation in sequencing depth,
Accounting for technical variability. scVI provides a parametric we found, as expected, that in relatively homogeneous popula-
distribution designed to decouple biological signal from the effects tions the library size factor inferred by scVI (ℓn) strongly correlated
of sample-level categorical nuisance factors such as batch annota- with the observed depth per cell (for example, in a subpopulation
tions and variation in sequencing depth. To evaluate the capacity of of peripheral blood mononuclear cells (PBMCs); Supplementary
scVI to correct batch effects, we used a mouse retinal bipolar neuron Fig. 13a). A related technical issue is low sensitivity due to lim-
dataset consisting of two batches (RETINA). We defined an entropy ited mRNA capture efficiency and (to a lesser extent) sequenc-
measure to evaluate the mixing of cells from different batches in ing depth, which exacerbates the number of zero entries and can
any local neighborhood of the latent space (abstracted using a distort similarity among homogeneous cells. We found that most
k-nearest-neighbor graph; Methods). In this dataset, scVI aligned zero entries could be explained by the negative binomial compo-
the batches considerably better than ComBat34 (which uses linear nent (Supplementary Fig. 14a,b) rather than the ‘inflation’ of addi-
models and empirical Bayes shrinkage) and a recent method based tional unexplained Bernoulli-distributed zeros. Consistently, the
on matching of mutual nearest neighbors35, while still maintain- occurrence of zero entries largely agreed with a process of random
ing a tight representation of preannotated subpopulations (Fig. 2, sampling of genes from each cell, in a manner proportional to their
Supplementary Figs. 9d and 12). Algorithms that do not account expected frequency (as inferred in the matrix ρ of our model, which
for batch effects in their models provided poor mixing of batches, is proportional to the negative binomial mean) and with no addi-
as expected. Specifically, although SIMLR and DCA were capable tional bias (Supplementary Fig. 13b and Supplementary Note 2).
of clustering the cells well within each batch, the respective clusters Indeed, we found that zero probabilities from the negative binomial
from each batch remained largely separated. We obtained similar distribution correlated more with cell-specific quality factors related
BRAIN-LARGE. This dataset contains 1.3 million brain cells from 10x Genomics28. b (i ) − a (i )
s (i ) =
We randomly shuffled the data to get a subset of 1 million cells and ordered max{a (i ), b (i )}
genes by variance to retain first 10,000 and then 720 sampled variable genes.
This dataset was then sampled multiple times in cells for the runtime and where a(i) is the average distance from i to all data points in the same cluster ci,
goodness-of-fit analysis. We report imputation scores for the 10,000 cells and and b(i) is the lowest average distance from i to all data points in the same
720 gene samples only. cluster c among all clusters c. Clusters can be replaced with batches if one is
estimating the silhouette width to assess batch effects23.
RETINA. After filtering according to the original pipeline, the dataset of bipolar
cells from ref. 31 contained 27,499 cells and 13,166 genes from two batches. We used Clustering metrics. The following metrics require clustering and not simply a
the cluster annotation from 15 cell types from the author. We also extracted their similarity matrix. For these, we use k-means clustering on the given latent space of
normalized data with ComBat and used it for benchmarking. dimension 10 with T = 200 random initializations to achieve a stable score.
HEMATO. This dataset with continuous gene expression variations from Adjusted Rand index. This index requires clustering. For most indexes,
hematopoietic progenitor cells32 contains 4,016 cells and 7,397 genes. We removed
n a b n
the library basal-bm1, which was of poor quality, on the basis of the authors’
recommendation. We used their population balance analysis result as a potential
()
∑ij ij − ∑i i ∑j j
2 2 2 2
function for differentiation. ARI =
a b a b n
CBMC*. This dataset includes 8,617 cord blood mononuclear cells33 profiled using 2
2 2
()
(1∕2) ∑i i + ∑j j − ∑i i ∑j j
2 2
10 × , along with 13 well-characterized mononuclear antibodies for each cell. We
kept the top 600 genes by variance. where nij, ai, and bj are values from the contingency table.
Protein abundance/mRNA expression. Take the similarity matrix for the normalized Reporting Summary. Further information on research design is available in the
protein abundance (centered log-ratio transformation; see ref. 33). Compute a Nature Research Reporting Summary linked to this article.
100-nearest-neighbors graph. Fix a similarity matrix for the cells and compute a
100-nearest-neighbors graph. Report the Spearman correlation of the flattened Software availability. An open-source software implementation of scVI is available
matrices and the fold enrichment. on Github (https://github.com/YosefLab/scVI). All code for the reproduction of
Let A be the set of edges in the protein nearest neighbors (NN) graph, B be the results and figures in this article has been deposited at https://zenodo.org/badge/
set of edges in the cell NN graph, and C be the entire set of possible edges. The fold latestdoi/125294792 and is included as Supplementary Software.
enrichment is defined as
Data availability
∣A ∩ B∣ × ∣C∣ All of the datasets analyzed in this paper are public and can be referenced at
∣A∣ ∣B∣ https://github.com/romain-lopez/scVI-reproducibility.
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Data
April 2018
1
nature research | reporting summary
Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf
April 2018