Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views11 pages

Deep Generative Modeling For Single-Cell Transcriptomics

The document introduces single-cell variational inference (scVI), a scalable framework for analyzing single-cell transcriptomics data, which addresses issues of technical noise and bias. scVI employs deep neural networks for probabilistic modeling, enabling tasks like batch correction, visualization, clustering, and differential expression with high accuracy. The model demonstrates superior scalability and performance compared to existing methods, particularly in handling large datasets of single-cell RNA sequencing data.

Uploaded by

jambudrinks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Deep Generative Modeling For Single-Cell Transcriptomics

The document introduces single-cell variational inference (scVI), a scalable framework for analyzing single-cell transcriptomics data, which addresses issues of technical noise and bias. scVI employs deep neural networks for probabilistic modeling, enabling tasks like batch correction, visualization, clustering, and differential expression with high accuracy. The model demonstrates superior scalability and performance compared to existing methods, particularly in handling large datasets of single-cell RNA sequencing data.

Uploaded by

jambudrinks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Articles

https://doi.org/10.1038/s41592-018-0229-2

Deep generative modeling for single-cell


transcriptomics
Romain Lopez1, Jeffrey Regier 1
, Michael B. Cole2, Michael I. Jordan1,3 and Nir Yosef *
1,4,5

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and
bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell varia-
tional inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression
in single cells (https://github.com/YosefLab/scVI). scVI uses stochastic optimization and deep neural networks to aggregate
information across similar cells and genes and to approximate the distributions that underlie observed expression values, while
accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch
correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.

S
ingle-cell RNA sequencing (scRNA-seq) is a powerful tool (library size8,22 and batch effects10,23) and the only readily available
that is beginning to make important contributions to diverse solution for a range of analysis tasks using the same generative
research areas such as development1, autoimmunity2, and can- model (Methods, Supplementary Note 1, Supplementary Table 1).
cer3. The interpretation of scRNA-seq data remains challenging, To demonstrate its flexibility, we carried out batch removal, nor-
however, as it is confounded by nuisance factors such as limited4 malization, dimensionality reduction, clustering, and differential
and variable5 sensitivity, batch effects6, and transcriptional noise7. expression. We show here that for each of these tasks, scVI com-
Several recent studies modeled scRNA-seq bias and uncertainty by pared favorably to current state-of-the-art methods.
fitting a probabilistic model for each gene measurement in each
cell, which represents the data in a lower and potentially less noisy Results
dimension8–10. Once these models have been fit, they can be used The scVI model. We modeled the observed expression xng of each
for various tasks such as clustering11, imputation12, and differential gene g in each cell n as a sample drawn from a zero-inflated nega-
expression analysis13. tive binomial (ZINB) distribution p(xng ∣ zn, sn, ℓn) conditioned
Although these methods have provided new insights into the on the batch annotation sn of each cell (if available), as well as two
biological variation between cells, they assume that a generalized additional, unobserved random variables10,16,17 (Methods). The first
linear model can be used to accurately map onto a low-dimensional variable, ℓn, is a one-dimensional Gaussian that represents nuisance
manifold underlying the data, which is not necessarily justified. variation due to differences in capture efficiency and sequenc-
Also, different models are currently used for different tasks, whereas ing depth, and serves as a cell-specific scaling factor. The second
the application of a single distributional model to a range of down- variable, zn, is a low-dimensional vector of Gaussians (set here to
stream tasks would help to ensure consistency and interpretabil- ten dimensions; Supplementary Fig. 1) representing the remaining
ity. Finally, most existing methods cannot be applied to more than variation, which should better reflect biological differences between
tens of thousands of cells, but recent datasets include hundreds of cells24. We used it to represent each cell as a point in a low-dimen-
thousands of cells or more14. sional latent space that served for visualization and clustering. In
To address these limitations, we developed scVI, a fully proba- the scVI model, a neural network maps the latent variables to the
bilistic approach for the normalization and analysis of scRNA-seq parameters of the ZINB distribution (Fig. 1a, neural networks 5
data. scVI is based on a hierarchical Bayesian model15 with condi- and 6). This mapping goes through intermediate values ρgn, which
tional distributions specified by deep neural networks, which can be provide a batch-corrected, normalized estimate of the percentage of
trained very efficiently even for very large datasets. The transcrip- transcripts in each cell n that originate from each gene g. We used
tome of each cell is encoded through a nonlinear transformation these estimates for differential expression analysis and its scaled
into a low-dimensional latent vector of normal random variables. version (multiplying ρgn by the estimated library size ℓn) for impu-
This latent representation is then decoded by another nonlinear tation. We derived an approximation for the posterior distribution
transformation to generate a posterior estimate of the distributional of the latent variables q (zn, logℓn∣xn, sn ) by training another neural
parameters of each gene in each cell. The transformation assumes a network using variational inference and a scalable stochastic opti-
zero-inflated negative binomial distribution, which accounts for the mization procedure25–27 (Fig. 1a, neural networks 1–4).
observed overdispersion and limited sensitivity10,16,17.
Several recent papers have also demonstrated the utility of Model evaluation. We evaluated scVI along with a set of bench-
neural networks for embedding scRNA-seq datasets in a scalable mark methods for probabilistic modeling and imputation of scRNA-
manner18–21. scVI stands out from these as the only method that seq data using a collection of published datasets spanning a range
explicitly models the two key nuisance factors in scRNA-seq data of technical and biological characteristics (Supplementary Table 2

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA. 2Department of Physics, University
1

of California, Berkeley, Berkeley, CA, USA. 3Department of Statistics, University of California, Berkeley, Berkeley, CA, USA. 4Ragon Institute of MGH, MIT,
and Harvard, Cambridge, MA, USA. 5Chan Zuckerberg BioHub, San Francisco, CA, USA. *e-mail: [email protected]

Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods 1053


Articles Nature MetHodS

a Variational posterior Generative model


q (zn, ln ∣ xn, sn) p (xn ∣ zn, sn, ln)
NN1
Size factor

Mean ln Cell-specific
scaling
NN2
xn,1 Expected
S.d. counts
NN5
Expected
...

NN3 zn,1 frequency


Latent
space fw (zn, sn)

...
xn,G Mean
NN6
NN4 zn,d
sn Expected
S.d. dropout
sn
fh(zn, sn)

Generative
Raw expression Nonlinear Variational Nonlinear
Sampling distribution
data + batch ID mapping distribution mapping Imputation
parameters

Clustering Differential
Visualization expression
Batch removal

b
360 360

60 60
30 30

10 10
Running time (min)

1 1

Factor analysis ZIFA SIMLR


scVI DCA BISCUIT
scVI w/ ES ZINB-WaVE MAGIC

0 20 40 60 80 100 1,000
Dataset size (thousands of cells)

Fig. 1 | Overview of scVI. Given a gene expression matrix with batch annotations as input, scVI learns a nonlinear embedding of the cells that can be used
for multiple analysis tasks. a, The neural networks used to compute the embedding and the distribution of gene expression. NN, neural network. fw and fh
are functional representations of NN5 and NN6, respectively. b, Running times for fitting models on the BRAIN-LARGE data with a set of 720 genes and
increasing input sizes subsampled randomly from the complete dataset. Algorithms were tested on a machine with one eight-core Intel i7-6820HQ CPU
addressing 32 GB RAM, and one NVIDIA Tesla K80 (GK210GL) GPU addressing 24 GB RAM. Basic matrix factorization with FA acted as a control. For the
1-million-cell dataset, we report the results of scVI with and without early stopping (ES).

and Methods). To assess the scalability of training, we randomly cells, fewer training iterations (or epochs) were needed, and thus
subsampled a dataset of 1.3 million mouse brain cells28 (BRAIN- heuristics for stopping the learning process may save time. Indeed,
LARGE). To facilitate comparison to state-of-the-art algorithms for standard scVI, which uses a fixed number of epochs, was slower
probabilistic modeling and dimensionality reduction of single-cell than DCA, which uses the stopping heuristic by default, but scVI’s
data8–12, which may be less scalable, we limited this analysis to the early-stopping option greatly enhanced its speed (it trains in under
720 genes with the largest s.d. across all cells (Fig. 1b). We found 1 h) without affecting data fit (Supplementary Fig. 2).
that most methods were capable of processing up to 50,000 cells Next, we evaluated the extent to which the methods fit the data by
before running out of memory (using 32 GB RAM). In contrast, assessing their ability to accurately impute missing values. On five
scVI was generally faster and scaled to 1 million cells, thanks to datasets of different sizes (BRAIN-LARGE28, CORTEX29, PBMC30,
its reliance on a fixed number of cells at each iteration of iterative RETINA31, and HEMATO32; 3–27,000 cells; Supplementary Table 2),
stochastic optimization (Methods). We observed similar scalabil- we set 9% of nonzero entries (chosen randomly (Supplementary
ity with DCA20, a denoising autoencoder that also uses stochastic Figs. 3 and 4) or with a preference for low values (Supplementary
optimization. Notably, as the dataset size approached 1 million Figs. 5 and 6)) to zero and tested the ability of each method to

1054 Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods


Nature MetHodS Articles
RBC BC1B
MG BC2
Bipolar 1–4 BC5A BC5D
Astrocytes ependymal Oligodendrocytes Bipolar 5–6 BC7 BC3A
Endothelial mural Pyramidal CA1 Erythroblasts Other Granulocytes BC6 BC5B
Interneurons Pyramidal SS BC5C BC4
Microglia BC1A BC8_9
BC3B

1.2 1.2

MNNs + PCA
1.0 1.0

SIMLR
SIMLR

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0

8
8
6
scVI 6

scVI
scVI

4 4
2 2
0 0
CORTEX HEMATO RETINA

Fig. 2 | Biological signal is retained in the scVI latent space. scVI was applied to three datasets (CORTEX, n =​3,005 cells; HEMATO, n =​4,016 cells; and
RETINA, n =​27,499 cells). CORTEX and HEMATO showed distance matrices in the latent space and 2D cell embeddings for scVI and SIMLR. Distance
matrix scales are in relative units from low to high similarity over the range of values in the entire matrix; cells are grouped using labels provided in the
original studies. CORTEX cell subsets were ordered by hierarchical clustering as in the original study. The embedding plot layout was determined by
t-distributed stochastic neighbor embedding (t-SNE) (CORTEX) or a five-nearest-neighbors graph visualized with a Fruchterman–Reingold force-directed
algorithm (HEMATO) (see Supplementary Fig. 10d for original SIMLR embedding). The color-coding is the same for embeddings and distance matrices.
For RETINA, scVI is compared with principal component analysis followed by the mutual nearest neighbors method. t-SNE on the latent space provides
embeddings. Left, cells are color-coded by batch. Right, cells are color-coded by subpopulation annotations from the original study31.

recover the values. In most cases, methods based on a ZINB protein measurements in addition to mRNA (CBMC)33 served as
distribution—namely, scVI, DCA, and ZINB-WaVE10 (when it an alternative benchmark, by allowing us to evaluate the extent to
scales to the dataset size)—performed better than methods that which the similarity between cells in the mRNA latent space resem-
use alternative distributions8,12 (e.g., log normal in ZIFA9), thus bled their similarity at the protein level (Methods).
supporting the suitability of ZINB for current scRNA-seq data- In these tests, scVI grouped cells that were from the same
sets. In one important exception, scVI was outperformed by annotated subpopulation or that expressed similar proteins, and
MAGIC12 (which imputes by means of propagation in a cell–cell it compared favorably to other methods that aim to infer a bio-
similarity graph) on the HEMATO hematopoietic differentiation logically meaningful latent space (ZIFA, ZINB-WaVE, DCA, and
dataset, which includes fewer cells (4,016) than genes (7,397). FA; Supplementary Fig. 9). Notably, a simpler version of scVI that
In such cases, scVI is expected to underfit the data, potentially does not explicitly model library size did not perform as well as the
leading to worse imputation accuracy. However, restriction of standard scVI, thus supporting our modeling choice.
the analysis to the top 700 variable genes improved imputation Next, we benchmarked scVI with SIMLR11, a method that cou-
(Supplementary Fig. 3c). ples clustering with learning of a cell–cell similarity matrix and a
As an additional evaluation of model fit, we tested the likelihood respective low-dimensional (latent) representation. SIMLR outper-
of data that were held out during training, and obtained results in formed scVI by providing a tighter representation of the compu-
agreement with those of the synthetic dropout test (Supplementary tationally annotated subpopulations. This result was expected, as
Fig. 7, Supplementary Table 3). Furthermore, scVI, like ZIFA and SIMLR explicitly aims to produce a tight representation in a target
factor analysis (FA), can also be used to generate unseen data number of clusters; however, a possible consequence of this action
via sampling of the latent space. As evidence of the validity of this is that SIMLR may not capture higher-resolution structural proper-
procedure, we sampled from the posterior distribution given the ties of the cell–cell similarity map. Indeed, in the protein-versus-
perturbed training data, and observed that the samples were largely mRNA test, scVI and DCA performed best, albeit by a small margin
consistent with the unperturbed data (Supplementary Fig. 8). (Supplementary Fig. 9c). scVI also more accurately captured hier-
archical structure among cell subsets, such as was reported for cor-
Capturing biological structure in latent space. We next evaluated tical cells (CORTEX);29 cells from related subpopulations tended
the extent to which the latent space inferred by scVI reflects biologi- to be closer to each other in scVI’s latent space (Supplementary
cal variability between cells. One way to assess this is to rely on prior Fig. 9e–g). Another important case is when variation is continuous
stratification of the cells into biologically meaningful subpopula- rather than discrete, as reported for differentiating hematopoietic
tions, which is normally done by unsupervised clustering followed cells (HEMATO)32. SIMLR identified several discrete clusters and
by manual inspection and annotation29,30. We evaluated accuracy did not reflect the continuous nature of this system as well as scVI
with respect to these stratifications (available for the CORTEX and or PCA did (Fig. 2, Supplementary Fig. 10). Finally, the data may be
PBMC datasets) by applying k-means clustering on the latent space almost entirely dominated by noise and lack structure. On a noisy
and testing for overlap with the annotated subpopulations (using dataset that we generated by sampling at random from a vector of
the same k as in the annotated data), or by comparing the proximity ZINB distributions, SIMLR erroneously reported 11 distinct clus-
of cells in the same subpopulation to the proximity of cells from dif- ters, which were not perceived by other methods (Supplementary
ferent subpopulations (Methods). A dataset that included single-cell Fig. 11). Altogether, these results suggest that the latent space of

Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods 1055


Articles Nature MetHodS

a Comparison of reproducibility (B/DC)


c 10.0 e
10.0
0.26

0.750 7.5 7.5


0.24

Signed log corrected P value microarray

Signed log corrected P value microarray


Mixture weight for reproducibility

5.0 5.0
0.22 0.725

2.5
0.20 0.700 2.5

0.0

AUC
0.18 0.675 0.0
−2.5
0.16
0.650 −2.5
−5.0
0.14
0.625
−5.0
−7.5
0.12
0.600
−7.5 P = 22.7 % −10.0 P = 20.9 %
DESeq2 DCA MAST edgeR scVI scVI
+DESeq2 (NoLib) −10 −5 0 5 10 −20 −15 −10 −5 0 5 10 15 20
Testing procedures Bayes factor scVI Signed log corrected P value edgeR

b Comparison of reproducibility (CD4/CD8) d 10.0


f 10.0
0.11 0.85

7.5 7.5
0.10

Signed log corrected P value microarray


Signed log corrected P value microarray

0.80
5.0 5.0
Mixture weight for reproducibility

0.09
0.75
2.5 2.5
0.08
0.70
0.0 0.0
AUC

0.07

0.65 −2.5
0.06 −2.5

0.05 0.60 −5.0 −5.0

0.04 0.55 −7.5 −7.5

0.03 −10.0 P = 11.8 % −10.0 P = 17.8 %


0.50
DESeq2 DCA MAST edgeR scVI scVI
+DESeq2 (NoLib) −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20
Testing procedures Signed log corrected P value DESeq Signed log corrected P value MAST

Fig. 3 | Benchmarking of differential expression analysis. Performance was evaluated on the PBMC dataset (n =​12,039 cells) on the basis of consistency
with published bulk data. a,b, Comparison of B cells and dendritic cells (a) and of CD4+ and CD8+ T cells (b) evaluated for consistency with the IDR41
framework (blue) and using AUROC (green). scVI (NoLib) refers to a simpler version of scVI that does not include the cell-specific scaling factor.
The range of values was derived from subsampling of 100 cells from each cluster n =​20 times to determine robustness. Box plots indicate the median
(center lines), interquantile range (hinges), and 5th to 95th percentiles (whiskers). c–f, Significance levels of differential expression between B cells
and dendritic cells. Points represent individual genes (n =​3,346). Bayes factors or BH-corrected P values on scRNA-seq data are compared with bulk
microarray-based BH-corrected P values. Horizontal lines denote the significance threshold of 0.05 for corrected P values. Vertical lines denote the
significance threshold for the Bayes factor of scVI (c) or 0.05 for corrected P values for DESeq2 (d), edgeR (e), and MAST (f). We also report the median
mixture weight for reproducibility p (higher values are better).

scVI is flexible and describes the data well, even when the data do results when we applied a simplified version of scVI with no batch
not fit in a simple structure of discrete cell states. variable, thus supporting our modeling choice.
Turning to confounding due to variation in sequencing depth,
Accounting for technical variability. scVI provides a parametric we found, as expected, that in relatively homogeneous popula-
distribution designed to decouple biological signal from the effects tions the library size factor inferred by scVI (ℓn) strongly correlated
of sample-level categorical nuisance factors such as batch annota- with the observed depth per cell (for example, in a subpopulation
tions and variation in sequencing depth. To evaluate the capacity of of peripheral blood mononuclear cells (PBMCs); Supplementary
scVI to correct batch effects, we used a mouse retinal bipolar neuron Fig. 13a). A related technical issue is low sensitivity due to lim-
dataset consisting of two batches (RETINA). We defined an entropy ited mRNA capture efficiency and (to a lesser extent) sequenc-
measure to evaluate the mixing of cells from different batches in ing depth, which exacerbates the number of zero entries and can
any local neighborhood of the latent space (abstracted using a distort similarity among homogeneous cells. We found that most
k-nearest-neighbor graph; Methods). In this dataset, scVI aligned zero entries could be explained by the negative binomial compo-
the batches considerably better than ComBat34 (which uses linear nent (Supplementary Fig. 14a,b) rather than the ‘inflation’ of addi-
models and empirical Bayes shrinkage) and a recent method based tional unexplained Bernoulli-distributed zeros. Consistently, the
on matching of mutual nearest neighbors35, while still maintain- occurrence of zero entries largely agreed with a process of random
ing a tight representation of preannotated subpopulations (Fig. 2, sampling of genes from each cell, in a manner proportional to their
Supplementary Figs. 9d and 12). Algorithms that do not account expected frequency (as inferred in the matrix ρ of our model, which
for batch effects in their models provided poor mixing of batches, is proportional to the negative binomial mean) and with no addi-
as expected. Specifically, although SIMLR and DCA were capable tional bias (Supplementary Fig. 13b and Supplementary Note 2).
of clustering the cells well within each batch, the respective clusters Indeed, we found that zero probabilities from the negative binomial
from each batch remained largely separated. We obtained similar distribution correlated more with cell-specific quality factors related

1056 Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods


Nature MetHodS Articles
to library size (e.g., number of reads per unique molecular identi- of other, possibly better, architectures42 and procedures for param-
fier), whereas zero probabilities from the Bernoulli correlated more eter and hyperparameter tuning43 might in some instances provide
with quality factors indicative of alignment errors (using subpopu- a better model fit and more suitable approximate inference. Notably,
lations of BRAIN-SMALL cortical cells or PBMCs; Supplementary because our procedure has a random component and optimizes a
Figs. 13c,d and 14c,d), possibly because of contamination or mRNA nonconvex objective function, it might give alternative results with
degradation. Taken together, these results corroborate the idea that different initializations. To address this, we demonstrated the stabil-
most zeros, at least in the datasets explored here, can be explained ity of scVI in terms of its objective function, as well as imputation
by low (or zero) ‘biological’ abundance of the respective transcript, and clustering (Supplementary Fig. 1). A related issue is that if there
exacerbated by limited sampling. are few observations (cells) for each gene, the prior and the induc-
tive bias of the neural network might keep scVI from fitting the data
Differential expression. With its probabilistic representation of closely. Indeed, gene prefiltering may be warranted in cases where
the data, scVI provides a natural way of performing various types there are fewer cells than genes. A complementary approach would
of hypothesis testing, while intrinsically controlling for nuisance make use of techniques such as Bayesian shrinkage17 or regulariza-
factors. In the case of differential expression between two sets of tion and second-order optimization10. However, we were able to
cells, one can use the model to approximate the posterior probabil- show that for a range of datasets of varying sizes, scVI fit the data
ity of the alternative hypotheses (genes are different) and that of the well and captured relevant biological diversity between cells.
null hypotheses through repeated sampling from the variational Because it provides a general probabilistic representation of gene
distribution, thus obtaining a low variance estimate of their ratio expression, scVI could enable other forms of scRNA-seq analysis
(i.e., Bayes factor36,37; Methods). not explored in this study, such as lineage inference1 and cell-state
To evaluate scVI in comparison with other differential expression annotation7,44. Furthermore, because it requires only the latent space
methods13,17,20,38, we used a dataset of 12,039 PBMCs from a healthy and the model specification (which both have a low memory foot-
human donor (PBMC) and undertook comparisons between B cell print) to generate any data point (cell ×​ gene) of interest, scVI can be
and dendritic cell clusters, and between CD4+ and CD8+ T cell clus- used as an effective baseline for scalable and interactive visualization
ters. Bulk-level comparative analysis of similar cell subsets served tools45–47. Finally, scVI can be extended to merge multiple datasets
as a gold standard39,40. For evaluation, we first defined genes as true from a given tissue while integrating prior biological annotations of
positives (Benjamini–Hochberg (BH)-adjusted P value <​  0.05 in cell types. We therefore expect this work to be of immediate inter-
bulk data) and then calculated the area under the receiver operating est, especially where dataset harmonization needs to be scalable and
characteristic curve (AUROC) on the basis of the Bayes factor for conducive to various forms of downstream analysis14.
scVI or BH-corrected P value for the other methods. Because the
definition of true positives requires a somewhat arbitrary thresh- Online content
old, we also used a second score that evaluated the reproducibility Any methods, additional references, Nature Research reporting
of gene ranking (bulk reference versus single cell, considering all summaries, source data, statements of data availability and asso-
genes), using the irreproducible discovery rate (IDR)41. scVI had ciated accession codes are available at https://doi.org/10.1038/
the highest AUROC in the T cell comparison, whereas edgeR out- s41592-018-0229-2.
performed scVI by a smaller margin in the comparison of B cells
versus dendritic cells. scVI performed best with respect to IDR in Received: 30 March 2018; Accepted: 26 October 2018;
both comparisons (Fig. 3, Supplementary Fig. 15a–e). We noted that Published online: 30 November 2018
the use of DCA followed by DESeq2 constituted a solid improve-
ment over the direct application of DESeq2, which was designed for
References
bulk data, thus supporting the need for single-cell-adapted models. 1. Semrau, S. et al. Dynamics of lineage commitment revealed by single-cell
Furthermore, a simpler variant of scVI that does not include the transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8,
library size factor performed extremely poorly in the comparison 1096 (2017).
of B cells versus dendritic cells, thus supporting the usefulness of 2. Gaublomme, J. T. et al. Single-cell genomics unveils critical regulators of
Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).
explicit inclusion of library size normalization in the model.
3. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in
primary glioblastoma. Science 344, 1396–1401 (2014).
Discussion 4. Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach
scVI was designed to address an important need in the rapidly to single-cell differential expression analysis. Nat. Methods 11,
evolving field of single-cell transcriptomics—namely, to account 740–742 (2014).
5. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C.
for measurement uncertainty and bias in tertiary analysis tasks Normalizing single-cell RNA sequencing data: challenges and opportunities.
through a common, scalable statistical model. As a result, it provides Nat. Methods 14, 565–571 (2017).
a computationally efficient and ‘all-inclusive’ tool that couples low- 6. Shaham, U. et al. Removal of batch effects using distribution-matching
dimensional probabilistic representation of gene expression data residual networks. Bioinformatics 33, 2539–2546 (2017).
with downstream analysis capabilities, comparing favorably to state- 7. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity
with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
of-the-art methods in each of a range of tasks, including batch-effect 8. Prabhakaran, S., Azizi, E., Carr, A. & Pe’er, D. Dirichlet process mixture
correction, imputation, clustering, and differential expression. model for correcting technical variation in single-cell gene expression data.
scVI takes raw count data as input and includes an effective nor- PMLR 48, 1070–1079 (2016).
malization procedure that is integrated into its model. First, it learns 9. Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated
a cell-specific scaling factor as a hidden variable, with the objective single-cell gene expression analysis. Genome. Biol. 16, 241 (2015).
10. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general
of maximizing the likelihood of the data8,10,22, which is more justifi- and flexible method for signal extraction from single-cell RNA-seq data.
able than a posteriori correction of the observed counts5. Second, Nat. Commun. 9, 284 (2018).
scVI explicitly accounts for batch annotations, via a mild assump- 11. Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization
tion of conditional independence. We demonstrated that both of and analysis of single-cell RNA-seq data by kernel-based similarity learning.
these components are essential for the method’s performance. Nat. Methods 14, 414–416 (2017).
12. van Dijk, D. et al. MAGIC: a diffusion-based imputation method
The scVI deep learning architecture is built on several canonical reveals gene-gene interactions in single-cell RNA-sequencing data.
building blocks such as nonlinearities, regularization, and mean- bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/02/25/111591
field approximation to the posterior25 (Methods). The exploration (2017).

Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods 1057


Articles Nature MetHodS
13. Finak, G. et al. MAST: a flexible statistical framework for assessing 36. Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90,
transcriptional changes and characterizing heterogeneity in single-cell RNA 773–795 (1995).
sequencing data. Genome. Biol. 16, 278 (2015). 37. Held, L. & Ott, M. On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5,
14. Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017). 393–419 (2018).
15. Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/ 38. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor
Hierarchical Models (Cambridge University Press, New York, 2007). package for differential expression analysis of digital gene expression data.
16. Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for Bioinformatics 26, 139–140 (2010).
single-cell transcriptomics. Nat. Methods 11, 637–640 (2014). 39. Nakaya, H. I. et al. Systems biology of vaccination for seasonal influenza in
17. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and humans. Nat. Immunol. 12, 786–795 (2011).
dispersion for RNA-seq data with DESeq2. Genome. Biol. 15, 550 (2014). 40. Görgün, G., Holderried, T. A. W., Zahrieh, D., Neuberg, D. &
18. Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of Gribben, J. G. Chronic lymphocytic leukemia cells induce changes
single cell transcriptome data with deep generative models. Nat. Commun. 9, in gene expression of CD4 and CD8 T cells. J. Clin. Invest. 115,
2002 (2018). 1797–1805 (2005).
19. Wang, D. & Gu, J. VASC: dimension reduction and visualization of single cell 41. Li, Q., Brown, J. B., Huang, H. & Bickel, P. J. Measuring reproducibility of
RNA sequencing data by deep variational autoencoder. bioRxiv Preprint at high-throughput experiments. Ann. Appl. Stat. 5, 1752–1779 (2011).
https://www.biorxiv.org/content/early/2017/10/06/199315 (2017). 42. Zoph, B. & Le, Q. Neural architecture search with reinforcement learning.
20. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell Oral presentation at the International Conference on Learning
RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at Representations, Toulon, France, 24–26 April 2017.
https://www.biorxiv.org/content/early/2018/04/13/300681 (2018). 43. Bergstra, J. S., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-
21. Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene parameter optimization. In Advances in Neural Information Processing
expression data. bioRxiv Preprint at https://www.biorxiv.org/content/ Systems 24 (eds Shawe-Taylor, J. et al.) 2546–2554 (NIPS Foundation,
early/2018/05/16/318295 (2018). La Jolla, CA, 2011).
22. Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of 44. Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to
single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015). mechanism. Nature 541, 331–338 (2017).
23. Cole, M. B. et al. Performance assessment and selection of normalization 45. DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of
procedures for single-cell RNA-seq. bioRxiv Preprint at https://www.biorxiv. single-cell RNA-Seq data. BMC Bioinformatics 17, 315 (2016).
org/content/early/2018/05/18/235382 (2017). 46. Fan, J. et al. Characterizing transcriptional heterogeneity through
24. Louizos, C., Swersky, K., Li, Y., Welling, M. & Zemel, R. The variational fair pathway and gene set overdispersion analysis. Nat. Methods 13,
autoencoder. Oral presentation at the International Conference on Learning 241–244 (2016).
Representations, San Juan, Puerto Rico, 2–4 May 2016. 47. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene
25. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Oral expression data analysis. Genome. Biol. 19, 15 (2018).
presentation at the International Conference on Learning Representations,
Banff, Alberta, Canada, 14–16 April 2014.
26. Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review Acknowledgements
for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017). N.Y. and R.L. were supported by NIH–NIAID (grant U19 AI090023). We thank A. Klein,
27. Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder S. Dudoit, and J. Listgarten for helpful discussions.
variational autoencoders. In Advances in Neural Information Processing Systems
(eds Lee, D. D. et al.) 3738–3746 (NIPS Foundation, La Jolla, CA, 2016). Author contributions
28. 10x Genomics. Support: single cell gene expression datasets. 10x Genomics R.L., J.R., and N.Y. conceived the statistical model. R.L. developed the software.
https://support.10xgenomics.com/single-cell-gene-expression/datasets (2017). R.L. and M.B.C. applied the software to real data analysis. R.L., J.R., N.Y., and M.I.J.
29. Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by wrote the manuscript. N.Y. and M.I.J. supervised the work.
single-cell RNA-seq. Science 347, 1138–1142 (2015).
30. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of
single cells. Nat. Commun. 8, 14049 (2017). Competing interests
31. Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by The authors declare no competing interests.
single-cell transcriptomics. Cell 166, 1308–1323 (2016).
32. Tusi, B. K. et al. Population snapshots predict early haematopoietic and
erythroid hierarchies. Nature 555, 54–60 (2018). Additional information
33. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in Supplementary information is available for this paper at https://doi.org/10.1038/
single cells. Nat. Methods 14, 865–868 (2017). s41592-018-0229-2.
34. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in Reprints and permissions information is available at www.nature.com/reprints.
microarray expression data using empirical Bayes methods. Biostatistics 8, Correspondence and requests for materials should be addressed to N.Y.
118–127 (2007).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
35. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in
published maps and institutional affiliations.
single-cell RNA-sequencing data are corrected by matching mutual nearest
neighbors. Nat. Biotechnol. 36, 421–427 (2018). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2018

1058 Nature Methods | VOL 15 | DECEMBER 2018 | 1053–1058 | www.nature.com/naturemethods


Nature MetHodS Articles
Methods multivariate Gaussians are a part of25. The reparameterization trick is a specific
The scVI probabilistic model. First, we present in more detail the generative sampling scheme from the variational distribution, which makes our objective
process for scVI. Each expression value xng is drawn independently through the function stochastic. Remarkably, this sampling step coupled with neural network
following process: approximation to the posterior is what makes it possible to go beyond restrictive
‘conditional conjugacy’ properties often needed for sampling or variational
zn ~Normal(0, I ) inference. This allows us to efficiently carry out inference with arbitrary models,
including those with conditional distributions specified by neural networks25.
ℓn ~log normal(ℓμ, ℓ 2σ )
A second level of stochasticity comes from subsampling from the training
ρn =fw (zn, sn) set (possible because the cells are identically independently distributed when
wng ~Gamma(ρng , θ ) conditioned on the latent variables). We then have an online optimization
procedure that can handle massive datasets, used by both scVI and other methods
yng ~Poisson(ℓnwng ) that exploit neural networks18–21. At each iteration, we focus only on a small subset
hng ~Bernoulli(fhg (zn, sn)) of the data randomly sampled (M =​ 128 data points) and do not need to go through
the entire dataset. Therefore, there is no need to store the entire dataset in memory.
 y if hng = 0 Because the number of genes is in practice limited to a few tens of thousands, these
xng =  ng
 0 otherwise mini-batches of cells can be handled easily by a GPU. Now, our objective function

is continuous and end-to-end differentiable, which allows us to use automatic
A standard multivariate normal prior for z is commonly used in variational differentiation operators.
autoencoders because it can be reparameterized in a differentiable way into any Throughout the paper, we use Adam (a first-order stochastic optimizer)
arbitrary multivariate Gaussian random variable25, which turns out to be extremely with ε = 0.01. As indicated in ref. 27, we use deterministic warm-up and batch
convenient in the inference process. normalization during learning to learn an expressive model. A complete list of
B denotes the number of batches and ℓμ, ℓσ ∈ R+B parameterize the prior for the hyperparameters is provided in Supplementary Table 2. The hyperparameters were
chosen via a small grid search that maximized held-out log likelihood—a common
scaling factor (on a log scale). ℓμ, ℓσ are set to be the empirical mean and variance
practice for training deep generative models. One of the strengths of scVI is
of the log-library size per batch. We note that the random variable ℓn is not the
that there are only three dataset-specific hyperparameters to set (learning rate,
log-library size (scaling the sampled observation) itself but a scaling factor that
number of layers, and layer width). We optimize the objective function until
is expected to correlate strongly with log-library size (hence the choice of the
convergence—usually between 120 and 250 epochs, where each epoch is a
parameters). The parameter θ ∈ RG+ denotes a gene-specific inverse dispersion,
complete pass through the dataset (we note that bigger datasets require fewer
estimated via variational Bayesian inference.
epochs). For the larger subset of the BRAIN-LARGE dataset, we also ran with
fw and fh are neural networks that map the latent space and batch annotation
the early-stopping criterion: the algorithm stopped after 12 consecutive epochs
back to the full dimension of all genes: Rd × {0, 1}B → RG. We use superscript with no improvement on the validation loss.
annotation (for example, fwg (zn, sn) ) to refer to a single entry that corresponds to Because the encoder network q (z∣x , s ) might still produce output correlated
a specific gene g. Neural network fw is constrained during the inference to encode with the bath s, one could use in principle a maximum mean discrepancy (MMD)-
the mean proportion of transcripts expressed across all genes by the use of a based penalty as in ref. 24 to correct the variational distribution. For this paper,
softmax activation at the last layer. Namely, for each cell n, the sum of fwg (zn, sn) however, we did not explicitly enforce the MMD penalty and simply retained
values over all genes g is 1. Neural network fh encodes whether a particular entry the conditional independence property, which has been shown to be sufficiently
has dropped out owing to technical effects9,10. These intermediate vectors can efficient. This may be useful with other datasets, though it explicitly assumes that
therefore be interpreted as expected frequencies. Importantly, we note that neural the exact same biological signal is present in the datasets.
networks allow users to go beyond the generalized linear model framework
and provide a more flexible model of gene expression. All neural networks use Bayesian differential expression. For each gene g and pair of cells (za, zb) with
dropout regularization and batch normalization. Each network has one, two, or observed gene expression (xa, xb) and batch ID (sa, sb), we can formulate two
three fully connected layers, with 128 or 256 nodes each. Details are provided in mutually exclusive hypotheses:
Supplementary Table 2. The activation functions between two hidden layers
are all ReLU. We use a standard link function to parameterize the distribution H1g : = Es fwg (za , s ) > Es fwg (zb, s )versus H2g : = Es fwg (za , s ) ≤ Es fwg (zb, s )
parameters (exponential, logarithmic, or softmax). Weights for some layers are
shared between fw and fh. where the expectation Es is taken with the empirical frequencies. Notably, we
propose a hypothesis testing that does not calibrate the data to one batch but
Fast inference via stochastic optimization. The posterior distribution combines will find genes that are consistently differentially expressed. Evaluation of which
the prior knowledge with information acquired from the data matrix X. We cannot hypothesis is more probable amounts to evaluation of a Bayes factor37 (Bayesian
directly apply a Bayes rule to determine the posterior because the denominator (the
generalization of the P value). Its sign indicates which of H1g and H2g is more likely.
marginal distribution, integrated over the latent variables) p (xn∣sn) is intractable. Its magnitude is a significance level, and throughout the paper, we consider a
Inference over the whole graphical model is not needed. We can integrate out the
Bayes factor as strong evidence in favor of a hypothesis if ∣K ∣ > 3 (ref. 36)
latent variables wng, hng, and yng because p (xng ∣zn, ℓn, sn) has a closed-form density. (equivalent to an odds ratio of exp(3) ≈​ 20).
Notably, the distribution p (xng ∣zn, sn, ℓn) is a ZINB16 with mean ℓn ρng , gene-specific
dispersion θ g and zero-inflation probability fhg (zn, sn) (Supplementary Note 3). p (H1g ∣xa , xb)
We discuss the numerical stability and parameterization of the ZINB distribution K = log e
p (H2g ∣xa , xb)
in Supplementary Note 4. Having simplified our model, we use variational
inference26 to approximate the posterior p (zn, ℓn∣xn, sn) . Our variational
where the posterior of these models can be approximated via the variational
distribution q (zn, ℓn∣xn, sn) is mean-field:
distribution
q (zn, ℓn∣xn, sn) = q (zn∣xn, sn) q (ℓn∣xn, sn)
p (H1g ∣xa , xb) ≈ ∑ ∬ p (fwg (zx , s ) ≤ fwg (zx , s )) p (s ) dq (za∣xa ) dq (zb∣xb)
s za, zb
The variational distribution q (zn∣xn, sn) is chosen to be Gaussian with a diagonal
covariance matrix, mean, and covariance given by an encoder network applied
to (xn,sn), as in ref. 25. The variational distribution q (ℓn∣xn, sn) is chosen to be where p(s) designates the relative abundance of cells in batch s and all of the
log-normal with the scalar mean and variance also given by an encoder network measures are low-dimensional, so we can use naive Monte Carlo to compute these
applied to (xn,sn). The variational lower bound is integrals. We can then use a Bayes factor for the test.
Because we assume that all the cells are independent, we can average the
log p(x∣s ) ≥Eq(z, l∣ x, s) log p (x∣z , l , s ) Bayes factors across a large set of randomly sampled cell pairs, one from each
subpopulation. The average factor will provide an estimate of whether cells from
−DKL (q(z∣x , s )∥p(z )) (1)
one subpopulation tend to express g at a higher frequency.
−DKL (q(l∣x , s )∥p(l )) We demonstrated the robustness of our method by repeating the entire
evaluation process and comparing the results (Fig. 3a,b). We also ensured that
In this objective function, the dispersion parameters θg for each gene are treated as our Bayes factors were well calibrated by running the differential expression
global variables to be optimized in a variational Bayesian inference fashion. analysis across cells from the same cluster and making sure no genes reached the
To optimize the lower bound, we use the analytic expression for p (x∣z , l , s ) significance threshold (Supplementary Fig. 15f).
and use analytic expressions for the Kullback–Leibler divergences. We use the
reparameterization trick to compute low-variance Monte Carlo estimates of the Modeling choices. In this section, we consider the extent to which each of a
expectations’ gradients. Analytic closed-form for the Kullback–Leibler divergence sequence of modeling choices in the design of scVI contributes to its performance.
and the reparameterization trick are possible only on certain distributions that As a baseline approach, consider normalizing scRNA-seq data as in the literature9

Nature Methods | www.nature.com/naturemethods


Articles Nature MetHodS
and reducing the dimensionality of the data by using a variational autoencoder BRAIN-SMALL*. This dataset, which consists of 9,128 mouse brain cells
with a Gaussian prior and a Gaussian conditional probability. profiled using 10x28, was used as a complement to PBMC for our study of
One way to enhance a model is to change the Gaussian conditional probability zero abundance and quality control metric correlation with our generative
to one of the many available count distributions, such as ZINB, negative binomial posterior parameters. We derived quality control metrics by using the
(NB), Poisson, and others. Recent work by Eraslan et al. using simulated data cellrangerRkit R package (v. 1.1.0). Quality metrics were extracted from
shows that when the dropout effect drives the signal-to-noise ratio to a less Cell Ranger throughout the molecule-specific information file. We kept
favorable regime, a denoising autoencoder with mean squared error (i.e., Gaussian the top 3,000 genes by variance. We used the clusters provided by Cell Ranger
conditional likelihood) cannot recover cell types from expression data, whereas for the correlation analysis of zero probabilities.
an autoencoder with ZINB conditional likelihood can20. This result points to the
importance of at least modeling the sparsity of the data and is in agreement with Statistics. Differential expression for bulk datasets. Specifically, we assembled
previous contributions9,10. a set of genes that are differentially expressed between human B cells and dendritic
The next question is which count distribution to use. In scVI we have chosen cells (microarrays; n =​ 10 in each group39; GSE29618) and between CD4+ and CD8+
to use the ZINB, a choice motivated by published literature (for example, ref. 10). T cells (microarrays; n =​ 12 in each group40; GSE8835). For GSE29618, we first
First, the choice of negative binomial is common with RNA-seq data, as they are loaded bulk human expression array data using the GEOquery package, selecting
overdispersed17. Furthermore, under some assumptions this distribution captures all B cell and myeloid dendritic cell samples from the baseline (“Day0”) time point.
the steady-state form of the canonical two-state promoter-activation model16. We retained all expression features described by exactly one gene symbol and
Finally, recent work by Grønbech et al21. proposes an analysis based on Bayesian regressed the expression of these expression measures on cell-type covariate
model selection (held-out log-likelihood as in this paper). In that analysis, the (B cell versus myeloid dendritic cell) using lmFit linear modeling in limma.
NB and ZINB distributions stand out with similarly high scores. We demonstrate P values were derived from empirical Bayes moderated t-tests for differences
that the addition of a zero-inflation (Bernoulli) component is important for between the two cell types, using eBayes in limma. We conducted an identical
explaining a subset of the zero values in the data (Supplementary Fig. 14) and study on GSE8835 for the CD4+ and CD8+ T cell comparison. These P values
that it captures important aspects of technical variability that are not captured were then corrected via the standard BH procedure.
by the NB component (Supplementary Fig. 13).
To enhance the model further, we added terms to account for library size as a Differential expression for scRNA-seq datasets. We used the packages as detailed
nuisance factor, which can be considered as a Bayesian approach to normalization above. The P values were then corrected via the standard BH procedure.
as in refs 8,22. We showed how this contributes to our model by increasing clustering
scores and differential expression analysis accuracy on the PBMC dataset. Capturing technical variability. We computed the average probability of zero
As a further enhancement, we designed the generative model to explain data from the NB distribution and from the Bernoulli across all genes for a particular
from different experimental batches. This is not a trivial task, as a substantial cell. We tested for a correlation between these cell-specific zero probabilities
covariate shift may exist between the observed transcript measurements. We and cell-specific quality control metrics by using a Pearson-correlation test.
showed how this modification to our model is crucial when dealing with batch
effects in subsection on the RETINA dataset. Evaluation. We describe below how we computed the metrics used in the
study. Further details of the algorithms used for benchmarking in this work are
Datasets and preprocessing. Below we describe all of the datasets and the provided in Supplementary Note 5.
preprocessing steps used in the current work. We focused on relatively large
datasets (3,000 cells or more) with unique molecular identifiers, thus providing Log-likelihood on held-out data. We provide a multivariate metric of goodness
enough information during training and avoiding the problem of overcounting of fit on the data in Supplementary Note 6.
due to amplification. An asterisk after the dataset name indicates that we used it
as an auxiliary dataset; these datasets were used not for general benchmarking Corrupting the datasets for imputation benchmarking. In this study we used
but rather to support specific points presented in the paper. The only case where two different approaches to measure the robustness of algorithms to noise
we subsampled the data multiple times was that of the BRAIN-LARGE dataset. in the data:
However, we simply used one instance of it to report all possible scores • Uniform zero introduction: we randomly selected 10% of the nonzero entries
(further details are presented in Supplementary Table 2). and multiplied the entry n with a Ber(0.9) random variable.
• Binomial data corruption: we randomly selected 10% of the matrix and
CORTEX. The Mouse Cortex Cells dataset from ref. 29 contains 3,005 mouse replaced an entry n with a Bin(n, 0.2) random variable.
cortex cells and gold-standard labels for seven distinct cell types. Each cell type
corresponds to a cluster to recover (Supplementary Table 4). We retained the Accuracy of imputing missing data. As imputation is tantamount to replacing
top 558 genes ordered by variance as in ref. 8. missing data by its mean conditioned on being observed, we used the median L1
distance between the original dataset and the imputed values for corrupted entries
PBMC. We considered scRNA-seq data from two batches of PBMCs from a healthy only. For MAGIC, we used the output of the associated software. For BISCUIT,
donor (4,000 and 8,000 PBMCs, respectively)30. We derived quality control metrics we used the imputed counts. For ZIFA, we used the mean of the generative
using the cellrangerRkit R package (v. 1.1.0). Quality metrics were extracted from distribution conditioned on the nonzero event (mean of the factor analysis part)
Cell Ranger throughout the molecule-specific information file. After filtering that we projected back into count space. For scVI and ZINB-WaVE, we used the
as in ref. 23, we extracted 12,039 cells with 10,310 sampled genes and generated mean of the NB distribution.
biologically meaningful clusters with the software Seurat (Supplementary
Table 5). We then filtered genes that we could not match with the bulk data Silhouette width. The silhouette width requires either a similarity matrix or a latent
used for differential expression, which resulted in g =​  3,346. space. We can define a silhouette score for each sample i with

BRAIN-LARGE. This dataset contains 1.3 million brain cells from 10x Genomics28. b (i ) − a (i )
s (i ) =
We randomly shuffled the data to get a subset of 1 million cells and ordered max{a (i ), b (i )}
genes by variance to retain first 10,000 and then 720 sampled variable genes.
This dataset was then sampled multiple times in cells for the runtime and where a(i) is the average distance from i to all data points in the same cluster ci,
goodness-of-fit analysis. We report imputation scores for the 10,000 cells and and b(i) is the lowest average distance from i to all data points in the same
720 gene samples only. cluster c among all clusters c. Clusters can be replaced with batches if one is
estimating the silhouette width to assess batch effects23.
RETINA. After filtering according to the original pipeline, the dataset of bipolar
cells from ref. 31 contained 27,499 cells and 13,166 genes from two batches. We used Clustering metrics. The following metrics require clustering and not simply a
the cluster annotation from 15 cell types from the author. We also extracted their similarity matrix. For these, we use k-means clustering on the given latent space of
normalized data with ComBat and used it for benchmarking. dimension 10 with T =​ 200 random initializations to achieve a stable score.

HEMATO. This dataset with continuous gene expression variations from Adjusted Rand index. This index requires clustering. For most indexes,
hematopoietic progenitor cells32 contains 4,016 cells and 7,397 genes. We removed
n    a  b   n
the library basal-bm1, which was of poor quality, on the basis of the authors’
recommendation. We used their population balance analysis result as a potential 
()
∑ij  ij  − ∑i  i  ∑j  j  
 2    2   2   2

function for differentiation. ARI =
 a  b     a  b   n
CBMC*. This dataset includes 8,617 cord blood mononuclear cells33 profiled using   2 
 
 2     2 
 ()
(1∕2) ∑i  i  + ∑j  j   − ∑i  i  ∑j  j  
 2   2

10 ×​ , along with 13 well-characterized mononuclear antibodies for each cell. We
kept the top 600 genes by variance. where nij, ai, and bj are values from the contingency table.

Nature Methods | www.nature.com/naturemethods


Nature MetHodS Articles
Normalized mutual information. Differential expression metrics. We used 100 cells from each cluster. In scVI,
we draw 200 samples from the variational posterior; subsampling ensures that
I (P ; T ) our results are stable.
NMI =
H (P ) H (T )
Area under the curve. We assign each gene a label of differentially expressed (DE)
where P,T designates empirical categorical distributions for the predicted and real or non-DE on the basis of its P value from the reference data (genes with
clustering, I is the mutual entropy, and H is the Shannon entropy. BH-corrected P values <​ 0.05 are positive, and the rest are negative); then we
use these labels to compute the AUROC.
Entropy of batch mixing. Fix a similarity matrix for the cells and take U to be a
uniform random variable on the population of cells. Take BU as the empirical Irreproducible discovery rate. The IDR is computed with the corresponding
frequencies for the 50 nearest neighbors of cell U being a in batch b. Report the R package. We adjust the prior for the mixture weight to be the fraction of
entropy of this categorical variable and average over T =​ 100 values of U. genes detected in the microarray data.

Protein abundance/mRNA expression. Take the similarity matrix for the normalized Reporting Summary. Further information on research design is available in the
protein abundance (centered log-ratio transformation; see ref. 33). Compute a Nature Research Reporting Summary linked to this article.
100-nearest-neighbors graph. Fix a similarity matrix for the cells and compute a
100-nearest-neighbors graph. Report the Spearman correlation of the flattened Software availability. An open-source software implementation of scVI is available
matrices and the fold enrichment. on Github (https://github.com/YosefLab/scVI). All code for the reproduction of
Let A be the set of edges in the protein nearest neighbors (NN) graph, B be the results and figures in this article has been deposited at https://zenodo.org/badge/
set of edges in the cell NN graph, and C be the entire set of possible edges. The fold latestdoi/125294792 and is included as Supplementary Software.
enrichment is defined as
Data availability
∣A ∩ B∣ × ∣C∣ All of the datasets analyzed in this paper are public and can be referenced at
∣A∣ ∣B∣ https://github.com/romain-lopez/scVI-reproducibility.

Nature Methods | www.nature.com/naturemethods


nature research | reporting summary
Corresponding author(s): Nir Yosef

Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars


State explicitly what error bars represent (e.g. SD, SE, CI)

Our web collection on statistics for biologists may be useful.

Software and code


Policy information about availability of computer code
Data collection No software was used to collect the data.

Data analysis https://github.com/romain-lopez/scVI-reproducibility (version 0.1)


Python packages: scikit-learn v0.19.0
R packages: IDR v1.2 (cran)
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
April 2018

Policy information about availability of data


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
All of the datasets analyzed in this manuscript are public and referenced at https://github.com/romain-lopez/scVI-reproducibility

1
nature research | reporting summary
Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size No experiments in study

Data exclusions No experiments in study

Replication No experiments in study

Randomization No experiments in study

Blinding No experiments in study

Reporting for specific materials, systems and methods

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Unique biological materials ChIP-seq
Antibodies Flow cytometry
Eukaryotic cell lines MRI-based neuroimaging
Palaeontology
Animals and other organisms
Human research participants

April 2018

You might also like