Published online 1 February 2021 Nucleic Acids Research, 2021, Vol. 49, No.
7 e42
doi: 10.1093/nar/gkab004
Flexible comparison of batch correction methods for
single-cell RNA-seq using BatchBench
* * *
Ruben Chazarra-Gil , Stijn van Dongen, Vladimir Yu Kiselev and Martin Hemberg
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
Received May 28, 2020; Revised December 11, 2020; Editorial Decision December 28, 2020; Accepted January 29, 2021
ABSTRACT differences arising due to non-biological factors are com-
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
monly known as batch effects.
As the cost of single-cell RNA-seq experiments Fortunately, with appropriate experimental design it is
has decreased, an increasing number of datasets possible to remove a portion of the batch effects com-
are now available. Combining newly generated and putationally, and recently there has been a large degree
publicly accessible datasets is challenging due of interest in developing such methods for scRNA-seq.
to non-biological signals, commonly known as We group the methods into three categories depending on
batch effects. Although there are several compu- what space they operate on with respect to the expression
tational methods available that can remove batch matrix (Figure 1A). The expression matrix represents the
effects, evaluating which method performs best is number of reads found for each cell and gene, and it is
not straightforward. Here, we present BatchBench central to computational analyses. The first set of meth-
(https://github.com/cellgeni/batchbench), a modular ods, mnnCorrect, limma, ComBat, Seurat 3 (hereafter re-
ferred to as Seurat) and Scanorama, produce a merged,
and flexible pipeline for comparing batch correc-
corrected expression matrix. The second set, Harmony and
tion methods for single-cell RNA-seq data. We ap- fastMNN, instead operate on a low-dimensional embed-
ply BatchBench to eight methods, highlighting their ding of the original expression matrices. As such their out-
methodological differences and assess their perfor- put cannot be used for downstream analyses which re-
mance and computational requirements through a quire the expression matrix, limiting their use for some
compendium of well-studied datasets. This system- applications. Finally, the BBKNN method operates on
atic comparison guides users in the choice of batch the k-nearest neighbor graph constructed from the ex-
correction tool, and the pipeline makes it easy to eval- pression matrices and consequently its output is restricted
uate other datasets. to downstream analyses where only the cell label can be
used.
As the choice of batch correction method may impact the
INTRODUCTION downstream analyses, the decision of which one to use can
be consequential. To decide what method to use, most re-
Single-cell RNA sequencing (scRNA-seq) technologies searchers rely on benchmarking studies. Traditionally such
have made it possible to address biological questions that comparisons are carried out using a compendium of rele-
were not accessible using bulk RNA sequencing (1), e.g. vant datasets. The downside of this approach is that meth-
identification of rare cell types (2,3), discovery of devel- ods published after the benchmark was carried out are not
opmental trajectories (4–6), characterization of the vari- included and that the comparison may not have featured
ability in splicing (7–11), investigations into allele specific datasets that contain all the relevant features required to
expression (12–15) and analysis of stochastic gene expres- evaluate the methods. To overcome these issues we have de-
sion and transcriptional kinetics (11,16). There are cur- veloped BatchBench (Figure 1B), a flexible computational
rently a plethora of different protocols and experimental pipeline which makes it easy to compare both new methods
platforms available (17,18). Considerable differences exist and datasets using a variety of criteria. Here we report on
among scRNA-seq protocols with regards to mRNA cap- the comparison of eight popular batch effect removal meth-
ture efficiency, transcript coverage, strand specificity, UMI ods (Table 1) using three well-studied scRNA-seq datasets.
inclusion and other potential biases (17,18). It is well known BatchBench is implemented in Nextflow (20) and it is freely
that these and other technical differences can impact the ob- available at https://github.com/cellgeni/batchbench under
served expression values, and if not properly accounted for the MIT Licence.
they could be confounded with biological signals (19). Such
* To
whom correspondence should be addressed. Tel: +44 1223499955; Email: [email protected]
Correspondence may also be addressed to Vladimir Yu Kiselev. Email: [email protected]
Correspondence may also be addressed to Martin Hemberg. Email: [email protected]
C The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 2 OF 12
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Figure 1. Overview of the BatchBench pipeline workflow and schematic representation of the conventional scRNA-seq data analysis pipeline from the
expression matrix. (A) Batchbench first carries out QC on the input dataset prior to performing batch correction with the eight methods selected. After
this, a series of downstream analyses are computed, including: UMAP coordinates, Shannon entropies, clustering and marker gene analysis, and resource
consumption metrics of each of the processes. (B) Central and lower panels depict the conventional scRNA-seq data analysis pipeline and the analyses
that can be carried out with the output of each step. Upper panel represents the space over which each of the batch correction methods operate. The
initial expression matrix typically undergoes feature selection, being then source for gene based analyses, as marker gene and pseudotime analysis or gene
networks. Methods mnnCorrect, Limma, ComBat, Seurat and Scanorama operate in the expression matrix space. Next, a dimensionality reduction step is
performed. Methods Harmony and fastMNN operate in this space. The low dimensional embedding is then converted into a matrix of cell-cell distances
which in turn can be converted to a graph. These are inputs for cell based analysis as clustering, visualization and trajectory inference of cells. BBKNN
method operates in this graph space.
PAGE 3 OF 12 Nucleic Acids Research, 2021, Vol. 49, No. 7 e42
Table 1. Summary of the eight batch correction methods considered in this study. Programming language of the method, type of output object, tool’s
batch correction principle as well as installation source and license type are listed
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
By default, BatchBench evaluates batch correction meth- had been assigned a biologically meaningful cell type (e.g.
ods based on two different entropy metrics. The normal- removing cells from the ‘unclassified’ category).
ized Shannon entropy is used to quantify how well batches For Figure 3, we wanted to represent the pancreas results
are aligned while preserving the separation of different cell as a boxplot similar to the other datasets. To ensure that we
populations. However, the entropy measures do not pro- got a distribution we considered three additional versions of
vide a complete picture of how the batch correction impacts the data. One of these versions contained all of the genes ex-
downstream analyses. Therefore, BatchBench has a modu- pressed across the three batches rather than just the highly
lar design to allow users to incorporate additional metrics, variable ones. The second contained 1000 cells selected ran-
and we provide two examples of such metrics - unsupervised domly from each batch using the highly variable genes. The
clustering and identification of marker genes. Five different third version contained only six cell types (acinar, alpha,
unsupervised clustering methods are applied to the merged beta, gamma, delta and ductal) from each batch downsam-
cells to afford the user a better understanding of how the dif- pled to 50% of the original number of cells and information
ferent methods affect this step which is often central to the from the highly variable genes.
analysis. We also compare cell-type specific marker genes to
understand how different batch correction methods affect Mouse cell atlas datasets. Individual MCA datasets were
the expression levels. downloaded from the paper’s Figshare site and merged by
tissue, generating 37 organ datasets. From these, 18 datasets
containing more than one batch and with a reasonable pro-
METHODS portion of cells across batches were selected. Through fur-
ther preprocessing we removed cells expressing <250 genes,
Datasets
genes expressed in <50 cells, cell types representing <1%
Pancreas dataset. We consider three published pancreas of total cell population in a tissue, and batches containing
datasets: Baron (GSE84133) (39), Muraro (GSE85241) <5% of the total number of cells in a tissue (Supplementary
(27) and Segerstolpe (E-MTAB-5061) (28) generated using Table S2).
inDrop, CEL-Seq2 and Smart-Seq2 technologies, respec-
tively. Initially, quality control was performed on each of Tabula muris datasets. The data was downloaded from the
the datasets to remove cells with <200 counts and genes paper’s Figshare site. For all analyses except Figure 4, indi-
that were present in <3 cells along with spike-ins and anti- vidual datasets representing the same tissue across the two
sense transcripts. Furthermore, we only retained cells that platforms were merged into 11 organ datasets (Supplemen-
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 4 OF 12
tary Table S1). We set workflow quality control parame- cells sampled with uniform probability from the sequence
ters to remove cells expressing <1000 genes, genes expressed [0.05, 0.1, 0.15, . . . 1.0]; (ii) a value d representing the dis-
in <50 cells. Again, cell types representing <1% of total persion of the effect to be simulated sampled with uniform
cell population in a tissue, and batches containing <5% of probability from the sequence [0.5, 1.0, 1.5, . . . n], where n
the total number of cells in a tissue were excluded from is the number of batches to simulate. For each of the 10 cell
further analyses. For the scaling analysis in Figure 4, the types in the input data we add count values by drawing val-
previous tissues were merged into an atlas Tabula Muris ues from a normal distribution with a standard deviation
dataset which was filtered to retain cells with >200 genes d. The artificial batch effect is only applied to those genes
expressed, genes expressed in >3 cells. Cells assigned to NA expressed in >f of the cells. If a gene is assigned a negative
or unknown cell types were excluded. Cell types represent- value, then it is replaced by 0. The result is a simulated data
ing <1% of total cell population in a tissue, and batches set of 1001 cells and 4168 genes which is appended to the
containing <5% of the total number of cells in a tissue were input data set. We followed this approach to simulate data
excluded from further analyses. This resulted in an object sets with 2, 3, 5, 10, 20 and 50 equally sized batches.
of 4168 genes and 60 828 cells (40 058 from 10X and 20 770
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
from Smart-Seq2). Feature selection
We rank genes in descending order by their coefficient of
Batch and cell type entropy variation establishing five fractions of features: 0.05, 0.1,
0.2, 0.5 and 1.0 (all of the features). Feature selection is
The output of each tool is transformed into a K Near-
performed as a first step in each clustering algorithm script
est Neighbour graph with each node i representing a cell
prior to any processing of the input data.
(BuildKNNGraph, scran package). Each cell is connected
to its k = 30 nearest neighbors as defined by the similarity of
Clustering analysis
expression profiles calculated using the Euclidean distance.
Using the graph, we calculate for each cell i the probability The merged samples were clustered using five different clus-
that a neighbor has cell type c, Pic , as well as the probability tering algorithms: SC3 (35) from the homonim Bioconduc-
that a neighbor comes from batch b, Pib . From these joint tor package, Louvain and Leiden as implemented in Seurat
probabilities we can calculate cell type and batch entropies. (23), RaceID (2) and standard hierarchical clustering using
We report the average value across all cells divided by the Ward’s agglomeration method. SC3 and Race ID require a
theoretical maximum to ensure a value in the interval [0, 1]. count matrix as input. For SC3 we set k to the number of
For the datasets considered in this study, the results are ro- cell populations of each dataset. If the dataset had >5000
bust with respect to the choice of k (Supplementary Figure cells we enable sc3 run svm to speed up the processing.
S16). RaceID uses Euclidean distances based on the Pearson cor-
relation distance. All three RaceID clustering options (k-
means, k-medoids and hclust) are implemented in our clus-
UMAP
tering step. The other clustering algorithms can be applied
Uniform Manifold Approximation and Pro- to all batch correction methods in our study. Louvain and
jection (UMAP) is computed through the Leiden methods were implemented with the Seurat func-
scanpy.api.tl.umap function, which uses the im- tion FindClusters, with other parameters set to their
plementation of umap-learn (38). For the batch removal default values. We also implemented standard hierarchi-
methods implemented in R, the rds objects are first con- cal clustering using Ward’s agglomeration method with the
verted into h5ad objects using the sce2anndata from the hclust function from the stats package.
sceasy package (https://github.com/cellgeni/sceasy/). As a pre-processing step after feature selection and prior
to SC3, RaceID and hierarchical clustering, all cells and fea-
tures with zero variance are removed. Moreover, in the case
Downsampling
of RaceID clustering, negative values in the expression ma-
The filtered Tabula Muris dataset was sampled using uni- trix that may result from the batch effect removal step are
form selection and no replacement to 1, 2, 5, 10, 20 and set to to zero.
50% of its cells. Resulting in objects of: 4168 genes and 608, To assess the similarity of each corrected output cluster-
1217, 3041, 6083, 12 166 and 30 414 cells. The initial pro- ing annotation with the provided ground truth Adjusted
portion of the batches (0.64, 0.36) was maintained through Rand Index and Variation of Information are computed
the different subsets. with the arandi and the vi.dist functions from the mclust
package respectively.
To compare the results across datasets we select the best
Artificial batches
similarity metric value across all feature selection fractions
We work with a reduced version of the Tabula Muris atlas considered (Figure 5, Supplementary Figure S12). Since
object. We first removed all the Smart-seq2 cells and then re- variation of information is a distance metric, we perform
tained only the 10 largest cell types. From this 1001 cells are a min–max normalization to scale the data across datasets
randomly sampled to serve as input to the artificial batch (Supplementary Figure S12). For both metrics the feature
generation. All 4168 initial genes are considered. We base fraction matching the best value is stored (Supplementary
our simulation of batch effects on a normal distribution. Figures S10 and S13). Additionally, we examine the corre-
For each batch to be simulated, we define: (i) a fraction f of lation (as Pearson’s ) between the similarity values and the
PAGE 5 OF 12 Nucleic Acids Research, 2021, Vol. 49, No. 7 e42
feature range for which they were obtained (Supplementary mainly driven by the cell types, whereas the other methods
Figures S11 and S14). tend to aggregate the different batches. It is notable that
BBKNN brings cell populations closer but is unable to su-
Marker gene analysis perimpose the batches.
To evaluate how well the batch correction methods mix
To obtain marker genes we use the FindMarkers func- cells from different batches while keeping cell types sepa-
tion from the Seurat package which restricts the compari- rate, we computed the normalized Shannon entropy (16,29)
son to methods that output a normalised count matrix. For based on the batch and cell type annotations provided by
a gene to be considered as a marker, we require that the ab- the original authors (Methods). The desired outcome is a
solute value of the log fold-change >2, and that the gene high batch entropy, indicating a homogeneous mixture of
is expressed in at least half of the cells in each population. the batches, and a low cell type entropy, suggesting that
We use the default Wilcoxon Rank Sum test to find genes cell populations remain distinct. While all the methods were
that are significantly different (adjusted P-value < 0.05) be- able to keep the distinct cell populations separate, we ob-
tween the merged dataset, and in each of the individual served greater differences for the batch entropy (Figure 3).
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
batches. Based on this metric we consider Seurat and Harmony as
To compare the overlap of the sets of marker genes identi- the best methods. As intermediate performers Scanorama
fied across batches and the merged data we used the multiple and fastMNN show a wider distribution of batch entropy
site generalized Jaccard index (36). We restricted the com- values. Finally, mnnCorrect, Limma and ComBat can be
parison to the cell populations that are common to all in- considered the poorer performers in aligning the different
dividual batches. We also investigate the proportion of cell batches.
populations of the dataset for which marker genes can be We carried out similar investigations for the Mouse Cell
found. Atlas (MCA) (31) and Tabula Muris (32) datasets. In the
MCA the batches correspond to the eight different animals
BatchBench pipeline (31), and as the mice all come from the same genetic back-
ground and were raised in the same environment we expect
As an input, BatchBench (https://github.com/cellgeni/ the batch effects to be smaller than for the pancreas data.
batchbench) requires a SingleCellExperiment (R The batch entropy for the uncorrected data is indeed higher
based) or AnnData object (python based). The input ob- than for the pancreas data (Figure 3), and most methods are
ject is then converted into its counterpart in the other lan- able to mix the batches of the MCA better, as confirmed by
guage. This object must contain: log-normalized counts, visual inspection. The cell type entropies are higher than for
and the batch and cell type annotation of their cells as the pancreas data, and we hypothesize that this is a conse-
Batch and cell type1 respectively, in the object meta- quence of the fine-grained annotation which makes it dif-
data. The workflow performs an initial QC step where cells, ficult to separate cell types. For example, the bone marrow
genes, batches or cell types can be filtered according to user- contains six different types of neutrophils and the testes five
defined parameters. Cells not assigned to any batch or cell types of spermatocytes. Overall across MCA data, Seurat
type are excluded in this step also. Each dataset is then sent and Harmony show the best batch mixing, although at the
in parallel as input to each of the batch effect correction cost of slightly increasing cell type mixing compared to the
tools, after which rds and h5ad objects containing the out- uncorrected counts and the other methods. Scanorama can
put are saved and made available for the user. Each of the also be considered a good performer followed by fastMNN.
batch corrected outputs serves as input for a series of down- Next, we investigated another mouse cell atlas, Tabula
stream analyses: (i) UMAP coordinates are computed and Muris (32) and our analysis shows a greater sample effect as
saved as a csv file for visualization of the different batch cor- evidenced by a very low batch entropy for the uncorrected
rections, (ii) entropy computation and saved as csv file, (iii) data (Figure 3). Since the batches correspond to two dif-
clustering analysis, (iv) marker gene analysis and any mod- ferent experimental platforms (32), it is not surprising that
ule optionally added by the user. there are larger differences than for the MCA. Furthermore,
all methods perform better with regards to the cell type en-
RESULTS tropy, potentially due to a more coherent annotation. For
all three datasets, we note that for most methods there is
Entropy measures quantify integration of batches and sepa-
greater variation in batch entropy than cell type entropy.
ration of cell types
Closer inspection reveals that the batch entropies vary sub-
To illustrate the use of BatchBench we first considered three stantially across tissues (Supplementary Table S1). Interest-
scRNA-seq studies of the human pancreas (27–29). Even ingly, all methods, except for Seurat and BBKNN, are un-
though the samples were collected, processed and anno- able to achieve high batch entropy for datasets with a small
tated independently, several comparisons have shown that number of cell types. Closer inspection reveals that all meth-
batch effects can be overcome (19,30). Visualization of the ods except Seurat and BBKNN show a significant correla-
uncorrected data using UMAP reveals a clear separation tion between cell type entropy and number of cell types, sug-
of the major cell types across batches (Figure 2A). As ex- gesting poorer performance with more fine-grained anno-
pected, all of the methods in our study were able to merge tation (Supplementary Figure S1). Taken together, Seurat
equivalent cell populations from different batches while en- consistently succeeds in mixing the batches, again at the cost
suring their separation from other cell types. Visual inspec- of a slightly distinct cell population mixing. Scanorama per-
tion suggests that Seurat and Harmony achieve groupings forms well although with higher variation across datasets.
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 6 OF 12
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Figure 2. UMAP visualization of the human pancreas dataset prior and after batch correction with the eight different methods considered. (A) Original
uncorrected data. (B–H) Corrected data. Each pair of panels shows the cells labeled either by dataset of origin (left) or cell type (right). A good batch
correction should ensure that cells from different batches are grouped together while cells from distinct cell populations are retained separate.
PAGE 7 OF 12 Nucleic Acids Research, 2021, Vol. 49, No. 7 e42
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Figure 3. Batch and cell type entropies prior and after batch correction with the eight different methods considered. The boxplots show the Shannon
entropy over batch (black) and cell type (gray) of the different batch effect correction methods for pancreas data (red), Mouse Cell Atlas (green) and
Tabula Muris (blue). The black line represents the mean across the cells, the box the upper and lower quartiles, the whiskers 95th percentiles and the dots
show outliers.
Surprisingly, Harmony is unable to properly align the Tab- The main goal of the investigation involving different
ula Muris batches. numbers of cells is to learn how the computational resource
requirements change as this is an important factor when
choosing a method. Considering the time required to per-
Batch correction becomes harder as the number of cells and
form the integration, we found substantial differences as
the number of batches increase
ComBat, Limma, Harmony and BBKNN have more or less
To determine how the number of cells in each sample influ- constant run times as the number of cells grow. By con-
ences batch correction performance and running times we trast, mnnCorrect and fastMNN grow exponentially, with
considered the Tabula Muris dataset, and downsampled it the former being the slowest method in our study. Seurat ini-
to 1, 5, 10, 20 and 50% of the original 60 828 cells (Meth- tially has a stable runtime before it starts to grow exponen-
ods). Across all subsets, the input objects contain 64% of tially (Figure 4B). For all methods we found that memory
10X cells and 36% of FACS-sorted Smart-Seq2 cells. Note usage increases exponentially with the number of cells. The
that this batch correction task is more challenging than the differences are smaller than for the run-time, with Seurat,
one in Figure 3 as we now merge cells from different tissues. mnnCorrect, ComBat and fastMNN consuming the most
The number of cells has a strong impact on performance resources, while Harmony, Scanorama and BBKNN have
and it becomes more difficult to align the two batches with the lowest requirements (Figure 4C). The memory require-
increasing cell numbers. All methods except Scanorama, ments and runtimes observed in the scaling experiments are
Harmony and Seurat reduce the batch entropy by >50% similar to what we found for the previous section (Supple-
as the number of cells increases from 608 to 60 828 (Figure mentary Figure S2).
4A). Unfortunately, Scanorama mixes the cell types as well As sequencing costs decrease, the number of different
as batches, and surprisingly none of the entropies change as samples that can be processed will increase. Thus, we also
the number of cells increases. Harmony is the only method evaluated how well each method handles an increasing
that, after an initial drop, increases the batch entropy with number of batches. For this study we considered subsets
the number of cells. For all methods except Scanorama, the of the Tabula Muris 10X dataset with 4168 genes and 18
cell type entropy is also reduced, suggesting that it becomes 347 cells. As the batches created by subsampling this dataset
easier to group cells from the same origin for larger datasets. are entirely artificial, we added small batch-specific random
With the exception of Scanorama, the majority of the meth- counts to each gene to ensure that there are differences that
ods do not significantly increase the cell type entropy above require correction (Methods). In our simulations, cell types
the value of the uncorrected counts, even decreasing it for are well separated whereas the batches are more overlap-
the smaller subsets. ping.
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 8 OF 12
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Figure 4. Entropy measures and resource consumption of methods as a function of the number of cells and simulated batches. (A) Batch and cell type
entropies, (B) running time and (C) RAM usage over different subsets of the Tabula Muris atlas object with ∼61 000 cells in total. (D) Batch and cell type
entropies, (E) running time and (F) RAM usage over an increasing number of simulated batches of 1001 cells each, generated from Tabula Muris atlas 10X
cells.
We fixed the batch size to 1001 cells and we created required. BatchBench allows users to add customized mod-
datasets including 2, 3, 5, 10, 20 and 50 and batches, intro- ules to evaluate the aspect they find most relevant. Here, we
ducing small artificial batch effects. Cell type entropies are consider two common types of analyses, unsupervised clus-
maintained low with the number of batches for all methods, tering and identification of marker genes.
highlighting the capacity of our batch simulating procedure To evaluate the effect on unsupervised clustering, we ap-
to not mix distinct cell populations as batches are included. ply four published methods, Leiden (33), Louvain (34), SC3
Regarding batch entropy (Figure 4D), BBKNN, Seurat and (35), RaceID (2) and hierarchical clustering, to the cor-
Harmony show the most stable performance as the number rected data, and we then compare the merged cluster labels
of batches increases. Although all methods have an expo- to the ones that were assigned prior to merging. To assess
nential increase in both memory use and runtime, mnnCor- the proximity between clusterings we used a distance metric,
rect stands out again as the slowest method. As before, we variation of information, and a similarity metric, Adjusted
find that Seurat consumes the most memory, and along with Rand Index (ARI). The two measurements are by definition
mnnCorrect it fails to integrate 50 batches. inversely correlated, and because they are consistent (Spear-
man’s rho = –0.87) we will mainly refer to the ARI results. A
common question regarding clustering refers to the choice
Impact of batch correction on unsupervised clustering and
of which features to include. To determine the effect of fea-
identification of marker genes
ture selection in clustering performance after batch correc-
A key advantage of the entropy measures is that they can tion we establish five fractions of features: 0.05, 0.1, 0.2,
easily be calculated for any dataset containing discrete cell 0.5 and 1.0 (all of the features) by ranking genes descend-
state clusters and that they are easy to interpret. However, ingly by their coefficient of variation. Note that the feature
they only evaluate the mixing of the cells as represented by selection could not be applied to BBKNN, Harmony and
the nearest neighbor graph, and they do not directly assess FastMNN since they do not operate in the gene space.
how the batch correction will impact downstream analyses Our analysis of the MCA suggested small differences
based on the corrected data. To understand how specific as- in cell type entropy, but large differences in how well the
pects of the analysis are affected, tailored benchmarks are batches were mixed (Figure 3). By contrast, when run-
PAGE 9 OF 12 Nucleic Acids Research, 2021, Vol. 49, No. 7 e42
ning unsupervised clustering the batch correction methods ports marker genes for fewer cell types than the other meth-
achieve similar ARI values, except Race-ID kmeans and ods. A similar problem stems from the fact that sometimes
kmedoids, which perform worse. Closer inspection reveals the individual batches do not share any or only few marker
large differences between tissues, something that is not evi- genes prior to merging, e.g. the neonatal calvaria from the
dent from the entropy measures (Supplementary Table S1). MCA, which explains the grey boxes in Figure 5c.
In general a greater clustering similarity for this dataset is
achieved by clustering with all genes (Supplementary Fig-
DISCUSSION
ure S10). Except for RaceID-kmeans, which in turn shows
a very poor clustering similarity. Note RaceID clustering We have developed BatchBench, a customizable pipeline for
for Bone Marrow was interrupted after running for a week, comparing scRNA-seq batch correction methods. We have
and hence is not displayed. assessed the performance of eight popular batch correc-
For the Tabula Muris we observe a similar pattern with tion methods based on entropy measurements across three
large differences in ARI between tissues and relatively small datasets, suffering from donor and platform effects. Our re-
differences across methods. Compared to MCA, we ob- sults highlight Seurat as the top performer as it correctly
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
serve an improvement in similarity values for RaceID, SC3 merges batches while maintaining the separation of distinct
and hierarchical clustering algorithms, whereas Leiden and cell populations. Harmony also shows very good results
Louvain algorithms show worse performance. Closer in- in pancreas and MCA but surprisingly fails in correcting
spection reveals that the Leiden and Louvain methods per- the Tabula Muris batch effects. Scanorama and fastMNN
form poorly for datasets with a small number of clusters can be considered consistent good performers. Regarding
(Supplementary Figures S3–S9). Surprisingly for heart and BBKNN, we note that the entropies are not suitable for
mammary glands, the best results are achieved by hierarchi- evaluating its performance as the method operates by iden-
cal clustering, RaceID and SC3 applied to the uncorrected tifying nearest neighbours in each of the provided batches
data. There is a higher diversity in the feature fraction dis- (26) and adjusting neighbors to maximize the batch entropy.
playing the best similarity (Supplementary Figure S10). For Hence, a different metric should be established to evalu-
TM datasets, the usage of a smaller fraction of features with ate the performance of BBKNN. We also evaluated how
a higher coefficient of variation results in an enhanced clus- the methods perform as the number of cells and the num-
tering. ber of batches are varied. Here, we highlight Harmony as
For the pancreas dataset, hierarchical clustering together a method that provides good performance while being eco-
with RaceID and SC3 algorithms tend to have a higher nomical in its use of computational resources. However, our
ARI. Inclusion of all features in the clustering tend to analyses suggest that all methods, with the possible excep-
yield better similarity results (Supplementary Figure S10). tions of BBKNN and Harmony, will struggle to integrate
We also highlight that hierarchical clustering applied to hundreds of batches even if each batch is relatively small.
BBKNN distance matrix is not a good approach. Addition- Thus, improving scalability is a central requirement for fu-
ally, Scanorama shows highly variable performance across ture methods.
the clustering algorithms and datasets considered. A key insight from our study is that the entropy mea-
The main objective of batch correction methods is to en- sures do not fully reflect how the choice of batch correc-
sure that cells with similar expression profiles end up near tion method will impact downstream computational analy-
each other. The most widely used metrics, e.g. mixing en- ses. We applied five different unsupervised clustering meth-
tropies or inverse Simpson index (16,19,29), are designed to ods to the merged datasets, and the results are not as clear
evaluate this aspect. However, if a researcher is interested in as for the entropy analyses. No single method emerges as
analyzing the expression values for other purposes then it is the best performer, and in some cases the best results were
important to make sure that the corrected values are close to obtained using the uncorrected data. This result highlights
the original ones. To investigate how much expression ma- the importance of using benchmarks that are more closely
trices are distorted by the different methods, we compared linked to the analysis that will be carried out for the merged
the marker genes identified before and after batch correc- dataset.
tion for the five methods that modify the expression ma- Our attempt to identify marker genes from the corrected
trix (Table 1). We identified marker genes for each batch dataset demonstrates the difficulty of using the merged ex-
individually as well as for the merged datasets from each pression matrix for downstream analyses. As none of the
method that outputs a modified expression matrix. Unlike methods considered in our study performed adequately in
the entropy and clustering analyses, we observed stark dif- this benchmark, we highlight this as an area where improve-
ferences between batch correction methods. Remarkably, ments are required. Since marker genes are not preserved,
after merging using Scanorama or mnnCorrect, not a sin- we stress the importance for users to monitor how expres-
gle marker gene is identified. Only ComBat and Limma are sion levels change. Any analysis based on the expression
able to identify marker genes for most cell types, while Seu- levels, e.g. identification of marker genes or differentially
rat only reports markers for a minority of cell types in most expressed genes, will need to be verified to ensure that the
tissues (Figure 5B). Comparing the similarity between the result was not distorted due to the alterations introduced
marker genes identified in the individual batches and the by the batch correction method. An important limitation
merged dataset using a generalized Jaccard index (36), we of our marker gene analysis is that it only quantifies con-
find that Seurat provides the highest degree of consistency sistency as there is not yet an established ground truth for
(Figure 5C). However, it is important to keep in mind that what marker genes are represented for the cell types in our
Seurat’s good performance is biased by the fact that it re- study. We tried to use marker gene lists from the literature as
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 10 OF 12
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Figure 5. Evaluation of the impact of batch correction on unsupervised clustering and marker gene identification. (A) Clustering similarity of batch
corrected output to cell labels as evaluated by the Adjusted Rand Index. The highest ARI value from the five fractions of features considered for the
clustering is displayed. MCA: Mouse Cell Atlas, P: Pancreas, TM: Tabula Muris. (B) Fraction of total cell types over which marker genes are detected. (C)
Similarity of marker genes between merged dataset and individual batches as evaluated by the generalized Jaccard Index.
PAGE 11 OF 12 Nucleic Acids Research, 2021, Vol. 49, No. 7 e42
represented by the CellMarker database (37), but we found (2017) Single-cell RNA-seq and computational analysis using
that all pancrease datasets provided poor overlap, even be- temporal mixture modelling resolves Th1/Tfh fate bifurcation in
malaria. Sci Immunol., 2, eaal2.
fore batch correction (Supplementary Figure S15). 7. Shalek,A.K., Satija,R., Adiconis,X., Gertner,R.S., Gaublomme,J.T.,
Benchmark studies are important as they help guide re- Raychowdhury,R., Schwartz,S., Yosef,N., Malboeuf,C., Lu,D. et al.
searchers in their choice of methods. They are also help- (2013) Single-cell transcriptomics reveals bimodality in expression
ful for developers as they can highlight limitations of ex- and splicing in immune cells. Nature, 498, 236–240.
isting methods and provide guidance as to where improve- 8. Marinov,G.K., Williams,B.A., McCue,K., Schroth,G.P., Gertz,J.,
Myers,R.M. and Wold,B.J. (2014) From single-cell to cell-pool
ments are needed. One shortcoming of traditional bench- transcriptomes: stochasticity in gene expression and RNA splicing.
marks, however, is that they are static in nature and that they Genome Res., 24, 496–510.
only consider the datasets that the authors of the bench- 9. Qiu,X., Hill,A., Packer,J., Lin,D., Ma,Y.-.A. and Trapnell,C. (2017)
mark study had chosen to include. A related issue is that Single-cell mRNA quantification and differential analysis with
Census. Nat. Methods, 14, 309–315.
the metrics used to evaluate methods may not be relevant 10. Welch,J.D., Hu,Y. and Prins,J.F. (2016) Robust detection of
to all datasets and research questions. Along with a similar alternative splicing in a population of single cells. Nucleic Acids Res.,
study by Leucken et al. (40), BatchBench will serve as a use- 44, e73.
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
ful platform to the community as it enables benchmarks to 11. Trapnell,C., Hendrickson,D.G., Sauvageau,M., Goff,L., Rinn,J.L.
be tailored to specific needs. and Pachter,L. (2013) Differential analysis of gene regulation at
transcript resolution with RNA-seq. Nat. Biotechnol., 31, 46–53.
12. Deng,Q., Ramsköld,D., Reinius,B. and Sandberg,R. (2014)
SUPPLEMENTARY DATA Single-cell RNA-seq reveals dynamic, random monoallelic gene
expression in mammalian cells. Science, 343, 193–196.
Supplementary Data are available at NAR Online. 13. Kim,J.K., Kolodziejczyk,A.A., Ilicic,T., Teichmann,S.A. and
Marioni,J.C. (2015) Characterizing noise structure in single-cell
RNA-seq distinguishes genuine from technical stochastic allelic
ACKNOWLEDGEMENTS expression. Nat. Commun., 6:8687.
14. Reinius,B., Mold,J.E., Ramsköld,D., Deng,Q., Johnsson,P.,
We would like to thank members of the Cellular Genet- Michaëlsson,J., Frisén,J. and Sandberg,R. (2016) Analysis of allelic
ics Informatics team and the Hemberg lab for constructive expression patterns in clonal somatic cells by single-cell RNA-seq.
Nat. Genet., 48, 1430–1435.
feedback and comments. The S.vD., V.Y.K. and M.H. were 15. Kim,J.K. and Marioni,J.C. (2013) Inferring the kinetics of stochastic
funded by a core grant from the Wellcome Trust. R.C.G. gene expression from single-cell RNA-sequencing data. Genome Biol.,
was funded by the Polytechnic University of Valencia un- 14, R7.
der an Erasmus+ studentship, and by the Wellcome Trust. 16. Haghverdi,L., Lun,A.T.L., Morgan,M.D. and Marioni,J.C. (2018)
Batch effects in single-cell RNA-sequencing data are corrected by
Author contributions: M.H. and V.Y.K. conceived of the matching mutual nearest neighbors. Nat. Biotechnol., 36, 421–427.
study and supervised the work. R.C.G. and S.vD. developed 17. Hwang,B., Lee,J.H. and Bang,D. (2018) Single-cell RNA sequencing
the Nextflow pipeline. R.C.G. carried out the benchmarking technologies and bioinformatics pipelines. Exp. Mol. Med., 50, 96.
of the three datasets and eight methods. R.C.G. and M.H. 18. Chen,G., Ning,B. and Shi,T. (2019) Single-Cell RNA-Seq
wrote the manuscript with inputs from V.Y.K. and S.vD. technologies and related computational data analysis. Front. Genet.,
10, 317.
19. Tran,H.T.N., Ang,K.S., Chevrier,M., Zhang,X., Lee,N.Y.S., Goh,M.
FUNDING and Chen,J. (2020) A benchmark of batch-effect correction methods
for single-cell RNA sequencing data. Genome Biol., 21, 12.
Wellcome Trust; Polytechnic University of Valencia [Eras- 20. Di Tommaso,P., Chatzou,M., Floden,E.W., Barja,P.P., Palumbo,E.
and Notredame,C. (2017) Nextflow enables reproducible
mus+]. Funding for open access charge: Wellcome Trust. computational workflows. Nat. Biotechnol., 35, 316–319.
Conflict of interest statement. None declared. 21. Ritchie,M.E., Phipson,B., Wu,D., Hu,Y., Law,C.W., Shi,W. and
Smyth,G.K. (2015) limma powers differential expression analyses for
RNA-sequencing and microarray studies. Nucleic Acids Res., 43, e47.
REFERENCES 22. Leek,J.T., Johnson,W.E., Parker,H.S., Jaffe,A.E. and Storey,J.D.
1. Rostom,R., Svensson,V., Teichmann,S.A. and Kar,G. (2017) (2012) The sva package for removing batch effects and other
Computational approaches for interpreting scRNA-seq data. FEBS unwanted variation in high-throughput experiments. Bioinformatics,
Lett., 591, 2213–2225. 28, 882–883.
2. Grün,D., Lyubimova,A., Kester,L., Wiebrands,K., Basak,O., 23. Stuart,T., Butler,A., Hoffman,P., Hafemeister,C., Papalexi,E.,
Sasaki,N., Clevers,H. and van Oudenaarden,A. (2015) Single-cell Mauck,W.M., Hao,Y., Stoeckius,M., Smibert,P. and Satija,R. (2019)
messenger RNA sequencing reveals rare intestinal cell types. Nature, Comprehensive integration of single-cell data. Cell, 177, 1888–1902.
525, 251–255. 24. Hie,B., Bryson,B. and Berger,B. (2019) Efficient integration of
3. Zeisel,A., Muñoz-Manchado,A.B., Codeluppi,S., Lönnerberg,P., La heterogeneous single-cell transcriptomes using Scanorama. Nat.
Manno,G., Juréus,A., Marques,S., Munguba,H., He,L., Betsholtz,C. Biotechnol., 37, 685–691.
et al. (2015) Brain structure. Cell types in the mouse cortex and 25. Korsunsky,I., Millard,N., Fan,J., Slowikowski,K., Zhang,F., Wei,K.,
hippocampus revealed by single-cell RNA-seq. Science, 347, Baglaenko,Y., Brenner,M., Loh,P.R. and Raychaudhuri,S. (2019)
1138–1142. Fast, sensitive and accurate integration of single-cell data with
4. Bendall,S.C., Davis,K.L., Amir,E.-.A.D., Tadmor,M.D., Harmony. Nat. Methods, 16, 1289–1296.
Simonds,E.F., Chen,T.J., Shenfeld,D.K., Nolan,G.P. and Pe’er,D. 26. Polański,K., Young,M.D., Miao,Z., Meyer,K.B., Teichmann,S.A.
(2014) Single-cell trajectory detection uncovers progression and and Park,J.-.E. (2020) BBKNN: fast batch alignment of single cell
regulatory coordination in human B cell development. Cell, 157, transcriptomes. Bioinformatics, 36, 964–965.
714–725. 27. Muraro,M.J., Dharmadhikari,G., Grün,D., Groen,N., Dielen,T.,
5. Haghverdi,L., Büttner,M., Wolf,F.A., Buettner,F. and Theis,F.J. Jansen,E., van Gurp,L., Engelse,M.A., Carlotti,F., de Koning,E.J.
(2016) Diffusion pseudotime robustly reconstructs lineage branching. et al. (2016) A single-cell transcriptome atlas of the human pancreas.
Nat. Methods, 13, 845–848. Cell Syst., 3, 385–394.
6. Lönnberg,T., Svensson,V., James,K.R., Fernandez-Ruiz,D., Sebina,I., 28. Segerstolpe,Å., Palasantza,A., Eliasson,P., Andersson,E.-.M.,
Montandon,R., Soon,M.S.F., Fogg,L.G., Nair,A.S., Liligeto,U. et al. Andréasson,A.-.C., Sun,X., Picelli,S., Sabirsh,A., Clausen,M.,
e42 Nucleic Acids Research, 2021, Vol. 49, No. 7 PAGE 12 OF 12
Bjursell,M.K. et al. (2016) Single-cell transcriptome profiling of 35. Kiselev,V.Y., Kirschner,K., Schaub,M.T., Andrews,T., Yiu,A.,
human pancreatic islets in health and type 2 diabetes. Cell Metab., 24, Chandra,T., Natarajan,K.N., Reik,W., Barahona,M., Green,A.R.
593–607. et al. (2017) SC3: consensus clustering of single-cell RNA-seq data.
29. Azizi,E., Carr,A.J., Plitas,G., Cornish,A.E., Konopacki,C., Nat. Methods, 14, 483–486.
Prabhakaran,S., Nainys,J., Wu,K., Kiseliovas,V., Setty,M. et al. 36. Diserud,O.H. and Odegaard,F. (2007) A multiple-site similarity
(2018) Single-cell map of diverse immune phenotypes in the breast measure. Biol. Lett., 3, 20–22.
tumor microenvironment. Cell, 174, 1293–1308. 37. Zhang,X., Lan,Y., Xu,J., Quan,F., Zhao,E., Deng,C., Luo,T., Xu,L.,
30. Kiselev,V.Y., Yiu,A. and Hemberg,M. (2018) scmap: projection of Liao,G., Yan,M. et al. (2019) CellMarker: a manually curated
single-cell RNA-seq data across data sets. Nat. Methods, 15, 359–362. resource of cell markers in human and mouse. Nucleic Acids Res., 47,
31. Han,X., Wang,R., Zhou,Y., Fei,L., Sun,H., Lai,S., Saadatpour,A., D721–D728.
Zhou,Z., Chen,H., Ye,F. et al. (2018) Mapping the mouse cell atlas by 38. McInnes,L., Healy,J., Saul,N. and Großberger,L. (2018) UMAP:
microwell-seq. Cell, 172, 1091–1107. uniform manifold approximation and projection. JOSS, 3, 861.
32. Tabula Muris Consortium (2018) Single-cell transcriptomics of 20 39. Baron,M., Veres,A., Wolock,S.L., Faust,A.L., Gaujoux,R.,
mouse organs creates a Tabula Muris. Nature, 562, 367–372. Vetere,A., Ryu,J.H., Wagner,B.K., Shen-Orr,S.S., Klein,A.M. et al.
33. Traag,V.A., Waltman,L. and van Eck,N.J. (2019) From Louvain to (2016) A single-cell transcriptomic map of the human and mouse
Leiden: guaranteeing well-connected communities. Sci. Rep., 9, 5233. pancreas reveals inter- and intra-cell population structure. Cell Syst.,
34. Blondel,V.D., Guillaume,J.-.L., Lambiotte,R. and Lefebvre,E. (2008) 3, 346–360.
Downloaded from https://academic.oup.com/nar/article/49/7/e42/6125660 by guest on 27 April 2023
Fast unfolding of communities in large networks. J Stat Mech, 2008, 40. Luecken,M.D. and Theis,F.J. (2019) Current best practices in
P10008. single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol., 15, e8746.