scICER is an R package that implements a systematic and efficient workflow to evaluate clustering consistency in single-cell RNA-seq data. This is the R port of the original scICE Julia package, designed to be fully compatible with the Seurat ecosystem.
- 🔬 Automated Cluster Evaluation: Systematically tests multiple cluster numbers to find consistent results
- ⚡ Element-Centric Similarity (ECS): Efficient algorithm for comparing clustering results
- 🧬 Seurat Integration: Seamless integration with Seurat objects and workflows
- 🚀 Parallel Processing: Leverages multiple cores for faster computation
- 📊 Comprehensive Visualization: Built-in plotting functions for result interpretation
- 📈 Robust Statistics: Bootstrap-based confidence intervals and stability assessment
- 🎯 Reproducible Analysis: Optional seed parameter for deterministic results
- 🔍 Smart Parameter Recommendations: Automatically suggests optimal parameters based on dataset size
- ✅ Preprocessing Validation: Checks and guides users through proper Seurat preprocessing
- 📋 Comprehensive Reporting: Generates detailed analysis summaries with interpretation guidance
scICER evaluates clustering consistency by:
- Multiple Clustering: Runs Leiden clustering multiple times with different parameters
- Binary Search: Automatically finds resolution parameters for target cluster numbers
- ECS Calculation: Uses Element-Centric Similarity to efficiently compare clustering results
- Optimization: Iteratively refines clustering to minimize inconsistency
- Bootstrap Validation: Assesses stability through bootstrap sampling
- Threshold Application: Identifies cluster numbers meeting consistency criteria
- Reproducibility Control: Optional seed parameter for deterministic results
Performance Considerations vs scICE Julia package
- Speed: R implementation typically runs slower than the native Julia version
- Memory: Requires more memory for equivalent analyses
- Accuracy: While highly consistent, results may differ slightly from the original scICE due to implementation differences in R vs Julia, random number generation variations, and floating-point precision handling.
Note: These tradeoffs provide better integration with the Seurat ecosystem and R workflows
scICER requires the ClustAssess library for Element-Centric Similarity (ECS) calculations, which provides:
- ~150x faster performance for similarity calculations
- Professional, well-tested implementation by Cambridge Core Bioinformatics
- Identical results to reference implementations
ClustAssess will be automatically installed when you install scICER.
Ensure you have R ≥ 4.0.0 and the required dependencies:
# Install required CRAN packages
install.packages(c(
"Seurat", "SeuratObject", "igraph", "Matrix",
"parallel", "foreach", "doParallel", "dplyr",
"ggplot2", "devtools", "ClustAssess"
))
# Optional: Install leiden package for better performance
install.packages("leiden")# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install scICER from GitHub
devtools::install_github("ATPs/scICER", upgrade = "never")#avoid upgrade packages and use existing oneslibrary(scICER)
library(Seurat)# Load required packages
library(scICER)
library(Seurat)
library(foreach)
library(doParallel)
library(ggplot2)
# Set up parallel processing (optional, but recommended)
n_workers <- 4 # adjust based on your system
# Load your Seurat object
data(pbmc_small) # Example dataset
seurat_obj <- pbmc_small
# Ensure preprocessing is complete
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER analysis
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
remove_threshold = Inf,
verbose = TRUE
)
# For reproducible results, use the seed parameter
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
remove_threshold = Inf,
verbose = TRUE,
seed = 123 # Set seed for reproducible results
)
# Visualize results
plot_ic(scice_results, threshold = 1.005)
# Extract consistent clustering labels
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE)
# Visualize with consistent clusters, cluster range from 3 to 15
xl.fig <- list()
for (n in 3:15) {
if (paste0("clusters_", n) %in% colnames(seurat_obj@meta.data)){
xl.fig[[paste0("clusters_", n)]] <- DimPlot(seurat_obj, group.by = paste0("clusters_", n), label = TRUE) + ggtitle(paste("Clusters:", n))
}
}
cowplot::plot_grid(plotlist = xl.fig, ncol = 3)# Comprehensive scICER analysis
results <- scICE_clustering(
object = your_seurat_object,
cluster_range = 2:20, # Range of cluster numbers to test
n_workers = 8, # Number of parallel workers
n_trials = 15, # Clustering trials per resolution
n_bootstrap = 100, # Bootstrap iterations
seed = 42, # Optional: Set for reproducible results
ic_threshold = 1.005, # Consistency threshold
verbose = TRUE # Progress messages
)# Fine-tuned parameters for large datasets
results <- scICE_clustering(
object = large_seurat_object,
graph_name = "RNA_snn", # Optional: specify graph name
cluster_range = 5:12, # Focused range for efficiency
objective_function = "CPM", # "CPM" or "modularity"
beta = 0.1, # Leiden beta parameter
n_iterations = 10, # Initial Leiden iterations
max_iterations = 150, # Maximum optimization iterations
remove_threshold = 1.15, # Filter inconsistent results
resolution_tolerance = 1e-8, # Resolution search precision
n_trials = 10, # Reduced for speed
n_bootstrap = 50, # Reduced for speed
seed = 12345 # For reproducible results
)It's OK to set a large remove_threshold value, because plot_ic and get_robust_labels all have default threshold = 1.005.
# Plot Inconsistency Coefficient (IC) scores
plot_ic(results,
threshold = 1.005,
title = "Clustering Consistency Analysis")
# Plot stability analysis
plot_stability(results)
# Extract summary of consistent clusters
summary <- extract_consistent_clusters(results, threshold = 1.005)
print(summary)
# Get clustering labels as data frame
labels_df <- get_robust_labels(results)
# Add results directly to Seurat object
seurat_obj <- get_robust_labels(results, return_seurat = TRUE)# Complete Seurat + scICER workflow
seurat_obj <- CreateSeuratObject(counts = your_counts)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER with reproducibility
scice_results <- scICE_clustering(
seurat_obj,
cluster_range = 3:15,
seed = 42 # For reproducible results
)
# Add consistent clusters to Seurat object
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE)
# Continue with standard Seurat analysis
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
DimPlot(seurat_obj, group.by = "clusters_8") # Use consistent clustering
# Find markers for consistent clusters
Idents(seurat_obj) <- "clusters_8"
markers <- FindAllMarkers(seurat_obj)The codes used to generate results in scICER paper are available at GitHub (https://github.com/yerry77/scICER-workflow).
scICER also includes scLENS, a dimensionality reduction method that uses Random Matrix Theory (RMT) to distinguish signal from noise and performs robustness testing to identify stable signals.
- 🧮 Random Matrix Theory: Uses RMT to automatically identify signal vs. noise eigenvalues
- 🔄 Robustness Testing: Performs multiple perturbations to test signal stability
- 🎯 Noise Filtering: Automatically filters out noise-dominated components
- ⚡ Parallel Processing: Supports multi-core computation for faster analysis
- 🔧 Seurat Integration: Seamlessly works with Seurat objects and workflows
- 📊 Flexible Normalization: Works with raw counts or pre-normalized data (set
is_normalized = TRUEfor "data" or "scale.data" slots)
# Load required packages
library(scICER)
library(Seurat)
library(ggplot2)
# Load your data (example with pbmc dataset)
data(pbmc_small)
seurat_obj <- pbmc_small
# Basic scLENS analysis with default parameters
seurat_obj <- sclens(seurat_obj)
# Check the new reductions added
Reductions(seurat_obj)
# Visualize scLENS results
seurat_obj <- RunUMAP(seurat_obj,
dims = 1:ncol(seurat_obj@reductions$sclens_pca_filtered),
reduction = "sclens_pca_filtered",
reduction.name = "sclens_umap")
# Compare with standard PCA
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30, reduction.name = "standard_umap")
# Plot comparison
library(patchwork)
p1 <- DimPlot(seurat_obj, reduction = "sclens_umap") + ggtitle("scLENS UMAP")
p2 <- DimPlot(seurat_obj, reduction = "standard_umap") + ggtitle("Standard PCA UMAP")
print(p1 | p2)# Customized scLENS analysis
seurat_obj <- sclens(
seurat_obj,
assay = "RNA", # Which assay to use
slot = "counts", # Which slot to use
th = 60, # Threshold angle in degrees for robustness
p_step = 0.001, # Decrement level for sparsity testing
n_perturb = 20, # Number of perturbations for robustness
centering = "mean", # Centering method ("mean" or "median")
is_normalized = FALSE, # Whether data is already normalized
reduction_name_all = "sclens_all", # Name for all signals reduction
reduction_name_filtered = "sclens_robust", # Name for robust signals reduction
n_threads = 4, # Number of parallel threads
verbose = TRUE # Show detailed progress
)
# Working with pre-normalized data
# If your data is already normalized (e.g., from "data" or "scale.data" slots),
# set is_normalized = TRUE to skip scLENS normalization steps
seurat_normalized <- NormalizeData(seurat_obj)
seurat_normalized <- sclens(
seurat_normalized,
slot = "data", # Use normalized data
is_normalized = TRUE, # Skip scLENS normalization
verbose = TRUE
)
# Or with scaled data
seurat_scaled <- seurat_obj %>% NormalizeData() %>% ScaleData()
seurat_scaled <- sclens(
seurat_scaled,
slot = "scale.data", # Use scaled data
is_normalized = TRUE, # Skip scLENS normalization
verbose = TRUE
)
# Access scLENS results
sclens_metadata <- seurat_obj@misc$sclens_results
cat("Total signals detected:", sclens_metadata$n_signals_total, "\n")
cat("Robust signals found:", sclens_metadata$n_signals_robust, "\n")
cat("Signal retention rate:",
round(sclens_metadata$n_signals_robust / sclens_metadata$n_signals_total * 100, 1), "%\n")scLENS adds two reductions to your Seurat object:
sclens_pca_all: All signals detected by RMT filteringsclens_pca_filtered: Only robust signals that passed perturbation testing
# Check dimensions of each reduction
cat("All signals:", ncol(seurat_obj@reductions$sclens_pca_all), "components\n")
cat("Robust signals:", ncol(seurat_obj@reductions$sclens_pca_filtered), "components\n")
# Access eigenvalues and robustness scores
eigenvalues <- seurat_obj@misc$sclens_results$signal_eigenvalues
robustness_scores <- seurat_obj@misc$sclens_results$robustness_scores
robust_indices <- seurat_obj@misc$sclens_results$robust_signal_indices
# Plot eigenvalue spectrum
eigenvalue_df <- data.frame(
Component = 1:length(eigenvalues),
Eigenvalue = eigenvalues,
Robust = 1:length(eigenvalues) %in% robust_indices
)
ggplot(eigenvalue_df, aes(x = Component, y = Eigenvalue, color = Robust)) +
geom_point(size = 2) +
scale_color_manual(values = c("FALSE" = "gray", "TRUE" = "red")) +
labs(title = "scLENS Eigenvalue Spectrum",
subtitle = "Red points are robust signals") +
theme_minimal()# Use scLENS for improved clustering
seurat_obj <- FindNeighbors(seurat_obj,
reduction = "sclens_pca_filtered",
dims = 1:ncol(seurat_obj@reductions$sclens_pca_filtered))
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)
# Compare scLENS-based clustering with standard clustering
seurat_obj <- FindNeighbors(seurat_obj,
reduction = "pca",
dims = 1:30,
graph.name = "standard")
seurat_obj <- FindClusters(seurat_obj,
graph.name = "standard_snn",
resolution = 0.5)
# Store clustering results
seurat_obj@meta.data$sclens_clusters <- seurat_obj@meta.data$seurat_clusters
seurat_obj@meta.data$standard_clusters <- seurat_obj@meta.data$seurat_clusters
# Visualize clustering comparison
p1 <- DimPlot(seurat_obj, group.by = "sclens_clusters", reduction = "sclens_umap") +
ggtitle("scLENS-based Clustering")
p2 <- DimPlot(seurat_obj, group.by = "standard_clusters", reduction = "standard_umap") +
ggtitle("Standard PCA Clustering")
print(p1 | p2)-
Input Data: For optimal performance, we recommend using raw count data as input. Our testing shows this produces the best results.
-
Batch Effect Handling: scLENS does not inherently remove batch effects. While integration with batch correction tools may be possible, this functionality has not been thoroughly tested.
-
Integration with Harmony: Combining scLENS with Harmony for batch correction is not a validated workflow. Our preliminary tests, which involved running Harmony on scLENS-generated embeddings, did not yield satisfactory results and this approach is not recommended.
-
Performance and Memory Management: The R version of scLENS is significantly slower and uses more memory than the Julia version. Additionally, R may not fully release memory after the analysis completes.
- To reduce memory usage and increase speed: Consider reducing
n_perturbto 10 and using fewern_threads. - To manage memory: We recommend saving your results, restarting the R session, and then reloading your data.
# Save the Seurat object qs::qsave(seurat_obj, "your_seurat_object.qs") # --> Restart your R session here <-- # Load the data to continue your work seurat_obj <- qs::qread("your_seurat_object.qs")
- To reduce memory usage and increase speed: Consider reducing
# Complete workflow: scLENS for dimensionality reduction + scICER for clustering evaluation
library(scICER)
# Step 1: Preprocessing
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
# Step 2: scLENS dimensionality reduction
seurat_obj <- sclens(seurat_obj, verbose = TRUE)
# Step 3: Use scLENS embedding for neighborhood graph
n_sclens_dims <- ncol(seurat_obj@reductions$sclens_pca_filtered)
seurat_obj <- FindNeighbors(seurat_obj,
reduction = "sclens_pca_filtered",
dims = 1:n_sclens_dims)
# Step 4: scICER clustering evaluation
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = 4,
seed = 42,
verbose = TRUE
)
# Step 5: Visualization and analysis
plot_ic(scice_results, threshold = 1.005)
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE)
# Step 6: Final UMAP with scLENS
seurat_obj <- RunUMAP(seurat_obj,
reduction = "sclens_pca_filtered",
dims = 1:n_sclens_dims,
reduction.name = "sclens_umap")
# Visualize final results
DimPlot(seurat_obj, reduction = "sclens_umap", group.by = "clusters_8") +
ggtitle("scLENS + scICER: Robust Dimensionality Reduction and Clustering")| Parameter | Default | Description |
|---|---|---|
th |
60 | Threshold angle (degrees) for signal robustness. Lower = stricter |
p_step |
0.001 | Decrement level for sparsity in robustness testing |
n_perturb |
20 | Number of perturbations for robustness testing |
centering |
"mean" | Centering method: "mean" or "median" |
n_threads |
5 | Number of parallel threads to use |
Recommended for:
- Datasets with high noise levels
- When standard PCA shows unclear structure
- Need for automatic signal/noise separation
- Robust dimensionality reduction for downstream analysis
- Integration with Random Matrix Theory principles
Consider alternatives when:
- Dataset is very small (<500 cells)
- Standard PCA already works well
- Computational resources are very limited
- Need exact reproducibility of published analyses using standard methods
The scICE_clustering() function returns a list containing:
gamma: Resolution parameters for each cluster numberlabels: Clustering results for each cluster numberic: Inconsistency scores (lower = more consistent)ic_vec: Bootstrap IC distributionsn_cluster: Number of clusters testedbest_labels: Best clustering labels for each cluster numberconsistent_clusters: Cluster numbers meeting consistency thresholdmei: Mutual Element-wise Information scores
- IC < 1.005: Highly consistent clustering (recommended), less than 0.5% of cells is inconsistent.
- IC 1.005-1.01: Moderately consistent clustering, 0.5% to 1% of cells is inconsistent.
- IC > 1.01: Low consistency (not recommended), more than 1% of cells is inconsistent.
- IC Plot: Shows consistency scores across cluster numbers
- Stability Plot: Displays bootstrap distributions
- UMAP/t-SNE: Visualize clusters with consistent labels
# Get recommended parameters based on dataset size
recommended <- get_recommended_parameters(seurat_obj)
# Optimized settings for large datasets
results <- scICE_clustering(
object = large_seurat_obj,
cluster_range = 5:15, # Focused range
n_trials = 8, # Fewer trials
n_bootstrap = 50, # Fewer bootstrap samples
n_workers = detectCores() - 1, # Maximum parallel processing
seed = 42 # For reproducibility
)- Reduce
n_trialsandn_bootstrapfor memory-limited systems - Use focused
cluster_rangebased on biological expectations - Install the
leidenpackage for improved performance - Use
get_recommended_parameters()for optimal settings
-
"Graph not found" error:
- Run
FindNeighbors()on your Seurat object first - Use
check_seurat_ready()to validate preprocessing - By default, scICER uses the SNN graph from the active assay (e.g., "RNA_snn")
- You can check available graphs with
names(your_seurat_object@graphs) - Specify a different graph with the
graph_nameparameter if needed
- Run
-
No consistent clusters found:
- Try adjusting
ic_threshold(e.g., 1.01) - Expand
cluster_range - Check preprocessing quality
- Use
create_results_summary()for detailed analysis
- Try adjusting
-
Memory issues:
- Reduce
n_workers,n_trials, orcluster_range - Use
get_recommended_parameters()for guidance
- Reduce
-
Slow performance:
- Install the
leidenpackage - Increase
n_workers - Use recommended parameters for your dataset size
- Install the
-
Reproducibility issues:
- Always set the
seedparameter - Keep all other parameters constant
- Document parameter values used
- Always set the
# Check if leiden package is available
if (requireNamespace("leiden", quietly = TRUE)) {
message("leiden package detected - enhanced performance available")
} else {
message("Consider installing 'leiden' package for better performance")
}
# Validate Seurat object preprocessing
check_seurat_ready(seurat_obj)
# Get recommended parameters
params <- get_recommended_parameters(seurat_obj)If you use scICER in your research, please cite:
Cao, X. (2024). scICER: Single-cell Inconsistency-based Clustering Evaluation in R.
R package version 1.0.0. https://github.com/ATPs/scICER
The original scICE algorithm citation will be added upon publication
For each number of clusters (e.g., when testing 3-10 clusters):
- The algorithm runs
n_trialsindependent clustering attempts (default: 15) - Each trial uses the Leiden algorithm with different random initializations
- For each cluster number:
- Calculates pairwise similarities between all trial results using Element-Centric Similarity (ECS)
- Computes an Inconsistency (IC) score measuring how different the clustering solutions are
- A lower IC score indicates more consistent clustering across trials
- For cluster numbers with IC scores below your threshold (e.g., < 1.05):
- Selects the "best" clustering solution from the trials
- The "best" solution is the one with highest average similarity to other solutions (most representative)
- This becomes the reported clustering for that number of clusters
So when you get results showing clusters 3-10 have IC scores < 1.05, it means:
- Each of these cluster numbers produced stable results across different random initializations
- The reported clustering solution is the most representative one from all trials
- You can be confident in using these clustering solutions for downstream analysis
n_trials:
- Increasing
n_trials(default: 15) can improve the robustness of results - More trials = more chances to find truly stable clustering solutions
- For large datasets, you might want to reduce trials to improve speed
- Recommended ranges:
- Small datasets (<10k cells): 15-20 trials
- Medium datasets (10k-50k cells): 10-15 trials
- Large datasets (>50k cells): 8-10 trials
n_bootstrap:
- Increasing
n_bootstrap(default: 100) improves confidence in IC score estimates - More bootstrap iterations = more accurate stability assessment
- The default 100 iterations is usually sufficient
- Consider increasing for:
- Critical analyses requiring high confidence
- Publications or benchmark studies
- Unusual or complex datasets
- Recommended ranges:
- Standard analysis: 100 iterations
- High-confidence analysis: 200-500 iterations
- Note: Higher values increase computation time
Remember: Quality preprocessing (normalization, feature selection, dimensionality reduction) often has more impact than increasing these parameters.
- 📋 Issues: Report bugs or request features on GitHub Issues
- 📧 Contact: [email protected]
- 🤝 Contributing: Pull requests welcome!
This package is licensed under the MIT License. See LICENSE file for details.
- Original scICE Julia package developers
- Seurat team for the excellent single-cell framework
- R community for the robust package ecosystem