Rigorous integration of single-cell ATAC-seq data using regularized barycentric mapping

Package installation

It is recommended to create a new environment for Fountain.

conda create -n Fountain python==3.8
conda activate Fountain

Fountain is available on PyPI and can be installed using

pip install scFountain

Installation via Github is also provided

This process will take approximately 2 to 10 minutes, depending on the user's computer device and internet connectivition.

Tutorial

Usage and examples of Fountain's main functions are shown in tutorial.

Quick Start

Fountain is a deep learning framework for batch integration on scATAC-seq data utilizing regularized barycentric mapping. Fountain supports: generating batch-corrected low-dimensional embeddings, generating batch-corrected and enhanced ATAC profiles in the original dimensionality, and online bacth integration.

Input format

h5ad file:
- AnnData object of shape n_obs × n_vars.
count matrix file:
- Rows correspond to peaks and columns to cells.
batch label and cell type label:
- The batch label and cell type labels are included in anndata.obs. Cell type labels are used for evaluation, rather than being necessary for training.

1. Data preprocessing

import scanpy as sc
import episcanpy as epi
import numpy as np
import sklearn
import pandas as pd 
import torch
from Fountain.data import create_dataloader,create_batchind_dict
from Fountain.fountain import Fountain
import scib
import matplotlib.pyplot as plt

You can chick MB.h5ad to download the example dataset.
First, load and preprocess the raw scATAC-seq count matrix, including binarization and filtering peaks with low counts to reduce noise (typically 1-5% of cells). While stricter filtering improves training time and memory, it may compromise biological signal retention.
```
adata=sc.read("./MB.h5ad")
fpeak=0.04
epi.pp.binarize(adata)
epi.pp.filter_features(adata, min_cells=np.ceil(fpeak*adata.shape[0]))
```
Anndata object is a Python object/container designed for storing single-cell data in Python packege anndata, which is seamlessly integrated with scanpy, a widely-used Python library for single-cell data analysis.

2. Model training

Next, initialize the dataloader and define the model architecture.

batchind_dict=create_batchind_dict(adata,batch_name='batch')
batchsize=min(128*len(batchind_dict),1024)
dataloader=create_dataloader(adata,batch_size=batchsize,batchind_dict=batchind_dict,batch_name='batch',num_worker=4,droplast=False)
# Define the encoder and decoder architectures. This example uses a three-layer MLP. ['fc', 1024, '', 'gelu'] denotes fully connected layer with output dimmention 1024 and gelu activation.
enc=[['fc', 1024, '', 'gelu'],['fc', 256, '', 'gelu'],['fc', 16, '', '']]
# Decoder: Simple single-layer architecture matching input dimension
# Note: For complex cases (e.g., severe batch effects), consider deeper architectures like: dec = [['fc', 256, '', 'gelu'], ['fc', adata.X.shape[1], '', '']] to enhance the model's capacity to capture batch-specific variations.
dec=[['fc', adata.X.shape[1], '', '']]

early_stopping= None   # Early stopping is omitted here for brevity.
device='cuda:0' # Recommended to run on GPU for better performance

Train the Fountain model. The training process is divided into two phases: Phase 1 (0 to mid_iteration): Training without batch correction (VAE loss only). Phase 2 (mid_iteration to max_iteration): Training with batch correction (VAE loss + MSE loss).

model.train(            
          dataloader,             
          lambda_mse=0.005, 
          lambda_Eigenvalue=0.5,
          max_iteration=30000,
          mid_iteration=3000,
          early_stopping=early_stopping,
          device=device, 
      )

3. Generating batch-corrected low-dimensional embeddings

After training, one can extract the batch-corrected embeddings as follows.

# Get the latent embeddings and store them in adata.obsm
emb='fountain'
adata.obsm[emb]=model.get_latent(dataloader,device=device)

Visualize the results using UMAP

sc.pp.neighbors(adata, use_rep='fountain')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['cell_type','batch'])

4. Generating batch-corrected and enhanced ATAC profiles in the original dimension

Fountain can also generate corrected and denoised ATAC profiles in the original feature space. For a detailed guide, see tutorial. Here is a basic workflow:
```
adata.layers['enhance']=model.enhance(adata,device=device,batch_name='batch')
```

5. Using Fountain-enhanced ATAC profiles for data analysis

One can download a Fountain-enhanced example dataset from enhanced_MB.h5ad to explore Fountain's capabilities. This pre-processed dataset is ready for immediate analysis using our tutorial.

6. Achieving online integration

One can achieve online integration through the model.get_latent function. Please refer to the tutorial for more details.

7. Extending Fountain to scRNA-seq Batch Correction

While originally developed for scATAC-seq data, Fountain's flexible architecture makes it also effective for other omics such as scRNA-seq data. We provide an example tutorial on applying Fountain to scRNA-seq data.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Tutorials		Tutorials
.gitattributes		.gitattributes
Environments.ipynb		Environments.ipynb
Fountain.png		Fountain.png
LICENSE		LICENSE
README.md		README.md
TF-IDF.py		TF-IDF.py
__init__.py		__init__.py
data.py		data.py
fountain.py		fountain.py
layer.py		layer.py
loss.py		loss.py
plot.py		plot.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rigorous integration of single-cell ATAC-seq data using regularized barycentric mapping

Package installation

Tutorial

Quick Start

Input format

1. Data preprocessing

2. Model training

3. Generating batch-corrected low-dimensional embeddings

4. Generating batch-corrected and enhanced ATAC profiles in the original dimension

5. Using Fountain-enhanced ATAC profiles for data analysis

6. Achieving online integration

7. Extending Fountain to scRNA-seq Batch Correction

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

BioX-NKU/Fountain

Folders and files

Latest commit

History

Repository files navigation

Rigorous integration of single-cell ATAC-seq data using regularized barycentric mapping

Package installation

Tutorial

Quick Start

Input format

1. Data preprocessing

2. Model training

3. Generating batch-corrected low-dimensional embeddings

4. Generating batch-corrected and enhanced ATAC profiles in the original dimension

5. Using Fountain-enhanced ATAC profiles for data analysis

6. Achieving online integration

7. Extending Fountain to scRNA-seq Batch Correction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages