A genomic language model to identify chimera artifacts introduced by whole genome amplification (WGA).
ChimeraLM is a deep learning-powered genomic language model that detects artificial chimeric reads arising from whole genome amplification (WGA) processes. Built with PyTorch Lightning and optimized for modern GPUs, it provides fast and accurate identification of chimeric artifacts in BAM files.
- High Accuracy: Deep learning model trained on real WGA data
- GPU Accelerated: Optimized for CUDA, MPS (Apple Silicon), and CPU
- Easy to Use: Simple CLI with sensible defaults
- Fast Processing: Batch inference with configurable parallelism
- Web Interface: Interactive web UI for visualization and analysis
- Production Ready: Includes filtering, sorting, and indexing of BAM files
pip install chimeralm
# Clone the repository
git clone https://github.com/ylab-hi/ChimeraLM.git
cd ChimeraLM
# Install in development mode with uv
uv sync
uv run chimeralm --version
Detect chimeric reads in your BAM file:
# Install ChimeraLM
pip install chimeralm
# Predict chimeric reads (CPU)
chimeralm predict your_data.bam
# Predict with GPU acceleration
chimeralm predict your_data.bam --gpus 1 --batch-size 24
ChimeraLM provides a Python CLI with three main commands:
predict
: Detect chimeric reads using pre-trained modelsfilter
: Filter BAM files based on predictionsweb
: Launch interactive web interface
chimeralm [OPTIONS] COMMAND [ARGS]...
Predict chimeric reads in a BAM file using the pre-trained ChimeraLM model.
chimeralm predict [OPTIONS] DATA_PATH
Arguments:
DATA_PATH
: Path to the input BAM file
Options:
-g, --gpus INTEGER
: Number of GPUs to use (default: 0)-o, --output PATH
: Output path for predictions (default:{input}.predictions
)-b, --batch-size INTEGER
: Batch size for processing (default: 12)-w, --workers INTEGER
: Number of worker threads (default: 0)-v, --verbose
: Enable verbose output-m, --max-sample INTEGER
: Maximum number of samples to process-l, --limit-batches INTEGER
: Limit prediction batches-p, --progress-bar
: Show progress bar--random-seed
: Make prediction non-deterministic
Examples:
# Basic prediction on CPU
chimeralm predict input.bam
# Prediction with GPU acceleration
chimeralm predict input.bam --gpus 1 --batch-size 24
# Prediction with custom output path and progress bar
chimeralm predict input.bam --output results/ --progress-bar --verbose
Filter BAM files based on prediction results.
chimeralm filter [OPTIONS] INPUT_BAM PREDICTIONS_DIR
Arguments:
INPUT_BAM
: Path to the input BAM filePREDICTIONS_DIR
: Directory containing prediction results
Options:
-o, --output-prediction PATH
: Output path for filtered BAM file
Example:
chimeralm filter input.bam predictions/ --output-prediction filtered.bam
Launch an interactive web interface for visualizing and analyzing chimeric reads.
chimeralm web
This command starts a local web server that provides:
- Interactive visualization of predictions
- Analysis dashboards and metrics
- Easy-to-use interface for non-technical users
- GPU Usage: Use
--gpus 1
for faster processing if CUDA is available - Batch Size: Increase
--batch-size
for better GPU utilization (e.g., 24-32) - Memory: Monitor memory usage with large batch sizes
- Threading: Adjust
--workers
based on your system's CPU cores
Prediction outputs:
{output_dir}/predictions.txt
: Tab-separated file with read names and predicted labels (0=biological, 1=chimeric){input}.filtered.sorted.bam
: BAM file with chimeric reads removed (auto-generated by filter command){input}.filtered.sorted.bam.bai
: BAM index file
Common Issues:
- CUDA out of memory: Reduce
--batch-size
or use CPU mode - Slow processing: Enable GPU acceleration with
--gpus 1
- Missing dependencies: Run
uv sync
to install all dependencies
Debug Mode:
Use --verbose
flag to get detailed logging information about the prediction process.
chimeralm --version
# General help
chimeralm --help
# Command-specific help
chimeralm predict --help
If you use ChimeraLM in your research, please cite:
@software{chimeralm2025,
title={ChimeraLM: A genomic language model to identify chimera artifacts},
author={Li, Yangyang, Guo, Qingxiang and Yang, Rendong},
year={2025},
url={https://github.com/ylab-hi/ChimeraLM}
}
This project is licensed under the Apache License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.