Deep‑learning sequence aligner for immunoglobulin & T‑cell receptor repertoires
Step 1: Pull the Docker image
docker pull thomask90/alignair:latestStep 2: Start container with your data volumes
# Mount your input data and output directories
docker run -it --rm \
-v /path/to/your/input/data:/data \
-v /path/to/your/output/downloads:/downloads \
thomask90/alignair:latestStep 3: Run AlignAIR inside the container
# Example: Heavy Chain analysis with extended model
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
--genairr-dataconfig=HUMAN_IGH_EXTENDED \
--sequences=/data/your_sequences.csv \
--save-path=/downloads/Table of contents
AlignAIR v2.0 introduces a revolutionary unified architecture:
- SingleChainAlignAIR: Optimized for single receptor type analysis
- MultiChainAlignAIR: Native multi-chain support with chain type classification
- Universal compatibility: Works with any GenAIRR dataconfig combination
- Mixed receptor processing: Analyze IGK + IGL light chains simultaneously
- Chain type classification: Automatic receptor type identification
- Optimized batch processing: Equal partitioning across chain types
- Built-in dataconfigs:
HUMAN_IGH_OGRDB,HUMAN_IGK_OGRDB,HUMAN_IGL_OGRDB,HUMAN_TCRB_IMGT - Custom config support: Use your own GenAIRR dataconfigs
- Automatic detection: Single vs. multi-chain mode based on input
- Streamlined architecture: Single codebase for all receptor types
- Memory optimization: Efficient processing for large datasets
- GPU acceleration: Optimized tensor operations
- State‑of‑the‑art accuracy for V, D, J allele calling and junction segmentation
- Unified multi‑chain architecture supporting any chain combinations with dynamic GenAIRR integration
- Multi‑task deep network jointly optimises alignment, productivity, indel detection, and chain type classification
- Scales to millions of AIRR‑seq reads with GPU support
- Universal model architecture that adapts to single-chain or multi-chain scenarios
- Dynamic data configuration with built-in GenAIRR dataconfigs for major species and receptors
- Drop‑in integration with AIRR schema & downstream tools
# Pull the latest image
docker pull thomask90/alignair:latest
# Start interactive container (mount local data to /data)
docker run -it --rm -v /path/to/local/data:/data thomask90/alignair:latestPrerequisites: Nvidia GPU + CUDA 11 recommended (CPU works, slower).
git clone https://github.com/MuteJester/AlignAIR.git
cd AlignAIR && pip install -e ./- Note that the local version comes without pretrained model weights and is mainly used for custom model and pipeline development, testing, and debugging. It is mainly recommended for developers, contributors and advanced users.
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
--genairr-dataconfig=HUMAN_IGH_OGRDB \
--sequences=/data/input/sequences.csv \
--save-path=/data/outputHeavy Chain Analysis (Extended Model):
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
--genairr-dataconfig=HUMAN_IGH_EXTENDED \
--sequences=/data/heavy_sequences.csv \
--save-path=/downloads/ \
--v-allele-threshold=0.75 \
--d-allele-threshold=0.3 \
--j-allele-threshold=0.8Light Chain Multi-Chain Analysis (Lambda + Kappa with Chain Type Prediction):
python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
--sequences=/data/mixed_light_sequences.csv \
--save-path=/downloads/ \
--airr-formatOutput includes: Chain type prediction in
chain_typecolumn (Lambda or Kappa)
Single Light Chain Analysis:
# Lambda only (using multi-chain model)
python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
--sequences=/data/lambda_sequences.csv \
--save-path=/downloads/ \
--airr-format \
--fix-orientation
# Kappa only (using multi-chain model)
python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
--sequences=/data/kappa_sequences.csv \
--save-path=/downloads/ \
--airr-format \
--fix-orientationNote: The MultiChainAlignAIR model requires both dataconfigs but will predict the correct chain type for each sequence
T-Cell Receptor Beta Chain:
python app.py run \
--model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
--genairr-dataconfig=HUMAN_TCRB_IMGT \
--sequences=/data/tcr_sequences.csv \
--save-path=/downloads/AlignAIR v2.0 introduces a unified architecture that dynamically adapts to different chain types and configurations using GenAIRR dataconfigs:
| Architecture | Use Case | DataConfig Support | Multi-Chain |
|---|---|---|---|
| SingleChainAlignAIR | Single receptor type analysis | Single GenAIRR dataconfig | No |
| MultiChainAlignAIR | Mixed receptor analysis | Multiple GenAIRR dataconfigs | Yes |
| DataConfig | Chain Type | Species | Reference | D Gene | Model Compatibility |
|---|---|---|---|---|---|
HUMAN_IGH_OGRDB |
Heavy Chain | Human | OGRDB | ✓ | IGH_S5F_576 |
HUMAN_IGH_EXTENDED |
Heavy Chain Extended | Human | OGRDB + Custom | ✓ | IGH_S5F_576_EXTENDED |
HUMAN_IGK_OGRDB |
Kappa Light | Human | OGRDB | ✗ | IGL_S5F_576 (multi-chain) |
HUMAN_IGL_OGRDB |
Lambda Light | Human | OGRDB | ✗ | IGL_S5F_576 (multi-chain only) |
HUMAN_TCRB_IMGT |
TCR Beta | Human | IMGT | ✓ | TCRB_Uniform_576 |
The Docker container ships with optimized models for common use cases:
| Model | Architecture | Supported Configs | Checkpoint Path | Use Case |
|---|---|---|---|---|
| Heavy Chain Extended | SingleChainAlignAIR | HUMAN_IGH_EXTENDED |
/app/pretrained_models/IGH_S5F_576_EXTENDED |
Enhanced heavy chain with extended allele coverage |
| Heavy Chain Standard | SingleChainAlignAIR | HUMAN_IGH_OGRDB |
/app/pretrained_models/IGH_S5F_576 |
Standard heavy chain analysis |
| Multi-Light | MultiChainAlignAIR | HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB |
/app/pretrained_models/IGL_S5F_576 |
Lambda + Kappa analysis with chain type prediction |
| TCR Beta | SingleChainAlignAIR | HUMAN_TCRB_IMGT |
/app/pretrained_models/TCRB_Uniform_576 |
T-cell receptor beta chain |
Note: The Multi-Light model (
IGL_S5F_576) is a MultiChainAlignAIR instance that requires both Lambda and Kappa dataconfigs and always outputs chain type predictions.
You can use custom GenAIRR dataconfigs by providing a path to a pickled DataConfig object:
python app.py run \
--model-checkpoint=path/to/custom/model \
--genairr-dataconfig=/path/to/custom_dataconfig.pkl \
--sequences=input.csv \
--save-path=output/For multi-chain custom configs:
python app.py run \
--model-checkpoint=path/to/multichain/model \
--genairr-dataconfig=/path/to/config1.pkl,/path/to/config2.pkl \
--sequences=input.csv \
--save-path=output/1. Pull the latest AlignAIR image
docker pull thomask90/alignair:latest2. Prepare your data
- Ensure your input sequences are in CSV format with a
sequencecolumn - Create directories for input data and output results
3. Start the container with volume mounts
# Windows example:
docker run -it --rm \
-v C:/path/to/your/data:/data \
-v C:/path/to/your/downloads:/downloads \
thomask90/alignair:latest
# Linux/Mac example:
docker run -it --rm \
-v /path/to/your/data:/data \
-v /path/to/your/downloads:/downloads \
thomask90/alignair:latest4. Run AlignAIR with the appropriate model
python app.py run \
--model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
--genairr-dataconfig=HUMAN_IGH_EXTENDED \
--sequences=/data/sample_HeavyChain_dataset.csv \
--save-path=/downloads/python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
--sequences=/data/sample_LightChain_dataset.csv \
--save-path=/downloads/Important: This MultiChainAlignAIR model predicts both Lambda and Kappa chains. The output includes a
chain_typecolumn indicating the predicted chain type for each sequence. The order of dataconfigs (Lambda first, then Kappa) must match the training order.
# Lambda or Kappa only (using multi-chain model with both dataconfigs)
python app.py run \
--model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
--genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
--sequences=/data/light_chain_sequences.csv \
--save-path=/downloads/Note: Even for single chain type analysis, the MultiChainAlignAIR model requires both dataconfigs but will correctly predict and classify each sequence's chain type.
python app.py run \
--model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
--genairr-dataconfig=HUMAN_TCRB_IMGT \
--sequences=/data/tcr_sequences.csv \
--save-path=/downloads/- Always use the same GenAIRR dataconfig during prediction as was used during model training
- Never use modified dataconfigs with pre-trained models
- For multi-chain models: The order of dataconfigs must match the training order exactly
- Custom dataconfigs: Provide the path to your pickled DataConfig object instead of built-in names
python app.py run \
--model-checkpoint=/path/to/your/custom/model \
--genairr-dataconfig=/data/your_custom_dataconfig.pkl \
--sequences=/data/sequences.csv \
--save-path=/downloads/5. Check results
Your results will be saved in the mounted /downloads directory and can be accessed from your host system.
| Parameter | Description | Default |
|---|---|---|
--model-checkpoint |
Path to model weights | Required |
--chain-type |
Specify heavy, light, or tcrb | Required |
--sequences |
Input file path (CSV/TSV/FASTA) | Required |
--save-path |
Output directory | Required |
| Parameter | Description | Default |
|---|---|---|
--max-input-size |
Maximum input window size | 576 |
--batch-size |
Sequences per batch | 2048 |
| Parameter | Description | Default |
|---|---|---|
--v-allele-threshold |
V allele calling threshold | 0.75 |
--d-allele-threshold |
D allele calling threshold | 0.30 |
--j-allele-threshold |
J allele calling threshold | 0.80 |
--v-cap / --d-cap / --j-cap |
Maximum calls per segment | 3 |
| Parameter | Description | Default |
|---|---|---|
--airr-format |
Output full AIRR Schema | false |
--fix-orientation |
Auto-correct orientations | true |
--translate-to-asc |
Output ASC allele names | false |
For complete parameter list: python app.py run --help
See the examples/ folder for Jupyter notebooks:
- End‑to‑end heavy‑chain pipeline
- Benchmark vs. IgBLAST on 10 K reads
- Batch processing workflows
Training & benchmark datasets are archived on Zenodo: doi:10.5281/zenodo.XXXXXXXX
For comprehensive documentation, examples, and technical details, visit: https://alignair.ai/docs
Pull requests are welcome! Please:
- Run
pre-commit run --all-files - Ensure
pytestpasses - Update
CHANGELOG.md
See CONTRIBUTING.md for full guidelines.
This project is licensed under the terms of the GNU General Public License v3.0 or later (GPLv3+).
Open an issue or email [email protected].
For announcements, visit https://alignair.ai or join our Slack.