AlignAIR

Deep‑learning sequence aligner for immunoglobulin & T‑cell receptor repertoires

Quick Start

Step 1: Pull the Docker image

docker pull thomask90/alignair:latest

Step 2: Start container with your data volumes

# Mount your input data and output directories
docker run -it --rm \
  -v /path/to/your/input/data:/data \
  -v /path/to/your/output/downloads:/downloads \
  thomask90/alignair:latest

Step 3: Run AlignAIR inside the container

# Example: Heavy Chain analysis with extended model
python app.py run \
  --model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
  --genairr-dataconfig=HUMAN_IGH_EXTENDED \
  --sequences=/data/your_sequences.csv \
  --save-path=/downloads/

Table of contents

What's New in v2.0

AlignAIR v2.0 introduces a revolutionary unified architecture:

Unified Models

SingleChainAlignAIR: Optimized for single receptor type analysis
MultiChainAlignAIR: Native multi-chain support with chain type classification
Universal compatibility: Works with any GenAIRR dataconfig combination

Multi-Chain Analysis

Mixed receptor processing: Analyze IGK + IGL light chains simultaneously
Chain type classification: Automatic receptor type identification
Optimized batch processing: Equal partitioning across chain types

Dynamic GenAIRR Integration

Built-in dataconfigs: HUMAN_IGH_OGRDB, HUMAN_IGK_OGRDB, HUMAN_IGL_OGRDB, HUMAN_TCRB_IMGT
Custom config support: Use your own GenAIRR dataconfigs
Automatic detection: Single vs. multi-chain mode based on input

Enhanced Performance

Streamlined architecture: Single codebase for all receptor types
Memory optimization: Efficient processing for large datasets
GPU acceleration: Optimized tensor operations

Key features

State‑of‑the‑art accuracy for V, D, J allele calling and junction segmentation
Unified multi‑chain architecture supporting any chain combinations with dynamic GenAIRR integration
Multi‑task deep network jointly optimises alignment, productivity, indel detection, and chain type classification
Scales to millions of AIRR‑seq reads with GPU support
Universal model architecture that adapts to single-chain or multi-chain scenarios
Dynamic data configuration with built-in GenAIRR dataconfigs for major species and receptors
Drop‑in integration with AIRR schema & downstream tools

Installation

Docker (recommended)

# Pull the latest image
docker pull thomask90/alignair:latest

# Start interactive container (mount local data to /data)
docker run -it --rm -v /path/to/local/data:/data thomask90/alignair:latest

Prerequisites: Nvidia GPU + CUDA 11 recommended (CPU works, slower).

Local (advanced)

git clone https://github.com/MuteJester/AlignAIR.git
cd AlignAIR && pip install -e ./

Note that the local version comes without pretrained model weights and is mainly used for custom model and pipeline development, testing, and debugging. It is mainly recommended for developers, contributors and advanced users.

Usage

Basic Usage

python app.py run \
    --model-checkpoint=/app/pretrained_models/IGH_S5F_576 \
    --genairr-dataconfig=HUMAN_IGH_OGRDB \
    --sequences=/data/input/sequences.csv \
    --save-path=/data/output

Example Commands

Heavy Chain Analysis (Extended Model):

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
  --genairr-dataconfig=HUMAN_IGH_EXTENDED \
  --sequences=/data/heavy_sequences.csv \
  --save-path=/downloads/ \
  --v-allele-threshold=0.75 \
  --d-allele-threshold=0.3 \
  --j-allele-threshold=0.8

Light Chain Multi-Chain Analysis (Lambda + Kappa with Chain Type Prediction):

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
  --sequences=/data/mixed_light_sequences.csv \
  --save-path=/downloads/ \
  --airr-format

Output includes: Chain type prediction in chain_type column (Lambda or Kappa)

Single Light Chain Analysis:

# Lambda only (using multi-chain model)
python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
  --sequences=/data/lambda_sequences.csv \
  --save-path=/downloads/ \
  --airr-format \
  --fix-orientation

# Kappa only (using multi-chain model)  
python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
  --sequences=/data/kappa_sequences.csv \
  --save-path=/downloads/ \
  --airr-format \
  --fix-orientation

Note: The MultiChainAlignAIR model requires both dataconfigs but will predict the correct chain type for each sequence

T-Cell Receptor Beta Chain:

python app.py run \
  --model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
  --genairr-dataconfig=HUMAN_TCRB_IMGT \
  --sequences=/data/tcr_sequences.csv \
  --save-path=/downloads/

Available Models and Configurations

AlignAIR v2.0 introduces a unified architecture that dynamically adapts to different chain types and configurations using GenAIRR dataconfigs:

Model Architecture Types

Architecture	Use Case	DataConfig Support	Multi-Chain
SingleChainAlignAIR	Single receptor type analysis	Single GenAIRR dataconfig	No
MultiChainAlignAIR	Mixed receptor analysis	Multiple GenAIRR dataconfigs	Yes

Built-in GenAIRR DataConfigs

DataConfig	Chain Type	Species	Reference	D Gene	Model Compatibility
`HUMAN_IGH_OGRDB`	Heavy Chain	Human	OGRDB	✓	IGH_S5F_576
`HUMAN_IGH_EXTENDED`	Heavy Chain Extended	Human	OGRDB + Custom	✓	IGH_S5F_576_EXTENDED
`HUMAN_IGK_OGRDB`	Kappa Light	Human	OGRDB	✗	IGL_S5F_576 (multi-chain)
`HUMAN_IGL_OGRDB`	Lambda Light	Human	OGRDB	✗	IGL_S5F_576 (multi-chain only)
`HUMAN_TCRB_IMGT`	TCR Beta	Human	IMGT	✓	TCRB_Uniform_576

Pre-trained Model Checkpoints

The Docker container ships with optimized models for common use cases:

Model	Architecture	Supported Configs	Checkpoint Path	Use Case
Heavy Chain Extended	SingleChainAlignAIR	`HUMAN_IGH_EXTENDED`	`/app/pretrained_models/IGH_S5F_576_EXTENDED`	Enhanced heavy chain with extended allele coverage
Heavy Chain Standard	SingleChainAlignAIR	`HUMAN_IGH_OGRDB`	`/app/pretrained_models/IGH_S5F_576`	Standard heavy chain analysis
Multi-Light	MultiChainAlignAIR	`HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB`	`/app/pretrained_models/IGL_S5F_576`	Lambda + Kappa analysis with chain type prediction
TCR Beta	SingleChainAlignAIR	`HUMAN_TCRB_IMGT`	`/app/pretrained_models/TCRB_Uniform_576`	T-cell receptor beta chain

Note: The Multi-Light model (IGL_S5F_576) is a MultiChainAlignAIR instance that requires both Lambda and Kappa dataconfigs and always outputs chain type predictions.

Custom DataConfigs

You can use custom GenAIRR dataconfigs by providing a path to a pickled DataConfig object:

python app.py run \
  --model-checkpoint=path/to/custom/model \
  --genairr-dataconfig=/path/to/custom_dataconfig.pkl \
  --sequences=input.csv \
  --save-path=output/

For multi-chain custom configs:

python app.py run \
  --model-checkpoint=path/to/multichain/model \
  --genairr-dataconfig=/path/to/config1.pkl,/path/to/config2.pkl \
  --sequences=input.csv \
  --save-path=output/

Docker in depth

Step-by-Step Docker Usage Guide

1. Pull the latest AlignAIR image

docker pull thomask90/alignair:latest

2. Prepare your data

Ensure your input sequences are in CSV format with a sequence column
Create directories for input data and output results

3. Start the container with volume mounts

# Windows example:
docker run -it --rm \
  -v C:/path/to/your/data:/data \
  -v C:/path/to/your/downloads:/downloads \
  thomask90/alignair:latest

# Linux/Mac example:
docker run -it --rm \
  -v /path/to/your/data:/data \
  -v /path/to/your/downloads:/downloads \
  thomask90/alignair:latest

4. Run AlignAIR with the appropriate model

Heavy Chain Analysis (Extended Model)

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGH_S5F_576_EXTENDED \
  --genairr-dataconfig=HUMAN_IGH_EXTENDED \
  --sequences=/data/sample_HeavyChain_dataset.csv \
  --save-path=/downloads/

Light Chain Multi-Chain Analysis (Lambda + Kappa)

python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
  --sequences=/data/sample_LightChain_dataset.csv \
  --save-path=/downloads/

Important: This MultiChainAlignAIR model predicts both Lambda and Kappa chains. The output includes a chain_type column indicating the predicted chain type for each sequence. The order of dataconfigs (Lambda first, then Kappa) must match the training order.

Single Light Chain Analysis

# Lambda or Kappa only (using multi-chain model with both dataconfigs)
python app.py run \
  --model-checkpoint=/app/pretrained_models/IGL_S5F_576 \
  --genairr-dataconfig=HUMAN_IGL_OGRDB,HUMAN_IGK_OGRDB \
  --sequences=/data/light_chain_sequences.csv \
  --save-path=/downloads/

Note: Even for single chain type analysis, the MultiChainAlignAIR model requires both dataconfigs but will correctly predict and classify each sequence's chain type.

T-Cell Receptor Beta Chain

python app.py run \
  --model-checkpoint=/app/pretrained_models/TCRB_Uniform_576 \
  --genairr-dataconfig=HUMAN_TCRB_IMGT \
  --sequences=/data/tcr_sequences.csv \
  --save-path=/downloads/

Critical Notes for Custom Models

Always use the same GenAIRR dataconfig during prediction as was used during model training
Never use modified dataconfigs with pre-trained models
For multi-chain models: The order of dataconfigs must match the training order exactly
Custom dataconfigs: Provide the path to your pickled DataConfig object instead of built-in names

Custom DataConfig Example

python app.py run \
  --model-checkpoint=/path/to/your/custom/model \
  --genairr-dataconfig=/data/your_custom_dataconfig.pkl \
  --sequences=/data/sequences.csv \
  --save-path=/downloads/

5. Check results Your results will be saved in the mounted /downloads directory and can be accessed from your host system.

Parameter Reference

Core Parameters

Parameter	Description	Default
`--model-checkpoint`	Path to model weights	Required
`--chain-type`	Specify heavy, light, or tcrb	Required
`--sequences`	Input file path (CSV/TSV/FASTA)	Required
`--save-path`	Output directory	Required

Model Settings

Parameter	Description	Default
`--max-input-size`	Maximum input window size	`576`
`--batch-size`	Sequences per batch	`2048`

Thresholds

Parameter	Description	Default
`--v-allele-threshold`	V allele calling threshold	`0.75`
`--d-allele-threshold`	D allele calling threshold	`0.30`
`--j-allele-threshold`	J allele calling threshold	`0.80`
`--v-cap` / `--d-cap` / `--j-cap`	Maximum calls per segment	`3`

Output Options

Parameter	Description	Default
`--airr-format`	Output full AIRR Schema	`false`
`--fix-orientation`	Auto-correct orientations	`true`
`--translate-to-asc`	Output ASC allele names	`false`

For complete parameter list: python app.py run --help

Examples

See the examples/ folder for Jupyter notebooks:

End‑to‑end heavy‑chain pipeline
Benchmark vs. IgBLAST on 10 K reads
Batch processing workflows

Data availability

Training & benchmark datasets are archived on Zenodo: doi:10.5281/zenodo.XXXXXXXX

Documentation

For comprehensive documentation, examples, and technical details, visit: https://alignair.ai/docs

Contributing

Pull requests are welcome! Please:

Run pre-commit run --all-files
Ensure pytest passes
Update CHANGELOG.md

See CONTRIBUTING.md for full guidelines.

License

This project is licensed under the terms of the GNU General Public License v3.0 or later (GPLv3+).

Contact

Open an issue or email [email protected].
For announcements, visit https://alignair.ai or join our Slack.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
build/lib/AlignAIR		build/lib/AlignAIR
docs		docs
src		src
stress_tests		stress_tests
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
local_server.py		local_server.py
main.py		main.py
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

License

MuteJester/AlignAIR

Folders and files

Latest commit

History

Repository files navigation

AlignAIR

Quick Start

What's New in v2.0

Unified Models

Multi-Chain Analysis

Dynamic GenAIRR Integration

Enhanced Performance

Key features

Installation

Docker (recommended)

Local (advanced)

Usage

Basic Usage

Example Commands

Available Models and Configurations

Model Architecture Types

Built-in GenAIRR DataConfigs

Pre-trained Model Checkpoints

Custom DataConfigs

Docker in depth

Step-by-Step Docker Usage Guide

Heavy Chain Analysis (Extended Model)

Light Chain Multi-Chain Analysis (Lambda + Kappa)

Single Light Chain Analysis

T-Cell Receptor Beta Chain

Critical Notes for Custom Models

Custom DataConfig Example

Parameter Reference

Core Parameters

Model Settings

Thresholds

Output Options

Examples

Data availability

Documentation

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages