IntegrateALL is a machine learning based pipeline for multilevel-data extraction and subtype classification in B-cell precursor ALL. It provides the user with an interactive report, containing subtype prediction, virtual karyotype, fusions, quality control, single nucleotide variants using a snakemake workflow management system.
- Machine Learning Classification: Automated subtype prediction for B-cell precursor ALL using ALLCatchR and automated karyotype prediction using KaryALL
- Comprehensive Analysis: Fusion detection, CNV analysis, variant calling, and quality control
- Interactive Reporting: HTML reports with visualizations and predictions
- Two-Workflow Architecture: Separate setup and analysis workflows for improved stability
- Automatic Classification: Classification according to WHO-HAEM5 and ICC classification
IntegrateALL uses a two-workflow architecture with multiple Snakefiles:
- Setup Workflow (
setup.smk): One-time installation of reference data and tools (~21GB) - Analysis Workflows:
Snakefile: Standard analysis workflow for local/single-node executionSnakefile.cluster: Optimized workflow for cluster environments with job-specific resource allocation
- Linux operating system (tested on Ubuntu/CentOS)
- Snakemake >= 7.3
- Conda/Mamba for dependency management
- 50GB free disk space (21GB for references + analysis space)
- Minimum 50GB RAM for STAR alignment
# Download and install Miniconda
cd ~/software
wget https://repo.anaconda.com/miniconda/Miniconda3-py311_23.11.0-2-Linux-x86_64.sh
sh Miniconda3-py311_23.11.0-2-Linux-x86_64.sh
# Install mamba for faster dependency resolution
conda install mamba=1.5.8 -n base -c conda-forge# Clone the repository
git clone https://github.com/NadineWolgast/IntegrateALL.git
cd IntegrateALL
# Or download and extract the zip file
wget https://github.com/NadineWolgast/IntegrateALL/archive/refs/heads/main.zip
unzip main.zip && cd IntegrateALL-main# Create and activate the conda environment
conda activate base
mamba env create --name integrateall --file environment.yaml
conda activate integrateallEdit the config.yaml file to match your system:
absolute_path: /absolute/path/to/IntegrateALL # No trailing slash!
star_mem: 50000 # Minimum 50GB RAM for STAR
threads: 4 # Adjust to your CPU coresInstall all required reference data and tools:
snakemake --snakefile setup.smk --cores 4 --use-conda --conda-frontend conda--use-conda to ensure correct tool versions are used for reference data creation.
This setup downloads (~21GB total):
- Reference genome and annotations (~16GB)
- STAR genome index (~1GB)
- RNAseqCNV reference data (~50MB)
- FusionCatcher database (~4.4GB)
- R packages (ALLCatchR, RNAseqCNV)
- Arriba draw_fusions tool (~10MB)
⏱️ Time: 1-3 hours (depending on internet speed)
Setup only runs once - subsequent executions skip existing components.
- Prepare sample sheet: Edit
samples.csvwith your FASTQ file paths:
sample_id,left,right
Sample_01,/path/to/sample_01_R1.fastq.gz,/path/to/sample_01_R2.fastq.gz
Sample_02,/path/to/sample_02_R1.fastq.gz,/path/to/sample_02_R2.fastq.gz- Test with provided data (optional):
sample_id,left,right
Test,/path/to/IntegrateALL/data/samples/sub1_new.fq.gz,/path/to/IntegrateALL/data/samples/sub2_new.fq.gzChoose the appropriate Snakefile for your environment:
Snakefile: Standard workflow for local execution or single-node processingSnakefile.cluster: Standard cluster workflow that always runs both Arriba and FusionCatcher for comprehensive fusion detectionSnakefile.conditional: Optimized cluster workflow with Arriba-first approach that only runs FusionCatcher if initial classification fails, significantly reducing runtime for samples with clear driver fusions
- Dry run (preview jobs):
snakemake -n --use-conda --conda-frontend conda- Full analysis:
snakemake --cores 20 --use-conda --conda-frontend condaNote: The --conda-frontend conda flag prevents libmamba-related errors during conda environment creation.
- Single sample:
snakemake --cores 4 --use-conda --conda-frontend conda STAR_output/YOUR_SAMPLE_ID/Aligned.sortedByCoord.out.bamOption 1: Standard cluster workflow (comprehensive fusion detection)
snakemake -s Snakefile.cluster --cores 20 --use-conda --conda-frontend condaOption 2: Optimized cluster workflow (Arriba-first approach)
snakemake -s Snakefile.conditional --cores 20 --use-conda --conda-frontend condaRecommendation: Use
Snakefile.conditionalfor faster processing when you expect clear driver fusions. UseSnakefile.clusterfor comprehensive analysis when you need maximum sensitivity or are analyzing challenging samples.
Option 3: SBATCH submission script
First, configure the submission script for your cluster:
# Edit submit_snakemake_job.sh to match your cluster configuration
cp submit_snakemake_job.sh my_submit_script.sh
# Adjust partition name, memory, CPU count, and conda pathsThen submit the job:
sbatch my_submit_script.shOption 4: Simple SLURM:
srun -c 20 --mem 100G snakemake --cores 20 --use-conda --conda-frontend condaOption 5: SLURM Executor:
snakemake --slurm --default-resources mem_mb=5000 threads=4 slurm_partition=YOUR_PARTITION --jobs 200 --use-conda --conda-frontend conda --keep-goinginteractive_output/{SAMPLE_ID}/output_report_{SAMPLE_ID}.html
- Alignments:
STAR_output/{SAMPLE_ID}/ - Fusions:
fusions/{SAMPLE_ID}.pdf,fusioncatcher_output/{SAMPLE_ID}/ - Fusion Intersects:
data/fusion_intersect/{SAMPLE_ID}.csv - Variants:
Variants_RNA_Seq_Reads/{SAMPLE_ID}/ - CNV Analysis:
RNAseqCNV_output/{SAMPLE_ID}/ - Classification:
allcatch_output/{SAMPLE_ID}/predictions.tsv - Karyotype Prediction:
karyotype_prediction/{SAMPLE_ID}.csv - Quality Control:
qc/fastqc/{SAMPLE_ID}/,qc/multiqc/{SAMPLE_ID}/ - Individual Classification:
Final_classification/{SAMPLE_ID}_output_report.csv - Aggregated Summary:
Final_classification/Aggregated_output_curation.csv
- TPM:
data/tpm/{SAMPLE_ID}.tsv - CPM:
data/cpm/{SAMPLE_ID}.tsv - Counts:
data/counts/{SAMPLE_ID}.tsv - Raw counts:
STAR_output/{SAMPLE_ID}/ReadsPerGene.out.tab - Combined counts:
data/combined_counts/ensemble_counts.tsv,data/combined_counts/gene_counts.tsv
Run only specific analysis modules by editing the rule all section in Snakefile:
rule all:
input:
"check_samples.txt",
expand("fusioncatcher_output/{sample_id}/final-list_candidate-fusion-genes.txt",
sample_id=list(samples.keys())),
# Uncomment desired outputs:
# expand("STAR_output/{sample_id}/Aligned.sortedByCoord.out.bam", sample_id=list(samples.keys())),
# expand("allcatch_output/{sample_id}/predictions.tsv", sample_id=samples.keys()),
# expand("interactive_output/{sample}/output_report_{sample}.html", sample=list(samples.keys()))# Deactivate environment
conda deactivate
# Reactivate for analysis
conda activate integrateall- "Snakefile not found": Ensure you're in the IntegrateALL directory
- Memory errors: Increase
star_meminconfig.yaml(minimum 50000) - Permission denied: Check file paths and permissions in
samples.csv - Environment conflicts: Remove and recreate the conda environment
For issues and bug reports, please use the GitHub issue tracker.
If you use IntegrateALL in your research, please cite: https://doi.org/10.1101/2025.09.25.673987