3 releases
| 0.1.3 | Nov 7, 2025 |
|---|---|
| 0.1.1 | Nov 5, 2025 |
| 0.1.0 | Nov 5, 2025 |
#180 in Biology
235KB
5.5K
SLoC
seqwish 🦀
A variation graph inducer - Build pangenome graphs from pairwise sequence alignments.
Seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments. As input we typically take all-versus-all alignments, but the exact structure of the alignment set may be defined in an application specific way.
✨ Features
- Memory-safe - Rewritten in Rust with compile-time safety guarantees
- Parallel - Multi-threaded processing throughout the pipeline
- Scalable - Disk-backed data structures for processing large genomes
- Fast - Performance comparable to the highly-optimized C++ version
- Verified - Produces byte-for-byte identical output to reference implementation
🚀 Installation
# From source (recommended)
git clone https://github.com/pangenome/seqwish
cd seqwish
cargo build --release
# Binary will be in target/release/seqwish
# Future: cargo install seqwish (once published to crates.io)
📖 Usage
# Basic usage
seqwish -s sequences.fa -p alignments.paf -g output.gfa
# With options for large datasets
seqwish \
-s sequences.fa \ # Input sequences (FASTA/FASTQ)
-p alignments.paf \ # Pairwise alignments (PAF format)
-g output.gfa \ # Output variation graph (GFA format)
-t 16 \ # Use 16 threads
-k 19 \ # Filter matches < 19bp
-P # Show progress
Options
-s, --seqs <FILE> Input sequences (FASTA/FASTQ, optionally gzipped)
-p, --paf-alns <FILE> Input alignments (PAF format, optionally gzipped)
-g, --gfa <FILE> Output graph (GFA v1.0 format)
-t, --threads <N> Number of threads [default: 1]
-k, --min-match-len <N> Minimum match length [default: 0]
-r, --repeat-max <N> Maximum repeat copies in transitive closure [default: 0]
-l, --min-repeat-distance <N> Minimum distance for repeat handling [default: 0]
-B, --transclose-batch <N> Transitive closure batch size [default: 1000000]
-b, --temp-dir <PATH> Temporary file directory
-T, --keep-temp Keep temporary files
-P, --show-progress Show progress messages
🔬 Algorithm Overview
The algorithm proceeds in stages:
- Sequence Indexing - Build FM-index of input sequences
- Alignment Indexing - Parse PAF alignments into interval trees
- Transitive Closure - Compute equivalence classes of aligned positions
- Graph Emission - Write graph sequence from closures
- Node Compaction - Merge non-bifurcating regions
- Link Derivation - Extract edges between nodes
- GFA Output - Emit final variation graph
📊 Example Workflow
# 1. Generate all-to-all alignments
minimap2 -cx asm20 -X sequences.fa sequences.fa > alignments.paf
# 2. Build the variation graph
seqwish -s sequences.fa -p alignments.paf -g graph.gfa -t 16 -P
# 3. Visualize (requires vg and graphviz)
vg view -dp graph.gfa | dot -Tpng > graph.png
🏗️ Use Cases
- Pangenome construction - Build graphs from multiple related genomes
- Structural variation - Capture large-scale genomic rearrangements
- Population genomics - Represent variation across many samples
- Reference graphs - Create enhanced reference structures
🔗 Related Tools
- minimap2 - Generate PAF alignments
- vg - Variation graph toolkit
- odgi - Graph optimization and manipulation
- gfaffix - GFA graph manipulation
📚 Citation
If you use seqwish, please cite:
Garrison E, Guarracino A. (2023)
Unbiased pangenome graphs
Bioinformatics, Volume 39, Issue 1, btac743
https://doi.org/10.1093/bioinformatics/btac743
🏛️ Implementation
This is a complete Rust reimplementation of seqwish using the "Ship of Theseus" pattern - incrementally replacing components while maintaining correctness. The Rust version:
- ✅ Produces byte-for-byte identical output to C++
- ✅ Passes all original test suites
- ✅ Provides memory safety guarantees
- ✅ Offers comparable performance
The original C++ implementation is preserved in cpp/ for reference. See README_CPP.md for C++ documentation.
🤝 Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
📄 License
MIT - see LICENSE file
💬 Support
Made with 🦀 Rust
Dependencies
~6–13MB
~257K SLoC