Thanks to visit codestin.com
Credit goes to lib.rs

#genomics #bioinformatics #pangenome #gfa #variation-graph

bin+lib seqwish

A variation graph inducer - build pangenome graphs from pairwise alignments

3 releases

0.1.3 Nov 7, 2025
0.1.1 Nov 5, 2025
0.1.0 Nov 5, 2025

#180 in Biology

MIT license

235KB
5.5K SLoC

seqwish 🦀

Build Status License: MIT

A variation graph inducer - Build pangenome graphs from pairwise sequence alignments.

Seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments. As input we typically take all-versus-all alignments, but the exact structure of the alignment set may be defined in an application specific way.

✨ Features

  • Memory-safe - Rewritten in Rust with compile-time safety guarantees
  • Parallel - Multi-threaded processing throughout the pipeline
  • Scalable - Disk-backed data structures for processing large genomes
  • Fast - Performance comparable to the highly-optimized C++ version
  • Verified - Produces byte-for-byte identical output to reference implementation

🚀 Installation

# From source (recommended)
git clone https://github.com/pangenome/seqwish
cd seqwish
cargo build --release
# Binary will be in target/release/seqwish

# Future: cargo install seqwish (once published to crates.io)

📖 Usage

# Basic usage
seqwish -s sequences.fa -p alignments.paf -g output.gfa

# With options for large datasets
seqwish \
  -s sequences.fa \      # Input sequences (FASTA/FASTQ)
  -p alignments.paf \    # Pairwise alignments (PAF format)
  -g output.gfa \        # Output variation graph (GFA format)
  -t 16 \                # Use 16 threads
  -k 19 \                # Filter matches < 19bp
  -P                     # Show progress

Options

-s, --seqs <FILE>              Input sequences (FASTA/FASTQ, optionally gzipped)
-p, --paf-alns <FILE>          Input alignments (PAF format, optionally gzipped)
-g, --gfa <FILE>               Output graph (GFA v1.0 format)
-t, --threads <N>              Number of threads [default: 1]
-k, --min-match-len <N>        Minimum match length [default: 0]
-r, --repeat-max <N>           Maximum repeat copies in transitive closure [default: 0]
-l, --min-repeat-distance <N>  Minimum distance for repeat handling [default: 0]
-B, --transclose-batch <N>     Transitive closure batch size [default: 1000000]
-b, --temp-dir <PATH>          Temporary file directory
-T, --keep-temp                Keep temporary files
-P, --show-progress            Show progress messages

🔬 Algorithm Overview

The algorithm proceeds in stages:

  1. Sequence Indexing - Build FM-index of input sequences
  2. Alignment Indexing - Parse PAF alignments into interval trees
  3. Transitive Closure - Compute equivalence classes of aligned positions
  4. Graph Emission - Write graph sequence from closures
  5. Node Compaction - Merge non-bifurcating regions
  6. Link Derivation - Extract edges between nodes
  7. GFA Output - Emit final variation graph

📊 Example Workflow

# 1. Generate all-to-all alignments
minimap2 -cx asm20 -X sequences.fa sequences.fa > alignments.paf

# 2. Build the variation graph
seqwish -s sequences.fa -p alignments.paf -g graph.gfa -t 16 -P

# 3. Visualize (requires vg and graphviz)
vg view -dp graph.gfa | dot -Tpng > graph.png

🏗️ Use Cases

  • Pangenome construction - Build graphs from multiple related genomes
  • Structural variation - Capture large-scale genomic rearrangements
  • Population genomics - Represent variation across many samples
  • Reference graphs - Create enhanced reference structures
  • minimap2 - Generate PAF alignments
  • vg - Variation graph toolkit
  • odgi - Graph optimization and manipulation
  • gfaffix - GFA graph manipulation

📚 Citation

If you use seqwish, please cite:

Garrison E, Guarracino A. (2023)
Unbiased pangenome graphs
Bioinformatics, Volume 39, Issue 1, btac743
https://doi.org/10.1093/bioinformatics/btac743

🏛️ Implementation

This is a complete Rust reimplementation of seqwish using the "Ship of Theseus" pattern - incrementally replacing components while maintaining correctness. The Rust version:

  • ✅ Produces byte-for-byte identical output to C++
  • ✅ Passes all original test suites
  • ✅ Provides memory safety guarantees
  • ✅ Offers comparable performance

The original C++ implementation is preserved in cpp/ for reference. See README_CPP.md for C++ documentation.

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

📄 License

MIT - see LICENSE file

💬 Support


Made with 🦀 Rust

Dependencies

~6–13MB
~257K SLoC