3 unstable releases
Uses new Rust 2024
| 0.9.10 | Feb 9, 2026 |
|---|---|
| 0.1.0-beta.2 | Nov 22, 2023 |
#38 in Biology
235KB
6K
SLoC
jam-rs
Just another minhash (jam). A high-performance FracMinHash implementation for genomic sequence similarity analysis, optimized for searching plasmids, phages, and other small genomic elements in large datasets.
jam uses a custom hash function (jamhash) that provides lower collision rates, 2-10x higher speed and better uniformity than murmur3. It also includes a compact memory-mapped database format (.jam) for fast random access, and a bias filtering system based on Count-Min Sketches to selectively increase sensitivity for target sequences.
Installation
From crates.io:
cargo install jam-rs
From source:
cargo install --git https://github.com/St4NNi/jam-rs
Key Features
- Custom hash function: jamhash provides lower collisions, better uniformity and is faster compared to murmur3
- Bias-aware sketching: Count-Min Sketch based compositional filtering with automatic background extraction
- Complexity filtering: Shannon entropy threshold to exclude low-complexity k-mers
- Memory-efficient: External sorting for processing datasets larger than available RAM
- Compact storage: 256-bucket memory-mapped
.jamformat with binary fuse filters for fast random access - Parallel execution: File-level parallelization via rayon with configurable thread count
- Tuned for speed: jemalloc allocator, LTO, single codegen unit,
opt-level = 3
Usage
$ jam --help
Just another (genomic) minhasher (jam), obviously blazingly fast
Usage: jam [OPTIONS] <COMMAND>
Commands:
sketch Sketch one or more files and write the result to an output file
dist Estimate containment of a query sequence against a sketch database
bias Build and analyze hash bias tables for filtering
stats Display statistics about a JAM database
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-s, --silent Silent mode, no (additional) output to stdout
-m, --memory <MEMORY> Maximum memory usage in GB [default: 2]
-h, --help Print help
-V, --version Print version
Sketching
Create .jam databases from FASTA/FASTQ files (plain or gzip/bzip2/xz/zstd compressed). Supports single files, multiple files, or directories.
$ jam sketch --help
Sketch one or more files and write the result to an output file
Usage: jam sketch [OPTIONS] --output <OUTPUT> [INPUT]...
Arguments:
[INPUT]... Input file(s), directories, or file with list of files to be hashed
Options:
-o, --output <OUTPUT> Output file (.jam format)
-k, --kmer-size <KMER_SIZE> K-mer size, all sketches must have the same size to be compared and below 32 [default: 21]
--fscale <FSCALE> Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
--complexity <COMPLEXITY> Complexity cut-off, only hash sequences with complexity above this value [default: 0.0]
--singleton Create a separate sketch for each sequence record
--temp-dir <TEMP_DIR> Custom temporary directory for intermediate files during sorting
--bias-table <BIAS_TABLE> Path to a bias table file (.bias) for compositional filtering
-h, --help Print help
Examples:
# Sketch a single file
jam sketch input.fasta -o sketch.jam
# Sketch a directory with 8 threads and FracMinHash scaling
jam sketch genomes/ -o db.jam --fscale 1000 -t 8
# Filter low-complexity k-mers by Shannon entropy
jam sketch genomes/ -o db.jam --fscale 1000 --complexity 1.5
# One sketch per sequence record
jam sketch multi.fasta -o db.jam --singleton
# Apply bias filtering during sketching
jam sketch plasmids/ -o filtered.jam --bias-table host_filter.bias
Querying
Estimate containment of query sequences against a sketch database.
$ jam dist --help
Estimate containment of a query sequence against a sketch database
Usage: jam dist [OPTIONS] --input <INPUT> --database <DATABASE>
Options:
-i, --input <INPUT> Input FASTA/FASTQ file to query
-d, --database <DATABASE> Database sketch (.jam file)
-o, --output <OUTPUT> Output to file instead of stdout
-c, --cutoff <CUTOFF> Cut-off value for similarity/containment [default: 0.0]
--singleton Singleton mode, process each query sequence separately
-h, --help Print help
Examples:
# Query against a database with a containment cutoff
jam dist -i query.fasta -d db.jam -c 0.1 -o results.tsv
# Per-sequence queries
jam dist -i multi_query.fasta -d db.jam --singleton -c 0.1
Output is tab-separated: query, sample_id, hit_count, containment.
Bias Table Construction
Bias tables allow compositional filtering to increase sensitivity for target sequences while suppressing background noise. They work by scoring k-mers based on their enrichment in a positive (target) set relative to a negative (background) set.
The underlying data structure is a Count-Min Sketch (CMS), a probabilistic structure that approximates k-mer frequencies using multiple independent hash functions mapped to a fixed-width table. This keeps memory usage constant regardless of the number of distinct k-mers. By default, the CMS uses 1,048,576 columns and 5 hash functions (~5 MB).
How it works:
- K-mer frequencies from both the positive and negative input sets are counted into separate CMS tables.
- Background extraction: The positive counts are subtracted from the negative counts (floored at zero). This prevents k-mers naturally shared between target and background from being penalized.
- A log-ratio weight is computed per CMS cell:
log((pos + alpha) / (adjusted_neg + alpha)), wherealphais a smoothing parameter. - Weights are quantized to
i8(-127 to +127) for compact storage. - Threshold calibration: All 255 possible thresholds are evaluated. The threshold that maximizes fold enrichment (positive retention / negative retention) is selected. If a target fold enrichment is specified, the closest achievable threshold is used instead.
$ jam bias create --help
Create a bias table from positive (target) and negative (background) FASTA files.
Target signal is always subtracted from background before computing bias weights.
Usage: jam bias create [OPTIONS] --positive <POSITIVE> --negative <NEGATIVE> --output <OUTPUT>
Options:
--positive <POSITIVE> Positive (target) FASTA file(s)
--negative <NEGATIVE> Negative (background) FASTA file(s)
-o, --output <OUTPUT> Output bias table file (.bias)
-k, --kmer-size <KMER_SIZE> K-mer size (must match sketch) [default: 21]
--fscale <FSCALE> FracMinHash scale (must match sketch) [default: 1000]
--cms-width <CMS_WIDTH> CMS columns, power of 2 recommended [default: 1048576]
--cms-depth <CMS_DEPTH> CMS hash functions [default: 5]
--alpha <ALPHA> Smoothing parameter for log-ratio [default: 1.0]
--fold-enrichment <FOLD_ENRICHMENT> Target fold enrichment (auto-maximized if not set)
--threads <THREADS> Number of threads
-h, --help Print help
Examples:
# Build a bias table to filter out host sequences
jam bias create --positive plasmids.fasta --negative host_genome.fasta -o host_filter.bias
# With custom fold enrichment target
jam bias create --positive targets.fasta --negative background.fasta -o filter.bias --fold-enrichment 10.0
# Inspect a bias table
jam bias stats filter.bias
jam bias stats filter.bias -o report.json
Statistics
Display database statistics including hash counts and distribution analysis.
$ jam stats --help
Display statistics about a JAM database
Usage: jam stats [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Input JAM database (.jam file)
--short Short summary only
--full Include the full entry statistics
-h, --help Print help
Examples:
jam stats -i db.jam --short
jam stats -i db.jam --full
License
This project is licensed under the MIT license. See the LICENSE file for more info.
Feedback & Contributions
If you have any ideas, suggestions, or issues, please don't hesitate to open an issue and/or PR. Contributions to this project are always welcome! We appreciate your help in making this project better.
Credits
This tool is inspired by finch-rs and sourmash. Check them out if you need a more mature ecosystem.
Dependencies
~18–26MB
~451K SLoC