Onika is an efficient Rust MinHash sketcher and similarity engine built around compressed inverted indexes.
Please cite: Onika preprint – https://www.biorxiv.org/content/10.1101/2025.11.21.689685v1
git clone https://github.com/Malfoy/Onika.git
cd Onika
cargo build --release
# Binary is at target/release/Onika
target/release/Onika --help- Single FASTA/Q:
--input-fasta some_reads.fq[.gz|.zst]treats each record as one document. - File-of-files (FOF):
--input-fof samples.fofwheresamples.foflists one FASTA/Q path per line. The order defines the genome IDs used in outputs (0-indexed). - Both modes accept gzipped or zstd-compressed input transparently.
- Inspect the bundled sample FOF (two genomes in
data/):cat data/fof.txt
- Build a sketch index from the sample FOF:
./target/release/Onika sketch --input-fof data/fof.txt --k_size 31 --s_size 10 --w_size 16 --reorder-similarity -o index.bin
- Compare the built index against itself:
By default the output is Zstandard-compressed. Use
./target/release/Onika compare --ref-sketch index.bin --query-sketch index.bin -o out.tsv.zst
--zstd-level 0to write plain text.
Common
--threads <INT>: override the auto-detected thread count.--zstd-level <LEVEL>: compression level for the index (sketch) or comparison output;0disables compression.
sketch
-
--input-fof <FILE>|--input-fasta <FILE>: input source (one required). -
-o, --output <FILE>: where to write the sketch index (.binis typical). -
-k, --k-size <INT>: k-mer size (default 31). -
-s, --s-size <INT>: log2 of sketch partitions; sketch has2^sbuckets (default 10 → 1024 partitions). -
-w, --w-size <INT>: fingerprint width in bits; values are reduced into a2^wrange (default 16). -
--reorder-similarity: greedily reorder sketches by self-similarity.
compare
- Reference inputs:
--ref-sketch <FILE>|--ref-fof <FILE>|--ref-fasta <FILE>. - Query inputs:
--query-sketch <FILE>|--query-fof <FILE>|--query-fasta <FILE>. -o, --output <FILE>: destination for similarity results (Zstandard-compressed unless--zstd-level 0).--threshold <FLOAT>: similarity cutoff. Below this value hits are pruned in sparse mode; also controls how many bucket matches are required.--matrix: emit a full similarity matrix row per query (comma-separated floats). Without it, output is sparse.--prob-threshold-probability <FLOAT>: probabilistic pruning tail probability; set to0to disable the heuristic (default 0.001).
- Index (
.bin): binary file containing the sketch table and metadata; compressed per-position blocks with stream-vbyte deltas and optional Zstandard. Load with--ref-sketchor--query-sketch. - Sparse comparison (default): each line is
query_<qid>\tref_gid:score[,ref_gid:score...], sorted by score descending. IDs are zero-based in the order the inputs were read.scoreis the fraction of matching buckets (matches / sketch_size). - Matrix mode: each line is
query_<qid>\tfollowed by comma-separated similarity scores for every reference (values below the threshold are written as0.0000).
- Larger
s-sizeadds resolution but increases memory and runtime;w-sizecontrols fingerprint collision rate in the index. --reorder-similaritycan improve spatial locality of sketches before compression; leave it off for the fastest build.- Outputs and indexes are Zstandard-compressed by default; set
--zstd-level 0if you need plain text for downstream tools.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.