A high-performance Rust tool for filtering DNA/RNA reads based on a set of reference k-mers. Inspired by BBDuk by Brian Bushnell. Provides performant and memory-efficient read processing with support for both paired and unpaired FASTA/FASTQ files, with multiple files or in interleaved format.
K-mer based read filtering:
- Reads are compared to reference sequences by matching k-mers.
- If a read sequence has at least x k-mers also found in reference dataset, it is a match
- x is 1 by default, changed with
--minhits <int>
- x is 1 by default, changed with
Piping:
- Reads from stdin by default (or use
--in -explicitly) - Use
--outm/--outu/--outm2/--outu2stdout.fa/stdout.fqto pipe results to stdout
Paired reads support:
- Paired inputs and outputs can be specified by adding more input/output files
- Interleaved inputs or outputs, signify interleaved input with
--interinput - Automatic detection of input/output mode
Multithreading with Rayon:
- Adjustable thread count via
--threadsargument - Defaults to all available CPU cores
Memory Limit:
- Specify maximum memory usage with
--maxmem <String>(e.g.,5Gfor 5 gigabytes,500Mfor 500 megabytes)
Automatic Reference Indexing:
- Builds a serialized reference k-mer index using Bincode if
--binref <file>is provided from references provided with--ref <file> - Uses saved index on subsequent runs if
--binref <file>is included
If using UNIX, run this command and follow the ensuing instructions:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shIf using Windows, download the correct installer from Rustup.
brew install nucleazeNucleaze requires at least one of --ref or --binref to be provided. Input defaults to stdin if --in is not specified.
See more parameter documentation at ./src/main.rs
./nucleaze --in reads.fq --ref refs.fa --outm matched.fq --outu unmatched.fq --k 21This command:
- Builds 21-mer index from
refs.fasequences - Reads input reads from
reads.fqinto chunks of size 10,000 - Processes each read into 21-mers and checks against reference index
- Outputs matched reads to
matched.fqand unmatched reads tounmatched.fq
This project is licensed under the MIT License, see LICENSE for details. There is lots of room for improvement here so new additions or suggestions are welcome!
- Needletail — FASTA/FASTQ file parsing and bitkmer operations
- Bincode — K-mer hashset serialization/deserialization
- Rayon — Multithreading
- Clap — CLI
- Num-Cpus — detection of available threads
- Sysinfo — system memory information
- Crossbeam — asynchronous channels