UPDATE: JULY 2019
I no longer worker for UKRI. As a result, all versions of HULK pre 1.0.0 have been renamed and archived to the UKRI github.
This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.
I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.
Importantly, this project is now fully open source and I can develop freely on it!
HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.
HULK works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by HULK. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.
The advantages of HULK include:
- it's fast and can run on a laptop
- hulk sketches are compact, fixed size and incorporate k-mer frequency information
- it works on data streams and does not require complete data instances
- it can use concept drift for histosketching
- you get to type
hulk smashinto the command line...
Finally, you can use hulk sketches to with a Machine Learning classifier to predict microbiome sample origin (see the paper and BANNER).
- WASM interface
- run HULK locally and from a browser
- based on my baby-GROOT user interface
- HULK will output additional sketches
- KMV MinHash
- HyperMinHash
- Indexing
- re-implementation of the LSH Forest index
- fully re-written codebase
- I've aimed for it to be largely backwards compatible with previous releases
- fully open-sourced!
- MIT license (OSI approved)
- algorithm changes
- underlying histogram is now based on minimizer frequencies
- count-min sketch for k-mer frequencies is now replaced with a fixed-size array and a jump-hash for minimizer placement
- changes to the
sketchsubcommand:- sketches saved to JSON by default (ala sourmash)
- histosketch count-min sketch is no longer configurable by the user (this was Epsilon and Delta)
- spectrum size is determined based on k-mer size
- minCount for k-mer frequencies is removed
- changes to the
smashsubcommand:- operates on JSON input
- outputs matrix as csv
- replaced some unecessary features
- the functionality of the
printanddistancesubcommands is available in thesmashsubcommand
- the functionality of the
- all versions of HULK (and BANNER) pre v1.0.0 have been moved to the UKRI github and renamed. I can no longer work on these code bases.
Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.
For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.
conda install -c bioconda hulkHULK is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:
# Clone this repository
git clone https://github.com/will-rowe/hulk.git
# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...
# Run the unit tests
go test -v ./...
# Compile the program
go build ./
# Call the program
./hulk --helpHULK is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:
# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA
# Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfileI'm working on some new documentation and this will be available on readthedocs soon.
A paper describing the HULK method is published in Microbiome:
Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.