Thanks to visit codestin.com
Credit goes to github.com

Skip to content

will-rowe/hulk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hulk-logo

Histosketching Using Little Kmers


travis Documentation Status reportcard DOI License bioconda Binder

UPDATE: JULY 2019

I no longer worker for UKRI. As a result, all versions of HULK pre 1.0.0 have been renamed and archived to the UKRI github.

This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.

I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.

Importantly, this project is now fully open source and I can develop freely on it!

Overview

HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

HULK works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by HULK. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.

The advantages of HULK include:

  • it's fast and can run on a laptop
  • hulk sketches are compact, fixed size and incorporate k-mer frequency information
  • it works on data streams and does not require complete data instances
  • it can use concept drift for histosketching
  • you get to type hulk smash into the command line...

Finally, you can use hulk sketches to with a Machine Learning classifier to predict microbiome sample origin (see the paper and BANNER).

Change log

version 1.0.1 (dev branch)

  • WASM interface
    • run HULK locally and from a browser
    • based on my baby-GROOT user interface
  • HULK will output additional sketches
    • KMV MinHash
    • HyperMinHash
  • Indexing
    • re-implementation of the LSH Forest index

version 1.0.0 (current release)

  • fully re-written codebase
    • I've aimed for it to be largely backwards compatible with previous releases
  • fully open-sourced!
  • algorithm changes
    • underlying histogram is now based on minimizer frequencies
    • count-min sketch for k-mer frequencies is now replaced with a fixed-size array and a jump-hash for minimizer placement
  • changes to the sketch subcommand:
    • sketches saved to JSON by default (ala sourmash)
    • histosketch count-min sketch is no longer configurable by the user (this was Epsilon and Delta)
    • spectrum size is determined based on k-mer size
    • minCount for k-mer frequencies is removed
  • changes to the smash subcommand:
    • operates on JSON input
    • outputs matrix as csv
  • replaced some unecessary features
    • the functionality of the print and distance subcommands is available in the smash subcommand

pre version 1.0.0

  • all versions of HULK (and BANNER) pre v1.0.0 have been moved to the UKRI github and renamed. I can no longer work on these code bases.

Installation

Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.

Bioconda

For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.

conda install -c bioconda hulk

Source

HULK is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:

# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help

Quick Start

HULK is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:

# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA

#  Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile

Further Information & Citing

I'm working on some new documentation and this will be available on readthedocs soon.

A paper describing the HULK method is published in Microbiome:

Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.

About

Histosketching Using Little Kmers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •