StrainEst is a novel, reference-based method that uses the Single Nucleotide Variants (SNV) profiles of the available genomes of selected species to determine the number and identity of coexisting strains and their relative abundances in mixed metagenomic samples.
The easiest way to run StrainEst is through Docker. Docker works similarly to a virtual machine image, providing a container in which all the software has already been installed, configured and tested.
Download the latest version of StrainEst:
docker pull compmetagen/strainest
Run an instance of the image, mounting the host working directory (e.g.
/Users/davide/strainest) on to the container working directory/strainest:docker run --rm -t -i -v /Users/davide/strainest:/strainest -w /strainest compmetagen/strainest /bin/bash
You need to write something like
-v //c/Users/davide/strainest:/strainestif you are in Windows or-v /home/davide/micca:/miccain Linux. The--rmoption automatically removes the container when it exits.Now you can use strainest:
root@68f6784e1101:/micca# strainest --help
Sickle, Bowtie2 and samtools are preinstalled in the Docker image.
We suggest to install the following packages through the package manager:
sudo apt-get update
sudo apt-get install build-essential \
pkg-config \
python2.7 \
python-dev \
python-pip \
python-numpy \
python-scipy \
python-matplotlib \
gcc \
gfortran \
libblas-dev \
liblapack-dev \
libfreetype6 libfreetype6-dev \
libpng-dev \
liblzma-dev \
libbz2-devThen, upgrade pip and install the following packages:
sudo pip install --upgrade pip
pip install 'Click>=5.1' 'pandas' 'pysam>=0.12' 'scikit-learn>=0.16.1,<0.20' 'biopython>=1.50'Download the latest version from https://github.com/compmetagen/strainest/releases and complete the installation:
tar -zxvf strainest-X.Y.Z.tar.gz
cd strainest-X.Y.Z
sudo python setup.py installThis tutorial requires Sickle (https://github.com/najoshi/sickle), Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and samtools (http://samtools.sourceforge.net/) to be installed in your system.
Download the example data (Illumina paired-end reads):
wget ftp://ftp.fmach.it/metagenomics/strainest/example/reads.tar.gz
tar zxvf reads.tar.gzNow the raw reads will be quality trimmed (e.g. using sickle):
sickle pe -f reads1.fastq -r reads2.fastq -t sanger -o \
reads1.trim.fastq -p reads2.trim.fastq -s reads.singles.fastq -q 20Given the species of interest (e.g. P. acnes), download and untar the
precomputed Bowtie reference database available at
ftp://ftp.fmach.it/metagenomics/strainest/ref/ (e.g. pacnes.tar.gz):
wget ftp://ftp.fmach.it/metagenomics/strainest/ref/pacnes.tar.gz
tar zxvf pacnes.tar.gzThe Bowtie2 database is available in the P_acnes/bowtie directory. At this
point we can align the metagenome against the database:
bowtie2 --very-fast --no-unal -x P_acnes/bowtie/align -1 reads1.trim.fastq \
-2 reads2.trim.fastq -S reads.samNow we can sort and index the BAM file:
samtools view -b reads.sam > reads.bam
samtools sort reads.bam -o reads.sorted.bam
samtools index reads.sorted.bamFinally, run the strainest est command to predict the strain abundances:
strainest est P_acnes/snp_clust.dgrp reads.sorted.bam outputdirIn the output directory we can find:
- abund.txt
- the predicted abundances for each reference genome;
- max_ident.txt
- for each reference genome, the percentage of alleles that are present in the metagenome;
- info.txt
- information about the prediction, including the prediction Pearson R;
- counts.txt
- number of counts for each SNV position/base pairs;
- mse.pdf
- Lasso cross-validation plot as a function of the shrinkage coefficient.
For a detailed description of the methods to build a custom reference database see the Methods section of the paper.
Reference databases for a number of species are available at ftp://ftp.fmach.it/metagenomics/strainest/ref2/
Each tar.gz archive contains the following files:
- assembly_list.txt
- tab separated table containing a list of the sequences that we have used with basic info. The column “Rep” tells you if it is one of the representative strains and the column ClustID tells you in which cluster it is.
- log.txt
- basic info on the clustering procedure, i.e number of clusters, number of SNVs and % of core genome
- map_align.fasta
- contains the reference sequences for read alignment. Use this file to build the bowtie or bwa database. For all species we used the default, i.e. 10 assemblies (if available)
- snv.txt
- contains the snv matrix for the "strainest est” command.
The list of species for which pre-built databases are available is given below, with the number of genomes used to build the database, the number of representative SNV profiles, the number of SNVs and the percentage of the reference genome that is common to all genomes (core)
| Species | # of genomes | # of repr | # of SNVs | % of core |
|---|---|---|---|---|
| Bacteroides dorei | 13 | 10 | 35506 | 46.580 |
| Bacteroides eggerthii | 4 | 3 | 24686 | 70.537 |
| Bacteroides fragilis | 124 | 49 | 176883 | 21.561 |
| Bacteroides intestinalis | 6 | 4 | 205290 | 27.325 |
| Bacteroides ovatus | 17 | 11 | 220623 | 46.294 |
| Bacteroides thetaiotaomicron | 17 | 14 | 370873 | 49.412 |
| Bacteroides uniformis | 23 | 16 | 97560 | 48.607 |
| Bacteroides vulgatus | 17 | 16 | 32365 | 40.987 |
| Bifidobacterium adolescentis | 26 | 24 | 43230 | 52.126 |
| Bifidobacterium bifidum | 36 | 24 | 41129 | 61.021 |
| Bifidobacterium catenulatum | 5 | 2 | 25621 | 83.984 |
| Bifidobacterium longum | 114 | 84 | 76932 | 34.662 |
| Bifidobacterium pseudocatenulatum | 17 | 13 | 74842 | 66.678 |
| Clostridium clostridioforme | 16 | 9 | 233709 | 43.286 |
| Collinsella aerofaciens | 5 | 5 | 184750 | 66.271 |
| Dorea longicatena | 12 | 9 | 104873 | 32.200 |
| Escherichia coli | 10321 | 433 | 124403 | 10.071 |
| Eubacterium siraeum | 5 | 4 | 90669 | 75.687 |
| Parabacteroides distasonis | 17 | 11 | 117483 | 58.044 |
| Staphylococcus epidermidis | 512 | 85 | 34349 | 19.165 |
| Streptococcus salivarius | 45 | 32 | 246121 | 58.651 |