Thanks to visit codestin.com
Credit goes to github.com

Skip to content

UMMISCO/strainmake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

185 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StrainMake

GitHub Release Docker Image Version GitHub commit activity

Set up

Local version

Clone the repository and install StrainMake:

git clone https://github.com/UMMISCO/strainmake.git
cd strainmake/
pip install -e .
strainmake --version

Ensure to have at least Snakemake and Conda installed.

Using Docker

You can use the Docker image and run everything via a Docker container (strainmake is the entrypoint):

docker pull bapt931894/strainmake:latest
# show CLI help
docker run --rm bapt931894/strainmake --help

# run the pipeline (example)
docker run --rm bapt931894/strainmake run --cores 16

If you need raw Snakemake in the container, override the entrypoint:

docker run --rm --entrypoint /opt/conda/envs/snakemake8.24.1/bin/snakemake bapt931894/strainmake -h

Note that you should mount volumes for keeping the generated data:

docker run bapt931894/strainmake \
    -v /where/to/keep/results:/opt/strainmake/results \
    -v /where/to/keep/logs:/opt/strainmake/logs \
    -v /where/to/keep/benchmarks:/opt/strainmake/benchmarks \
    ...

Configuration, paths, etc., should also be consistent with the container file context.

How to run

A step by step example of use is available on the wiki.

Overview of integrated tools

Quality control, preprocessing

Tool First release Conda available? Link Implemented?
fastp 2018 Yes https://github.com/OpenGene/fastp Yes
fastQC 2010 Yes http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Yes

Human decontamination

Tool First release Conda available? Link Implemented?
bowtie2 2012 Yes https://github.com/BenLangmead/bowtie2 Yes

Human assembly for mapping:

Asssembly

Tool First release Conda available? Link Implemented?
MEGAHIT 2015 Yes https://github.com/voutcn/megahit Yes
(Meta)SPAdes 2017 Yes https://github.com/ablab/spades Yes
(Meta)Flye 2020 Yes https://github.com/mikolmogorov/Flye Yes
HyLight 2024 Yes https://github.com/LuoGroup2023/HyLight Yes

Assembly quality assessment

Tool First release Conda available? Link Implemented?
QUAST 2013 Yes https://github.com/ablab/quast Yes

Non-redundant gene catalog

Tool First release Conda available? Link Implemented?
Prodigal 2010 Yes https://github.com/hyattpd/Prodigal Yes
CD-HIT 2012 Yes https://github.com/weizhongli/cdhit Yes

Binning

Tool First release Conda available? Link Implemented?
MetaBAT 2 2019 Yes https://bitbucket.org/berkeleylab/metabat/src/master/ Yes
SemiBin2 2023 Yes https://github.com/BigDataBiology/SemiBin Yes
VAMB 2019 Yes https://github.com/RasmussenLab/vamb Yes

Bins quality assessment

Tool First release Conda available? Link Implemented?
CheckM2 2023 Yes https://github.com/chklovski/CheckM2 Yes

Bins refinement

Tool First release Conda available? Link Implemented?
Binette 2024 Yes https://github.com/genotoul-bioinfo/Binette Yes

Bins post-processing(gene prediction, functional annotation and taxonomy classification)

Taxonomic annotation

Tool First release Conda available? Link Implemented?
GTDB-Tk 2022 Yes https://github.com/Ecogenomics/GTDBTk Yes

Dereplication

Tool First release Conda available? Link Implemented?
dRep 2017 Yes https://github.com/MrOlm/drep Yes

Genes prediction

Tool First release Conda available? Link Implemented?
Prodigal 2010 Yes https://github.com/hyattpd/Prodigal Yes

Coverage estimation

Tool First release Conda available? Link Implemented?
CheckM 2015 Yes https://github.com/Ecogenomics/CheckM Yes

Metabolic modeling

Tool First release Conda available? Link Implemented?
CarveMe 2018 Yes https://github.com/cdanielmachado/carveme Yes

Taxonomic profiling

Tool First release Conda available? Link Implemented?
MetaPhlAn 2023 Yes https://github.com/biobakery/MetaPhlAn Yes
Meteor2 2024 Yes https://github.com/metagenopolis/meteor Yes
StrainScan 2023 Yes https://github.com/liaoherui/StrainScan Yes

Strains profiling

Tool First release Conda available? Link Implemented?
inStrain 2021 Yes https://github.com/MrOlm/inStrain Yes (for SR only)
Floria 2024 Yes https://github.com/bluenote-1577/floria Yes (for SR only)

More scripts

You can find in workflow/scripts/other_scripts scripts made for processing some results produced by this pipeline.

skani_analysis.py performs bins pairwise comparison using Skani (https://doi.org/10.1038/s41592-023-02018-3). It can also produce a Venn diagram for results derived from dereplicated bins.

usage: skani_analysis.py compare [-h] --bins {refined,dereplicated} --tmp TMP --output_file OUTPUT_FILE --tsv_output TSV_OUTPUT --ani_threshold ANI_THRESHOLD --json_output JSON_OUTPUT --venn_diagram
                                 VENN_DIAGRAM --cpu CPU

options:
  -h, --help            show this help message and exit
  --bins {refined,dereplicated}
                        Type of bins to analyze
  --tmp TMP             Temporary directory for intermediate files
  --output_file OUTPUT_FILE
                        File to save the output results (Skani matrix)
  --tsv_output TSV_OUTPUT
                        File to save the Skani matrix in TSV format
  --ani_threshold ANI_THRESHOLD
                        Minimal ANI to consider two bins as the same
  --json_output JSON_OUTPUT
                        File to save the bins similarity results according to assembly methods (JSON)
  --venn_diagram VENN_DIAGRAM
                        Where to save the Venn diagram
  --cpu CPU             Number of CPU cores to use

We can then check bins found from one assembly method only.

usage: skani_analysis.py check [-h] --json_results JSON_RESULTS --tsv_output TSV_OUTPUT --assembly {unique,megahit,metaflye,metaspades,hybridspades}

options:
  -h, --help            show this help message and exit
  --json_results JSON_RESULTS
                        Path to the JSON produced using "skani_analysis.py compare"
  --tsv_output TSV_OUTPUT
                        File to save the results in TSV format
  --assembly {unique,megahit,metaflye,metaspades,hybridspades}
                        Choose 'unique' to get a list of bins that were not found from at least a second assembly method, at the given ANI threshold you used with the 'compare' subcommand. Chose any other possible assembly method to
                        get a list of bins recovered from the given assembly (it won't return the redundant bins coming from other asssemblies)

calculate_binned_contigs.py allows to compute the binned rate of contigs, i.e. the percentage of contigs from an assembly that is found in at least one bin at the end. The script can do it for each sample and its generated contigs for a given assembly method.

usage: calculate_binned_contigs.py [-h] --assembler {megahit,metaflye,hybridspades,metaspades} --results-dir RESULTS_DIR --type {binette,dereplicated_and_filtered} --tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
                                   --tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE

Count assembly contigs assigned to a bin.

options:
  -h, --help            show this help message and exit
  --assembler {megahit,metaflye,hybridspades,metaspades}
                        The assembly we should use
  --results-dir RESULTS_DIR
                        Folder storing the pipeline results. Typically named 'results': /path/to/pipeline/results
  --type {binette,dereplicated_and_filtered}
                        Type of bins
  --tsv_output_binned_contigs TSV_OUTPUT_BINNED_CONTIGS
                        File to save the list of contigs and their number of assignation in bins in TSV format
  --tsv_output_binned_rate TSV_OUTPUT_BINNED_RATE
                        File to save the binning rate of contigs in TSV format

Using preprocessed reads

If your sequencing reads have already been preprocessed, you can use strainmake prepare import-preprocessed to set up the results directory so that the pipeline starts directly from the assembly step, using your preprocessed FASTQ files.

To do this, provide a TSV file formatted like config_data.tsv. The script will create symbolic links in the results folder that point to your preprocessed FASTQ files, saving storage space by avoiding unnecessary duplication.

This approach ensures that the pipeline can use your preprocessed data without needing to process the FASTQ files again.

Help with configuration file

You can make use of strainmake init to build a YAML configuration.

 Usage: strainmake init [OPTIONS]                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                               
 Interactively generate a config.yaml for the StrainMake pipeline.                                                                                                                                                                                             
                                                                                                                                                                                                                                                               
 Walks through each pipeline section (preprocessing, assembly, …) and lets                                                                                                                                                                                     
 you accept defaults, edit values, or skip a section entirely.                                                                                                                                                                                                 
                                                                                                                                                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --samples    -s      PATH  Path to the sample metadata TSV (columns: sample, sample_id, type). [required]                                                                                                                                                │
│    --lr-format          TEXT  Format of long-read sequences: 'fastq' or 'fasta'. [default: fastq]                                                                                                                                                           │
│    --output     -o      PATH  Path to write the generated configuration YAML. [default: config.yaml]                                                                                                                                                        │
│    --help                     Show this message and exit.                                                                                                                                                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Then, to generate the Snakefile with relevant data to be generated by the pipeline, based on the YAML, use strainmake build.

 Usage: strainmake build [OPTIONS]                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                               
 Generate a Snakefile from a StrainMake configuration YAML.                                                                                                                                                                                                    
                                                                                                                                                                                                                                                               
 By default all sections present in the config are rendered. Use --perform                                                                                                                                                                                     
 to restrict to specific pipeline steps, and --post-processing,                                                                                                                                                                                                
 --taxonomic-profiling, or --strain-profiling                                                                                                                                                                                                                  
 for finer granularity within those steps.                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                               
 Examples:                                                                                                                                                                                                                                                     
     # All steps (config-driven, default):                                                                                                                                                                                                                     
     strainmake build --config config.yaml                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                               
     # Only assembly + binning:                                                                                                                                                                                                                                
     strainmake build --config config.yaml --perform assembly --perform binning                                                                                                                                                                                
                                                                                                                                                                                                                                                               
     # Post-processing with a subset of tools:                                                                                                                                                                                                                 
     strainmake build --config config.yaml \                                                                                                                                                                                                                   
         --perform assembly --perform binning --perform postprocessing \                                                                                                                                                                                       
         --post-processing gtdbtk --post-processing carveme                                                                                                                                                                                                    
                                                                                                                                                                                                                                                               
     # Taxonomic profiling with Meteor only:                                                                                                                                                                                                                   
     strainmake build --config config.yaml \                                                                                                                                                                                                                   
         --perform taxo_profiling --taxonomic-profiling meteor                                                                                                                                                                                                 
                                                                                                                                                                                                                                                               
     # Strain profiling with only inStrain:                                                                                                                                                                                                                    
     strainmake build --config config.yaml \                                                                                                                                                                                                                   
         --perform assembly --perform binning --perform postprocessing \                                                                                                                                                                                       
         --perform strain_profiling --strain-profiling instrain                                                                                                                                                                                                
                                                                                                                                                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --config               -c      PATH                                                                             Path to the StrainMake configuration YAML (generated by `strainmake init`). [required]                                                   │
│    --output               -o      PATH                                                                             Path to write the generated Snakefile. [default: Snakefile]                                                                              │
│    --perform                      [preprocessing|assembly|binning|postprocessing|taxo_profiling|strain_profiling]  Pipeline step(s) to include. Pass once per step, e.g. `--perform assembly --perform binning`. If omitted, all steps present in the       │
│                                                                                                                    config are included.                                                                                                                     │
│    --strain-profiling             [instrain|floria]                                                                Strain profiling tool(s) to include: instrain, floria. Pass once per tool. Automatically activates 'strain_profiling' even without an    │
│                                                                                                                    explicit --perform flag. If omitted, all tools are included.                                                                             │
│    --post-processing              [gtdbtk|coverage|carveme|bakta]                                                  Post-processing tool(s) to include: gtdbtk, coverage (checkm1), carveme, bakta. Pass once per tool. Automatically activates              │
│                                                                                                                    'postprocessing' even without an explicit --perform flag. If omitted, all tools are included.                                            │
│    --taxonomic-profiling          [meteor|metaphlan|strainscan]                                                    Taxonomic profiling tool(s) to include: meteor, metaphlan, strainscan. Pass once per tool. Automatically activates 'taxo_profiling' even │
│                                                                                                                    without an explicit --perform flag. If omitted, all tools are included.                                                                  │
│    --help                                                                                                          Show this message and exit.                                                                                                              │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

MultiQC report

Use strainmake report.

 Usage: strainmake report [OPTIONS]                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                               
 Generate MultiQC report(s) from StrainMake pipeline results.                                                                                                                                                                                                  
                                                                                                                                                                                                                                                               
 Collects QC artefacts from preprocessing, assembly, binning and annotation                                                                                                                                                                                    
 steps, then runs MultiQC.  One report is produced per assembler when                                                                                                                                                                                          
 multiple assemblers were used.                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --results         -r      DIRECTORY  Directory containing StrainMake pipeline results (the `results/` folder). [required]                                                                                                                                │
│ *  --logs            -l      DIRECTORY  Directory containing pipeline logs (the `logs/` folder). [required]                                                                                                                                                 │
│ *  --output          -o      PATH       Directory where the MultiQC report(s) will be written. [required]                                                                                                                                                   │
│    --ani                     INTEGER    ANI threshold used for MAG dereplication (must match the pipeline run). [default: 95]                                                                                                                               │
│    --multiqc-config          PATH       Path to the MultiQC YAML configuration file. [default: /path/to/strainmake/workflow/scripts/multiqc_results/multiqc_config.yaml]                                   │
│    --dry-run         -n                 Print the MultiQC commands without executing them.                                                                                                                                                                  │
│    --help                               Show this message and exit.                                                                                                                                                                                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

About

Reproducible hybrid metagenomics with MAG recovery and strain-level resolution

Topics

Resources

Stars

Watchers

Forks

Contributors