Spriggan is a Nextflow pipeline for the assembly of bacterial whole genome sequence data and identification of antibiotic resistance genes.
Usage
Input
Parameters
Workflow outline
Read trimming and quality assessment
Genome assembly
Assembly quality assessment
Genome coverage
Antimicrobial resistance gene detection
MLST scheme
Contamination detection
Summary
Output
Credits
Contributions and Support
Citations
The pipeline is designed to start from raw, paired-end Illumina reads. Start the pipeline using:
nextflow spriggan/main.nf --input [path-to-samplesheet] --outdir [path-to-outdir] -profile [docker,singularity,aws]
You can specify a version of the pipeline and run it directly from the github repository by using:
nextflow wslh-bio/spriggan -r [version] --input [path-to-samplesheet] --outdir [path-to-outdir] -profile [docker,singularity,aws]
You can also test the pipeline with example data using -profile test or -profile test_full:
nextflow spriggan/main.nf --outdir [path-to-outdir] -profile test[_full],[docker/singularity]
Spriggan's inputs are paired Illumina FASTQ files for each sample and a comma separated sample sheet containing the sample name, the path to the forward reads file, and the path to the reverse reads file for each sample. An example of the sample sheet's format can be seen in the table below and found here.
| sample | fastq_1 | fastq_2 |
|---|---|---|
| sample_name | /path/to/sample_name_R1.fastq.gz | /path/to/sample_name_R2.fastq.gz |
Spriggan's main parameters and their defaults are shown in the table below:
| Parameter | Parameter description and default |
|---|---|
| input | Path to comma-separated file containing information about the samples in the experiment. |
| outdir | Output directory where the results will be saved. Absolute path must be used for storage on cloud infrastructure. |
| qualitytrimscore | Minimum read quality for trimming (default: 10) |
| trimdirection | Read trimming direction (default: "lr") |
| minlength | Minimum read length for trimming (default: 10) |
| contaminants | Path to FASTA file of contaminating sequences for trimming |
| mincoverage | Minimum coverage threshold to pass a sample (default: 40) |
| kraken_db | Path to Kraken database for classification |
| plus | Use AMRFinderPlus' --plus option (default: false) |
| selected_genes | Genes of interest to pull from AMRFinderPlus output (default: 'NDM|OXA|KPC|IMP|VIM') |
| ncbi_assembly_stats | Path to NCBI database (default: NCBI_Assembly_stats_20240124.txt) |
| min_quast_contig | Lower threshold for contig length in bp (default: 500) |
Read repair, trimming, and cleaning are performed using BBtools v38.76 to repair fastqs with mismatched read numbers, trim reads of low quality bases, and remove PhiX contamination. Then FastQC v0.11.8 is used assess the quality of the raw and cleaned reads.
Assembly of the cleaned and trimmed reads is performed using Shovill v1.1.0.
Quality assessment of the assemblies is performed using QUAST v5.0.2.
Mean and median genome coverage is determined by mapping the cleaned reads back their the assembly using BWA v0.7.17-r1188 and calculating depth using samtools v1.10.
Antimicrobial resistance genes, as well as point mutations, are identified using AMRFinderPlus v3.10.30. Using the plus parameter provides results from the AMRFinderPlus "--plus" option, which includes genes such as virulence factors, stress-response, etc.
Spriggan can also generate a table of results for genes of interest with the selected_genes parameter. Spriggan will search for matches to the gene(s) of interest in the AMRFinderPlus results and make a separate table called selected_ar_genes.tsv. The list of genes must be separated by | and enclosed in single quotes in the config file.
MLST scheme is classified using MLST v2.17.6. Multiple schemes are available for specific organisms, and STs from all available schemes are reported for those organisms.
Contamination is detected by classifying reads using Kraken2 v2.0.8 with the Minikraken2_v1_8GB database. A custom Kraken database can be used with the kraken_db parameter.
Calculations are performed with Pandas v1.3.2 on Kraken2 v2.0.8 and QUAST v5.0.2 data to determine the expected : actual assembly length ratio, actual : expected assembly length ratio, and GC content statistics. The NCBI Assembly statistics database is referenced during these calculations. The expected : actual length ratio is included in the results/spriggan_report.csv.
Results are summarized using MultiQC v1.11 and Pandas v1.3.2. The main outputs of Spriggan are a csv file named spriggan_report.csv and an HTML report file named spriggan_multiqc_report.html. The spriggan_report.csv file summarizes the results of the QC, classification, and MLST steps of the pipeline. The spriggan_multiqc_report.html file contains tables and figures of quality metrics from the FastQC, BBDuk, Samtools, Kraken, and QUAST steps of the pipeline.
An example of Spriggan's output directory structure and its output files can be seen below:
spriggan_results
├── amrfinder
│ ├── *.amr.tsv
│ ├── *.fa
│ ├── amrfinder_predictions.tsv
│ ├── amrfinder_summary.tsv
│ └── selected_ar_genes.tsv
├── assembly
│ └── assembly_stats_results.tsv
├── bbduk
│ ├── *.adapter.stats.txt
│ ├── *.bbduk.log
│ ├── *_repaired_1.fastq.gz
│ ├── *_repaired_2.fastq.gz
│ ├── *.repair.log
│ ├── *_singletons.fastq.gz
│ ├── *_trimmed_1.fastq.gz
│ ├── *_trimmed_2.fastq.gz
│ └── *.trim.txt
├── calculate
│ ├── *_Assembly_ratio_*.tsv
│ ├── *_GC_content_*.tsv
├── coverage
│ └── coverage_stats.tsv
├── fastqc
│ ├── *.html
│ ├── *.zip
│ └── fastqc_summary.tsv
├── gc
│ └── gc_stats_results.tsv
├── kraken
│ ├── *.kraken2.txt
│ ├── kraken_results.tsv
│ └── kraken2.log
├── mlst
│ ├── *.alleles.tsv
│ ├── *.mlst.tsv
│ └── mlst_results.tsv
├── multiqc
│ ├── multiqc_data
│ │ ├── *.json
│ │ ├── *.txt
│ │ └── multiqc.log
│ ├── multiqc_plots
│ │ ├── pdf
│ │ │ └── *.pdf
│ │ ├── png
│ │ │ └── *.png
│ │ └── svg
│ │ └── *.svg
│ └── spriggan_multiqc_report.html
├── pipeline_info
│ ├── *.html
│ ├── *.txt
│ ├── samplesheet.valid.csv
│ └── software_versions.yml
├── quast
│ ├── *.quast.report.tsv
│ ├── *.transposed.quast.report.tsv
│ └── quast_results.tsv
├── rejected_samples
│ └── Empty_samples.csv
├── report
│ └── spriggan_report.csv
├── samtools
│ ├── *.bam
│ ├── *.depth.tsv
│ └── *.stats.txt
└── shovill
├── *.contigs.fa
├── *.sam
└── shovill_output
├── contigs.gfa
├── shovill.corrections
├── shovill.log
└── spades.fasta
Notable output files:
spriggan_report.csv - Summary table of each step in Spriggan.
spriggan_multiqc_report.html - HTML report generated by MultiQC.
*.contigs.fa - Shovill assembly for each sample.
*.amr.tsv - AMR genes identified in each sample by AMRFinderPlus.
*.mlst.tsv - MLST scheme identified for each sample.
Empty_samples.csv - Lists any samples that do not have any reads and were removed from the pipeline.
Spriggan was written by Dr. Kelsey Florek, Dr. Abigail C. Shockey, Eva Gunawan, and Dustin Lyfoung.
We thank the bioinformatics group at the Wisconsin State Laboratory of Hygiene for all of their contributions.
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use Spriggan for your analysis, please cite it using the following:
A.C. Shockey, K. Florek, E. Gunawan, & D. Lyfoung (2021). Spriggan (Version 1.7.1) [https://github.com/wslh-bio/spriggan/].
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.