nf-core · danilodileo · May 23, 2024 · Apr 26, 2024 · May 14, 2024 · May 14, 2024
diff --git a/docs/output.md b/docs/output.md
@@ -2,44 +2,64 @@
 
 ## Introduction
 
-This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
+This document describes the output produced by the pipeline.
 
 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
-<!-- TODO nf-core: Write this documentation describing your workflow's output -->
-
 ## Pipeline overview
 
-The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
+The pipeline is built using [Nextflow](https://www.nextflow.io/) and the results are organized as follow:
+
+- [Module output](#module-output)
+  - [Preprocessing](#preprocessing)
+    - [FastQC](#fastqc) - Read quality control
+    - [Trim galore!](#trim-galore) - Primer trimming
+    - [MultiQC](#multiqc) - Aggregate report describing results
+    - [BBduk](#bbduk) - Filter out sequences from samples that matches sequences in a user-provided fasta file (optional)
+  - [Filtering genomes](#filter-genomes-step) - Generate a list of genomes that will be used for the mapping
+    - [Sourmash](#sourmash) - Output from Sourmash filtering of genomes.
+  - [ORF Caller step](#orf-caller-step) - Identify protein-coding genes (ORFs) with an ORF caller
+    - [Prokka](#prokka) - Output from Prokka (optional)
+  - [Mapping step](#mapping-reads-to-genomes) - Predict the function and the taxonomy of ORFs
+    - [BBmap](#bbmap) - Output from BBmap
+    - [Featurecounts](#featurecounts) - Output from FeatureCounts
+- [Custom magmap output](#magmap-output)
+  - [Summary tables folder](#summary-tables) - Tab separated tables ready for further analysis in tools like R and Python
+  - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+
+## Module output
 
-- [FastQC](#fastqc) - Raw read QC
-- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
-- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+### Preprocessing
 
-### FastQC
+#### FastQC
+
+[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). FastQC is run as part of Trim galore! therefore its output can be found in Trimgalore's folder.
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `fastqc/`
-  - `*_fastqc.html`: FastQC report containing quality metrics.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+- `trimgalore/fastqc/`
+  - `*_fastqc.html`: FastQC report containing quality metrics for your untrimmed raw fastq files.
 
 </details>
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+#### Trim galore!
 
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
+[Trimgalore](https://github.com/FelixKrueger/TrimGalore) is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primers.
 
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+<details markdown="1">
+<summary>Output files</summary>
 
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+- `trimgalore/`: directory containing log files with retained reads, trimming percentage, etc. for each sample.
+  - `*trimming_report.txt`: report of read numbers that pass trimgalore.
 
-:::note
-The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
-:::
+</details>
+
+#### MultiQC
 
-### MultiQC
+[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+
+Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -51,9 +71,56 @@ The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They m
 
 </details>
 
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+:::note
+The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
+:::
 
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
+#### BBduk
+
+[BBduk](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/) is a filtering tool that removes specific sequences from the samples using a reference fasta file.
+BBduk is built-in tool from BBmap.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `bbmap/`
+  - `*.bbduk.log`: a text file with the results from BBduk analysis. Number of filtered reads can be seen in this log.
+
+</details>
+
+### ORF caller step
+
+#### Prokka
+
+You can use [Prokka](https://github.com/tseemann/prokka) to identify ORFs in any genomes for which a gff file is not provided.
+In addition to calling ORFs (done with Prodigal) Prokka will filter ORFs to only retain quality ORFs and will functionally annotate the ORFs.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `prokka/`
+  - `*.ffn.gz`: nucleotides fasta file output
+  - `*.faa.gz`: amino acids fasta file output
+  - `*.gff.gz`: genome feature file output
+
+</details>
+
+## Magmap output
+
+### Summary tables
+
+Consistently named and formated output tables in tsv format ready for further analysis.
+Filenames start with assembly program and ORF caller, to allow reruns of the pipeline with different parameter settings without overwriting output files.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `summary_tables/`
+  - `magmap.overall_stats.tsv.gz`: overall statistics from the pipeline, e.g. number of reads, number of called ORFs, number of reads mapping back to contigs/ORFs etc.
+  - `magmap.counts.tsv.gz`: read counts per ORF and sample.
+  - `summary_table.taxonomy.tsv.gz`: for each genomes this tsv file provides metrics and taxonomy.
+
+</details>
 
 ### Pipeline information
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -6,14 +6,33 @@
 
 ## Introduction
 
-<!-- TODO nf-core: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website. -->
+Magmap is a workflow designed for mapping metatranscriptomic and metagenomic reads onto a group of genomes.
+The collection of genomes can either be specified directly using a table (see the `--genomeinfo` parameter) or be the result of filtering with Sourmash.
+The latter can use either the genomes specified by `--genomeinfo`, a "sketch index" pointing to genomes available for instance at NCBI (see the `--indexes` parameter) or a combination, to identify a smaller set to map to.
+Genome files provided with `--genominfo` must include contigs in fasta format and optionally gff files (Prokka format).
+Any genome for which a gff file is missing will be annotated with Prokka.
+The pipeline can take output files from CheckM, CheckM2 and GTDB-Tk as input, and will provide processed output from these tools.
+Note that the pipeline can map to any collection of genomes, including single genomes and isolates.
 
-## Samplesheet input
+## Running the workflow
 
-You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
+### Quickstart
+
+A typical command for running the workflow is:
 
 ```bash
---input '[path to samplesheet file]'
+nextflow run nf-core/metatdenovo -profile docker --outdir results/ --input samples.csv --genomeinfo localgenomes.csv
+```
+
+### Samplesheet input
+
+You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It must be a comma-separated file with 3 columns, and a header row as shown in the examples below
+
+```csv title="samplesheet.csv"
+sample,fastq_1,fastq_2
+T0a,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+T0b,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
+T0c,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
 ```
 
 ### Multiple runs of the same sample
@@ -52,12 +71,76 @@ TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,
 
 An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
 
+### Genomes input
+
+Magmap needs two mandatory input to run. One, the samplesheet, was explained above while the second, genome input needs to be specified with the option `--genomeinfo`. The file is a `.csv` file and it requires three columns: accnos, genomes_fna, genome_gff. it looks as follow:
+
+```csv title="genomes.csv"
+GCA_002688505.1,https://github.com/nf-core/test-datasets/raw/magmap/testdata/GCA_002688505.1_ASM268850v1_genomic.fna.gz,https://github.com/nf-core/test-datasets/raw/magmap/testdata/GCA_002688505.1_ASM268850v1_genomic.gff.gz
+```
+
+N.B.: you don't need to provide gff, Prokka can handle if you only provide fasta files but we reccomend when adding gff that the file is generated by using prokka in order to don't get any conflict with other generated gffs.
+
+### Other inputs (optional)
+
+Magmap can handle several types of input that can be used for different purpose.
+
+#### Indexes input
+
+The indexes input is used by Sourmash to select genomes that can downloaded in a second step and added to the pipeline.
+It is provided with the `--indexes` parameter and it is a path (local or remote).
+
+```
+https://github.com/nf-core/test-datasets/raw/magmap/testdata/sourmash_test.index.sbt.zip
+```
+
+N.B.: the sbt files can be generated with sourmash (check it [here](https://sourmash.readthedocs.io/en/latest/index.html)) or some can be found in sourmash documentation website.
+E.g. "sourmash gtdb sbt" you will find the indexes for gtdb.
+
+#### metadata input
+
+Magmap accepts several metadata as `.csv` files that provides information about the genomes that you will use in the pipeline.
+
+##### gtdb_metadata
+
+This file contains several information that can be found in gtdb metadata on the official [website](https://data.ace.uq.edu.au/public/gtdb/data/releases/release80/80.0/). You can either use directly the same file found there or you can make a custom one.
+
+##### gtdb-tk metadata
+
+This file contains several information that can be obtaied as gtdb-tk output. You can either use directly the same file found there or you can make a custom one.
+
+##### checkM metadata
+
+This file contains several information that can be obtaied as checkM output. You can either use directly the same file found there or you can make a custom one.
+
+### Filter/remove sequences from the samples (e.g. rRNA sequences with SILVA database)
+
+The pipeline can remove potential contaminants using the BBduk program.
+Specify a fasta file, gzipped or not, with the --sequence_filter sequences.fasta parameter.
+For further documentation, see the [BBduk official website](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/).
+
+```bash
+nextflow run nf-core/metatdenovo -profile docker --outdir results/ --input samples.csv --genomeinfo localgenomes.csv --sequence_filter path/to/file
+```
+
+### Sourmash (optional)
+
+with [Sourmash](https://sourmash.readthedocs.io/en/latest/index.html) you can filter the genomes to be used by magmap in the mapping step. This function is optional but can speed up the process and let you get a better genomes /reads mapping ratio since you are removing all the genomes that are not passing the threshold (that you can select).
+
+```bash
+nextflow run nf-core/metatdenovo -profile docker --outdir results/ --input samples.csv --genomeinfo localgenomes.csv --sourmash true
+```
+
+### ORF caller option
+
+The pipeline uses [Prokka](https://github.com/tseemann/prokka) to call genes/ORFs from the genomes. This is suitable for prokaryotes and it provides a gff as output for downstream analysis. It also performs functional annotation of ORFs.
+
 ## Running the pipeline
 
 The typical command for running the pipeline is as follows:
 
 ```bash
-nextflow run nf-core/magmap --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker
+nextflow run nf-core/magmap --input ./samplesheet.csv --outdir ./results --genomeinfo ./genomes.csv -profile docker
 ```
 
 This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

diff --git a/modules.json b/modules.json
@@ -151,7 +151,7 @@
                     },
                     "utils_nfcore_pipeline": {
                         "branch": "master",
-                        "git_sha": "5caf7640a9ef1d18d765d55339be751bb0969dfa",
+                        "git_sha": "92de218a329bfc9a9033116eb5f65fd270e72ba3",
                         "installed_by": ["subworkflows"]
                     },
                     "utils_nfvalidation_plugin": {

diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/main.nf b/subworkflows/nf-core/utils_nfcore_pipeline/main.nf