From a48953e59b0b3b48d58046b850cc4c2d4395b9f9 Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Fri, 21 May 2021 15:51:16 +0200 Subject: [PATCH 1/8] Add basic functionality for barcode trimming - --- README.md | 3 +- assets/multiqc_config.yaml | 12 ++--- docs/output.md | 6 ++- main.nf | 93 ++++++++++++++++++++++++++++++++++++-- nextflow.config | 5 ++ nextflow_schema.json | 36 ++++++++++++--- 6 files changed, 134 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 5997625c6..2e28463d2 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ By default the pipeline currently performs the following: * Create reference genome indices for mapping (`bwa`, `samtools`, and `picard`) * Sequencing quality control (`FastQC`) -* Sequencing adapter removal and for paired end data merging (`AdapterRemoval`) +* Sequencing adapter removal, paired-end data merging (`AdapterRemoval`) * Read mapping to reference using (`bwa aln`, `bwa mem`, `CircularMapper`, or `bowtie2`) * Post-mapping processing, statistics and conversion to bam (`samtools`) * Ancient DNA C-to-T damage pattern visualisation (`DamageProfiler`) @@ -85,6 +85,7 @@ Additional functionality contained by the pipeline currently includes: #### Preprocessing * Illumina two-coloured sequencer poly-G tail removal (`fastp`) +* Post-AdapterRemoval trimming of FASTQ files prior mapping (`fastp`) * Automatic conversion of unmapped reads to FASTQ (`samtools`) * Host DNA (mapped reads) stripping from input FASTQ files (for sensitive samples) diff --git a/assets/multiqc_config.yaml b/assets/multiqc_config.yaml index 0d8c7c28a..7bab487a8 100644 --- a/assets/multiqc_config.yaml +++ b/assets/multiqc_config.yaml @@ -60,13 +60,13 @@ extra_fn_clean_exts: top_modules: - 'fastqc': - name: 'FastQC (pre-AdapterRemoval)' + name: 'FastQC (pre-Trimming)' path_filters: - '*_raw_fastqc.zip' - 'fastp' - 'adapterRemoval' - 'fastqc': - name: 'FastQC (post-AdapterRemoval)' + name: 'FastQC (post-Trimming)' path_filters: - '*.truncated_fastqc.zip' - '*.combined*_fastqc.zip' @@ -106,7 +106,7 @@ remove_sections: - sexdeterrmine-snps table_columns_visible: - FastQC (pre-AdapterRemoval): + FastQC (pre-Trimming): percent_duplicates: False percent_gc: True avg_sequence_length: True @@ -117,7 +117,7 @@ table_columns_visible: Adapter Removal: aligned_total: False percent_aligned: True - FastQC (post-AdapterRemoval): + FastQC (post-Trimming): avg_sequence_length: True percent_duplicates: False total_sequences: True @@ -180,7 +180,7 @@ table_columns_visible: Total_Snps: False table_columns_placement: - FastQC (pre-AdapterRemoval): + FastQC (pre-Trimming): total_sequences: 100 avg_sequence_length: 110 percent_gc: 120 @@ -188,7 +188,7 @@ table_columns_placement: after_filtering_gc_content: 200 Adapter Removal: percent_aligned: 300 - FastQC (post-AdapterRemoval): + FastQC (post-Trimming): total_sequences: 400 avg_sequence_length: 410 percent_gc: 420 diff --git a/docs/output.md b/docs/output.md index cc07d9a69..978b01f70 100644 --- a/docs/output.md +++ b/docs/output.md @@ -112,7 +112,8 @@ When dealing with ancient DNA data the MultiQC plots for FastQC will often show For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). -> **NB:** The FastQC (pre-AdapterRemoval) plots displayed in the MultiQC report shows *untrimmed* reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the FastQC (post-AdapterRemoval). You should expect after AdapterRemoval, that most of the artefacts are removed. +> **NB:** The FastQC (pre-Trimming) plots displayed in the MultiQC report shows *untrimmed* reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the FastQC (post-Trimming) section. You should expect after AdapterRemoval, that most of the artefacts are removed. +> :warning: If you turned on `post_ar_fastq_trimming` your 'post-Trimming' report will _include_ reads that were additionally trimmed. There is no separate report for the post-AdapterRemoval trimming. #### Sequence Counts @@ -648,7 +649,8 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir * `reference_genome/`: this directory contains the indexing files of your input reference genome (i.e. the various `bwa` indices, a `samtools`' `.fai` file, and a picard `.dict`), if you used the `--saveReference` flag. * `fastqc/`: this contains the original per-FASTQ FastQC reports that are summarised with MultiQC. These occur in both `html` (the report) and `.zip` format (raw data). The `after_clipping` folder contains the same but for after AdapterRemoval. -* `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) FASTQ files. These you can use for downstream applications such as taxonomic binning for metagenomic studies. +* `adapterremoval/`: this contains the log files (ending with `.settings`) with raw trimming (and merging) statistics after AdapterRemoval. In the `output` sub-directory, are the output trimmed (and merged) `fastq` files. These you can use for downstream applications such as taxonomic binning for metagenomic studies. +* `post_ar_fastq_trimmed`: this contains `fastq` files that have been additionally trimmed after AdapterRemoval (if turned on). These reads are usually that had internal barcodes, or damage that needed to be removed before mapping. * `mapping/`: this contains a sub-directory corresponding to the mapping tool you used, inside of which will be the initial BAM files containing the reads that mapped to your reference genome with no modification (see below). You will also find a corresponding BAM index file (ending in `.csi` or `.bam`), and if running the `bowtie2` mapper: a log ending in `_bt2.log`. You can use these for downstream applications e.g. if you wish to use a different de-duplication tool not included in nf-core/eager (although please feel free to add a new module request on the Github repository's [issue page](https://github.com/nf-core/eager/issues)!). * `samtools/`: this contains two sub-directories. `stats/` contain the raw mapping statistics files (ending in `.stats`) from directly after mapping. `filter/` contains BAM files that have had a mapping quality filter applied (set by the `--bam_mapping_quality_threshold` flag) and a corresponding index file. Furthermore, if you selected `--bam_discard_unmapped`, you will find your separate file with only unmapped reads in the format you selected. Note unmapped read BAM files will _not_ have an index file. * `deduplication/`: this contains a sub-directory called `dedup/`, inside here are sample specific directories. Each directory contains a BAM file containing mapped reads but with PCR duplicates removed, a corresponding index file and two stats file. `.hist.` contains raw data for a deduplication histogram used for tools like preseq (see below), and the `.log` contains overall summary deduplication statistics. diff --git a/main.nf b/main.nf index a45bc8943..2e9173d0d 100644 --- a/main.nf +++ b/main.nf @@ -932,14 +932,97 @@ if (!params.skip_adapterremoval) { ch_output_from_adapterremoval.mix(ch_fastp_for_skipadapterremoval) .filter { it =~/.*combined.fq.gz|.*truncated.gz/ } .dump(tag: "AR Bypass") - .into { ch_adapterremoval_for_fastqc_after_clipping; ch_adapterremoval_for_lanemerge; } + .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } else { ch_fastp_for_skipadapterremoval - .into { ch_adapterremoval_for_fastqc_after_clipping; ch_adapterremoval_for_lanemerge; } + .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } +// Post AR fastq trimming + +process post_ar_fastq_trimming { + label 'mc_small' + tag "${libraryid}" + publishDir "${params.outdir}/post_ar_fastq_trimmed", mode: params.publish_dir_mode + + when: params.run_post_ar_trimming + + input: + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_adapterremoval_for_post_ar_trimming + + output: + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_postartrimmed.fq.gz") into ch_post_ar_trimming_for_lanemerge_r1 + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R2_postartrimmed.fq.gz") optional true into ch_post_ar_trimming_for_lanemerge_r2 + + script: + if ( seqtype == 'SE' | (seqtype == 'PE' && !params.skip_collapse) ) { + """ + fastp --in1 ${r1} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_R1_postartrimmed.fq.gz + """ + } else if ( seqtype == 'PE' && params.skip_collapse ) { + """ + fastp --in1 ${r1} --in2 ${r2} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} --trim_front2 ${params.post_ar_trim_front2} --trim_tail2 ${params.post_ar_trim_tail2} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_R1_postartrimmed.fq.gz --out2 "${libraryid}"_R2_postartrimmed.fq.gz + """ + } + +} + +// When not collapsing paired-end data, re-merge the R1 and R2 files into single map. Otherwise if SE or collapsed PE, R2 now becomes NA +// Sort to make sure we get consistent R1 and R2 ordered when using `-resume`, even if not needed for FastQC +if ( params.skip_collapse ){ + ch_post_ar_trimming_for_lanemerge_r1 + .mix(ch_post_ar_trimming_for_lanemerge_r2) + .groupTuple(by: [0,1,2,3,4,5,6]) + .map{ + it -> + def samplename = it[0] + def libraryid = it[1] + def lane = it[2] + def seqtype = it[3] + def organism = it[4] + def strandedness = it[5] + def udg = it[6] + def r1 = file(it[7].sort()[0]) + def r2 = seqtype == "PE" ? file(it[7].sort()[1]) : file("$projectDir/assets/nf-core_eager_dummy.txt") + + [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] + + } + .set { ch_post_ar_trimming_for_lanemerge; } +} else { + ch_post_ar_trimming_for_lanemerge_r1 + .map{ + it -> + def samplename = it[0] + def libraryid = it[1] + def lane = it[2] + def seqtype = it[3] + def organism = it[4] + def strandedness = it[5] + def udg = it[6] + def r1 = file(it[7]) + def r2 = file("$projectDir/assets/nf-core_eager_dummy.txt") + + [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] + } + .set { ch_post_ar_trimming_for_lanemerge; } +} + + +// Inline barcode removal bypass when not running it +if (params.run_post_ar_trimming) { + ch_post_ar_trimming_for_lanemerge.mix(ch_adapterremoval_for_skip_post_ar_trimming) + .dump(tag: "Inline Removal Bypass") + .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } +} else { + ch_adapterremoval_for_skip_post_ar_trimming + .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } +} + + + // Lane merging for libraries sequenced over multiple lanes (e.g. NextSeq) -ch_branched_for_lanemerge = ch_adapterremoval_for_lanemerge +ch_branched_for_lanemerge = ch_inlinebarcoderemoval_for_lanemerge .groupTuple(by: [0,1,3,4,5,6]) .map { it -> @@ -1100,7 +1183,7 @@ process lanemerge_hostremoval_fastq { } -// Post-preprocessing QC to help user check pre-processing removed all sequencing artefacts +// Post-preprocessing QC to help user check pre-processing removed all sequencing artefacts. If doing post-AR trimming includes this step in output. process fastqc_after_clipping { label 'mc_small' @@ -1114,7 +1197,7 @@ process fastqc_after_clipping { when: !params.skip_adapterremoval && !params.skip_fastqc input: - tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_adapterremoval_for_fastqc_after_clipping + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(r1), file(r2) from ch_inlinebarcoderemoval_for_fastqc_after_clipping output: path("*_fastqc.{zip,html}") into ch_fastqc_after_clipping diff --git a/nextflow.config b/nextflow.config index 3e85581f4..bda8ba8f5 100644 --- a/nextflow.config +++ b/nextflow.config @@ -74,6 +74,11 @@ params { preserve5p = false mergedonly = false qualitymax = 41 + run_post_ar_trimming = false + post_ar_trim_front = 7 + post_ar_trim_tail = 7 + post_ar_trim_front2 = 7 + post_ar_trim_tail2 = 7 //Mapping algorithm mapper = 'bwaaln' diff --git a/nextflow_schema.json b/nextflow_schema.json index 64814061c..23b12ab51 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -516,6 +516,35 @@ "help_text": "Specify maximum Phred score of the quality field of FASTQ files. The quality-score range can vary depending on the machine and version (e.g. see diagram [here](https://en.wikipedia.org/wiki/FASTQ_format#Encoding), and this allows you to increase from the default AdapterRemoval value of `41`.\n\n> Modifies AdapterRemoval parameters: `--qualitymax`", "default": 41, "fa_icon": "fas fa-arrow-up" + }, + "run_post_ar_trimming": { + "type": "boolean", + "description": "Turn on trimming of inline barcodes (i.e. internal barcodes after adapter removal)", + "help_text": "In some cases, you may want to additionally trim reads in a FASTQ file after adapter removal.\n\nThis could be to remove short 'inline' or 'internal' barcodes that are ligated directly onto DNA molecules prior ligation of adapters and indicies (the former of which allow ultra-multiplexing and/or checks for barcode hopping).\n\nIn other cases, you may wish to already remove known high-frequency damage bases to allow stricter mapping.\n\nTurning on this module uses `fastp` to trim one, or both ends of a merged read, or in cases where you have not collapsed your read, R1 and R2.\n" + }, + "post_ar_trim_front": { + "type": "integer", + "default": 7, + "description": "Specify the number of bases to trim off the front of a merged read or R1", + "help_text": "Specify the number of bases to trim off the start of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_front1`" + }, + "post_ar_trim_tail": { + "type": "integer", + "default": 7, + "description": "Specify the number of bases to trim off the tail of of a merged read or R1", + "help_text": "Specify the number of bases to trim off the end of a read in a merged- or forward read FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail1`" + }, + "post_ar_trim_front2": { + "type": "integer", + "default": 7, + "description": "Specify the number of bases to trim off the front of R2", + "help_text": "Specify the number of bases to trim off the start of a read in an unmerged forward read (R1) FASTQ file.\n\n> Modifies fastp parameters: `--trim_front2`" + }, + "post_ar_trim_tail2": { + "type": "integer", + "default": 7, + "description": "Specify the number of bases to trim off the tail of R2", + "help_text": "Specify the number of bases to trim off the end of a read in an unmerged reverse read (R2) FASTQ file.\n\n> Modifies fastp parameters: `--trim_tail2`" } }, "fa_icon": "fas fa-cut", @@ -616,7 +645,6 @@ }, "bt2n": { "type": "integer", - "default": 0, "description": "Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-sort-numeric-down", "help_text": "The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with `--bt2_sensitivity`. Can either be 0 or 1. Default: 0 (i.e. use`--bt2_sensitivity` defaults).\n\n> Modifies Bowtie2 parameters: `-N`", @@ -627,21 +655,18 @@ }, "bt2l": { "type": "integer", - "default": 0, "description": "Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.", "fa_icon": "fas fa-ruler-horizontal", "help_text": "The length of the seed sub-string to use during seeding. This will override any values set with `--bt2_sensitivity`. Default: 0 (i.e. use`--bt2_sensitivity` defaults: [20 for local and 22 for end-to-end](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#command-line).\n\n> Modifies Bowtie2 parameters: `-L`" }, "bt2_trim5": { "type": "integer", - "default": 0, "description": "Specify number of bases to trim off from 5' (left) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0\n\n> Modifies Bowtie2 parameters: `-bt2_trim5`" }, "bt2_trim3": { "type": "integer", - "default": 0, "description": "Specify number of bases to trim off from 3' (right) end of read before alignment.", "fa_icon": "fas fa-cut", "help_text": "Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.\n\n> Modifies Bowtie2 parameters: `-bt2_trim3`" @@ -699,14 +724,12 @@ }, "bam_mapping_quality_threshold": { "type": "integer", - "default": 0, "description": "Minimum mapping quality for reads filter.", "fa_icon": "fas fa-greater-than-equal", "help_text": "Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to `0` (basically doesn't filter anything).\n\n> Modifies samtools view parameter: `-q`" }, "bam_filter_minreadlength": { "type": "integer", - "default": 0, "fa_icon": "fas fa-ruler-horizontal", "description": "Specify minimum read length to be kept after mapping.", "help_text": "Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.\n\nIf used _instead_ of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.\n\n> Modifies filter_bam_fragment_length.py parameter: `-l`" @@ -1071,7 +1094,6 @@ }, "freebayes_g": { "type": "integer", - "default": 0, "description": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.", "fa_icon": "fab fa-think-peaks", "help_text": "Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.\n\n> Modifies freebayes parameter: `-g`" From 5ed9de6233c6dca561da2b2fa7d6d65b05fe5ab2 Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Wed, 26 May 2021 09:23:59 +0200 Subject: [PATCH 2/8] Add CI tests --- .github/workflows/ci.yml | 3 +++ docs/output.md | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 213c2ac69..1ab27185b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -102,6 +102,9 @@ jobs: - name: ADAPTERREMOVAL Run the basic pipeline with preserve5p end and merged reads only options run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --preserve5p --mergedonly + - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming + run: | + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_post_ar_trimming - name: MAPPER_CIRCULARMAPPER Test running with CircularMapper run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --mapper 'circularmapper' --circulartarget 'NC_007596.2' diff --git a/docs/output.md b/docs/output.md index 978b01f70..c7b7c6d3a 100644 --- a/docs/output.md +++ b/docs/output.md @@ -113,7 +113,7 @@ When dealing with ancient DNA data the MultiQC plots for FastQC will often show For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). > **NB:** The FastQC (pre-Trimming) plots displayed in the MultiQC report shows *untrimmed* reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the FastQC (post-Trimming) section. You should expect after AdapterRemoval, that most of the artefacts are removed. -> :warning: If you turned on `post_ar_fastq_trimming` your 'post-Trimming' report will _include_ reads that were additionally trimmed. There is no separate report for the post-AdapterRemoval trimming. +> :warning: If you turned on `--post_ar_fastq_trimming` your 'post-Trimming' report the statistics _after_ this trimming. There is no separate report for the post-AdapterRemoval trimming. #### Sequence Counts From bdfc5a19a94fb0abf6b0d9beca0c1f9981be788b Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Mon, 7 Jun 2021 15:33:33 +0200 Subject: [PATCH 3/8] dd additional CI --- .github/workflows/ci.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 1ab27185b..45e921907 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -105,6 +105,9 @@ jobs: - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_post_ar_trimming + - name: POST_AR_FASTQ_TRIMMING Run the basic pipeline post-adapterremoval FASTQ trimming, but skip adapterremoval + run: | + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_post_ar_trimming --skip_adapterremoval - name: MAPPER_CIRCULARMAPPER Test running with CircularMapper run: | nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --mapper 'circularmapper' --circulartarget 'NC_007596.2' From 231ae781bb750eb9d91f5173693f17b3ae9739a2 Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Tue, 8 Jun 2021 09:50:26 +0200 Subject: [PATCH 4/8] Fix naming conflict when skip AR + running post_ar trim --- main.nf | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/main.nf b/main.nf index 2e9173d0d..6056a6fe8 100644 --- a/main.nf +++ b/main.nf @@ -931,7 +931,7 @@ if ( params.skip_collapse ){ if (!params.skip_adapterremoval) { ch_output_from_adapterremoval.mix(ch_fastp_for_skipadapterremoval) .filter { it =~/.*combined.fq.gz|.*truncated.gz/ } - .dump(tag: "AR Bypass") + .dump(tag: "ar_bypass") .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } else { ch_fastp_for_skipadapterremoval @@ -957,11 +957,11 @@ process post_ar_fastq_trimming { script: if ( seqtype == 'SE' | (seqtype == 'PE' && !params.skip_collapse) ) { """ - fastp --in1 ${r1} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_R1_postartrimmed.fq.gz + fastp --in1 ${r1} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz """ } else if ( seqtype == 'PE' && params.skip_collapse ) { """ - fastp --in1 ${r1} --in2 ${r2} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} --trim_front2 ${params.post_ar_trim_front2} --trim_tail2 ${params.post_ar_trim_tail2} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_R1_postartrimmed.fq.gz --out2 "${libraryid}"_R2_postartrimmed.fq.gz + fastp --in1 ${r1} --in2 ${r2} --trim_front1 ${params.post_ar_trim_front} --trim_tail1 ${params.post_ar_trim_tail} --trim_front2 ${params.post_ar_trim_front2} --trim_tail2 ${params.post_ar_trim_tail2} -A -G -Q -L -w ${task.cpus} --out1 "${libraryid}"_L"${lane}"_R1_postartrimmed.fq.gz --out2 "${libraryid}"_L"${lane}"_R2_postartrimmed.fq.gz """ } @@ -1012,7 +1012,7 @@ if ( params.skip_collapse ){ // Inline barcode removal bypass when not running it if (params.run_post_ar_trimming) { ch_post_ar_trimming_for_lanemerge.mix(ch_adapterremoval_for_skip_post_ar_trimming) - .dump(tag: "Inline Removal Bypass") + .dump(tag: "inline_removal_bypass") .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } } else { ch_adapterremoval_for_skip_post_ar_trimming @@ -1039,7 +1039,7 @@ ch_branched_for_lanemerge = ch_inlinebarcoderemoval_for_lanemerge [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } - .dump(tag: "LaneMerge Bypass") + .dump(tag: "lanemerge_bypass_decision") .branch { skip_merge: it[7].size() == 1 // Can skip merging if only single lanes merge_me: it[7].size() > 1 @@ -1060,7 +1060,7 @@ ch_branched_for_lanemerge_skipme = ch_branched_for_lanemerge.skip_merge [ samplename, libraryid, lane, seqtype, organism, strandedness, udg, r1, r2 ] } - .dump(tag: "LaneMerge Reconfigure") + .dump(tag: "lanemerge_reconfigure") ch_branched_for_lanemerge_ready = ch_branched_for_lanemerge.merge_me @@ -1088,7 +1088,7 @@ process lanemerge { publishDir "${params.outdir}/lanemerging", mode: params.publish_dir_mode input: - tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_branched_for_lanemerge_ready + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_branched_for_lanemerge_ready.dump(tag: "lange_merge_input") output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("*_R1_lanemerged.fq.gz") into ch_lanemerge_for_mapping_r1 @@ -1112,7 +1112,7 @@ process lanemerge { // Ensuring always valid R2 file even if doesn't exist for AWS if ( ( params.skip_collapse || params.skip_adapterremoval ) ) { ch_lanemerge_for_mapping_r1 - .dump(tag: "Post LaneMerge Reconfigure") + .dump(tag: "post_lanemerge_reconfigure") .mix(ch_lanemerge_for_mapping_r2) .groupTuple(by: [0,1,2,3,4,5,6]) .map{ @@ -1227,7 +1227,7 @@ process bwa { publishDir "${params.outdir}/mapping/bwa", mode: params.publish_dir_mode input: - tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_lanemerge_for_bwa.dump(tag: "input_tuple") + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path(r1), path(r2) from ch_lanemerge_for_bwa.dump(tag: "bwa_input_reads") path index from bwa_index.collect().dump(tag: "input_index") output: @@ -1544,7 +1544,7 @@ ch_branched_for_seqtypemerge = ch_mapping_for_seqtype_merging [ samplename, libraryid, lane, seqtype_new, organism, strandedness, udg, r1, r2 ] } - .dump(tag: "Seqtype") + .dump(tag: "pre_seqtype_decision") .branch { skip_merge: it[7].size() == 1 // Can skip merging if only single lanes merge_me: it[7].size() > 1 @@ -1927,7 +1927,7 @@ process library_merge { publishDir "${params.outdir}/merged_bams/initial", mode: params.publish_dir_mode input: - tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_fixedinput_for_librarymerging.dump(tag: "Input Tuple Library Merge") + tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, file(bam), file(bai) from ch_fixedinput_for_librarymerging.dump(tag: "library_merge_input") output: tuple samplename, val("${samplename}_libmerged"), lane, seqtype, organism, strandedness, udg, path("*_libmerged_rg_rmdup.bam"), path("*_libmerged_rg_rmdup.bam.{bai,csi}") into ch_output_from_librarymerging @@ -2435,7 +2435,7 @@ process genotyping_pileupcaller { file fai from ch_fai_for_pileupcaller.collect() file dict from ch_dict_for_pileupcaller.collect() path(bed) from ch_bed_for_pileupcaller.collect() - path(snp) from ch_snp_for_pileupcaller.collect().dump(tag: "Pileupcaller SNP file") + path(snp) from ch_snp_for_pileupcaller.collect().dump(tag: "pileupcaller_snp_file") output: tuple samplename, libraryid, lane, seqtype, organism, strandedness, udg, path("pileupcaller.${strandedness}.*") into ch_for_eigenstrat_snp_coverage From 6741b0079bc5c667f422b90352d4d5609af2c28b Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Wed, 9 Jun 2021 21:52:59 +0200 Subject: [PATCH 5/8] Remove accidently merged pre-fastq triming file from channel of trimmed files --- main.nf | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/main.nf b/main.nf index 6056a6fe8..bf23ebfcf 100644 --- a/main.nf +++ b/main.nf @@ -930,11 +930,13 @@ if ( params.skip_collapse ){ // AdapterRemoval bypass when not running it if (!params.skip_adapterremoval) { ch_output_from_adapterremoval.mix(ch_fastp_for_skipadapterremoval) + .dump(tag: "post_ar_adapterremoval_decision_skipar") .filter { it =~/.*combined.fq.gz|.*truncated.gz/ } .dump(tag: "ar_bypass") .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } else { ch_fastp_for_skipadapterremoval + .dump(tag: "post_ar_adapterremoval_decision_withar") .into { ch_adapterremoval_for_post_ar_trimming; ch_adapterremoval_for_skip_post_ar_trimming; } } @@ -1011,7 +1013,7 @@ if ( params.skip_collapse ){ // Inline barcode removal bypass when not running it if (params.run_post_ar_trimming) { - ch_post_ar_trimming_for_lanemerge.mix(ch_adapterremoval_for_skip_post_ar_trimming) + ch_adapterremoval_for_skip_post_ar_trimming .dump(tag: "inline_removal_bypass") .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } } else { @@ -1019,8 +1021,6 @@ if (params.run_post_ar_trimming) { .into { ch_inlinebarcoderemoval_for_fastqc_after_clipping; ch_inlinebarcoderemoval_for_lanemerge; } } - - // Lane merging for libraries sequenced over multiple lanes (e.g. NextSeq) ch_branched_for_lanemerge = ch_inlinebarcoderemoval_for_lanemerge .groupTuple(by: [0,1,3,4,5,6]) From 863570e38d7a4e82da1782536f95e35e0a6b8c6d Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Fri, 11 Jun 2021 13:31:10 +0200 Subject: [PATCH 6/8] Update ci.yml --- .github/workflows/ci.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 45e921907..264d63f4c 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -59,7 +59,7 @@ jobs: git clone --single-branch --branch eager https://github.com/nf-core/test-datasets.git data - name: DELAY to try address some odd behaviour with what appears to be a conflict between parallel htslib jobs leading to CI hangs run: | - if [[ $NXF_VER = '' ]]; then sleep 360; fi + if [[ $NXF_VER = '' ]]; then sleep 600; fi - name: BASIC Run the basic pipeline with directly supplied single-end FASTQ run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input 'data/testdata/Mammoth/fastq/*_R1_*.fq.gz' --single_end @@ -200,4 +200,4 @@ jobs: nextflow run ${GITHUB_WORKSPACE} -profile test_tsv_humanbam,docker --skip_fastqc --skip_adapterremoval --skip_deduplication --skip_qualimap --skip_preseq --skip_damage_calculation --run_mtnucratio - name: RESCALING Run basic pipeline with basic pipeline but with mapDamage rescaling of BAM files. Note this will be slow run: | - nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled' \ No newline at end of file + nextflow run ${GITHUB_WORKSPACE} -profile test_tsv,docker --run_mapdamage_rescaling --run_genotyping --genotyping_tool hc --genotyping_source 'rescaled' From 707b5bcb0edac605fb2a245a28ce977ce984e56a Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Fri, 11 Jun 2021 14:37:00 +0200 Subject: [PATCH 7/8] Update ci.yml --- .github/workflows/ci.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 264d63f4c..7b0315a4b 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -59,7 +59,7 @@ jobs: git clone --single-branch --branch eager https://github.com/nf-core/test-datasets.git data - name: DELAY to try address some odd behaviour with what appears to be a conflict between parallel htslib jobs leading to CI hangs run: | - if [[ $NXF_VER = '' ]]; then sleep 600; fi + if [[ $NXF_VER = '' ]]; then sleep 1200; fi - name: BASIC Run the basic pipeline with directly supplied single-end FASTQ run: | nextflow run ${GITHUB_WORKSPACE} -profile test,docker --input 'data/testdata/Mammoth/fastq/*_R1_*.fq.gz' --single_end From 9207374b2ca7fc50f80a2934f7f03ccfbdafee50 Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Mon, 26 Jul 2021 10:07:01 +0200 Subject: [PATCH 8/8] update changelgo --- CHANGELOG.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3907b4cde..ec1b2d98e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,19 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). +## v2.4dev - [unreleased] + +### `Added` + +- [#642](https://github.com/nf-core/eager/issues/642) and [#431](https://github.com/nf-core/eager/issues/431) adds post-adapter removal barcode/fastq trimming +### `Fixed` + +- [#771](https://github.com/nf-core/eager/issues/771) Remove legacy code +- Improved output documentation for MultiQC general stats table (thanks to @KathrinNaegele and @esalmela) + +### `Dependencies` + +### `Deprecated` ## v2.3.5dev - [date] ### `Added`