A pipeline for genome annotation (On Polar2020) consisting of repeat masking, gene prediction, and functional annotation.
- Nextflow:
conda activate /data2/work/local/miniconda/envs/nextflow - Access to reference data (protein references, RNA-seq reads for genome annotation)
- RNA-seq reads for annotation must all be in one directory, which is passed to the script with the
--rna_readsflag
# Load nextflow onto your environment
conda activate /data2/work/local/miniconda/envs/nextflow
# run FeatureFlow
nextflow run artorias111/FeatureFlow --genome_assembly /path/to/my_genome.fa --rna_reads /path/to/rna/reads --species_id my_species --nthreads 64
# run Featureflow in a specific mode (see below for the full list of modes)
nextflow run artorias111/FeatureFlow --runMode interPro --braker_aa /path/to/braker.aa --braker_gff /path/to/braker.gff3# Load nextflow onto your environment
conda activate /data2/work/local/miniconda/envs/nextflow
# clone this repo
git clone https://github.com/artorias111/FeatureFlow.git
# rename the folder to make more sense
mv FeatureFlow Dmaw12_annotations
cd Dmaw12_annotations
# Run FeatureFlow's entire pipeline
nextflow run main.nf --genome_assembly /path/to/my_genome.fa --rna_reads /path/to/rna/reads --nthreads 64
# Run FeatureFlow in interpro mode
nextflow run main.nf --runMode interPro --braker_aa /path/to/braker.aa --braker_gff /path/to/braker.gff3
# Show help message
nextflow run main.nf --helpThere are situations where you've run RepeatMasker after assembling your genome, and now you want to run the rest of this pipeline, without running repeatmasker again. That's possible, via different run modes of FeatureFlow, provided with the --runMode flag. A list of all Run Modes are in the "Run Modes" section below.
If you want to run the pipeline starting from braker:
nextflow run main.nf --runMode braker_interpro --genome_assembly /path/to/masked/assembly.fa --rna_reads /path/to/rna_seq/read/dirThere's two modes that can be specified through --runMode (see the section "Run Modes") that allows you to only use protein reference instead of protein+RNA-seq for training braker.
FeatureFlow can be run in different modes, depending on the use case. A list of available Run Modes and use cases:
| Run mode | Flag | Description |
|---|---|---|
| full | --runMode full |
entire pipeline, requires --genome_assembly and --rna_reads |
| repeatMask | --runMode repeatMask |
run RepeatModeler, RepeatMasker, and KimuraDiverge, only requires --genome_assembly |
| braker+interpro | --runMode braker_interpro |
Run braker, followed by interproscan, and combine the two results. Ideal if you've already run repeatmasker |
| braker | --runMode braker |
only braker, expects you to provide a masked genome assembly via --genome_assembly. Also requires --rna_reads |
| interPro | --runMode interPro |
only interPro, expects you to provide an amino acid sequence file via --braker_aa, and the braker gff to combine the interpro results with --braker_gff |
| braker_bam | --runMode braker_bam |
You have a merged bam file that already contains the aligned reads, and you want to run Braker with this pre-aligned file. Requires a masked assembly with --genome_assembly and the bam file with --braker_bam |
| protein_only | --runMode protein_only |
Same as --runMode full, but without RNA-seq reads, and only uses protein reference to train braker. Requires only the path to genome assembly via --genome_assembly |
| brakerP_only | --runMode braker_protein_only |
Same as braker+interpro, but with only protein reference. Requires a masked genome assembly via --genome_assembly |
| Parameter | Description |
|---|---|
--genome_assembly |
Path to genome assembly FASTA file (or the masked assembly, depending on the runMode |
--nthreads |
Number of CPU threads to use (default: 64) |
--rna_reads |
Path to RNA-seq reads directory |
--protein_ref |
Path to reference proteins FASTA file |
--braker_aa |
Path to a braker.aa (or any) amino acid fasta file |
--braker_gff |
Path to a braker.gff3 file, to combine interpro results with the braker annotations |
If you're interested in the final annotated gff3, you will find it in results/agat/agat_out
When run in full mode, the pipeline generates annotated genome files in the results directory, including:
- Repeat-masked genome sequences
- Gene predictions
- Functional annotations
- Protein and transcript sequences
When run in any of the other modes, the outputs are all in the results directory, with subdirectories named accordingly.
Another set of outputs includes the standard work directory produced by Nextflow pipelines. The directories are named according to the executor name of the process, and contain all the output files, and command logs (as .command.*) for each process.
For advanced configuration, you can edit the parameters directly in the nextflow.config file. However, this is not a recommended option, and is primarily targeted for development use only. There's also some workflows that's only available for dev. For more information, see the help message:
nextflow run main.nf --help- RepeatMasker
- RepeatModeler
- HISAT2
- BRAKER3
- gffread
- interProScan
- AGAT