Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Genome Feature annotations. Annotate a genome assembly with TEs (EarlGrey) and protein-coding genes (BRAKER3)

Notifications You must be signed in to change notification settings

artorias111/FeatureFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FeatureFlow

A pipeline for genome annotation (On Polar2020) consisting of repeat masking, gene prediction, and functional annotation.

Prerequisites

  • Nextflow: conda activate /data2/work/local/miniconda/envs/nextflow
  • Access to reference data (protein references, RNA-seq reads for genome annotation)
  • RNA-seq reads for annotation must all be in one directory, which is passed to the script with the --rna_reads flag

Quick Start

# Load nextflow onto your environment
conda activate /data2/work/local/miniconda/envs/nextflow

# run FeatureFlow
nextflow run artorias111/FeatureFlow --genome_assembly /path/to/my_genome.fa --rna_reads /path/to/rna/reads --species_id my_species --nthreads 64

# run Featureflow in a specific mode (see below for the full list of modes)
nextflow run artorias111/FeatureFlow --runMode interPro --braker_aa /path/to/braker.aa --braker_gff /path/to/braker.gff3

A (slightly) longer way to go about it, if you want to explore and change local code

# Load nextflow onto your environment
conda activate /data2/work/local/miniconda/envs/nextflow

# clone this repo
git clone https://github.com/artorias111/FeatureFlow.git
# rename the folder to make more sense
mv FeatureFlow Dmaw12_annotations
cd Dmaw12_annotations

# Run FeatureFlow's entire pipeline
nextflow run main.nf --genome_assembly /path/to/my_genome.fa --rna_reads /path/to/rna/reads --nthreads 64

# Run FeatureFlow in interpro mode
nextflow run main.nf --runMode interPro --braker_aa /path/to/braker.aa --braker_gff /path/to/braker.gff3

# Show help message
nextflow run main.nf --help

Usage scenarios

There are situations where you've run RepeatMasker after assembling your genome, and now you want to run the rest of this pipeline, without running repeatmasker again. That's possible, via different run modes of FeatureFlow, provided with the --runMode flag. A list of all Run Modes are in the "Run Modes" section below.
If you want to run the pipeline starting from braker:

nextflow run main.nf --runMode braker_interpro --genome_assembly /path/to/masked/assembly.fa --rna_reads /path/to/rna_seq/read/dir

If you only have protein data and no RNA-seq for annotation

There's two modes that can be specified through --runMode (see the section "Run Modes") that allows you to only use protein reference instead of protein+RNA-seq for training braker.

Run Modes

FeatureFlow can be run in different modes, depending on the use case. A list of available Run Modes and use cases:

Run mode Flag Description
full --runMode full entire pipeline, requires --genome_assembly and --rna_reads
repeatMask --runMode repeatMask run RepeatModeler, RepeatMasker, and KimuraDiverge, only requires --genome_assembly
braker+interpro --runMode braker_interpro Run braker, followed by interproscan, and combine the two results. Ideal if you've already run repeatmasker
braker --runMode braker only braker, expects you to provide a masked genome assembly via --genome_assembly. Also requires --rna_reads
interPro --runMode interPro only interPro, expects you to provide an amino acid sequence file via --braker_aa, and the braker gff to combine the interpro results with --braker_gff
braker_bam --runMode braker_bam You have a merged bam file that already contains the aligned reads, and you want to run Braker with this pre-aligned file. Requires a masked assembly with --genome_assembly and the bam file with --braker_bam
protein_only --runMode protein_only Same as --runMode full, but without RNA-seq reads, and only uses protein reference to train braker. Requires only the path to genome assembly via --genome_assembly
brakerP_only --runMode braker_protein_only Same as braker+interpro, but with only protein reference. Requires a masked genome assembly via --genome_assembly

Common Parameters

Parameter Description
--genome_assembly Path to genome assembly FASTA file (or the masked assembly, depending on the runMode
--nthreads Number of CPU threads to use (default: 64)
--rna_reads Path to RNA-seq reads directory
--protein_ref Path to reference proteins FASTA file
--braker_aa Path to a braker.aa (or any) amino acid fasta file
--braker_gff Path to a braker.gff3 file, to combine interpro results with the braker annotations

Output

If you're interested in the final annotated gff3, you will find it in results/agat/agat_out

When run in full mode, the pipeline generates annotated genome files in the results directory, including:

  • Repeat-masked genome sequences
  • Gene predictions
  • Functional annotations
  • Protein and transcript sequences

When run in any of the other modes, the outputs are all in the results directory, with subdirectories named accordingly.

Another set of outputs includes the standard work directory produced by Nextflow pipelines. The directories are named according to the executor name of the process, and contain all the output files, and command logs (as .command.*) for each process.

Advanced Usage

For advanced configuration, you can edit the parameters directly in the nextflow.config file. However, this is not a recommended option, and is primarily targeted for development use only. There's also some workflows that's only available for dev. For more information, see the help message:

nextflow run main.nf --help

tools used in this annotation pipeline

  • RepeatMasker
  • RepeatModeler
  • HISAT2
  • BRAKER3
  • gffread
  • interProScan
  • AGAT

About

Genome Feature annotations. Annotate a genome assembly with TEs (EarlGrey) and protein-coding genes (BRAKER3)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published