Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Filtering workflow for identifying mis-annotations in genome annotations using a trusted protein dataset and Helixer

License

Notifications You must be signed in to change notification settings

Andy-B-123/AnnotationSplitter

Repository files navigation

AnnotationSplitter Documentation

Overview

AnnotationSplitter is a bioinformatics tool for comparing reference gene annotations with those generated by machine-learning annotation tools, such as Helixer. Any valid annotation file in ".gff" format can be provided as the alternate annotation, provided that only a single isoform is present per gene, as this is the default output from Helixer. The program incorporates several steps including annotation parsing, protein sequence extraction, and structural comparisons using MMseqs2.

Features

  • Compare gene annotations from different sources (mainly RefSeq but Ensembl works too).
  • Extract protein sequences from GFF and FASTA files.
  • Perform sequence similarity analysis using MMseqs2.
  • Generate summary tables and potential chimera reports.

Installation

Prerequisites

Ensure the following dependencies are installed:

  • Python 3.8 or later
  • Required Python libraries (install via requirements.txt):
    pip install -r requirements.txt
  • MMseqs2: Ensure the mmseqs binary is available in the system PATH or provide its location using the --mmseqs_path argument.

Clone the Repository

git clone https://github.com/Andy-B-123/AnnotationSplitter.git
cd AnnotationSplitter

Databases

Please download the relevant database for your organism from https://doi.org/10.6084/m9.figshare.28236284

Usage

Command-Line Arguments

python src/main.py [-h] reference_gff helixer_gff reference_fasta output_directory
                   --mmseqs_db MMSEQS_DB
                   [--mmseqs_path MMSEQS_PATH]
                   [--mmseqs_threads MMSEQS_THREADS]

Positional Arguments

  • reference_gff: Path to the reference GFF file.
  • helixer_gff: Path to the Helixer GFF file.
  • reference_fasta: Path to the reference FASTA file.
  • output_directory: Directory to store output files.

Optional Arguments

  • --mmseqs_db: Path to the MMseqs2 SwissProt database (required).
  • --mmseqs_path: Path to the MMseqs2 executable (default: uses system PATH).
  • --mmseqs_threads: Number of threads for MMseqs2 (default: 16).

Example Command

python src/main.py \
    /path/to/reference.gff \
    /path/to/helixer.gff \
    /path/to/reference.fasta \
    /output/directory \
    --mmseqs_db /path/to/mmseqs_db \
    --mmseqs_threads 16

Output

The program generates the following outputs:

GFFCompare Results

  • gffcompare_output: Outputs from gffcompare summarizing annotation comparisons.

Protein Sequences

  • reference_proteins.fasta: Extracted protein sequences from the reference annotation.
  • helixer_proteins.fasta: Extracted protein sequences from the Helixer annotation.

MMseqs2 Analysis

  • reference_proteins.mmseqs.out: MMseqs2 output for reference proteins.
  • helixer_proteins.mmseqs.out: MMseqs2 output for Helixer proteins.

Summary Reports

  • PotentialChimeras.RefSeqs.Summary.csv: Summary of potential chimeras in reference proteins.
  • PotentialChimeras.Helixer.Summary.csv: Summary of potential chimeras in Helixer proteins.
  • reference_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in reference proteins.
  • helixer_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in Helixer proteins.

Development

Directory Structure

.
├── LICENSE.TXT
├── README.md
├── requirements.txt
├── setup.py
├── Databases/
├── src/
│   ├── AnnotationParsing.py
│   ├── MMSeqs_analysis.py
│   ├── program_runners.py
│   └── main.py

Key Files

  • main.py: Entry point for the program.
  • AnnotationParsing.py: Functions for parsing annotation data.
  • program_runners.py: Interfaces for running external programs (e.g., gffcompare).
  • MMSeqs_analysis.py: Functions for running and processing MMseqs2 analysis.

License

This project is licensed under the terms of the license specified in LICENSE.TXT.

Contributing

Feel free to submit issues or pull requests to improve the program.

Contact

For questions or feedback, please reach out to the repository maintainer.

About

Filtering workflow for identifying mis-annotations in genome annotations using a trusted protein dataset and Helixer

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages