AnnotationSplitter Documentation

Overview

AnnotationSplitter is a bioinformatics tool for comparing reference gene annotations with those generated by machine-learning annotation tools, such as Helixer. Any valid annotation file in ".gff" format can be provided as the alternate annotation, provided that only a single isoform is present per gene, as this is the default output from Helixer. The program incorporates several steps including annotation parsing, protein sequence extraction, and structural comparisons using MMseqs2.

Features

Compare gene annotations from different sources (mainly RefSeq but Ensembl works too).
Extract protein sequences from GFF and FASTA files.
Perform sequence similarity analysis using MMseqs2.
Generate summary tables and potential chimera reports.

Installation

Prerequisites

Ensure the following dependencies are installed:

Python 3.8 or later
Required Python libraries (install via requirements.txt):
```
pip install -r requirements.txt
```
MMseqs2: Ensure the mmseqs binary is available in the system PATH or provide its location using the --mmseqs_path argument.

Clone the Repository

git clone https://github.com/Andy-B-123/AnnotationSplitter.git
cd AnnotationSplitter

Databases

Please download the relevant database for your organism from https://doi.org/10.6084/m9.figshare.28236284

Usage

Command-Line Arguments

python src/main.py [-h] reference_gff helixer_gff reference_fasta output_directory
                   --mmseqs_db MMSEQS_DB
                   [--mmseqs_path MMSEQS_PATH]
                   [--mmseqs_threads MMSEQS_THREADS]

Positional Arguments

reference_gff: Path to the reference GFF file.
helixer_gff: Path to the Helixer GFF file.
reference_fasta: Path to the reference FASTA file.
output_directory: Directory to store output files.

Optional Arguments

--mmseqs_db: Path to the MMseqs2 SwissProt database (required).
--mmseqs_path: Path to the MMseqs2 executable (default: uses system PATH).
--mmseqs_threads: Number of threads for MMseqs2 (default: 16).

Example Command

python src/main.py \
    /path/to/reference.gff \
    /path/to/helixer.gff \
    /path/to/reference.fasta \
    /output/directory \
    --mmseqs_db /path/to/mmseqs_db \
    --mmseqs_threads 16

Output

The program generates the following outputs:

GFFCompare Results

gffcompare_output: Outputs from gffcompare summarizing annotation comparisons.

Protein Sequences

reference_proteins.fasta: Extracted protein sequences from the reference annotation.
helixer_proteins.fasta: Extracted protein sequences from the Helixer annotation.

MMseqs2 Analysis

reference_proteins.mmseqs.out: MMseqs2 output for reference proteins.
helixer_proteins.mmseqs.out: MMseqs2 output for Helixer proteins.

Summary Reports

PotentialChimeras.RefSeqs.Summary.csv: Summary of potential chimeras in reference proteins.
PotentialChimeras.Helixer.Summary.csv: Summary of potential chimeras in Helixer proteins.
reference_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in reference proteins.
helixer_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in Helixer proteins.

Development

Directory Structure

.
├── LICENSE.TXT
├── README.md
├── requirements.txt
├── setup.py
├── Databases/
├── src/
│   ├── AnnotationParsing.py
│   ├── MMSeqs_analysis.py
│   ├── program_runners.py
│   └── main.py

Key Files

main.py: Entry point for the program.
AnnotationParsing.py: Functions for parsing annotation data.
program_runners.py: Interfaces for running external programs (e.g., gffcompare).
MMSeqs_analysis.py: Functions for running and processing MMseqs2 analysis.

License

This project is licensed under the terms of the license specified in LICENSE.TXT.

Contributing

Feel free to submit issues or pull requests to improve the program.

Contact

For questions or feedback, please reach out to the repository maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github/workflows		.github/workflows
.vscode		.vscode
Databases		Databases
__pycache__		__pycache__
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnnotationSplitter Documentation

Overview

Features

Installation

Prerequisites

Clone the Repository

Databases

Usage

Command-Line Arguments

Positional Arguments

Optional Arguments

Example Command

Output

GFFCompare Results

Protein Sequences

MMseqs2 Analysis

Summary Reports

Development

Directory Structure

Key Files

License

Contributing

Contact

About

Uh oh!

Releases 15

Packages

Uh oh!

Languages

License

Andy-B-123/AnnotationSplitter

Folders and files

Latest commit

History

Repository files navigation

AnnotationSplitter Documentation

Overview

Features

Installation

Prerequisites

Clone the Repository

Databases

Usage

Command-Line Arguments

Positional Arguments

Optional Arguments

Example Command

Output

GFFCompare Results

Protein Sequences

MMseqs2 Analysis

Summary Reports

Development

Directory Structure

Key Files

License

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Languages

Packages