AnnotationSplitter is a bioinformatics tool for comparing reference gene annotations with those generated by machine-learning annotation tools, such as Helixer. Any valid annotation file in ".gff" format can be provided as the alternate annotation, provided that only a single isoform is present per gene, as this is the default output from Helixer. The program incorporates several steps including annotation parsing, protein sequence extraction, and structural comparisons using MMseqs2.
- Compare gene annotations from different sources (mainly RefSeq but Ensembl works too).
- Extract protein sequences from GFF and FASTA files.
- Perform sequence similarity analysis using MMseqs2.
- Generate summary tables and potential chimera reports.
Ensure the following dependencies are installed:
- Python 3.8 or later
- Required Python libraries (install via
requirements.txt
):pip install -r requirements.txt
- MMseqs2: Ensure the
mmseqs
binary is available in the system PATH or provide its location using the--mmseqs_path
argument.
git clone https://github.com/Andy-B-123/AnnotationSplitter.git
cd AnnotationSplitter
Please download the relevant database for your organism from https://doi.org/10.6084/m9.figshare.28236284
python src/main.py [-h] reference_gff helixer_gff reference_fasta output_directory
--mmseqs_db MMSEQS_DB
[--mmseqs_path MMSEQS_PATH]
[--mmseqs_threads MMSEQS_THREADS]
- reference_gff: Path to the reference GFF file.
- helixer_gff: Path to the Helixer GFF file.
- reference_fasta: Path to the reference FASTA file.
- output_directory: Directory to store output files.
- --mmseqs_db: Path to the MMseqs2 SwissProt database (required).
- --mmseqs_path: Path to the MMseqs2 executable (default: uses system PATH).
- --mmseqs_threads: Number of threads for MMseqs2 (default: 16).
python src/main.py \
/path/to/reference.gff \
/path/to/helixer.gff \
/path/to/reference.fasta \
/output/directory \
--mmseqs_db /path/to/mmseqs_db \
--mmseqs_threads 16
The program generates the following outputs:
- gffcompare_output: Outputs from gffcompare summarizing annotation comparisons.
- reference_proteins.fasta: Extracted protein sequences from the reference annotation.
- helixer_proteins.fasta: Extracted protein sequences from the Helixer annotation.
- reference_proteins.mmseqs.out: MMseqs2 output for reference proteins.
- helixer_proteins.mmseqs.out: MMseqs2 output for Helixer proteins.
- PotentialChimeras.RefSeqs.Summary.csv: Summary of potential chimeras in reference proteins.
- PotentialChimeras.Helixer.Summary.csv: Summary of potential chimeras in Helixer proteins.
- reference_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in reference proteins.
- helixer_proteins.potential_chimeras.fasta: FASTA sequences of potential chimeras in Helixer proteins.
.
├── LICENSE.TXT
├── README.md
├── requirements.txt
├── setup.py
├── Databases/
├── src/
│ ├── AnnotationParsing.py
│ ├── MMSeqs_analysis.py
│ ├── program_runners.py
│ └── main.py
- main.py: Entry point for the program.
- AnnotationParsing.py: Functions for parsing annotation data.
- program_runners.py: Interfaces for running external programs (e.g., gffcompare).
- MMSeqs_analysis.py: Functions for running and processing MMseqs2 analysis.
This project is licensed under the terms of the license specified in LICENSE.TXT
.
Feel free to submit issues or pull requests to improve the program.
For questions or feedback, please reach out to the repository maintainer.