A software suite for accurate identification, annotation, translation, and feature characterization of annotate transcripts. Reference:
- Juan C Entizne, Wenbin Guo, Cristiane P.G. Calixto, Mark Spensley, Nikoleta Tzioutziou, Runxuan Zhang, and John W. S. Brown; [TranSuite: a software suite for accurate translation and characterization of transcripts](https://www.biorxiv.org/ content/10.1101/2020.12.15.422989v1)
Original TranSuite repository: https://github.com/anonconda/TranSuite
A modified version of TranSuite with additional features for comprehensive transcript analysis.
Modified by:
- Mojtaba Bagherian, The University of Western Australia
- James Lloyd, The University of Western Australia
This modified TranSuite builds upon the original TranSuite software with the following changes:
- Chimeric Gene Handling: Encountered error when using the Araport11 GTF as input file regarding the --chimeric parameter encountered across FindLorf and TransFeat modules. The parameter is now consistently handled with a default value of None, preventing AttributeError exceptions when the parameter is not provided. This ensures more robust processing of chimeric genes across all TranSuite modules.
Precise measurements of splice junction distances and 3' UTR lengths are useful in studying nonsense-mediated mRNA decay (NMD). These calculations measure exonic distances between stop codons and splice junctions to aid in the prediction of transcript fate.
Calculates the precise distance between genomic positions along the mature transcript, excluding intronic sequences. This biological accuracy is critical for NMD analysis, where distances are measured on processed mRNA.
Measures the exonic distance from the nearest upstream exon-exon junction to the stop codon. Identifies the closest splice junction before the stop codon and calculates the distance along the mature transcript. This measurement is important for understanding exon junction complex (EJC) positioning and its regulatory effects in mammals.
Calculates the exonic distance from the stop codon to the farthest downstream junction in the 3' UTR. This is the primary NMD determinant - transcripts with stop codons >50 nucleotides upstream of a downstream exon junction are typically NMD substrates in mammals. The measurement focuses on the most distant junction for conservative NMD prediction.
Measures the total length of the 3' untranslated region by summing all exonic sequences downstream of the stop codon. This measurement is crucial for understanding transcript regulation, as 3' UTR length affects mRNA stability, localization, and translation efficiency. Long 3' UTRs often contain regulatory elements and can influence NMD susceptibility when combined with other features.
The modular design includes:
calculate_exon_distance(): Core distance calculation excluding intronscalculate_utr_lengths(): Accurate 3' UTR length determinationcalculate_splice_junction_distances(): Comprehensive USJ/DSJ analysiswrite_splice_junction_data(): Structured output generation
All functions are strand-aware and handle edge cases robustly.
New file: <outname>_splice_junctions.csv
Contains: T_ID, Strand, UpstreamEJ, DownstreamEJ, 3UTRlength, PTC_dEJ
The PTC_dEJ column provides binary NMD classification (Yes/No for ≥50nt threshold), enabling both automated analysis and custom threshold studies.
These measurements enable:
- Accurate NMD prediction based on the established 50-nucleotide rule
- Alternative splicing impact assessment on transcript stability
- Transcript quality evaluation for identifying aberrant isoforms
- Post-transcriptional regulation studies with quantitative metrics
The modified splice junction analysis makes TranSuite a comprehensive tool for understanding how transcript structure influences gene expression regulation through NMD and related quality control mechanisms.
TranSuite is a software for identifying coding sequences of transcripts, selecting translation start sites at gene-level, generating accurate translations of transcript isoforms, and identifying and characterizing multiple coding related features, such as: coding potential, similar-translation features, and multiple NMD-related signals. TranSuite consists of three independent modules, FindLORF, TransFix and TransFeat. Each module can be run independently or as a pipeline with a single command.
- FindLORFS - finds and annotates the longest ORF of each transcript
- TransFix - "fixes" the same translation start codon AUG in all the transcripts in a gene and re-annotates the resulting ORF of the transcripts
- TransFeat - identifies structural features of transcripts, coding potential and NMD signals
- Auto - executes the whole pipeline (FindLORFS, TransFix, and TransFeat) in tandem
TranSuite has been developed in Python 3.6
TranSuite requires the following packages:
Packages installations commands:
conda install -c anaconda biopython
TranSuite is ready to use. The compressed file can be directly downloaded from the GitHub repository. Once uncompressed, TranSuite can be used directly from the command line by specifying the path to the main executable transuite.py
Additionally, in the future it will also be possible to install TranSuite through popular Python installation managers PyPI and Anaconda:
TranSuite works with a module / module-options structure:
transuite.py module options
where the modules are:
- FindLORF : Finds and annotates the longest ORF of each transcript.
- TransFix : Fix the same translation start codon AUG in all the transcripts in a gene and re-annotates the resulting ORF of the transcripts.
- TransFeat : Identify structural features of transcripts, coding potential and NMD signals.
- Auto : Execute the whole pipeline (FindLORFS, TransFix, and TransFeat) in tandem.
Each module can be executed as follow:
python /path/to/transuite.py module options
For example, to observe the help documentation of Auto module:
python /path/to/transuite.py Auto --help
All of TranSuite modules (FindLORF, TransFix and TransFeat) use the same input files format:
- The transcriptome annotation to analyze, in GTF format
- The transcripts nucleotide sequences, in FASTA format
Please note that the programs assume that nucleotide sequences in the FASTA file represent the exonic region of the transcript. Please beware that errors will happen if the user provide a FASTA file of the coding-region (CDS) sequences instead.
Additional notes:
- FindLORF will parse only the "exon" feature information from the GTF file
- TransFix and TransFeat will extract "CDS" feature information from the GTF file
- Transcripts without annotated "CDS" features will remain unprocessed by TransFix, and they will be identified as "No ORF" by TransFeat
- When executing the whole pipeline (Auto module), TranSuite will automatically forward the appropiate GTF file to each module. That is: FindLORF output GTF will be use as input by TransFix, and TransFix output GTF will be TransFeat input
FindLORF identifies and annotate ORF information in newly transcriptome annotations. Firstly, FindLORF translates each transcript sequence in its three frames of translation according to its annotated strand and stores the relative start and stop codon positions of all the resulting ORFs. Secondly, FindLORF selects the longest ORF for each transcript as its putative CDS region. Finally, FindLORF annotates the CDS using the genomic information contained in the transcriptome annotation to convert the relative ORF start-stop codon positions in the transcript sequence into genomic co-ordinates. The FindLORF module takes as input the transcriptome annotation to be curated (GTF format) and the transcripts exon sequences (FASTA format). See Input files above.
Command to run FindLORF:
python transuite.py FindLORF [options]
python transuite.py FindLORF --gtf <input-gtf.gtf> --fasta <input-fasta.fa> --outpath </path/for/output-folder> --outname <outname> --cds <30>
List of options available:
- --gtf: Transcriptome annotation file in GTF format
- --fasta: Transcripts nucleotide fasta file
- --outpath: Path of the output folder
- --outname: Prefix for the output files
- --cds: Minimum number of amino-acids an ORF must have to be considered as a potential CDS. Default: 30 AA
Example:
python transuite.py FindLORF --gtf ./test_dataset/subset_AtRTD2.gtf --fasta ./test_dataset/subset_AtRTD2_transcripts.fa --outpath ./test_dataset/test_output --outname --cds 30
FindLORF automatically generates a subfolder to store the output files: /<outpath>/<outname>_longorf/
FindLORF generates the following output files:
- GTF file with the longest ORF in the transcripts annotated as its CDS
- FASTA files of the transcripts CDS regions (nucleotide, and peptide)
- Log CSV files reporting transcripts that could not be annotated, for example: due to lack of an AUG
- A JSON file containing the transcripts ORF relative coordinates
TransFix provides more biologically accurate translations by selecting the authentic translation start site for a gene, "fixing" this location and using it to translate the gene transcripts and annotating the resulting CDS of the translations. We define the authentic translation start site as the site used to produce the full-length protein of the gene. In detail, TransFix firstly extracts the CDS co-ordinates of the transcripts from the transcriptome annotation and groups the transcripts according to their gene of origin. Then, TransFix selects the start codon of the longest annotated CDS in the gene as the representative translation start site and translates all of the transcripts in the gene from the "fixed" translation start site. Finally, TransFix annotates the genomic co-ordinates of the resulting stop codons. In some cases, transcript isoforms do not contain the "fixed" translation start site due to an AS event or an alternative transcription start site. To account for this, TransFix tracks those transcripts that are not translated during the first fix AUG/translation cycle and they are then processed through a second fix AUG/translation cycle to determine and annotate their valid translation start-sites.
Command to run TransFix:
python transuite.py TransFix [options]
python transuite.py TransFix --gtf <input-gtf.gtf> --fasta <input-fasta.fa> --outpath </path/for/output-folder> --outname <outname> --iter <5>
List of options available:
- --gtf: Transcriptome annotation file in GTF format
- --fasta: Transcripts nucleotide fasta file
- --outpath: Path of the output folder
- --outname: Prefix for the output files
- --iter: Maximum number of 'start-fixing & translation' cycles to identify alternative start sites. Default: 5
- --chimeric: Table indicating chimeric genes in the annotation (Optional)
Example:
python transuite.py TransFix --gtf ./test_dataset/test_output/test_run_longorf/test_run_longorf.gtf --fasta ./test_dataset/subset_AtRTD2_transcripts.fa --outpath ./test_dataset/test_output --outname test_run --iter 5
TransFix automatically generates a subfolder to store the output files: /<outpath>/<outname>_transfix/
TransFix generates the following output files:
- GTF file with the fixed CDS coordinates at the gene-level
- FASTA files of the transcripts CDS regions (nucleotide, and peptide)
- Multiple log CSV files: a) log files reporting transcripts that could not be annotated, for example for lack of an annotated CDS; and b) logfile tracking the fixing cycle at which the AUG was annotated
TransFeat extracts and processes the transcripts CDS information contained in transcriptome annotations to infer multiple characteristics of the genes, transcripts and their coding potential, and it to reports theis information in an easily accessible format.
Command to run TransFeat:
python transuite.py TransFeat [options]
python transuite.py TransFeat --gtf <input-gtf.gtf> --fasta <input-fasta.fa> --outpath </path/for/output-folder> --outname <outname> --pep <30> --ptc <70>
List of options available:
- --gtf: Transcriptome annotation file in GTF format
- --fasta: Transcripts nucleotide fasta file
- --outpath: Path of the output folder
- --outname: Prefix for the output files
- --pep: Minimum number of amino-acids a translation must have to be consider a peptide. Default: 100 AA
- --ptc: Minimum CDS length percentage below which a transcript is considered prematurely terminated (PTC). Default: 70%
Example:
python transuite.py TransFeat --gtf ./test_dataset/test_output/test_run_transfix/test_run_transfix.gtf --fasta ./test_dataset/subset_AtRTD2_transcripts.fa --outpath ./test_dataset/test_output --outname test_run --pep 30 --ptc 70
Note: When running the analysis on the above test_dataset you will get a WARNING message regarding a number of features not present in the TransFeat table. This is expected given the small number of transcripts in the test dataset.
TransFeat automatically generates a subfolder to store the output files: /<outpath>/<outname>_longorf/
TransFeat generates the following output files:
- A main CSV table reporting the transcripts coding features
- FASTA files of: (1) transcripts with CDS (both protein-coding and unproductive), (2) transcripts classified by coding-potentiality (protein-coding transcripts, non-coding genes), (3) transcripts alternative ORFs (uORF, ldORF)
- Multiple CSV tables reporting number of transcripts and transcripts-features subdivided by gene categories
- Multiple CSV tables reporting co-ordinates and sequences of transcripts subdivided by feature categories (ldORF, NMD)
- A JSON file containing the transcripts ORF relative coordinates
This module performs FindLORF, TransFix, and TransFeat analysis in tandem with a single command.
Command to run Auto:
python transuite.py Auto [options]
python transuite.py Auto --gtf <input-gtf.gtf> --fasta <input-fasta.fa> --outpath </path/for/output-folder> --outname <outname> --cds <30> --iter <5> --pep <100> --ptc <70>
List of options available:
- --gtf: Transcriptome annotation file in GTF format
- --fasta: Transcripts nucleotide fasta file
- --outpath: Path of the output folder
- --outname: Prefix for the output files
- --cds: Minimum number of amino-acids an ORF must have to be considered as a potential CDS. Default: 30 AA
- --iter: Maximum number of 'start-fixing & translation' cycles to identify alternative start sites. Default: 5
- --pep: Minimum number of amino-acids a translation must have to be consider a peptide. Default: 100 AA
- --ptc: Minimum CDS length percentage below which a transcript is considered prematurely terminated (PTC). Default: 70%
- --chimeric: Table indicating chimeric genes in the annotation (Optional)
Example:
python transuite.py Auto --gtf ./test_dataset/subset_AtRTD2.gtf --fasta ./test_dataset/subset_AtRTD2_transcripts.fa --outpath ./test_dataset/test_output --outname test_run --cds 30 --iter 5 --pep 100 --ptc 70
The Auto module automatically generates all of the modules subfolders and their respective output files: /<outpath>/<outname>_longorf/ /<outpath>/<outname>_transfix/ /<outpath>/<outname>_transfeat/
The main output files of TranSuite pipeline are:
- The GTF file generate by TransFix
- The main CSV feature table generate by TransFeat
- The FASTA files generate by TransFix and/or as classified by TransFeat (Coding transcripts, Non-coding genes)
- Any of the multiple log files generated during the analysis
For questions about the modified features, please contact: [email protected]
For questions about the original TranSuite, please contact: [email protected]
When using this TranSuite version, please cite both:
The original TranSuite paper This modified version
This TranSuite version is released under the same MIT license as the original TranSuite.