Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Incomplete edge case handling with valid GFF3/GTF data #92

@MatthewRalston

Description

@MatthewRalston

Hello,

Preface

I have a genomic.gff that I've obtained from NCBI via datasets from C acetobutylicum ATCC824 (GCF_000008765.1). I was running your program when I encountered the following issue for the first time.

What is the issue?

There are two issues, the first of which is that valid GFF3 rows aren't handled properly. This is more of a convenience issue for most of us out there, we can just remove rows that don't work. For some that could be a headache. I refer to this as "edge case handling" because the GFF3 parsing components of rnaseqc do not function if there are genomic elements that don't conform to the specification that 'gene_id' and 'exon_id' are required fields for every element. Anyways, this part wasn't the point.

The bigger issue is that my otherwise valid GFF3 from NCBI, which was repopulated with 'gene_id' and 'exon_id' terms in the attribute column of the GFF3 spec, still didn't fit some otherwise unspecified definition of a valid row.

How to reproduce the issue:

conda run rnaseqc --stranded=FR -s SRR1774150 /ffast/RNAseq/C_aceto_thesis/references//genomic_revised.gff /ffast/RNAseq/C_aceto_thesis/alignments/bowtie2/final_bam/SRR1774150.final.bam /ffast/RNAseq/C_aceto_thesis/summary_files/rnaseqc/SRR1774150

Error message and interpretation

Failed to parse the GTF: Exon missing exon_id and gene_id fields:

After checking the gff from NCBI, I can see there aren't exon_ids or gene_ids for many of the regions of interest. So I wrote a script to repopulate the GFF3 'attribute' field (the ';' separated list in the last field) with exon_ids (where appropriate) and gene_ids using the 'locus_tag' subfield of the 'attribute' column. The issue persisted on a variety of rows, most of which were lncRNAs, tRNAs, rRNAs, etc., which were predicted by the INFERNAL suite or cmsearch engine during annotation.

The error still persisted:

Failed to parse the GTF: Exon missing exon_id and gene_id fields: NC_003030.1	cmsearch	exon	9712	11217	.	+	.	ID=exon-CA_RS00040-1;Parent=rna-CA_RS00040;Dbxref=RFAM:RF00177;gbkey=rRNA;inference=COORDINATES: profile:INFERNAL:1.1.5;locus_tag=CA_RS00040;product=16S ribosomal RNA;gene_id=CA_RS00040;exon_id=CA_RS00040

So, obviously I removed rows with such features with grep -v. After doing so, I receive the following error, where although all records describing INFERNAL predicted RNAs were removed, the definition of a valid GFF3 row is still unclear. Here is some of the GFF3 content with the gene_ids included, but rnaseqc asserts there are no valid genes or exons in my GFF3 file.

>head references/genomic_revised.gff

...

NC_003030.1	RefSeq	gene	32013	33095	.	-	.	ID=gene-CA_RS00135;Name=asd;gbkey=Gene;gene=asd;gene_biotype=protein_coding;locus_tag=CA_RS00135;old_locus_tag=CA_C0022;gene_id=CA_RS00135
NC_003030.1	Protein Homology	CDS	32013	33095	.	-	0	ID=cds-WP_010963351.1;Parent=gene-CA_RS00135;Dbxref=GenBank:WP_010963351.1;Name=WP_010963351.1;Ontology_term=GO:0009089,GO:0004073,GO:0050661;gbkey=CDS;gene=asd;go_function=aspartate-semialdehyde dehydrogenase activity|0004073||IEA,NADP binding|0050661||IEA;go_process=lysine biosynthetic process via diaminopimelate|0009089||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010963351.1;locus_tag=CA_RS00135;product=aspartate-semialdehyde dehydrogenase;protein_id=WP_010963351.1;transl_table=11;gene_id=CA_RS00135
NC_003030.1	RefSeq	gene	33214	34113	.	+	.	ID=gene-CA_RS00140;Name=CA_RS00140;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CA_RS00140;old_locus_tag=CA_C0023;gene_id=CA_RS00140
NC_003030.1	Protein Homology	CDS	33214	34113	.	+	0	ID=cds-WP_010963352.1;Parent=gene-CA_RS00140;Dbxref=GenBank:WP_010963352.1;Name=WP_010963352.1;Ontology_term=GO:0006355,GO:0003677,GO:0003700;gbkey=CDS;go_function=DNA binding|0003677||IEA,DNA-binding transcription factor activity|0003700||IEA;go_process=regulation of DNA-templated transcription|0006355||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010963352.1;locus_tag=CA_RS00140;product=LysR family transcriptional regulator;protein_id=WP_010963352.1;transl_table=11;gene_id=CA_RS00140

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions