-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hello,
Preface
I have a genomic.gff that I've obtained from NCBI via datasets from C acetobutylicum ATCC824 (GCF_000008765.1). I was running your program when I encountered the following issue for the first time.
What is the issue?
There are two issues, the first of which is that valid GFF3 rows aren't handled properly. This is more of a convenience issue for most of us out there, we can just remove rows that don't work. For some that could be a headache. I refer to this as "edge case handling" because the GFF3 parsing components of rnaseqc do not function if there are genomic elements that don't conform to the specification that 'gene_id' and 'exon_id' are required fields for every element. Anyways, this part wasn't the point.
The bigger issue is that my otherwise valid GFF3 from NCBI, which was repopulated with 'gene_id' and 'exon_id' terms in the attribute column of the GFF3 spec, still didn't fit some otherwise unspecified definition of a valid row.
How to reproduce the issue:
conda run rnaseqc --stranded=FR -s SRR1774150 /ffast/RNAseq/C_aceto_thesis/references//genomic_revised.gff /ffast/RNAseq/C_aceto_thesis/alignments/bowtie2/final_bam/SRR1774150.final.bam /ffast/RNAseq/C_aceto_thesis/summary_files/rnaseqc/SRR1774150Error message and interpretation
Failed to parse the GTF: Exon missing exon_id and gene_id fields:After checking the gff from NCBI, I can see there aren't exon_ids or gene_ids for many of the regions of interest. So I wrote a script to repopulate the GFF3 'attribute' field (the ';' separated list in the last field) with exon_ids (where appropriate) and gene_ids using the 'locus_tag' subfield of the 'attribute' column. The issue persisted on a variety of rows, most of which were lncRNAs, tRNAs, rRNAs, etc., which were predicted by the INFERNAL suite or cmsearch engine during annotation.
The error still persisted:
Failed to parse the GTF: Exon missing exon_id and gene_id fields: NC_003030.1 cmsearch exon 9712 11217 . + . ID=exon-CA_RS00040-1;Parent=rna-CA_RS00040;Dbxref=RFAM:RF00177;gbkey=rRNA;inference=COORDINATES: profile:INFERNAL:1.1.5;locus_tag=CA_RS00040;product=16S ribosomal RNA;gene_id=CA_RS00040;exon_id=CA_RS00040So, obviously I removed rows with such features with grep -v. After doing so, I receive the following error, where although all records describing INFERNAL predicted RNAs were removed, the definition of a valid GFF3 row is still unclear. Here is some of the GFF3 content with the gene_ids included, but rnaseqc asserts there are no valid genes or exons in my GFF3 file.
>head references/genomic_revised.gff
...
NC_003030.1 RefSeq gene 32013 33095 . - . ID=gene-CA_RS00135;Name=asd;gbkey=Gene;gene=asd;gene_biotype=protein_coding;locus_tag=CA_RS00135;old_locus_tag=CA_C0022;gene_id=CA_RS00135
NC_003030.1 Protein Homology CDS 32013 33095 . - 0 ID=cds-WP_010963351.1;Parent=gene-CA_RS00135;Dbxref=GenBank:WP_010963351.1;Name=WP_010963351.1;Ontology_term=GO:0009089,GO:0004073,GO:0050661;gbkey=CDS;gene=asd;go_function=aspartate-semialdehyde dehydrogenase activity|0004073||IEA,NADP binding|0050661||IEA;go_process=lysine biosynthetic process via diaminopimelate|0009089||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010963351.1;locus_tag=CA_RS00135;product=aspartate-semialdehyde dehydrogenase;protein_id=WP_010963351.1;transl_table=11;gene_id=CA_RS00135
NC_003030.1 RefSeq gene 33214 34113 . + . ID=gene-CA_RS00140;Name=CA_RS00140;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CA_RS00140;old_locus_tag=CA_C0023;gene_id=CA_RS00140
NC_003030.1 Protein Homology CDS 33214 34113 . + 0 ID=cds-WP_010963352.1;Parent=gene-CA_RS00140;Dbxref=GenBank:WP_010963352.1;Name=WP_010963352.1;Ontology_term=GO:0006355,GO:0003677,GO:0003700;gbkey=CDS;go_function=DNA binding|0003677||IEA,DNA-binding transcription factor activity|0003700||IEA;go_process=regulation of DNA-templated transcription|0006355||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010963352.1;locus_tag=CA_RS00140;product=LysR family transcriptional regulator;protein_id=WP_010963352.1;transl_table=11;gene_id=CA_RS00140