Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bcftools annotate throws unhandled assertion error in some cases #1957

@rickymagner

Description

@rickymagner

I'll outline:

  • The Issue
  • Debugging
  • Possible Fix

The Issue

I came across a curious exception when manipulating a few SV VCFs. Here's a minimal example that should demonstrate the error with the latest bcftools annotate.

We start with the following files:
annotations.hdr :

##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">

minimal_annotations.tsv.gz :

chr21	8914240	8914240	8914295
chr21	8914680	8914680	8914681
chr21	8914690	8914690	8914691

minimal_example.vcf.gz :

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##contig=<ID=chr4,length=193574945>
##contig=<ID=chr5,length=182045439>
##contig=<ID=chr6,length=172126628>
##contig=<ID=chr7,length=160567428>
##contig=<ID=chr8,length=146259331>
##contig=<ID=chr9,length=150617247>
##contig=<ID=chr10,length=134758134>
##contig=<ID=chr11,length=135127769>
##contig=<ID=chr12,length=133324548>
##contig=<ID=chr13,length=113566686>
##contig=<ID=chr14,length=101161492>
##contig=<ID=chr15,length=99753195>
##contig=<ID=chr16,length=96330374>
##contig=<ID=chr17,length=84276897>
##contig=<ID=chr18,length=80542538>
##contig=<ID=chr19,length=61707364>
##contig=<ID=chr20,length=66210255>
##contig=<ID=chr21,length=45090682>
##contig=<ID=chr22,length=51324926>
##contig=<ID=chrX,length=154259566>
##contig=<ID=chrY,length=62460029>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FILTER=<ID=HET1,Description="Heterozygous in the first haplotype">
##FILTER=<ID=HET2,Description="Heterozygous in the second haplotype">
##FILTER=<ID=GAP1,Description="Uncalled in the first haplotype">
##FILTER=<ID=GAP2,Description="Uncalled in the second haplotype">
##FILTER=<ID=DIPX,Description="Diploid chrX in non-PAR">
##FILTER=<ID=DIPY,Description="Diploid chrY in non-PAR">
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -o split.vcf.gz /cromwell_root/fc-37e70ab7-6023-490d-824d-5997ff71fbb3/submissions/fcc4a6da-c981-4dd9-a736-06f23f1e0af0/runDipcall/21a8a4fa-67db-406b-ba61-ae6a7aafa8e9/call-IndexVCF/cleaned-indexed.vcf.gz; Date=Thu Jul  6 18:40:20 2023
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SVTYPE">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="SVLEN">
##bcftools_viewVersion=1.17+htslib-1.17
##bcftools_viewCommand=view -i INFO/SVTYPE!="." -o truvari-SVs.vcf.gz truvari-annotated.vcf.gz; Date=Fri Jul  7 10:17:14 2023
##bcftools_viewCommand=view --regions-file bad_region.bed -o minimal_example.vcf.gz truvari-SVs.vcf.gz; Date=Fri Jul  7 18:11:13 2023
##bcftools_viewCommand=view minimal_example.vcf.gz; Date=Fri Jul  7 18:26:30 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	syndip
chr21	8914240	.	gttccattccattccattcaattccattccattgcattccattccattccattcca	G	30	HET2	SVTYPE=DEL;SVLEN=55	GT:AD	0|.:3,1
chr21	8914680	.	tattccattccattcc	TATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCTAGTTGATTCCATTCCATTCCATCCCGTTCCATTCCATTCCGTTACTTTCTATTCCATTCCATTCCATTCC	30	HET2	SVTYPE=INS;SVLEN=100	GT:AD	0|.:1,1
chr21	8914690	.	c	CATTCCATTCCATTCCATTCCATTCTAGTTGATTCCATTCCATTCCATCCCGTTCCATTCCATTCCGTTACTTTCT	30	HET2	SVTYPE=INS;SVLEN=75	GT:AD	0|.:2,1

Make sure you run bgzip on minimal_annotations.tsv to get the compressed version, then run tabix -s1 -b2 -e3 minimal_annotations.tsv.gz.

Finally, the command to produce the error is now:

bcftools annotate -a minimal_annotations.tsv.gz -h annotations.hdr -c CHROM,FROM,TO,INFO/END -o minimal_annotated.vcf.gz minimal_example.vcf.gz

The error output is:

Assertion failed: (isec > 0), function annotate, file vcfannotate.c, line 3086.
[1]    98724 abort      bcftools annotate -a minimal_annotations.tsv.gz -h annotations.hdr -c  -o

Even stranger: I tried doing my original analysis with bcftools 1.14 and it seems to have worked fine. The above was done with version 1.17.

Debugging

I've tracked the problem down to this line here. There is an assertion made that the value isec is positive, and in this case it turns out not to be. What's confusing to me is why this number ends up negative. I ended up editing the code around there to be:

//assert( isec > 0 );
if ( isec < 1 )
    error("Problem at %s, POS: %"PRId64" and RLEN: %lld\nAlso have START: %d and END: %d", bcf_seqname(args->hdr,line), line->pos, line->rlen, tmp->start, tmp->end);

Rebuilding/rerunning gave the output:

Problem at chr21, POS: 8914679 and RLEN: 2
Also have START: 8914689 and END: 8914689

Note that START and END are computed using the annotations file whereas the POS and RLEN are computed from the VCF.

Computing the expression for isec, we see that

int isec = (tmp->end < line->pos+line->rlen-1 ? tmp->end : line->pos+line->rlen-1) - (tmp->start > line->pos ? tmp->start : line->pos) + 1;

translates in math to

isec = MIN(END, POS+RLEN-1) - MAX(START, POS) + 1

So with our above values, it's negative because the annotations region is too far to the right of the variant position, making the left term POS + 1 and the right term START.

Conceptually, this expression should always be > 0 if the actual entries in annotations.tsv.gz get applied to the correct variants that correspond to that region. But here it seems that the 3rd line of annotations ends up getting applied to the second variant, causing this problem. This seems incredibly bizarre to me since I've run this exact sequence of commands on around 50 samples over the whole genome, and only a couple had this rare exception. Also, maybe I don't understand how RLEN gets computed, but it seems incorrect for this example to be 2 rather than LEN(tattccattccattcc).

Possible Fix?

Commenting out assert( isec > 0 ) lets the program run fine and seems to output the correct things. In fact, strangely enough, after removing this line it adds the correct INFO/END fields to each of the three entries. In particular, the second variant gets END=8914681 and NOT 8914691 as one might expect from the above error that seemed to match up the third annotation with the second variant. I have no idea why this would happen and would appreciate some insight.

In either case, it's probably best to make sure this assertion is handled properly so the user can have a more helpful error message to debug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions