-
Notifications
You must be signed in to change notification settings - Fork 260
Description
I'll outline:
- The Issue
- Debugging
- Possible Fix
The Issue
I came across a curious exception when manipulating a few SV VCFs. Here's a minimal example that should demonstrate the error with the latest bcftools annotate.
We start with the following files:
annotations.hdr :
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">
minimal_annotations.tsv.gz :
chr21 8914240 8914240 8914295
chr21 8914680 8914680 8914681
chr21 8914690 8914690 8914691
minimal_example.vcf.gz :
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##contig=<ID=chr4,length=193574945>
##contig=<ID=chr5,length=182045439>
##contig=<ID=chr6,length=172126628>
##contig=<ID=chr7,length=160567428>
##contig=<ID=chr8,length=146259331>
##contig=<ID=chr9,length=150617247>
##contig=<ID=chr10,length=134758134>
##contig=<ID=chr11,length=135127769>
##contig=<ID=chr12,length=133324548>
##contig=<ID=chr13,length=113566686>
##contig=<ID=chr14,length=101161492>
##contig=<ID=chr15,length=99753195>
##contig=<ID=chr16,length=96330374>
##contig=<ID=chr17,length=84276897>
##contig=<ID=chr18,length=80542538>
##contig=<ID=chr19,length=61707364>
##contig=<ID=chr20,length=66210255>
##contig=<ID=chr21,length=45090682>
##contig=<ID=chr22,length=51324926>
##contig=<ID=chrX,length=154259566>
##contig=<ID=chrY,length=62460029>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FILTER=<ID=HET1,Description="Heterozygous in the first haplotype">
##FILTER=<ID=HET2,Description="Heterozygous in the second haplotype">
##FILTER=<ID=GAP1,Description="Uncalled in the first haplotype">
##FILTER=<ID=GAP2,Description="Uncalled in the second haplotype">
##FILTER=<ID=DIPX,Description="Diploid chrX in non-PAR">
##FILTER=<ID=DIPY,Description="Diploid chrY in non-PAR">
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -o split.vcf.gz /cromwell_root/fc-37e70ab7-6023-490d-824d-5997ff71fbb3/submissions/fcc4a6da-c981-4dd9-a736-06f23f1e0af0/runDipcall/21a8a4fa-67db-406b-ba61-ae6a7aafa8e9/call-IndexVCF/cleaned-indexed.vcf.gz; Date=Thu Jul 6 18:40:20 2023
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SVTYPE">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="SVLEN">
##bcftools_viewVersion=1.17+htslib-1.17
##bcftools_viewCommand=view -i INFO/SVTYPE!="." -o truvari-SVs.vcf.gz truvari-annotated.vcf.gz; Date=Fri Jul 7 10:17:14 2023
##bcftools_viewCommand=view --regions-file bad_region.bed -o minimal_example.vcf.gz truvari-SVs.vcf.gz; Date=Fri Jul 7 18:11:13 2023
##bcftools_viewCommand=view minimal_example.vcf.gz; Date=Fri Jul 7 18:26:30 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT syndip
chr21 8914240 . gttccattccattccattcaattccattccattgcattccattccattccattcca G 30 HET2 SVTYPE=DEL;SVLEN=55 GT:AD 0|.:3,1
chr21 8914680 . tattccattccattcc TATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCTAGTTGATTCCATTCCATTCCATCCCGTTCCATTCCATTCCGTTACTTTCTATTCCATTCCATTCCATTCC 30 HET2 SVTYPE=INS;SVLEN=100 GT:AD 0|.:1,1
chr21 8914690 . c CATTCCATTCCATTCCATTCCATTCTAGTTGATTCCATTCCATTCCATCCCGTTCCATTCCATTCCGTTACTTTCT 30 HET2 SVTYPE=INS;SVLEN=75 GT:AD 0|.:2,1
Make sure you run bgzip on minimal_annotations.tsv to get the compressed version, then run tabix -s1 -b2 -e3 minimal_annotations.tsv.gz.
Finally, the command to produce the error is now:
bcftools annotate -a minimal_annotations.tsv.gz -h annotations.hdr -c CHROM,FROM,TO,INFO/END -o minimal_annotated.vcf.gz minimal_example.vcf.gz
The error output is:
Assertion failed: (isec > 0), function annotate, file vcfannotate.c, line 3086.
[1] 98724 abort bcftools annotate -a minimal_annotations.tsv.gz -h annotations.hdr -c -o
Even stranger: I tried doing my original analysis with bcftools 1.14 and it seems to have worked fine. The above was done with version 1.17.
Debugging
I've tracked the problem down to this line here. There is an assertion made that the value isec is positive, and in this case it turns out not to be. What's confusing to me is why this number ends up negative. I ended up editing the code around there to be:
//assert( isec > 0 );
if ( isec < 1 )
error("Problem at %s, POS: %"PRId64" and RLEN: %lld\nAlso have START: %d and END: %d", bcf_seqname(args->hdr,line), line->pos, line->rlen, tmp->start, tmp->end);
Rebuilding/rerunning gave the output:
Problem at chr21, POS: 8914679 and RLEN: 2
Also have START: 8914689 and END: 8914689
Note that START and END are computed using the annotations file whereas the POS and RLEN are computed from the VCF.
Computing the expression for isec, we see that
int isec = (tmp->end < line->pos+line->rlen-1 ? tmp->end : line->pos+line->rlen-1) - (tmp->start > line->pos ? tmp->start : line->pos) + 1;
translates in math to
isec = MIN(END, POS+RLEN-1) - MAX(START, POS) + 1
So with our above values, it's negative because the annotations region is too far to the right of the variant position, making the left term POS + 1 and the right term START.
Conceptually, this expression should always be > 0 if the actual entries in annotations.tsv.gz get applied to the correct variants that correspond to that region. But here it seems that the 3rd line of annotations ends up getting applied to the second variant, causing this problem. This seems incredibly bizarre to me since I've run this exact sequence of commands on around 50 samples over the whole genome, and only a couple had this rare exception. Also, maybe I don't understand how RLEN gets computed, but it seems incorrect for this example to be 2 rather than LEN(tattccattccattcc).
Possible Fix?
Commenting out assert( isec > 0 ) lets the program run fine and seems to output the correct things. In fact, strangely enough, after removing this line it adds the correct INFO/END fields to each of the three entries. In particular, the second variant gets END=8914681 and NOT 8914691 as one might expect from the above error that seemed to match up the third annotation with the second variant. I have no idea why this would happen and would appreciate some insight.
In either case, it's probably best to make sure this assertion is handled properly so the user can have a more helpful error message to debug.