-
Notifications
You must be signed in to change notification settings - Fork 455
converting bcf 1st phase data as in v44 #1938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
455a96e
to
aebd6ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If updatephasing()
is changed as suggested, all of HTSlib's tests still pass.
Most of bcftools' do as well, apart from bcftools convert --haplegendsample
which starts incorrectly marking lots of genotypes as partially phased. This is because the process_gt_to_hap()
and process_gt_to_hap2()
functions currently assume the first phasing bit is always zero. As they're likely broken for VCF4.4, they'll need to be fixed irrespective of this change. Adjusting them to ignore the first phasing bit gets them working again, and after that all bcftools tests pass.
vcf.c
Outdated
* If the version in header is >= 4.4, no change is made. Otherwise 1st phasing | ||
* is set if there are no other unphased ones. | ||
*/ | ||
HTSLIB_EXPORT int update44phasing(bcf_hdr_t *h, bcf1_t *b, int setreset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only allow the phasing to be set, we can drop setreset
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
f3a5f06
to
07c0c5e
Compare
The reset made on bcf write is removed. |
fce7ac9
to
8af52e2
Compare
synced_bcf_reader.c
Outdated
@@ -708,6 +708,11 @@ static int _reader_fill_buffer(bcf_srs_t *files, bcf_sr_t *reader) | |||
ret = bcf_itr_next(reader->file, reader->itr, reader->buffer[reader->nbuffer+1]); | |||
if ( ret < -1 ) files->errnum = bcf_read_error; | |||
if ( ret < 0 ) break; // no more lines or an error | |||
//set phasing of 1st value as in vcf v44 | |||
if ((ret = update44phasing(reader->header, reader->buffer[reader->nbuffer+1]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how this affects speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially it was observed to make ~15% overhead. On recheck, last update had more overhead and tweaked further. It shows ~13% overhead with bcftools view with the latest changes
.
e94005e
to
ad77218
Compare
The BCF conversion is removed, to avoid the performance overhead, based on discussions. With this, BCF and VCF will have different internal binary values/representations. As conversion is made for vcf, merge issue mentioned is fixed for vcf/vcf.gz. When input is bcf, the issue still exists and conversion may still be required. Performance issue will be looked further that it can be used for BCF as well. |
ad77218
to
0335a05
Compare
The motivation for this is to enable passing of a pointer to a bcf_hdr_t structure to bcf_readrec(), which currently does not get one. It does always get a pointer for the BGZF handle, so a header struct could be passed in via that if it can be stored somewhere. To enable this while not changing the bgzf API or ABI, extra fields are added to the opaque bgzf_cache_t field. The BGZF_CACHE macro that could be use to disable addition of the cache feature removed as it was always turned on anyway. The cache struct now has to be created for files open for write, although the cache part is not used. The hash type used by the cache is renamed from "cache" to "bgzf_cache" to improve its name-spacing. The interfaces to add, get, and remove private data are put in a new bgzf_internal.h header. The bgzf_cache_t struct definition is also moved there so that the get function can be inlined for faster access to the private data field. The bgzf_cache_t definition is rewritten slightly so that it's not necessary to invoke KHASH_MAP_INIT_INT64() before it in the header file, as doing that would require struct cache_t to be moved from bgzf.c to the new header as well. Instead, typedef kh_bgzf_cache_t is used in place of khash(bgzf_cache), and unsigned int instead of khint_t.
For bcf files, the header pointer hasn't always been passed into bcf_read(), especially when using iterators. As having it available would be useful for VCF 4.4+ support, this works around its absence by attaching a pointer to the header in BGZF private data, which was previously unused for vcf/bcf. It also adds reference counting to the header so that it can be cleaned up safely irrespective of whether hts_close() or bcf_hdr_destroy() was called first. To avoid ABI breakage, the reference count is stored in the bcf_hdr_aux_t struct.
BCF saved by versions of HTSlib before 1.22 will always store the first phasing bit as 0. For consistency with the VCF reader, update this bit when reading BCF so that is is set if all other phasing bits are also set.
Phasing should now be fixed up in bcf_read()/vcf_read(), so there's no need to try again in bcf_get_format_values().
By noting that we're only interested in the least-significant bit of each GT value, it's possible to reduce the number of branches in this function by doing bit manipulations on the first byte of each stored value. The common haploid and diploid cases are also specialised so the inner loop on ploidy can be avoided for those cases.
Make bcf_readrec able to update phasing, and faster updatephasing() function
updated to have BCF and VCF output same binary values (updatephasing only for bcf with version <v44) |
Fix for #1932
It was discussed earlier to keep the phasing data in VCF44 format internally as that easily resolves the issue mentioned (will necessitate some changes in bcftools as there are a few checks with v4.3 type phase values - in convert tests) with VCF checks.
Data needs a consistent binary representation, irrespective of VCF/BCF source, so the same conversion is needed for BCF data as well, from v4.x to v4.4 in terms of phasing values. This conversion is made on read, for bcf with version < v44. While writing, it is again converted that data is stored without the change made, if version < v44. This adds ~15% overhead it seems.
--removed the mention of a few changes that seemed required in bcftools, it worked fine without any such change with later checks--