Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Jul 22, 2025

Fix for #1932
It was discussed earlier to keep the phasing data in VCF44 format internally as that easily resolves the issue mentioned (will necessitate some changes in bcftools as there are a few checks with v4.3 type phase values - in convert tests) with VCF checks.
Data needs a consistent binary representation, irrespective of VCF/BCF source, so the same conversion is needed for BCF data as well, from v4.x to v4.4 in terms of phasing values. This conversion is made on read, for bcf with version < v44. While writing, it is again converted that data is stored without the change made, if version < v44. This adds ~15% overhead it seems.
--removed the mention of a few changes that seemed required in bcftools, it worked fine without any such change with later checks--

@daviesrob daviesrob self-assigned this Jul 24, 2025
@vasudeva8 vasudeva8 marked this pull request as ready for review August 4, 2025 17:13
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If updatephasing() is changed as suggested, all of HTSlib's tests still pass.

Most of bcftools' do as well, apart from bcftools convert --haplegendsample which starts incorrectly marking lots of genotypes as partially phased. This is because the process_gt_to_hap() and process_gt_to_hap2() functions currently assume the first phasing bit is always zero. As they're likely broken for VCF4.4, they'll need to be fixed irrespective of this change. Adjusting them to ignore the first phasing bit gets them working again, and after that all bcftools tests pass.

vcf.c Outdated
* If the version in header is >= 4.4, no change is made. Otherwise 1st phasing
* is set if there are no other unphased ones.
*/
HTSLIB_EXPORT int update44phasing(bcf_hdr_t *h, bcf1_t *b, int setreset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only allow the phasing to be set, we can drop setreset here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@vasudeva8
Copy link
Contributor Author

vasudeva8 commented Aug 7, 2025

The reset made on bcf write is removed.
This will make the output different from input. BCF is not widely in use due to its inherent compatibility issues and hence this difference can be ignored.

@vasudeva8 vasudeva8 force-pushed the phase44update1 branch 2 times, most recently from fce7ac9 to 8af52e2 Compare August 8, 2025 11:37
@@ -708,6 +708,11 @@ static int _reader_fill_buffer(bcf_srs_t *files, bcf_sr_t *reader)
ret = bcf_itr_next(reader->file, reader->itr, reader->buffer[reader->nbuffer+1]);
if ( ret < -1 ) files->errnum = bcf_read_error;
if ( ret < 0 ) break; // no more lines or an error
//set phasing of 1st value as in vcf v44
if ((ret = update44phasing(reader->header, reader->buffer[reader->nbuffer+1]))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how this affects speed?

Copy link
Contributor Author

@vasudeva8 vasudeva8 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially it was observed to make ~15% overhead. On recheck, last update had more overhead and tweaked further. It shows ~13% overhead with bcftools view with the latest changes
.

@vasudeva8 vasudeva8 force-pushed the phase44update1 branch 2 times, most recently from e94005e to ad77218 Compare August 15, 2025 14:25
@vasudeva8
Copy link
Contributor Author

vasudeva8 commented Aug 15, 2025

The BCF conversion is removed, to avoid the performance overhead, based on discussions.
1st phasing value in VCF, with version upto v4.3, are set based on rest of the values in the sample.
The API bcf_get_format_values modified to do the same conversion when the retrieved field is GT, limiting any overhead to such cases where GT is retrieved.

With this, BCF and VCF will have different internal binary values/representations.
When the bcf_get_format_values api is used, phasing data as per v4.4 is retrieved.
When the values are accessed directly, w/o the get_format api, user has to determine the phasing for 1st value based on phasing of other alleles.

As conversion is made for vcf, merge issue mentioned is fixed for vcf/vcf.gz. When input is bcf, the issue still exists and conversion may still be required. Performance issue will be looked further that it can be used for BCF as well.

daviesrob and others added 7 commits September 2, 2025 09:04
The motivation for this is to enable passing of a pointer to
a bcf_hdr_t structure to bcf_readrec(), which currently does
not get one.  It does always get a pointer for the BGZF handle,
so a header struct could be passed in via that if it can be
stored somewhere.

To enable this while not changing the bgzf API or ABI, extra
fields are added to the opaque bgzf_cache_t field.  The BGZF_CACHE
macro that could be use to disable addition of the cache feature
removed as it was always turned on anyway.  The cache struct now
has to be created for files open for write, although the cache
part is not used.  The hash type used by the cache is renamed from
"cache" to "bgzf_cache" to improve its name-spacing.

The interfaces to add, get, and remove private data are put in
a new bgzf_internal.h header.  The bgzf_cache_t struct definition
is also moved there so that the get function can be inlined for
faster access to the private data field.

The bgzf_cache_t definition is rewritten slightly so that it's
not necessary to invoke KHASH_MAP_INIT_INT64() before it in the
header file, as doing that would require struct cache_t to be
moved from bgzf.c to the new header as well.  Instead, typedef
kh_bgzf_cache_t is used in place of khash(bgzf_cache), and
unsigned int instead of khint_t.
For bcf files, the header pointer hasn't always been passed
into bcf_read(), especially when using iterators.  As having
it available would be useful for VCF 4.4+ support, this works
around its absence by attaching a pointer to the header in
BGZF private data, which was previously unused for vcf/bcf.
It also adds reference counting to the header so that it can
be cleaned up safely irrespective of whether hts_close() or
bcf_hdr_destroy() was called first.  To avoid ABI breakage,
the reference count is stored in the bcf_hdr_aux_t struct.
BCF saved by versions of HTSlib before 1.22 will always store the
first phasing bit as 0.  For consistency with the VCF reader,
update this bit when reading BCF so that is is set if all other
phasing bits are also set.
Phasing should now be fixed up in bcf_read()/vcf_read(), so
there's no need to try again in bcf_get_format_values().
By noting that we're only interested in the least-significant
bit of each GT value, it's possible to reduce the number of
branches in this function by doing bit manipulations on the
first byte of each stored value.  The common haploid and diploid
cases are also specialised so the inner loop on ploidy can
be avoided for those cases.
Make bcf_readrec able to update phasing, and faster updatephasing() function
@vasudeva8
Copy link
Contributor Author

updated to have BCF and VCF output same binary values (updatephasing only for bcf with version <v44)
cleanup
Will need an update in bcftools, to use bcf_gt_is_missing instead of bcf_gt_missing in mendelian check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants