This repo contains the manuscript for the publication describing the vcf-zarr specification and its compression and query performance on several datasets. All code required to generate figures and example analyses is in this repo.
-
The main text is in the paper.tex/paper.bib files.
-
Building the document and creating plots is automated using the Makefile. Building datasets and running benchmarks are also semi-automated using the main Makefile (but not entirely, as these benchmarks take a lot of time to run, and needed to be done bit-by-bit). Run python3
src/collect_data.py --helpto see the available commands. -
Code for generating all plots is in
src/plot.py, and the data for these plots is stored in CVS form inplot_data. -
Code for running the data compression analysis on 1000 Genomes data is in
src/compression_benchmarks.py. The dataset is downloaded and managed in thereal_datadirectory. See the Makefile there for details on downloading and converting to the inital Zarr. -
The
gel_exampledirectory contains the Jupyter notebooks used to run the benchmarks for the Genomics England example. -
Code for running simulation-based scaling benchmarks
src/collect_data.py, and the actual Zarr-Python benchmarks used insrc/zarr_afdist.py. Thesoftwaredirectory contains thesavvy-afdistC++ code used in the benchmarks, along with downloaded versions of bcftools etc. The Makefile in this directory should take care of downloading and compiling all of these. The Makefile is the starting point here, which should take care of downloading dataset, doing the subsetting and running the various conversion tools. The simulated dataset and various copies are stored inscaling, along with some utilities.
To run the simulation based benchmarks:
- cd to the
softwaredirectory andmake(you may need to install some dependencies). - cd to the
scalingdirectory andmake. This will take a long time and need a lot of storage space. - Run the various benchmarks using
python src/collect_data.py