vcf-zarr-publication

This repo contains the manuscript for the publication describing the vcf-zarr specification and its compression and query performance on several datasets. All code required to generate figures and example analyses is in this repo.

Layout

The main text is in the paper.tex/paper.bib files.
Building the document and creating plots is automated using the Makefile. Building datasets and running benchmarks are also semi-automated using the main Makefile (but not entirely, as these benchmarks take a lot of time to run, and needed to be done bit-by-bit). Run python3 src/collect_data.py --help to see the available commands.
Code for generating all plots is in src/plot.py, and the data for these plots is stored in CVS form in plot_data.
Code for running the data compression analysis on 1000 Genomes data is in src/compression_benchmarks.py. The dataset is downloaded and managed in the real_data directory. See the Makefile there for details on downloading and converting to the inital Zarr.
The gel_example directory contains the Jupyter notebooks used to run the benchmarks for the Genomics England example.
Code for running simulation-based scaling benchmarks src/collect_data.py, and the actual Zarr-Python benchmarks used in src/zarr_afdist.py. The software directory contains the savvy-afdist C++ code used in the benchmarks, along with downloaded versions of bcftools etc. The Makefile in this directory should take care of downloading and compiling all of these. The Makefile is the starting point here, which should take care of downloading dataset, doing the subsetting and running the various conversion tools. The simulated dataset and various copies are stored in scaling, along with some utilities.

To run the simulation based benchmarks:

cd to the software directory and make (you may need to install some dependencies).
cd to the scaling directory and make. This will take a long time and need a lot of storage space.
Run the various benchmarks using python src/collect_data.py

Name		Name	Last commit message	Last commit date
Latest commit History 812 Commits
all_of_us_example		all_of_us_example
diagrams		diagrams
figures		figures
gel_example		gel_example
notebooks		notebooks
ofh_example		ofh_example
plot_data		plot_data
real_data		real_data
saige_prototype		saige_prototype
sc2_example		sc2_example
scaling		scaling
software		software
spruce_example		spruce_example
src		src
.Rhistory		.Rhistory
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
cover-letter.txt		cover-letter.txt
funding.txt		funding.txt
license.txt		license.txt
orcids.txt		orcids.txt
oup-contemporary.cls		oup-contemporary.cls
paper.bib		paper.bib
paper.tex		paper.tex
requirements.txt		requirements.txt
response-to-reviewers.tex		response-to-reviewers.tex
reviewed-paper.tex		reviewed-paper.tex
vancouver-authoryear.bst		vancouver-authoryear.bst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vcf-zarr-publication

Layout

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 11

Uh oh!

Languages

License

sgkit-dev/vcf-zarr-publication

Folders and files

Latest commit

History

Repository files navigation

vcf-zarr-publication

Layout

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages