A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human
genomes, aligned to GRCh38.
This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios; implemented here, with improvements, using snakemake and conda for full reproducibility.
Download the refpanel source code
git clone [email protected]:ekirving/refpanel.git && cd refpanelThis pipeline uses the conda package manager (or the
faster mamba front-end) to handle installation of all software
dependencies. If you do not already have conda or mamba installed, then please install one first.
Once conda is setup, build and activate a new environment for the refpanel pipeline
conda env create --name refpanel --file environment.yamlconda activate refpanelThis pipeline comes preconfigured to build a joint-callset, called refpanel-v2 (n=5,100), involving all publicly
available samples from:
- 1000 Genomes Project (1000G), 30x NYGC version (n=3,202; inc. 2,504 phase 3 samples + 698 additional related samples);
The 1000 Genomes Project Consortium. (2015) Nature doi:10.1038/nature15393
Byrska-Bishop et al. (2022) Cell doi:10.1016/j.cell.2022.08.004 - Simons Genome Diversity Project (SGDP) (n=256; excluding overlap with 1000G);
Mallick et al. (2016) Nature doi:10.1038/nature18964 - Gambian Genome Variation Project (GGVP) (n=400; excluding overlap with 1000G);
Band et al. (2019) Nature Communications doi:10.1038/s41467-019-13480-z - Human Genome Diversity Project (HGDP) (n=816; excluding overlap with SGDP);
Bergström et al. (2020) Science doi:10.1126/science.aay5012 - Arabian Peninsula Population Genomic Study (APPG) (n=137);
Almarri et al. (2021) Cell doi:10.1016/j.cell.2021.07.013
Plus additional public genomes from:
- Meyer et al. (2012) Science (n=8); doi:10.1126/science.1224344
- Mondal et al. (2016) Nature Genetics (n=60); doi:10.1038/ng.3621
- Rodriguez-Flores et al. (2016) Genome Research (n=108); doi:10.1101/gr.191478.115
- de Barros Damgaard et al. (2018) Science (n=41); 10.1126/science.aar7711
- McColl et al. (2018) Science (n=2); doi:10.1126/science.aat3628
- Lindo et al. (2018) Science Advances (n=25); 10.1126/sciadv.aau4921
- Gelabert et al. (2019) BMC Genomics (n=12); doi:10.1186/s12864-019-5529-0
- Serra-Vidal et al. (2019) Current Biology (n=21); doi:10.1016/j.cub.2019.09.050
- Lorente-Galdos et al. (2019) Genome Biology (n=9); doi:10.1186/s13059-019-1684-5
- Crooks et al. (2020) BMC Genetics (n=3); doi:10.1186/s12863-020-00917-4
The data from these projects is hosted by the International Genome Sample Resource (IGSR) database (doi:10.1093/nar/gkw829) and the European Nucleotide Archive (ENA) (doi:10.1093/nar/gkq967).
If there are publicly available whole-genome sequencing data that you would like incorporated into refpanel-v3 please
raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.
If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs.
To ensure all data is processed consistently, refpanel downloads gVCF files for 1000G; CRAM files for 1000G, HGDP,
SGDP and GGVP; and fastq files for all other data sources.
To (optionally) pre-fetch all the data dependencies, run:
./refpanel download_data &All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log
Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.
| Superpopulation | Code | Samples |
|---|---|---|
| African Ancestry | AFR | 1,460 |
| American Ancestry | AMR | 589 |
| Central Asian and Siberian Ancestry | CAS | 66 |
| Central and South Asian Ancestry | CSA | 199 |
| East Asian Ancestry | EAS | 826 |
| European Ancestry | EUR | 790 |
| Middle Eastern Ancestry | MEA | 407 |
| Oceanian Ancestry | OCE | 38 |
| South Asian Ancestry | SAS | 678 |
| West Eurasian Ancestry | WEA | 47 |
In brief, refpanel produces a jointly-called and phased callset via the following steps:
- Adapter trimming with
fastp(v.0.23.2) - Alignment to
GRCh38withbwa mem(v0.7.15) - Fix-mate, merge, sort,
and mark duplicates with
picard(v2.5.0) - Base recalibration with
gatk BaseRecalibrator(v3.5) - Conversion to
cramwithsamtools(v1.14) - Per-sample calling of
gVCFswithgatk HaplotypeCaller(with sex-dependent ploidy) - Merging samples with
gatk CombineGVCFs - Joint-calling of all samples with
gatk GenotypeGVCFs - Variant quality score recalibration with
gatk VariantRecalibrator - Annotation with dbSNP build 155 with
bcftools(v1.14) - Hard-filtering of SNPs and INDELs with
bcftools:- VQSR PASS;
- GT missingness < 5%;
- HWE p-value > 1e-10 in at least one super-population;
- Mendelian error rate < 5%, using trios from 1000G (n=602) and GGVP (n=133);
- MAC ≥ 2 (i.e., no singletons);
- Read-based phasing with
whatshap(v1.2.1) using:- Illumina paired-end reads from all projects;
- 10x Genomics linked-read sequencing from HGDP (n=26) and APPG (n=137);
- Pedigree phasing with
whatshapusing:- Trios from 1000G (n=602) and GGVP (n=130);
- Statistical phasing with
shapeit4(v4.2.2) - Variant effect prediction with
ensembl-vep(v105.0)
For more information, refer to the DAG of the rule graph or the code itself.
To execute the full pipeline, end-to-end, run:
./refpanel &All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log
The pipeline can also be broken down into separate steps, for distribution across multiple nodes in a cluster.