Thanks to visit codestin.com
Credit goes to github.com

Skip to content

lh3/OpenHGL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Table of Contents

Introduction

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

The dataset currently consists of 579 huamn genomes with 1.7 trillion basepairs. The full dataset is available at AWS as Open Data. The primary data is also archieved at Zenodo.

Downloading OpenHGL Data

OpenHGL is available in S3 bucket s3://openhgl. The fastest way to download bulk data is to use the AWS command-line interface (aws-cli), for example, with:

# list all files (there are tens of files in total)
aws s3 ls --no-sign-request --recursive s3://openhgl

# download a small sample file (24.8MB in size)
aws s3 cp --no-sign-request s3://openhgl/misc/mtb/mtb152.tar.gz .

If you are not familiar with aws-cli, you can browse the files, find their links and download with wget or curl. Alternatively, you can download primary data from Zenodo. However, due to limited space provided by Zenodo, derived files (e.g. FM-index in the static format) are not available. Downloading from Zenodo is also much slower than from AWS.

Using OpenHGL Data

File description

At present, OpenHGL provides genome sequences in the AGC format and the corresponding FM-index in the ropebwt3 format:

  • human579.agc: AGC archive of assembly sequences
  • human579.fmd: BWT in the static ropebwt3 format (AWS only)
  • human579.fmd.ssa: sampled suffix array (AWS only)
  • human579.fmd.len.gz: contig names and lengths
  • human579.fmr.gz: BWT sequence in the dynamic ropebwt3 format
  • human579.fmd.ssa.gz: sampled suffix array (Zenodo only)
  • human579.meta.tsv: metadata including 1) assembly name, 2) the sex chromosome in the assembly, 3) sample name, 4) sample sex, 5) SGDP region code, 6) 1KG population code and 7) country.

Retrieving genomic sequences

It is recommended to download precompiled AGC binary from its release page. After copying the agc binary to your PATH, you can download and retrieve sequences with

# download AGC archive
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.agc
# or with aws-cli
aws s3 cp s3://openhgl/human/human579/human579.agc .

# list assembly names
agc listset human579.agc

# list contig names in assembly 200125_HG02129.pat
agc listctg human579.agc 200125_HG02129.pat

# retrieve all sequences in assembly 200125_HG02129.pat
agc getctg human579.agc 200125_HG02129.pat > HG02129.pat.fa

# retrieve the first 100bp of contig HG02129#1#CM085853.1
agc getctg human579.agc HG02129#1#CM085853.1:0-99

Importantly, with AGC, the coordinate of the first base is 0. start-end is a closed interval. This is different from common tools like samtools faidx which uses closed intervals but puts the first base at coodinate 1.

Finding sequence matches

Ropebwt3 is required for string search:

# install ropebwt3
git clone https://github.com/lh3/ropebwt3
cd ropebwt3; make               # add "omp=0" if you see errors

# download FM-index
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.ssa
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.len.gz

# exact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 mem -L human579.fmd -

# inexact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 sw -eN200 -Lm10 human579.fmd -

More use cases

The following command lines show more use cases:

# Locate up to 100 exact matches
ropebwt3 mem -t16 -p100 human579.fmd seq.fa.gz > out.bed

# Find non-human sequences/contaminations
ropebwt3 mem -t16 -l101 --gap=10k human579.fmd seq.fastq.gz > out.bed

# Count 101-mers occuring over 20 times per genome on average
ropebwt3 kount -k101 -m 11580 human579.fmd > k101-20.txt

The ropebwt3 paper provides additional examples.

Data Description

Data sources

Name Version nAsm Description
CHM13 2.0 1 Analysis set with HG002 chrY and rCRS chrM
CN1 1.0.1 2 Chinese Han
KSA001 1.1.0 2 Saudi Arabia
I002C 0.7 2 Indian
KOREF1 2025 2 Korean
YAO 2.0 2 Chinese
HPRC r2-v1.0.1 464 Human Pangenome Reference Consortium
APR v1 104 UAE-based Arab Pangenome Reference

Criteria in sample selection:

  • Publicly available
  • Requiring PacBio HiFi for base accuracy
  • Requiring ultra-long Nanopore reads for assembly through difficult regions
  • Requiring trio or Hi-C data for chromosome-scale phasing
  • Independent samples

Additional procedure:

  • APR male samples were processed with the yak X/Y partition pipeline such that Y is placed in hap1 and X in hap2. HPRC samples were processed the same way by the consortium.

Naming convention

A sample name matches regular expression ([0-9]{6})_([A-Z0-9]+)\.(pri|pat|mat|hap1|hap2). The leading digits are a unique identifier for the contig set. The alphanumeric string after the first underscore indicates the sample name. If the assembly of a sample is updated, the sample name stays the same but the identifier will be different. The ending code specifies the assembly type:

  • pri: primary assembly (for CHM13 only)
  • pat: paternal assembly from trio phasing, with chrY
  • mat: maternal assembly from trio phasing, with chrX
  • hap1: haplotype 1 from Hi-C phasing, with chrY (partitioned with yak)
  • hap2: haplotype 2 from Hi-C phasing, with chrX (partitioned with yak)

A contig name matches ([^\s#]+)#[012]#([^\s#]+) where the first field corresponds to the sample name and the last field to the contig or chromosome name. The number in the middle indicates haplotype with 0 in primary assembly, 1 for paternal or haplotype 1, and 2 for maternal or haplotype 2.

Known issues

  • HG002 from HPRC and CHM13 share the same Y chromosome
  • HG00272 has a ~50Mb inversion misassembly on the X chromosome
  • NA20806 has X and Y chromosomes mispartitioned to the same haplotype
  • HG02145 has fragmented Y chromosome (see HPRC noteworthy samples)

ChangeLogs

  • 4.0: added I002C, KOREF1 and APR; updated YAO to v2.0
  • 3.0: updated HPRC assemblies to r2-v1.0.1

About

Open Human Genome Library

Resources

Stars

Watchers

Forks