Codestin Search App

Introduction

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

The dataset currently consists of 579 huamn genomes with 1.7 trillion basepairs. The full dataset is available at AWS as Open Data. The primary data is also archieved at Zenodo.

Downloading OpenHGL Data

OpenHGL is available in S3 bucket s3://openhgl. The fastest way to download bulk data is to use the AWS command-line interface (aws-cli), for example, with:

# list all files (there are tens of files in total)
aws s3 ls --no-sign-request --recursive s3://openhgl

# download a small sample file (24.8MB in size)
aws s3 cp --no-sign-request s3://openhgl/misc/mtb/mtb152.tar.gz .

If you are not familiar with aws-cli, you can browse the files, find their links and download with wget or curl. Alternatively, you can download primary data from Zenodo. However, due to limited space provided by Zenodo, derived files (e.g. FM-index in the static format) are not available. Downloading from Zenodo is also much slower than from AWS.

Using OpenHGL Data

File description

At present, OpenHGL provides genome sequences in the AGC format and the corresponding FM-index in the ropebwt3 format:

human579.agc: AGC archive of assembly sequences
human579.fmd: BWT in the static ropebwt3 format (AWS only)
human579.fmd.ssa: sampled suffix array (AWS only)
human579.fmd.len.gz: contig names and lengths
human579.fmr.gz: BWT sequence in the dynamic ropebwt3 format
human579.fmd.ssa.gz: sampled suffix array (Zenodo only)
human579.meta.tsv: metadata including 1) assembly name, 2) the sex chromosome in the assembly, 3) sample name, 4) sample sex, 5) SGDP region code, 6) 1KG population code and 7) country.

Retrieving genomic sequences

It is recommended to download precompiled AGC binary from its release page. After copying the agc binary to your PATH, you can download and retrieve sequences with

# download AGC archive
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.agc
# or with aws-cli
aws s3 cp s3://openhgl/human/human579/human579.agc .

# list assembly names
agc listset human579.agc

# list contig names in assembly 200125_HG02129.pat
agc listctg human579.agc 200125_HG02129.pat

# retrieve all sequences in assembly 200125_HG02129.pat
agc getctg human579.agc 200125_HG02129.pat > HG02129.pat.fa

# retrieve the first 100bp of contig HG02129#1#CM085853.1
agc getctg human579.agc HG02129#1#CM085853.1:0-99

Importantly, with AGC, the coordinate of the first base is 0. start-end is a closed interval. This is different from common tools like samtools faidx which uses closed intervals but puts the first base at coodinate 1.

Finding sequence matches

Ropebwt3 is required for string search:

# install ropebwt3
git clone https://github.com/lh3/ropebwt3
cd ropebwt3; make               # add "omp=0" if you see errors

# download FM-index
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.ssa
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.len.gz

# exact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 mem -L human579.fmd -

# inexact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 sw -eN200 -Lm10 human579.fmd -

More use cases

The following command lines show more use cases:

# Locate up to 100 exact matches
ropebwt3 mem -t16 -p100 human579.fmd seq.fa.gz > out.bed

# Find non-human sequences/contaminations
ropebwt3 mem -t16 -l101 --gap=10k human579.fmd seq.fastq.gz > out.bed

# Count 101-mers occuring over 20 times per genome on average
ropebwt3 kount -k101 -m 11580 human579.fmd > k101-20.txt

The ropebwt3 paper provides additional examples.

Data Description

Data sources

Name	Version	nAsm	Description
CHM13	2.0	1	Analysis set with HG002 chrY and rCRS chrM
CN1	1.0.1	2	Chinese Han
KSA001	1.1.0	2	Saudi Arabia
I002C	0.7	2	Indian
KOREF1	2025	2	Korean
YAO	2.0	2	Chinese
HPRC	r2-v1.0.1	464	Human Pangenome Reference Consortium
APR	v1	104	UAE-based Arab Pangenome Reference

Criteria in sample selection:

Publicly available
Requiring PacBio HiFi for base accuracy
Requiring ultra-long Nanopore reads for assembly through difficult regions
Requiring trio or Hi-C data for chromosome-scale phasing
Independent samples

Additional procedure:

APR male samples were processed with the yak X/Y partition pipeline such that Y is placed in hap1 and X in hap2. HPRC samples were processed the same way by the consortium.

Naming convention

A sample name matches regular expression ([0-9]{6})_([A-Z0-9]+)\.(pri|pat|mat|hap1|hap2). The leading digits are a unique identifier for the contig set. The alphanumeric string after the first underscore indicates the sample name. If the assembly of a sample is updated, the sample name stays the same but the identifier will be different. The ending code specifies the assembly type:

pri: primary assembly (for CHM13 only)
pat: paternal assembly from trio phasing, with chrY
mat: maternal assembly from trio phasing, with chrX
hap1: haplotype 1 from Hi-C phasing, with chrY (partitioned with yak)
hap2: haplotype 2 from Hi-C phasing, with chrX (partitioned with yak)

A contig name matches ([^\s#]+)#[012]#([^\s#]+) where the first field corresponds to the sample name and the last field to the contig or chromosome name. The number in the middle indicates haplotype with 0 in primary assembly, 1 for paternal or haplotype 1, and 2 for maternal or haplotype 2.

Known issues

HG002 from HPRC and CHM13 share the same Y chromosome
HG00272 has a ~50Mb inversion misassembly on the X chromosome
NA20806 has X and Y chromosomes mispartitioned to the same haplotype
HG02145 has fragmented Y chromosome (see HPRC noteworthy samples)

ChangeLogs

4.0: added I002C, KOREF1 and APR; updated YAO to v2.0
3.0: updated HPRC assemblies to r2-v1.0.1

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
human579.meta.tsv		human579.meta.tsv
human579.misasm.tsv		human579.misasm.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Introduction

Downloading OpenHGL Data

Using OpenHGL Data

File description

Retrieving genomic sequences

Finding sequence matches

More use cases

Data Description

Data sources

Naming convention

Known issues

ChangeLogs

About

Uh oh!

lh3/OpenHGL

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

Downloading OpenHGL Data

Using OpenHGL Data

File description

Retrieving genomic sequences

Finding sequence matches

More use cases

Data Description

Data sources

Naming convention

Known issues

ChangeLogs

About

Resources

Uh oh!

Stars

Watchers

Forks