Abstract
The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence–absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol 23:110–120. https://doi.org/10.1016/j.mib.2014.11.014
Mewes HW, Albermann K, Bähr M et al (1997) Overview of the yeast genome. Nature 387(6632 Suppl):7–65. https://doi.org/10.1038/42755
Adams MD, Celniker SE, Holt RA et al (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195. https://doi.org/10.1126/science.287.5461.2185
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287(5461):2196–2204. https://doi.org/10.1126/science.287.5461.2196
The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2046. https://doi.org/10.1126/science.282.5396.2012
The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815. https://doi.org/10.1038/35048692
International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436:793–800. https://doi.org/10.1038/nature03895
The Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562. https://doi.org/10.1038/nature01262
The Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921. https://doi.org/10.1038/35057062
Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351. https://doi.org/10.1038/nrg.2016.49
Li R, Fan W, Tian G et al (2010) The sequence and de novo assembly of the giant panda genome. Nature 463:311–317. https://doi.org/10.1038/nature08696
Schmutz J, McClean P, Mamidi S et al (2014) A reference genome for common bean and genome-wide analysis of dual domestications. Nat Genet 46:707–713. https://doi.org/10.1038/ng.3008
The Potato Genome Sequencing Consortium (2011) Genome sequence and analysis of the tuber crop potato. Nature 475:189–195. https://doi.org/10.1038/nature10158
Schnable PS, Ware D, Fulton RS et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326(5956):1112–1115. https://doi.org/10.1126/science.1178534
Denoeud F, Carretero-Paulet L, Dereeper A et al (2014) The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345(6201):1181–1184. https://doi.org/10.1126/science.1255274
Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genomics Proteomics Bioinformatics 13(5):278–289. https://doi.org/10.1016/j.gpb.2015.08.002
Eid J, Fehr A, Gray J et al (2009) Real-time DNA sequencing from single polymerase molecules. Science 323(5910):133–138. https://doi.org/10.1126/science.1162986
Clarke J, Wu HC, Jayasinghe L et al (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 4:265–270. https://doi.org/10.1038/nnano.2009.12
Jain M, Olsen HE, Paten B, Akeson M (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17(1):239. https://doi.org/10.1186/s13059-016-1103-0
Chen Y, Nie F, Xie SQ et al (2021) Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat Commun 12:60. https://doi.org/10.1038/s41467-020-20236-7
Jain M, Koren S, Miga KH et al (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338–345. https://doi.org/10.1038/nbt.4060
Wenger AM, Peluso P, Rowell WJ et al (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37:1155–1162. https://doi.org/10.1038/s41587-019-0217-9
Marks RA, Hotaling S, Frandsen PB et al (2021) Representation and participation across 20 years of plant genome sequencing. Nat Plants 7:1571–1578. https://doi.org/10.1038/s41477-021-01031-8
Kitzman J, MacKenzie A, Adey A et al (2011) Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol 29:59–63. https://doi.org/10.1038/nbt.1740
Suk EK, McEwen GK, Duitama J et al (2011) A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res 21:1672–1685. https://doi.org/10.1101/gr.125047.111
Duitama J, McEwen GK, Huebsch T et al (2011) Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res 40(5):2041–2053. https://doi.org/10.1093/nar/gkr1042
Peters BA, Kermani BG, Sparks AB et al (2012) Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487(7406):190–195. https://doi.org/10.1038/nature11236
Peters BA, Liu J, Drmanac R (2014) Co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing. Front Genet 5:466. https://doi.org/10.3389/fgene.2014.00466
Redin D, Frick T, Aghelpasand H et al (2019) High throughput barcoding method for genome-scale phasing. Sci Rep 9(1):18116. https://doi.org/10.1038/s41598-019-54446-x
Wang O, Chin R, Cheng X et al (2019) Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res 29(5):798–808. https://doi.org/10.1101/gr.245126.118
Lieberman-Aiden E, van Berkum NL, Williams L et al (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289–293. https://doi.org/10.1126/science.1181369
Bickhart DM, Rosen BD, Koren S et al (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49:643–650. https://doi.org/10.1038/ng.3802
Trujillo-Montenegro JH, Rodríguez Cubillos MJ, Loaiza CD et al (2021) Unraveling the genome of a high yielding Colombian sugarcane hybrid. Front Plant Sci 12:694859. https://doi.org/10.3389/fpls.2021.694859
Browning S, Browning B (2011) Haplotype phasing: existing methods and new developments. Nat Rev Genet 12:703–714. https://doi.org/10.1038/nrg3054
Delaneau O, Zagury JF, Robinson MR et al (2019) Accurate, scalable and integrative haplotype estimation. Nat Commun 10:5436. https://doi.org/10.1038/s41467-019-13225-y
Ma L, Xiao Y, Huang H et al (2010) Direct determination of molecular haplotypes by chromosome microdissection. Nat Methods 7(4):299–301. https://doi.org/10.1038/nmeth.1443
Porubsky D, Garg S, Sanders AD et al (2017) Dense and accurate whole-chromosome haplotyping of individual genomes. Nat Commun 8(1):1293. https://doi.org/10.1038/s41467-017-01389-4
Campoy JA, Sun H, Goel M et al (2020) Gamete binning: chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes. Genome Biol 21(1):306. https://doi.org/10.1186/s13059-020-02235-5
Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327. https://doi.org/10.1016/j.ygeno.2010.03.001
Li Z, Chen Y, Mu D et al (2012) Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph. Brief Funct Genomics 11(1):25–37. https://doi.org/10.1093/bfgp/elr035
Pevzner PA, Tang H, Tesler G (2004) De novo repeat classification and fragment assembly. Genome Res 14:1786–1796. https://doi.org/10.1101/gr.2395204
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829. https://doi.org/10.1101/gr.074492.107
Li R, Zhu H, Ruan J et al (2009) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20:265–272. https://doi.org/10.1101/gr.097261.109
Butler J, MacCallum I, Kleber M et al (2008) ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 18:810–820. https://doi.org/10.1101/gr.7337908
Bankevich A, Nurk S, Antipov D et al (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10.1089/cmb.2012.0021
Koren S, Walenz BP, Berlin K et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722–736. https://doi.org/10.1101/gr.215087.116
Chin CS, Peluso P, Sedlazeck FJ et al (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13(12):1050–1054. https://doi.org/10.1038/nmeth.4035
Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14):2103–2110. https://doi.org/10.1093/bioinformatics/btw152
Vaser R, Sović I, Nagarajan N, Šikić M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27:737–746. https://doi.org/10.1101/gr.214270.116
Kolmogorov M, Yuan J, Lin Y et al (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37:540–546. https://doi.org/10.1038/s41587-019-0072-8
Bansal V, Bafna V (2008) HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24(16):i153–i159. https://doi.org/10.1093/bioinformatics/btn298
Geraci F (2010) A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26(18):2217–2225. https://doi.org/10.1093/bioinformatics/btq411
Edge P, Bafna V, Bansal V (2017) HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 27:801–812. https://doi.org/10.1101/gr.213462.116
Nurk S, Koren S, Rhie A, et al (2021) The complete sequence of a human genome. https://www.biorxiv.org. https://doi.org/10.1101/2021.05.26.445798
Hon T, Mars K, Young G et al (2020) Highly accurate long-read HiFi sequencing data for five complex genomes. Sci Data 7:399. https://doi.org/10.1038/s41597-020-00743-4
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85. https://doi.org/10.1093/bioinformatics/bti1114
Chaisson MJ, Tesler G (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinform 13:238. https://doi.org/10.1186/1471-2105-13-238
Nurk S, Walenz BP, Rhie A et al (2020) HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res 30(9):1291–1305. https://doi.org/10.1101/gr.263566.120
Guan D, McCarthy SA, Wood J et al (2020) Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36:2896–2898. https://doi.org/10.1093/bioinformatics/btaa025
Cheng H, Concepcion GT, Feng X et al (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18:170–175. https://doi.org/10.1038/s41592-020-01056-5
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46:395–415. https://doi.org/10.1145/316542.316550
Koren S, Rhie A, Walenz B et al (2018) De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol 36:1174–1182. https://doi.org/10.1038/nbt.4277
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Duitama, J. (2023). Phased Genome Assemblies. In: Peters, B.A., Drmanac, R. (eds) Haplotyping. Methods in Molecular Biology, vol 2590. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2819-5_16
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2819-5_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2818-8
Online ISBN: 978-1-0716-2819-5
eBook Packages: Springer Protocols