Memory-Bound and Taxonomy-Aware K-Mer Selection for Ultra-Large Reference Libraries

Şapcı, Ali Osman Berk; Mirarab, Siavash

doi:10.1007/978-1-0716-3989-4_26

Ali Osman Berk Şapcı²⁶ &
Siavash Mirarab^25,26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14758))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1918 Accesses

Abstract

Classifying sequencing reads based on \(k\)-mer matches to a reference library is widely used in applications such as taxonomic profiling. Given the ever-increasing number of genomes publicly available, it is increasingly impossible to keep all or a majority of their \(k\)-mers in memory. Thus, there is a growing need for methods for selecting a subset of \(k\)-mers while accounting for taxonomic relationships. We propose \(k\)-mer RANKer (KRANK), a method that uses a set of heuristics to efficiently and effectively select a size-constrained subset of \(k\)-mers from a diverse and imbalanced taxonomy that suffers biased sampling. Empirical evaluations demonstrate that a fraction of all \(k\)-mers in large reference libraries can achieve comparable accuracy to the full set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from £29.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 95.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 119.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

K-mer Counting for Genomic Big Data

A k-mer Based Sequence Similarity for Pangenomic Analyses

Improving Metagenomic Classification Using Discriminative k-mers from Sequencing Data

Availability

The tool is available under https://github.com/bo1929/KRANK and data are available under https://github.com/bo1929/shared.KRANK. The full paper is available at http://doi.org/10.1101/2024.02.12.580015.

References

Nasko, D.J., Koren, S., Phillippy, A.M., Treangen, T.J.: RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 19(1), 165 (2018). https://doi.org/10.1186/s13059-018-1554-6
Article Google Scholar
Ounit, R., Lonardi, S.: Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics (Oxford, England) 32(24), 3823–3825 (2016). https://doi.org/10.1093/bioinformatics/btw542
Article Google Scholar
Pachiadaki, M.G., et al.: Charting the complexity of the marine microbiome through single-cell genomics. Cell 179(7), 1623-1635.e11 (2019). https://doi.org/10.1016/j.cell.2019.11.017
Article Google Scholar
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics (Oxford, England) 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408
Article Google Scholar
Şapcı, A.O.B., Rachtman, E., Mirarab, S.: CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing. bioRxiv (2024). https://doi.org/10.1101/2023.11.07.566115
Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20(1), 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
Article Google Scholar
Zheng, H., Marçais, G., Kingsford, C.: Creating and using minimizer sketches in computational genomics. J. Comput. Biol., cmb.2023.0094 (2023). https://doi.org/10.1089/cmb.2023.0094
Zhu, Q., et al.: Reference phylogeny for microbes (data pre-release) (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA, 92093, USA
Siavash Mirarab
Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA, 92093, USA
Ali Osman Berk Şapcı & Siavash Mirarab

Authors

Ali Osman Berk Şapcı
View author publications
Search author on:PubMed Google Scholar
Siavash Mirarab
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Siavash Mirarab .

Editor information

Editors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Jian Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Şapcı, A.O.B., Mirarab, S. (2024). Memory-Bound and Taxonomy-Aware K-Mer Selection for Ultra-Large Reference Libraries. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_26

Download citation

DOI: https://doi.org/10.1007/978-1-0716-3989-4_26
Published: 17 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-1-0716-3988-7
Online ISBN: 978-1-0716-3989-4
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Memory-Bound and Taxonomy-Aware K-Mer Selection for Ultra-Large Reference Libraries