Abstract
Classifying sequencing reads based on \(k\)-mer matches to a reference library is widely used in applications such as taxonomic profiling. Given the ever-increasing number of genomes publicly available, it is increasingly impossible to keep all or a majority of their \(k\)-mers in memory. Thus, there is a growing need for methods for selecting a subset of \(k\)-mers while accounting for taxonomic relationships. We propose \(k\)-mer RANKer (KRANK), a method that uses a set of heuristics to efficiently and effectively select a size-constrained subset of \(k\)-mers from a diverse and imbalanced taxonomy that suffers biased sampling. Empirical evaluations demonstrate that a fraction of all \(k\)-mers in large reference libraries can achieve comparable accuracy to the full set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Availability
The tool is available under https://github.com/bo1929/KRANK and data are available under https://github.com/bo1929/shared.KRANK. The full paper is available at http://doi.org/10.1101/2024.02.12.580015.
References
Nasko, D.J., Koren, S., Phillippy, A.M., Treangen, T.J.: RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 19(1), 165 (2018). https://doi.org/10.1186/s13059-018-1554-6
Ounit, R., Lonardi, S.: Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics (Oxford, England) 32(24), 3823–3825 (2016). https://doi.org/10.1093/bioinformatics/btw542
Pachiadaki, M.G., et al.: Charting the complexity of the marine microbiome through single-cell genomics. Cell 179(7), 1623-1635.e11 (2019). https://doi.org/10.1016/j.cell.2019.11.017
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics (Oxford, England) 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408
Şapcı, A.O.B., Rachtman, E., Mirarab, S.: CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing. bioRxiv (2024). https://doi.org/10.1101/2023.11.07.566115
Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20(1), 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
Zheng, H., Marçais, G., Kingsford, C.: Creating and using minimizer sketches in computational genomics. J. Comput. Biol., cmb.2023.0094 (2023). https://doi.org/10.1089/cmb.2023.0094
Zhu, Q., et al.: Reference phylogeny for microbes (data pre-release) (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Şapcı, A.O.B., Mirarab, S. (2024). Memory-Bound and Taxonomy-Aware K-Mer Selection for Ultra-Large Reference Libraries. In: Ma, J. (eds) Research in Computational Molecular Biology. RECOMB 2024. Lecture Notes in Computer Science, vol 14758. Springer, Cham. https://doi.org/10.1007/978-1-0716-3989-4_26
Download citation
DOI: https://doi.org/10.1007/978-1-0716-3989-4_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-1-0716-3988-7
Online ISBN: 978-1-0716-3989-4
eBook Packages: Computer ScienceComputer Science (R0)