Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques
OPEN ACCESS
Loading...
Author / Producer
Date
2024
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
The advent of high-throughput sequencing (HTS) technologies has revolutionized genome analysis by enabling the rapid and cost-effective sequencing of large genomes. Despite these advancements, the increasing complexity and volume of genomic data present significant challenges related to accuracy, scalability, and computational efficiency. These challenges are mainly due to various forms of unwanted and unhandled variations in sequencing data, collectively referred to as noise. Addressing these challenges requires a deep understanding of different types of noise in genomic data and the development of techniques to mitigate the impact of noise on genome analysis.
In this dissertation, we aim to understand the types of noise that affect the genome analysis pipeline and address challenges posed by such noise by developing new computational techniques to tolerate or reduce noise for faster, more accurate, and scalable analysis of different types of sequencing data (e.g., raw electrical signals from nanopore sequencing).
First, we introduce BLEND, a noise-tolerant hashing mechanism that quickly identifies both exactly matching and highly similar sequences with arbitrary differences using a single lookup of their hash values. Second, to enable scalable and accurate analysis of noisy raw nanopore signals, we propose RawHash, a novel mechanism that effectively reduces noise in raw nanopore signals and enables accurate, real-time analysis by proposing the first hash-based similarity search technique for raw nanopore signals. Third, we extend the capabilities of RawHash with RawHash2, an improved mechanism that 1) provides a better understanding of noise in raw nanopore signals to reduce it more effectively and 2) improves the robustness of mapping decisions. Fourth, we explore the broader implications and new applications of raw nanopore signal analysis by introducing Rawsamble, the first mechanism for all-vs-all overlapping of raw signals using hash-based search. Rawsamble enables the construction of de novo assemblies directly from raw signals without basecalling, which opens up new directions and uses for raw nanopore signal analysis.
This dissertation builds a comprehensive understanding of how noise in different types of genomic data affects the genome analysis pipeline and provides novel solutions to mitigate the impact of noise. Our findings demonstrate that by effectively tolerating and reducing noise using new computational techniques, we can 1) significantly improve the performance, accuracy, and scalability of genome analysis and 2) expand the scope of raw signal analysis by enabling new applications and directions. We hope and believe that the methods and insights presented in this dissertation will contribute to the invention and development of more robust, efficient, and capable genomic analysis tools, especially in the field of raw signal analysis.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner: Mutlu, Onur
Examiner : Das, Reetuparna
Examiner : Gamaarachchi, Hasindu
Examiner : Langmead, Benjamin
Examiner : Li, Heng
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Bioinformatics; Data Science; Genomics; Computer Science; High throughput sequencing; Sequence analysis; Nanopore sequencing
Organisational unit
09483 - Mutlu, Onur / Mutlu, Onur