Thanks to visit codestin.com
Credit goes to www.research-collection.ethz.ch

 

Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques


Loading...

Author / Producer

Date

2024

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

The advent of high-throughput sequencing (HTS) technologies has revolutionized genome analysis by enabling the rapid and cost-effective sequencing of large genomes. Despite these advancements, the increasing complexity and volume of genomic data present significant challenges related to accuracy, scalability, and computational efficiency. These challenges are mainly due to various forms of unwanted and unhandled variations in sequencing data, collectively referred to as noise. Addressing these challenges requires a deep understanding of different types of noise in genomic data and the development of techniques to mitigate the impact of noise on genome analysis. In this dissertation, we aim to understand the types of noise that affect the genome analysis pipeline and address challenges posed by such noise by developing new computational techniques to tolerate or reduce noise for faster, more accurate, and scalable analysis of different types of sequencing data (e.g., raw electrical signals from nanopore sequencing). First, we introduce BLEND, a noise-tolerant hashing mechanism that quickly identifies both exactly matching and highly similar sequences with arbitrary differences using a single lookup of their hash values. Second, to enable scalable and accurate analysis of noisy raw nanopore signals, we propose RawHash, a novel mechanism that effectively reduces noise in raw nanopore signals and enables accurate, real-time analysis by proposing the first hash-based similarity search technique for raw nanopore signals. Third, we extend the capabilities of RawHash with RawHash2, an improved mechanism that 1) provides a better understanding of noise in raw nanopore signals to reduce it more effectively and 2) improves the robustness of mapping decisions. Fourth, we explore the broader implications and new applications of raw nanopore signal analysis by introducing Rawsamble, the first mechanism for all-vs-all overlapping of raw signals using hash-based search. Rawsamble enables the construction of de novo assemblies directly from raw signals without basecalling, which opens up new directions and uses for raw nanopore signal analysis. This dissertation builds a comprehensive understanding of how noise in different types of genomic data affects the genome analysis pipeline and provides novel solutions to mitigate the impact of noise. Our findings demonstrate that by effectively tolerating and reducing noise using new computational techniques, we can 1) significantly improve the performance, accuracy, and scalability of genome analysis and 2) expand the scope of raw signal analysis by enabling new applications and directions. We hope and believe that the methods and insights presented in this dissertation will contribute to the invention and development of more robust, efficient, and capable genomic analysis tools, especially in the field of raw signal analysis.

Publication status

published

Editor

Contributors

Examiner: Mutlu, Onur
Examiner : Das, Reetuparna
Examiner : Gamaarachchi, Hasindu
Examiner : Langmead, Benjamin
Examiner : Li, Heng

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Bioinformatics; Data Science; Genomics; Computer Science; High throughput sequencing; Sequence analysis; Nanopore sequencing

Organisational unit

09483 - Mutlu, Onur / Mutlu, Onur

Notes

Funding

Related publications and datasets