HIGH THROUGHPUT SEQUENCING
Sequence mapping algorithms (1 with many)
1. Read mapping
• Sequencing alignment is the next step a2er sequencing, you must know where the reads sequenced are located
respect to a reference genome.
• Mapping hundreds of millions of reads back to the reference genome is CPU and RMA intensive and slow.
• Most mappers allow approximately 2 mismatches within first 30bp (428 could sJll uniquely idenJfy most 30bp
sequences in a 3GB genome), slower when allowing INDELS.
2. Seed
• Break database sequences (FASTQ) into k-mer words (seed and hash their locaJons) and hash their locaJons to
speed later searches.
• K-mer: substring of length k. For example, a 5-mer index are k-mer sequences of 5bp length.
• You can T-index query sequences respect a template or reference genome referring to where these k-mers
appear in the genome of reference.
Note: 0 is posiJon 1 of the genomes of reference.
3. Blast algorithm
• Seed-and-extend paradigm.
• For each k-mer in a query, find the possible database k-mers that matches well with it.
• Only words with ≥ T-index cutoff are kept.
• Steps:
1. For each DB sequence with a high scoring word try to extend it in both ends:
§ High HSP (high-scoring Segment Pairs): for this long sequence there is a match region.
2. Keep only staJsJcally significant HSPs (E-value):
§ Based on the scores of aligning 2 random sequences, that mean that in this database the likelihood
you will some kind of match like this as point of reference for the query and references sequences.
3. Use Smith-Waterman (local alignment) algorithm to join the HSPs and opJmal alignment.
Sequence A is the database sequence (reference) and sequence B is the query sequence. The black thigh
lines are the HSPs and using the Smith-Waterman algorithm you can try to get these significant HSPs
together and generate the final alignment.
4. Suffix tree
• A tree of all the suffixes of the reference sequence:
§ “Banana” has the suffixes: BANANA, ANANA, NANA, ANA, NA, A.
• Used in alignment tools such as MUMmer.
• Order(n) Jme to build:
§ n=genome length.
• Order(m) Jme to search:
§ m=query length.
• Genome index is big, such as human genome
(50GB) the tree would be huge.
5. Suffix array
• The ith entry corresponds to the ith smallest suffix. In this example BANANA is the entry and has the smallest
suffix (0)
• Used in alignment tools such as STAR.
• Order(n) Jme to build:
§ n=genome length.
• Order(mlogn) Jme to search
§ Binary search.
§ m=query length.
• Index size is moderate, around 15GB.
• It is used in RNA sequence.