Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
60 views47 pages

Lecture 2 - Sequencing

Uploaded by

mohalsukh98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views47 pages

Lecture 2 - Sequencing

Uploaded by

mohalsukh98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Evolution of Sequencing Technology and the

Human Genome Sequencing Project

CMMB 461
University of Calgary
Gordon Chua

http://www.sciencedaily.com/ 1
Suggested readings
•Green (2001) Nature Reviews 2: 573-582 (genome sequencing review)
http://www.ncbi.nlm.nih.gov/pubmed/11483982

•Mardis (2007) Trends in Genetics 24: 133-141 (next generation


sequencing review)
http://www.ncbi.nlm.nih.gov/pubmed/18262675

Kremkow and Lee (2015) Biotechnol. Lett. 37: 55-65 (next generation
sequencing review)
http://www.ncbi.nlm.nih.gov/pubmed/25214225

•Pennisi (2008) Science 322:838 (personal genomics review)


http://www.ncbi.nlm.nih.gov/pubmed/18988816

•Venter et al. (2001) Science 291:1304-1351 (human genome sequence)


http://www.ncbi.nlm.nih.gov/pubmed/11181995

2
Sequencing a genome:
“Obtaining the parts list of the cell/organism”

ACGT
3
DNA synthesis/replication
Requires DNA polymerase, primer, template DNA and dNTPs

Template strand DNA polymerase

3’-ATGTTCCGCGATAAGCTTTAA-5’
5’-TACAA-3’-OH
dGTP
Primer
Pyrophosphate (PP)

3’-ATGTTCCGCGATAAGCTTTAA-5’
5’-TACAAG-3’-OH
dNTPs

Pyrophosphate (2P)
3’-ATGTTCCGCGATAAGCTTTAA-5’
5’-TACAAGGCGCTATTCGAAATT-3’
4
DNA sequencing by the chain termination method (Sanger
Sequencing)
Different size fragments are generated during DNA synthesis
depending on the location of ddNTP incorporation/termination

Deoxynucleotide Dideoxynucleotide
5’-CH2 Base 5’-CH2 Base
P-P-P P-P-P

3’-OH 3’-H if this get transcribed to the strand


there will be no daughter strand

3’-GGCCAATGAACCGTCACAGTTA-5’ -

DNA fragment migration


larger
Template DNA, primer, DNA fragments
polymerase,
dNTPs, ddCTP
5’-CCGGTTACTTGGCAGTGTC-3’-H

5’-CCGGTTACTTGGC-3’-H

5’-CCGGTTAC-3’-H Gel electrophoresis


+
when ddCTP is added it ends 5
Fluorescent-labelled ddNTPs and Capillary
- they run in capillaries rather than in gel
Electrophoresis
•Fluorescent ddNTPs determine
Applied Biosystems which ddNTP has been
incorporated in sequencing
reaction.

•Sequencing reactions are run on


capillary gel electrophoresis (better
heat dissipation and resolution,
less sample required, more parallel
reactions run at a time).
Silica capillaries Slab sequencing gel
- based on the colour, we can know which nucleotide is present

Celera genomics GibcoBRL S2


apparatus
www.sanger.ac.uk Swerdlow and Gesteland (1990) Nucleic Acids Res. 18: 1415-1419
http://www.abrf.org/JBT/1999/September99/sep99mardis.html 6
Automated base calling

Applied Biosystems

•Scanner records coloured images of different sized termination


fragments for each fluorescent-labelled ddNTP (four tracer profiles)

•Computer processes fluorescent signals to generate an


electropherogram, assigning a base to each peak (A, C, G, T, N)

•(Phred: Phil's revised editing program): accuracy of base calling


•Estimates a probability of error for each base call in electropherogram

•Error % is based on parameters such as shape of a peak, spacing


between peaks, height of a peak
Ewing & Green (1998) Genome Res. 8: 186-194 7
Automated sequencers
•All steps from sample loading to base calling
is automated

•Sequencing reactions are usually performed


manually in 96-well microplates in a thermal
cycler (denaturing, annealing, extension)
Joint Genome Institute

•Process 96 samples at a time

•Found in most sequencing facilities/centres

•Process up to 3000 samples/day

•Read 2 million base pairs per day (take 7


persons for one week to do this manually)
Applied Biosystems 3730xl
DNA analyzer •Obtain up to 800 bp of sequence/reaction
8
Human Genome Sequencing Project
•Advances in Automated sequencing allowed for
genomic projects such as human (3.2 Gb)

•Project formally proposed in 1985 with NIH and US


Department of Energy with a 15 year $3 billion plan
(public) consisting of international genome sequencing
centers

•Private consortium (Celera Genomics) started second


project in 1998 to complete genome sequence in three
years

•Anonymous donors of diverse ethnic backgrounds

•Draft of human genome published in 2001


9
Sequencing Strategy
•Human chromosomes (23 pairs) range in size
from 10-100’s million bases long

•Note that sequencing reads are still up to 800 bp


using the ABI 3700 DNA Analyzer - by the use of chromosome walking we can sequence
more than 800 bp

•Strategy requires breaking the genomes into


smaller fragments or clones and sequencing these
fragments (shotgun approach)

•Fragmentation is random shearing/sonication

10
gene/DNA sequence
Cloning vectors sequencing require many
copies of the same DNA
YFG fragment

Purpose: To store and make many copies of a


DNA fragment of interest in E. coli

Common Features:
•Promoter: constitutive/inducible
•Multicloning site: unique restriction sites for
inserting gene

•Epitope tag: protein purification/localization


•Origin of replication: determines copy number
•Selectable marker: antibiotic resistance

Types (differ in the size of DNA fragments):


Phagemid: 1 kb insert
Plasmid: up to 10 kb insert
P1 clone: 100 kb insert
Bacterial Artificial Chromosome: up to 300 kb insert
11
Hierarchical Shotgun Sequencing
Chromosome

Fragment by partial restriction


digest or shearing (sonication)
by restriction enzymes

Clone fragments into BACS


(300 Kb), PACS (100Kb) &
large fragments
cosmids (50 Kb) and
transform into E. Coli (DNA
library)
Each E. coli
colony contains
many copies of
the same BAC.

Map correct order of cloned fragments to select BACS for


sequencing (all genome is represented) 12
Mapping the correct order of BAC clones
•A genomic library is composed of many vectors (e.g. BACS)
containing DNA fragments covering the entire genome several
times.

•Therefore, a region of the genome can be represented many


times in the genomic library (i.e. same region found in many
BACS).

•The goal is to sequence the minimum number of nucleotides to


cover the entire genome to cut costs (i.e. don’t want to sequence
multiple BACS containing the same region of the genome).

•Two ways to detect BACS with overlapping genome sequences:


1. BAC library screening by hybridization

2. Restriction fingerprinting BAC clones

13
Alignment of BAC clones
1. BAC library screening by hybridization - add alkaline base - which denature DNA
(have to do this first so the probe can bind)
•Rapid identification of overlapping clones using a random
sequence/probe (single stranded DNA)

•BAC Colonies are robotically transferred to nitrocellulose/nylon


membrane and screened with a radioactive probe.

•Probe will only hybridize to BAC colonies with overlapping fragments.


- that it recognize (same region of chromosome)

• The sequence at the end of a clone can be used as a probe in a


subsequent screen to look for overlapping fragments: “chromosome
walking.”

1
2

sequencing by hybridization

http://www.roswellpark.org
black spot is the colony- need to have the same location as the probe
14
Alignment of BAC clones
2. Restriction fingerprinting BAC clones
•Complete restriction digest of BAC clones followed by gel electrophoresis
to determine restriction fragment profile for each BAC clone

•Identify BAC clones with common restriction fragments: indicative that the
clones are overlapping (usually done computationally)

BAC clones Restriction A B C D E


Fragments
Restriction fragment sizes

1
BAC clones

2
3
4

Nature (2001) 409: 860-921


-Clones 1 & 4 overlap in fragments B and C
they look at common region on the chromosome
15
Hierarchical Shotgun Sequencing

Shear BACs by sonication


(unique fragments)

Clone fragments into


phagemids (1 kb) or plasmids
(2-10 kb) and transform into
E. Coli (“shotgun library”)

Sequence library clones (10-fold coverage) and assemble genome


take fragment and break it up to smaller fragments - clone them to make library - transform to plasmids- 16
Whole-genome shotgun sequencing
faster than HSS (advantage)
Harder to assemble the genome
DNA extraction
(disadvantage )

DNA fragmentation (sonication)

Clone into vectors, transform bacteria, purified vectors

takes entire genome and break it up


and sequence all these

Sequence library clones and assemble genome


What are the advantages and disadvantages of WGSS compared to HSS? 17
Hierarchical and whole-genome shotgun
sequencing bakeoff

•HSS: easier to assemble genome sequence but


have to build physical map (labor intensive)

•WGSS: bypasses physical map but assembly of


genome is more difficult especially for more
complex genomes

18
Genome coverage
How many times should the genome be sequenced
(coverage) to ensure a high degree of accuracy?
- Human genome 3.3 billion BPs long, we cannot sequece it once because there are overlapping sequencing - it's not possible to
sequece the entore genome once
Assumption: sequencing reads will be randomly distributed in
the genome (i.e. ability to sequence a particular region of the
genome does not differ)

C (coverage)=LN/G, where,
L= sequence read length in bp how much sequence per sequence reaction, 5-800 bp
N=number of reads sequenced Number of clones
G=haploid genome length in bp

Given a genome size of 5 Mb, 1X coverage = 5 Mb


sequenced; 2X coverage=10 Mb sequenced etc
19
Sequencing Reads
•Sequencing reads by the ABI 3700 DNA Analyzer are up to
800 bp

•Insert is usually sequenced from both ends (paired


reads/mate pairs) with a pair of universal primers
- can get 2 reads from a single
sequence primer

Greater length of sequencing reads is better for aligning


sequences and better coverage of the genome
Imp can come in midterm
How many clones would have to be screened for 1X
coverage of a 4 Mb genome with paired reads of 500 bp
each? C= LN/G CG/L = N , 1*4*10^6 /500*2 (EACH SIDE)= 4000 clones
20
P is probability
Poisson distribution
Assuming a random library, the sequence coverage of a
genome roughly follows a Poisson distribution

P(y)=(ƛy x e-ƛ)/y! where:


y= number of events in a given interval
ƛ=mean number of events in a given interval

For genome sequencing:


Probability

P(y)=(ƛy x e-ƛ)/y! where:


y= number of times a nucleotide is
sequenced
ƛ=genome coverage

# of events in a given interval (y)


http://en.wikipedia.org/wiki/Poisson_distribution 21
Genome sequencing and the Poisson distribution
For genome sequencing, only want to determine the probability
that any base is NOT sequenced (i.e. y=0)

P(0)=e-ƛ where:
ƛ=genome coverage

Coverage % Not Sequenced % Sequenced


1 37 63
2 13.5 86.5
3 5 95
4 1.8 98.2
5 0.67 99.33
6 0.25 99.75
7 0.09 99.91
8 0.03 99.97
9 0.01 99.99
10 0.0045 99.9955

Genomes are usually sequenced with 5X-8X coverage


22
Genome assembly
DNA fragments with overlapping sequences must be adjacent to one another

Primers for
sequencing reads Sequencing reads
(400-700 bp)

Insert
Contigs

Clones
Scaffold/supercontig

Contigs: a contiguous sequence made of overlapping reads


with no gaps

Scaffold: an ordered set of contigs usually derived from


mate pairs
23
Finding overlapping sequences
•Find best match between suffix of one sequence read and the
prefix of another
Read A Read C

Read B Read D
Good Bad (misassembly?)

•Because sequencing errors occur, filter out pairs of


fragments that do not share a long string (>24 mer)

•Find an overlapping k-mer and extend to full alignment, if


alignment is not >94% identity, then discard
CTG TAATCTTGTTATTTCCGG TAC

CT TAATCTTGTTATTTCCGG GAT
24
Chromosomes and genes
5’ 3’

3’ 5’

•Genes can be coded from either strand

•However, genome sequence usually shows base pairs of


only one strand (FASTA file)

•Therefore, the sequence of genes on the opposite strand


are the reverse complement

5’ ATG...TAA TTA...CAT 3’

25
Mining genomic data (bioinformatics)
Gene-finding approaches

•Ab initio (intrinsic) approach: the genomic DNA


sequence alone is systematically searched for
protein-coding genes

•Extrinsic (evidence-based) approach: the target


genome is compared to other genomes to look for
similarity to known mRNA and protein sequences in
databases (NCBI, EMBL)

•Usually, both approaches are used since that are


highly complementary
26
How do you find genes that encode for proteins?

•Presence of ORF (start [ATG] and stop codons [TAA, TAG,


TGA]) > 300 bp

•Presence of CpG islands (60-70% GC content) associated


with 5’ end of transcribed genes

•Splicing sites (exons and introns)

•Sequence contains known protein domains (blast)

•Reading frame conserved over multiple species (genome


alignment)

•Range of human genes range from 20,000-35,000


although the consensus is somewhere ~25,000
27
Molecular functions of 26383 genes

•>40% of the genes have unknown functions

•The “function” of many known genes is preliminary at best


(interactions, localization, regulation are unknown)

•Which genes are associated with human diseases?


Venter et al. (2001) Science 291: 1304-1351 28
Next-Generation Sequencers (SGS and TGS)

2nd generation
SGS

3rd generation
TGS

1st generation
Schuster (2008) Nature Methods 5: 16-18
Sanger Science 343: 829-830 (2014) 29
Main differences between second-generation
sequencers and capillary-based sequences
•Library construction: fragment genomic DNA and
PCR, bypassing vector cloning

•Number of parallel reads: up to 4 billion compared


to 96

•Read lengths: generally shorter: 35 – 200 bp


compared to >800 bp for Sanger

•Amount of genomic template: (need a few


micrograms)

30
PCR: gene amplification for analysis
Number of molecules synthesized per cycle (x) is logarithmic [2x]

Forward Reverse
primer primer

Taq polymerase
synthesizes both DNA
strands simultaneously
users.ugent.be 31
1. Roche 454 Sequencer (SGA-2004)
https://www.roche-applied-science.com/sis/sequencing/flx/index.jsp
•Fragment DNA and
ligate adaptors to
ends
Step 1:
•Select fragments with
two different adaptors
(for PCR)

•Nick nonbiotinylated
strand to get sstDNA
library

•Bind sstDNA library


to beads
Step 2:
•Emulsion PCR to
amplifly sstDNA
(1000000 copies/bead
32
•Put beads in wells of
picotiter plate

•Add sequencing
Step 3:
reaction components
including adenosine
5’phosphosulfate (APS),
luciferin, luciferase

•Flood dNTPs one at


time over picotiter plate

•If nucleotide is added to


new DNA strand,
pyrophosphate is given
Step 4: off that results in light
emission.

•Take an image of
picotiter plate and repeat
with next dNTP
http://www.youtube.com/watch?v=bFNjxKHP8Jc 33
2. Illumina Solexa sequencing (SGS)
Flow cell http://www.illumina.com/
1. Sample preparation and amplification
•Fragment DNA and add linkers
(adaptors) at the ends.

•Denature and bind one end of


the ssDNA fragments to surface
of flow cell (glass).

•Free end of fragments hybridize


to adaptors on the flow cell
surface (bridging reaction).

•Add PCR components (e.g.


dNTPs, Taq polymerase) and
carry out PCR in flow cell.

•DNA fragments are amplified


generating clusters of multiple
copies of the same molecule.
34
2. Sequencing by synthesis
Initiate sequencing of clusters by
adding primers, DNA polymerase
and reversible ddNTPs.

Each type of ddNTP (ddATP,


ddTTP, ddCTP, ddGTP) is
labeled with a different
fluorophore

Add all four ddNTPs at once,


allow incorporation in sequencing
reaction and image flow cell.

Remove fluorophore from each


ddNTP and then add new
ddNTPs with fluorophore and
continue sequencing.

Repeat n times to create a read


length of n nucleotides.
https://www.youtube.com/watch?v=HMyCqWhwB8E 35
3. Applied Biosystems SOLiD sequencing (SGS)
http://www3.appliedbiosystems.com

•Library preparation similar to Roche 454


(beads and emPCR)

•Universal primer hybridize to P1 adapter


sequence at end of fragments

•A set of 16 8-mers that are fluorescently


labelled is flooded over the fragments.

•First two bases of each 8-mer are fixed


(dibase probes), and the remaining 6 bases
are degenerate

•Allow probe to bind template and ligate to


primer (sequencing by ligation)

•Following several ligation cycles, the


template is removed and the process
repeated with a new primer (offset by one
nucleotide)
http://www.youtube.com/watch?v=nlvyF8bFDwM 36
4. Life Technologies: Ion Torrent (SGS)
•Emulsion PCR and beads
are placed on chip (similar
to 454)

•Sequencing reaction
involves flooding of dNTPS
one at a time on the chip

•Nucleotide incorporation
causes a H+ to be released,
thereby changing pH of
well solution (no
fluorescence)

•pH change is measured by


an ion-sensitive detector at
bottom of well, converted
into voltage and measured
https://www.thermofisher.com/ca/en/home/life-science/sequencing/next-
by chip.
generation-sequencing/ion-torrent-next-generation-sequencing-technology.html 37
Main differences between second-generation and
third generation sequencers

•No PCR application required (sequencing of


single DNA molecules)

•Read lengths: much longer (1000’s to


10,000’s bp)

•Error rate and costs are still much higher


that 2nd generation platforms

38
5. Pacific Biosciences Single Molecule Real Time
(SMRT) Sequencing (TGS)
•Sequencing reaction carried out in extremely small wells (50 nm) called
zero-mode waveguides (ZMV) allowing for high sensitivity to measure
fluorescence

•DNA and polymerase is embedded on the bottom of ZMVs

•Fluorescent dNTPs are added one at a time and incorporation is


measured by intensity and colour of fluorescence.

https://www.youtube.com/watch?v=v8p4ph2MAvI 39
6. Oxford Nanopore Technologies (TGS)
•Nanopore is the bacterial α-hemolysin
protein embedded in a synthetic membrane
on an array chip

•Membrane has high electrical resistance


and the application of a potential across the
membrane cause a current to flow through
the aperture of the nanopore.

•DNA is inserted in a nanopore by a DNA


polymerase (brown) and travels through
the nanopore one nucleotide at a time.

•As each type of nucleotide travels through


the nanopore, it causes a unique current
disruption .

•The current changes are measured to


identify the nucleotide sequence.
https://www.youtube.com/watch?v=GUb1TZvMWsw 40
Next-generation sequencers bake-off
Platform Read length Number of reads/run Cost ($/Mb) Run time (days) Error rate (%)
454(SGS) 450-900 1,000,000 10 0.95 1
Illumina (SGS) 100-150 4,000,000,000 0.1 5-6 0.1-0.5
SOLiD (SGS) 35-75 700,000 0.1 7 < 0.1
Ion Torrent (SGS) 150-200 80,000,000 5 2-4 1-2%
PacBio (TGS) 5500-8500 50,000 50-150 0.5-3 15
Nanopore (TGS) 10000-30000 2,000,000 N/A 2-48 1-10

Genome sequenced (publication year) HGP (2003) Venter (2007) Watson (2008)
Time taken (start to finish) 13 years 4 years 4.5 months
Number of scientists listed as authors > 2,800 31 27
Cost of sequencing (start to fi nish) $2.7 billion $100 million < $1.5 million
Coverage 8–10 × 7.5 × 7.4 ×
Number of institutes involved 16 5 2
Number of countries involved 6 3 1
Technology Sanger Sanger Roche

•Illumina announced in 2014 that their HiSeq X can sequence a


complete human genome for $998.

•However, this cost does not include assembly and annotation.


Wadman (2008) Nature 452: 788
Kremkow and Lee (2015) Biotechnol Lett 37: 55-65 41
Applications of next-generation sequencing
•Large number of sequence reads of genomes make it easier to identify
SNPs linked to polygenic diseases and interesting traits

•Metagenomics-sequence genetic material from environmental samples


to determine identity and diversity of microbes (gut, volcanic vents, oil
sands)

•Sequencing capacity allows for greater coverage of a genome that is


present in a low proportion of the total genetic material in a sample (e.g.
Neanderthal: <5% of sample; woolly mammoth).

•Transcriptomics (RNA-Seq): global identification of low abundant


transcripts (including microRNAs) with higher sensitivity than microarrays

Chromatin Immunoprecipitation (ChIP)-Seq: global identification of


binding sites of nucleic-acid binding proteins and chemical modifications
(e.g. histone occupancy and acetylation, transcription factors)

42
Limitations of next-generation sequencing

•Shorter reads (35-100 bp) make denovo genome


assembly difficult (ok for prokaryotes with little
repetitive sequences or for resequencing projects)

•Have to increase coverage (35X-40X) due to


short reads

•Error rates are still too high for some platforms.

•Few centers with strong infrastructure and


support for assembly and analysis

43
Routine Human Genome
Sequencing/Personalized Medicine

Personal genomic information can lay out the health


roadmap of the individual

•Provide advanced screening for disease

•Select safer and more effective medications and


dosages

•Create better vaccines (DNA/RNA vaccines)

•Lower health care costs

44
Human genome projects
•Personal Genome Project (Harvard Medical School and other
countries including Canada): can volunteer to “share your genome
information for the greater good”

•1000 Human Genomes Project (completed in 2012): genomes of


over 1,092 anonymous people from 14 populations around the world
were sequenced

•The Cancer Genome Atlas (TCGA) started as a three-year pilot in


2006 funded by NCI and NHGRI to focus on the molecular
understanding of brain, lung and ovarian cancer (210,000 new cases
annually in US).

•UK10K: sequence 4000 healthy humans and exomes of 6000


currently living with a genetic disease (obesity, schizophrenia and
congenital heart disease)

•10K Autism Genome Project (Stephen Scherer, U of T)


45
Human Genome Variation
•Genome comparison indicates that 99.5% of sequence is identical among
individuals (15-16 million nucleotide differences).

•38 million common SNPs, 1.4 million short insertions and deletions, and
more than 14,000 larger deletions among humans.

•Estimated 76–190 rare deleterious non-synonymous variants and up to


20 loss-of-function and disease-associated variants

•As more disease genes are discovered, we will gain a better molecular
understanding of the disorder and develop better diagnostics and
therapeutics.

Nature 491: 56-65 (2012) 46


Learning objectives: you should be able to…

•Explain the advantages and disadvantages of hierarchical


shotgun and whole genome shotgun genome sequencing?

•Define coverage and explain why Sanger sequencing of a


genome requires 5-7X coverage?

•Explain how next-generation sequencing technology has


increased the capacity to sequence genomes over Sanger
sequencing.

•Describe the three main steps of sequencing the genome?

•Describe the different types of next generation sequencing


platforms.

•Describe the benefits of human genome sequencing

47

You might also like