LECTURE 2
Selected Topics in
Computer
Science
CS409 Introduction To
bioinformatics
DR. Ashraf Hendam
OUTLINE
• Genome Sequencing
• Genetic variations
• Biological Data File Format
Genome Sequencing
Sequencing by Reference
Reference assembly maps reads to a reference genome by identifying
reads with similar nucleotides to the reference.
Genome Sequencing
De novo sequencing
• Refers to sequencing a novel genome where there is no reference
sequence available for alignment.
• Sequence reads are assembled as contigs, and the coverage
quality of de novo sequence data depends on the size and
continuity of the contigs (ie, the number of gaps in the data).
Genetic variations
Causes of variations
1. Mistakes in DNA replication
2. Environmental agents (radiation, chemical agents)
3. Transposable elements (transposons)
A part of DNA is moved or copied to another location in genome
4. Horizontal transfer of DNA
• Organism obtains genetic material from another organism that
is not its parent
• Utilized in genetic engineering
Back
Genetic variations Cont.
Types of variations
1. SNP (Single Nucleotide Polymorphisms)
2. Indels (Insertion-Deletion)
3. Inversion
4. Duplication
Back
Genetic variations Cont.
SNP (Single Nucleotide Polymorphisms)
Ref
G A C T T C G A T C A
Sample
G A C G T C G A T C A
Back
Non-synonymous Synonymous SNP
Genetic variations Cont. SNP (nsSNP) (sSNP)
Ref G A C T T C G A T C A G DFDQ
Sample G A C G T C G A T C A A DVDQ
Back
Genetic variations Cont.
Indels (Insertion-Deletion)
Insertion
Ref
G A C T - - - - T C G
Sample
G A C T C G A T T C G
Genetic variations Cont.
Deletion
Ref
G A C T T C G A T C A
Sample
G A C - - C G A T C A
Genetic variations Cont.
Inversion
REF
G A C T T C G A T C A
Sample
G A C G C T T A T C A
Back
Genetic variations Cont.
Duplication
REF
G A C G T C G A T C A
ReSeq
G A C G T C G T C C A
Back
Biological Data File Format
• FASTA
• FASTQ
• SAM
• BAM
• VCF
• GFF
• PdB Back
Biological Data File Format
FASTA
• File extensions : file.fa, file.fasta, file.fna
• Fasta format is a simple way of representing nucleotide or amino acid sequences of
proteins.
• This is a very basic format with two minimum lines. First line referred as comment line
starts with ‘>’ and gives basic information about sequence.
• After comment line, sequence of nucleic acid or protein is included in standard one
letter code
• >XR_ 002086427.1 Candida albicans SC5314 uncharacterized ncRNA (SCR1), ncRNA
TGGCTGTGATGGCTTTTAGCGGAAGCGCGCTGTTCGCGTACCTGCTGTTTGTTGAAAATTTAA
GAGCAAAGTGTCCGGCTCGATCCCTGCGAATTGAATTCTGAACGCTAGAGTAATCAGTGTCTT
TCAAGTTCTGGTAATGTTTAGCATAACCACTGGAGGGAAGCAATTCAGCACAGTAATGCTAAT
CGTGGTGGAGGCGAATCCGGATGGCACCTTGTTTGTTGATAAATAGTGCGGTATCTAGTGTTG
CAACTCTATTTTT
• >7S4F1|Chain A|Tyrosine-protein phosphatase non-receptor type 1|Homo sapiens
MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRIK
LHQEDNDYINASLIKMEEAQRSYILTQGPLPNTCGHFWEMVWEQKSRGVVMLNRV
MEKGSLKCAQYWPQKEEKE
Back
Biological Data File Format
FASTQ
1. File extensions : file.fastq, file.sanfastq, file.fq
2. Fastq format was developed by Sanger institute in order to group together sequence
and its quality scores (Q: phred quality score). In fastq files each entry is associated with
4 lines.
@K00188:2:N:0:CTTGTA ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTA
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJ
1. Line 1 begins with a ‘@‘ character and is a sequence identifier and an optional
description.
2. Line 2 Sequence in standard one letter code.
3. Line 3 begins with a ‘+‘ character and is optionally followed by the same sequence
identifier (and any additional description) again.
4. Line 4 encodes the quality values for the sequence in Line 2, and must contain the
same number of symbols as letters in the sequence.
Back
Biological Data File Format
• File extensions : file.sam
• SAM (Sequence Alignment/Map format) Format is a text format for
storing sequence data in a series of tab delimited ASCII columns.
• SAM format files are generated following mapping of the reads to
reference sequence.
• Header It is TAB-delimited text format with header and a an
alignment section.
section is optional, Header lines start with ‘@’
Each alignment line has 11 mandatory fields for essential
alignment information such as mapping position, and
variable number of optional fields
Back
Biological Data File Format
Biological Data File Format
Biological Data File Format
BAM
• File extensions : file.bam
• A BAM (Binary Alignment/Map) file is the compressed
binary version of the Sequence Alignment/Map (SAM), a
compact and indexable representation of nucleotide
sequence alignments.
• Data between SAM and BAM is exactly same.
• Being Binary BAM files are small in size and ideal to store
alignment files.
Back
Biological Data File Format
BAM
Back
Biological Data File Format
VCF (Variant Calling Format/File)
• File extensions : file.vcf
• VCF is a text file format with a header (information VCF
version, sample etc) and data lines constitute the body of file.
• VCF file contains data of genetic variation including SNPs
and Indels.
Back
Biological Data File Format
VCF (SNPs)
Biological Data File Format
VCF(Indels)
Biological Data File Format
NW_008246507.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
C A T T ACGT A A T T C C G C T T C C G G C A T C T G G C T C A G T T C C G C
VCF
Biological Data File Format
GFF(General Feature Format or Gene Finding
Format)
• File extensions : file.gff2, file. gff3, file.gff
• It has first 8 fields like GFF2 but differs in field 9 in
assigning attributes.
• GFF file is used to annotate VCF files (SNPs and
Indels)
Back
Biological Data File Format
GFF
Back
Biological Data File Format
NW_008246507.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
CA T T A CGT A A T T C C G C T T C C G G C A T C T G G C T C A G T T C C G C
GFF
Biological Data File Format
GFF
VCF
Biological Data File Format
Protein Data Bank (PDB)
• Is a standard format for files containing atomic coordinates
of proteins.
• Structures deposited in the Protein Data Bank at the Research
Collaboratory for Structural Bioinformatics (RCSB) are
written in this standardized format.
• Several methods are currently used to determine the
structure of a protein, including X-ray crystallography, NMR
spectroscopy, and electron microscopy.
Back
Biological Data File Format
Protein Data Bank (PDB)
Biological Data File Format
Protein Data Bank (PDB)
Biological Data File Format
Protein Data Bank (PDB)
ATOM is the atoms of the protein
Biological Data File Format
Protein Data Bank (PDB)
HETATM is the atoms of drugs not a part of the protein
Biological Data File Format
Molecular visualization software of PDB
Back
Biological Data File Format
Protein Data Bank (PDB)