Bioinformatics - Session 1
Course Provider: PhD Tam Tran
Bachelor in PMAB & MST– Year 2
Lecturers
Prof. Chi-Ying Huang
Institute of Biopharmaceutical Sciences
National Yang-Ming University, Taiwan
E-mail:
[email protected]Dr. Tam Tran
Department of Life Sciences (LS) – USTH
Email:
[email protected] 2
Scoring
1. Attendance (10%)
2. Homework (30%) – submit your homework via USTH Moodle before
10am the day before the next class
3. Exam (60%) - Multiple Choice Exams
OR
1. Attendance (20%) (Note: 3 absences or late/5 classes = 0)
2. Homework (80%) – submit your homework via USTH Moodle before
10am the day before the next class
3
Important Administrative Notes
You must log in to your USTH Moodle account. Both the homework and
the final are submitted via Moodle.
https://moodle.usth.edu.vn
All handouts, assignments, lecture slides, and announcements are
posted on Bioinformatics course in USTH Moodle
4
Outline
1. What is Bioinformatics?
2. Primary data
3. Biomedical databases search using NCBI
5
Why Study Bioinformatics?
1. What is Bioinformatics?
Bioinformatics (NCBI definition)
is the field of science in which biology, computer science, and information technology merge to form a
single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying principles in biology can be discerned.
7
Applications of Bioinformatics
Basic molecular biology Clinical/medical diagnostics
Development of pharmaceuticals Agricultural biotechnology
8
Key research in bioinformatics
Sequence bioinformatics Structural bioinformatics
Systems biology
- Integrated analysis of biological pathways
and interacting networks
9
2. Primary databases
10
Primary databases
Labs
Experimental results
Information stored in the genetic code (nucleotide sequences)
Protein sequences (amino acid sequences)
11
Primary databases (Raw data)
Nucleic acids databases
Three databases that accept direct submission
Genbank
(NCBI, USA)
ENA
(European DDBJ
Molecular (National Centre
Biology for Genetics)
Laboratories)
Fig. Information stored at GenBank, ENA and DDBJ is shared during daily updates
12
Exponential growth of the sequence databases
(European Nucleotide Archive, https://www.ebi.ac.uk/ena/about/statistics) 13
Protein sequence origin
> 95 % of the protein sequences are derived from the translation of
nucleotide sequences
Sequences experimentally obtained by
direct protein sequencing (~ 5%)
14
Beckman protein/peptide sequencer
HOMEWORK 1
DNA sequencing vs Protein sequencing
a. What is the difference between DNA sequencing and protein sequencing?
b. Why don't we sequence protein like we sequence DNA
15
Protein sequence databases
Main databases
Protein databases
Swiss-Prot TrEMBL
(Europe) (Translated EMBL)
A “waiting list” database for Swiss-Prot
Reviewed Unreviewed
Manually annotated Computationally annotated
Records with information extracted from Records with await full manual
literature & curator-evaluated annotation
computational analysis
UniProt
Established in 2003
16
http://www.uniprot.org
A number of available complete genomes
T7 bacteriophage Escherichia coli Sacchoromyces cerevisae
Published 1983 1998 1996
Nucleotide (bp) 39,937 bp 4,639,221 bp 12,069,252 bp
Genes 59 4293 5800
Caenorhabditis elegans Drosophila melanogaster
completed in 1998 completed in 2000
95,078,296 bp; 19,099 genes 116,117,226 bp; 13,601 genes
17
What was the first mammal to be fully sequenced?
(Compeau and Pevzner, 2015)
18
Homo sapiens (13-year-long, $2.7 billion)
1st draft completed in 2001
3,160,079,000 bp; 31,780 genes
19
The Genome Sequencing Era
18 microbial genomes 40 microbial genomes
mouse
First eukaryote genome First higher plant
Arabidopsis First fish
Yeast Fugu
1996 1997 1998 1999 2000 2001 2002 2020
First microbial genome
H. influenzae
E. coli
First multicellular animal
C. elegans malaria:
Fruit fly
mosquito
First mammal and
Homo sapiens parasite
>9000 complete microbial genomes
(NCBI data) 20
EXERCISE BREAK
Exercise 1: Understand Reverse/forwards strands and reverse complementary strand
Example sequence (Forward strand): 5’-GCATGCAT-3’
Question:
• Write down the reverse strand
• Write down the reverse complementary strand
21
EXERCISE BREAK
Exercise 2: Translate DNA sequence into protein
> Example Sequence
ATGACAGGGTGGGAGAGCTTATATAAGGATGCAATCGAGAAGGCAATAAAATCAGTTCCAAAGGTTAAAGGA
GTTCTCCTAGGCTATAACACAAACATAGATGCCATAAAATACCTAGACTCTAAGGATCTCGAAGAGAGAATA
GAGAAAGTCGGTAAGGAGGAAGTATTAAAGTACTCCGAAGAGCTTCCAGAAAAAATCACTTCAATCCCGCAG
Question: Translate the DNA sequence in 3 frames, and determine the
reading frame which contains an open reading frame (ORF).
Suggestion: Bioinformatic Tools
- Sequence Manipulation Suite (SMS) (recommended):
http://annotathon.org/sms2/orf_find.html
- NCBI ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/
22
Translated ORFs
23
3. Biomedical databases search
using NCBI
24
The National Center for Biotechnology
Information (NCBI) (of USA)
https://www.ncbi.nlm.nih.gov/
25
Databases & Tools
Article
Abstracts
MedLine
Taxonomy VAST
Browser 3-D
Genome Structure
Taxonomy
Data MMDB
Viewer
Genomes
Nucleotide Protein
Sequences Sequences
BLAST
BLAST
26
Other Databases
Cancer
Genetic
Chromosome
Variation
Aberration
dbSNP CCAP
Cancer Gene
Gene
Expression
Expression
CGAP
SAGE
Genetic
Protein
Disease
OMIM Swiss Prot
Reference sequence collections at NCBI
(RefSeq Database )
Collection of unique sequences (One gene, one sequence)
NT_ - genomic contigs NM_ - cDNA/mRNAs NP_ - protein sequences
NR (Non redundant Prot)
= Uniprot = TrEMBL + SwissPROT
The sequences are deduced (semi) automatically, and later
human-curated
Status: provisional or reviewed
Available at: http://www.ncbi.nlm.nih.gov/RefSeq
28
Why NCBI, why not Google?
The NCBI search engine: Entrez
Let's try DNA mismatch repair, to search all NCBI databases and Google Web.
30
GenBank Record Fields
Locus name
Division
Accession
Number
gi Number
Version Number
Medline ID
Protein ID
Protein
sequence
Nucleotide
Sequence
31
FASTA Format
FASTA Definition Line
>gi|193425|gb|M60978.1|MUSGAPDS
gi number
Locus Name
Accession number
DB identifiers
gb GenBank
Emb EMBL
dbj DDBJ
sp Swiss-SROT
pdbProtein Databank
pir PIR
ref RefSeq
40
HOMEWORK 2
Homework 2 (USTH Moodle): Figure out how the genes assigned to each of you are implicated
in cancers and/or immunity (File: Gene List.xlsx)
Requirements: get the following information about each of the 3 genes assigned to you
• Gene symbol, full name, reviewed by RefSeq
• Summary of its function
• Location on the human genome (based on GRCh38)
– e.g. chromosome, start, end, strand
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or immunity in your
opinion. Please list the article title, the authors, their institutions, publication year, journal
name.
• Any situations (mutations, over-expression, etc.) of this gene associated with other (non-cancer
and non-immune) diseases
• Extract DNA sequence of these genes and translate the DNA sequences in 3 frames, and
determine the reading frame which contains an open reading frame (ORF).
33
Take home message
There are a large number of primary databases
Use appropriate databases
Know what kind of information to expect
34