Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views35 pages

Lecture 1

The Bioinformatics course, led by Prof. Chi-Ying Huang and Dr. Tam Tran, covers the integration of biology, computer science, and information technology to enhance biological insights. Key components include attendance, homework, and exams, with a focus on primary databases and applications in molecular biology and diagnostics. Students will engage in practical exercises and assignments related to DNA and protein sequencing, utilizing resources like NCBI and various bioinformatics tools.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views35 pages

Lecture 1

The Bioinformatics course, led by Prof. Chi-Ying Huang and Dr. Tam Tran, covers the integration of biology, computer science, and information technology to enhance biological insights. Key components include attendance, homework, and exams, with a focus on primary databases and applications in molecular biology and diagnostics. Students will engage in practical exercises and assignments related to DNA and protein sequencing, utilizing resources like NCBI and various bioinformatics tools.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Bioinformatics - Session 1

Course Provider: PhD Tam Tran

Bachelor in PMAB & MST– Year 2


Lecturers

Prof. Chi-Ying Huang


Institute of Biopharmaceutical Sciences
National Yang-Ming University, Taiwan
E-mail: [email protected]

Dr. Tam Tran


Department of Life Sciences (LS) – USTH
Email: [email protected]

2
Scoring

1. Attendance (10%)

2. Homework (30%) – submit your homework via USTH Moodle before


10am the day before the next class

3. Exam (60%) - Multiple Choice Exams

OR

1. Attendance (20%) (Note: 3 absences or late/5 classes = 0)

2. Homework (80%) – submit your homework via USTH Moodle before


10am the day before the next class

3
Important Administrative Notes

 You must log in to your USTH Moodle account. Both the homework and
the final are submitted via Moodle.

https://moodle.usth.edu.vn

 All handouts, assignments, lecture slides, and announcements are


posted on Bioinformatics course in USTH Moodle

4
Outline

1. What is Bioinformatics?

2. Primary data

3. Biomedical databases search using NCBI

5
Why Study Bioinformatics?
1. What is Bioinformatics?

Bioinformatics (NCBI definition)


is the field of science in which biology, computer science, and information technology merge to form a
single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying principles in biology can be discerned.
7
Applications of Bioinformatics

Basic molecular biology Clinical/medical diagnostics

Development of pharmaceuticals Agricultural biotechnology

8
Key research in bioinformatics

Sequence bioinformatics Structural bioinformatics

Systems biology
- Integrated analysis of biological pathways
and interacting networks

9
2. Primary databases

10
Primary databases

Labs

Experimental results
 Information stored in the genetic code (nucleotide sequences)
 Protein sequences (amino acid sequences)

11
Primary databases (Raw data)
Nucleic acids databases

Three databases that accept direct submission

Genbank
(NCBI, USA)

ENA
(European DDBJ
Molecular (National Centre
Biology for Genetics)
Laboratories)

Fig. Information stored at GenBank, ENA and DDBJ is shared during daily updates

12
Exponential growth of the sequence databases

(European Nucleotide Archive, https://www.ebi.ac.uk/ena/about/statistics) 13


Protein sequence origin
 > 95 % of the protein sequences are derived from the translation of
nucleotide sequences

 Sequences experimentally obtained by


direct protein sequencing (~ 5%)

14
Beckman protein/peptide sequencer
HOMEWORK 1
DNA sequencing vs Protein sequencing
a. What is the difference between DNA sequencing and protein sequencing?
b. Why don't we sequence protein like we sequence DNA

15
Protein sequence databases

Main databases
Protein databases

Swiss-Prot TrEMBL
(Europe) (Translated EMBL)
A “waiting list” database for Swiss-Prot
 Reviewed  Unreviewed
 Manually annotated  Computationally annotated
 Records with information extracted from  Records with await full manual
literature & curator-evaluated annotation
computational analysis

UniProt
Established in 2003
16
http://www.uniprot.org
A number of available complete genomes

T7 bacteriophage Escherichia coli Sacchoromyces cerevisae


Published 1983 1998 1996

Nucleotide (bp) 39,937 bp 4,639,221 bp 12,069,252 bp

Genes 59 4293 5800

Caenorhabditis elegans Drosophila melanogaster


completed in 1998 completed in 2000
95,078,296 bp; 19,099 genes 116,117,226 bp; 13,601 genes
17
What was the first mammal to be fully sequenced?

(Compeau and Pevzner, 2015)

18
Homo sapiens (13-year-long, $2.7 billion)
 1st draft completed in 2001
 3,160,079,000 bp; 31,780 genes

19
The Genome Sequencing Era
18 microbial genomes 40 microbial genomes

mouse

First eukaryote genome First higher plant


Arabidopsis First fish
Yeast Fugu
1996 1997 1998 1999 2000 2001 2002 2020

First microbial genome


H. influenzae
E. coli

First multicellular animal


C. elegans malaria:
Fruit fly
mosquito
First mammal and
Homo sapiens parasite

>9000 complete microbial genomes


(NCBI data) 20
EXERCISE BREAK

Exercise 1: Understand Reverse/forwards strands and reverse complementary strand


Example sequence (Forward strand): 5’-GCATGCAT-3’

Question:
• Write down the reverse strand
• Write down the reverse complementary strand

21
EXERCISE BREAK

Exercise 2: Translate DNA sequence into protein


> Example Sequence
ATGACAGGGTGGGAGAGCTTATATAAGGATGCAATCGAGAAGGCAATAAAATCAGTTCCAAAGGTTAAAGGA
GTTCTCCTAGGCTATAACACAAACATAGATGCCATAAAATACCTAGACTCTAAGGATCTCGAAGAGAGAATA
GAGAAAGTCGGTAAGGAGGAAGTATTAAAGTACTCCGAAGAGCTTCCAGAAAAAATCACTTCAATCCCGCAG

Question: Translate the DNA sequence in 3 frames, and determine the


reading frame which contains an open reading frame (ORF).

Suggestion: Bioinformatic Tools


- Sequence Manipulation Suite (SMS) (recommended):
http://annotathon.org/sms2/orf_find.html
- NCBI ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/

22
Translated ORFs

23
3. Biomedical databases search
using NCBI

24
The National Center for Biotechnology
Information (NCBI) (of USA)

https://www.ncbi.nlm.nih.gov/

25
Databases & Tools

Article
Abstracts
MedLine

Taxonomy VAST
Browser 3-D
Genome Structure
Taxonomy
Data MMDB
Viewer
Genomes

Nucleotide Protein
Sequences Sequences

BLAST
BLAST

26
Other Databases

Cancer
Genetic
Chromosome
Variation
Aberration
dbSNP CCAP

Cancer Gene
Gene
Expression
Expression
CGAP
SAGE

Genetic
Protein
Disease
OMIM Swiss Prot
Reference sequence collections at NCBI
(RefSeq Database )

 Collection of unique sequences (One gene, one sequence)

NT_ - genomic contigs NM_ - cDNA/mRNAs NP_ - protein sequences

NR (Non redundant Prot)


= Uniprot = TrEMBL + SwissPROT

 The sequences are deduced (semi) automatically, and later


human-curated
 Status: provisional or reviewed
 Available at: http://www.ncbi.nlm.nih.gov/RefSeq

28
Why NCBI, why not Google?

The NCBI search engine: Entrez


Let's try DNA mismatch repair, to search all NCBI databases and Google Web.

30
GenBank Record Fields
Locus name

Division
Accession
Number
gi Number

Version Number

Medline ID

Protein ID

Protein
sequence

Nucleotide
Sequence
31
FASTA Format
FASTA Definition Line
>gi|193425|gb|M60978.1|MUSGAPDS

gi number
Locus Name

Accession number
DB identifiers
gb GenBank
Emb EMBL
dbj DDBJ
sp Swiss-SROT
pdbProtein Databank
pir PIR
ref RefSeq
40
HOMEWORK 2
Homework 2 (USTH Moodle): Figure out how the genes assigned to each of you are implicated
in cancers and/or immunity (File: Gene List.xlsx)

Requirements: get the following information about each of the 3 genes assigned to you
• Gene symbol, full name, reviewed by RefSeq
• Summary of its function
• Location on the human genome (based on GRCh38)
– e.g. chromosome, start, end, strand
• How this gene is related to cancer
– Get one open-access reference that is most relevant to cancers and/or immunity in your
opinion. Please list the article title, the authors, their institutions, publication year, journal
name.
• Any situations (mutations, over-expression, etc.) of this gene associated with other (non-cancer
and non-immune) diseases
• Extract DNA sequence of these genes and translate the DNA sequences in 3 frames, and
determine the reading frame which contains an open reading frame (ORF).

33
Take home message

 There are a large number of primary databases


 Use appropriate databases
 Know what kind of information to expect

34

You might also like