Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views73 pages

Lecture3 4

The document provides an overview of bioinformatics databases, explaining their structure, types, and management systems. It highlights the importance of primary and secondary databases for storing biological data, with examples like GenBank, EMBL, and DDBJ. Additionally, it discusses the role of the National Center for Biotechnology Information (NCBI) in maintaining these databases and providing tools for data retrieval and analysis.

Uploaded by

sundus waseem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views73 pages

Lecture3 4

The document provides an overview of bioinformatics databases, explaining their structure, types, and management systems. It highlights the importance of primary and secondary databases for storing biological data, with examples like GenBank, EMBL, and DDBJ. Additionally, it discusses the role of the National Center for Biotechnology Information (NCBI) in maintaining these databases and providing tools for data retrieval and analysis.

Uploaded by

sundus waseem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Introduction to

Bioinformatics

Lecture 3+4
What is a database?
A database is an organized collection of related
information
A Computerized archive used to store and organize
data in such a way that information can be retrieved
easily.
A database is a repository of information that has a
specific structure that enables the entering and
extraction of data
In general, this database structure consists of files or
tables,
each containing numerous records and fields
Database System (DBS) is an integrated
collection of related files along with the detail
about their definition, interpretation,
manipulation and maintenance
A database system controls the data from unauthorized
access.
What are the advantages of using
databases?

 Easy and quick retrieval of information

 Provide backup support


Databases

 Sequence info is stored in databases


 So that they can be manipulated easily
 The db (next slide) are located at diff places
 They exchange info on a daily basis so that they are
up-to-date and are in sync
 Primary db – sequence data
Biological databases
Need to collect and store biological data and its
associated knowledge into databases

Fundamental to the survival of science

Each year, Nucleic Acids Research (NAR) journal


dedicates an entire issue on the available databases!
Database management systems
Database management systems provide several functions in
addition to simple file management:
control security
 maintain data integrity
provide for backup and recovery
 control redundancy
 allow data independence
perform automatic query optimization
Organisation
Organisation:
flat files
Relational databases
Flat-file databases
the simplest form of a database,
where collections of data, such as nucleotide and amino acid
sequence, are stored as either a large single text file
Conti…
Conti..
A relational database stores the data within a number of tables.
Each table consists of records and fields (rows and columns)
Two kinds of biological databases

1. Primary
 Contain primary sequence information (nucleotide or protein)
and associated annotations

2. Secondary
 Summarize the results from primary databases
Types of Database
The databases can be classified
into three categories on the basis
of the information stored.
They are Primary, Secondary and
Composite databases.
Primary databases contain data
that is derived experimentally.
They usually store information
related to the sequences or
structures of biological
components
They can be further divided into
protein or nucleotide databases
Primary databases

 Nucleotide sequence databases

 Protein sequence databases


Conti…
Useful Database
 Secondary
(curated)
 Primary (archival)  RefSeq (seqs)
 GenBank/EMBL/DDBJ  UniProt - SwissProt
(seqs) (seqs)
 PDB -(protein structures)  Taxon (taxonomy)
 PROSITE (binding
 Medline
sites)
(literature)  OMIM (genetics
 IMEx databases literature/reviews)
(protein interactions)  IMEx databases
(protein
interactions)
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
Primary Database
This databases contains the raw nucleic
acid sequence data which are produced
and submitted by researchers worldwide.
NCBI(The National Centre for
Biotechnology Information)
GenBank
DDBJ (DNA data bank of Japan)
Protein
SWISS-PROT(Swiss-Prot )
PIR
PIR (Protein Information Resource) MIPS
PDB(Protein Data Bank) SWISS-PROT
TrEMBL (Translated European Molecular TrEMBL
Biology Laboratory)
NATIONAL CENTER FOR BIOTECHNOLOGY
INFORMATION (NCBI)
developed at the National Institutes of Health (NIH) in 1988
Part of national library of medicine at national institute of health
provides access to a large amount of biomedical and genomic
information (www.ncbi.nlm.nih.gov/home/ about/mission.shtml).
It maintains a large scale of databases and bioinformatics tools
as well as services.
One of the most popular databases is GenBank
Conti…
Mission or role
The aim is to find novel techniques and methodologies for
dealing with huge and complex data
and provide better accessibility to analytical and computational
tools.
Maintenance of biological databases whether primary or
secondary.
It includes GENEBANK
NCBI provides the data retrieval systems such as ENTREZ
Provides computational sources for the analysis of the GENEBANK
data and other biological data
Conti…
Resources
The resources that are present on this site can be divided into
two major categories:
1) databases
2) tools
The major databases maintained at NCBI are
GenBank and PubMed (bibliographic database for biomedical
literature).
Other databases include the
Gene,
Genome,
Epigenomics,
Gene
Expression
RefSeq,
Structure, Database of Short Genetic Variation (dbSNP),
TAXONOMY, etc.
TOOLS at NCBI
The NCBI also provides a variety of tools for database search
The Entrez: is search engine of NCBI
The other tools include
Genomes Browser,
BLAST,
CDTree,
Genetic Codes,
Open Reading Frame Finder (ORF Finder),
SNP Database Specialized Search Tools,
NCBI

Created in 1988 as a part of the


National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Web Access: www.ncbi.nlm.nih.gov
Coding sequences: CDS
 CDS is a sequence of nucleotides that corresponds with
the sequence of amino acids in a protein. A
typical CDS starts with ATG and ends with a stop codon.
 The identification of coding sequences (CDS) is an
important step in the functional annotation of genes.
 complete CDS" means the presence of a start codon
"ATG" and a stop codon "TAA/TGA/TAG".
 partial" means there's a stop codon but no start codon
GenBank
GenBank (Genetic Sequence Databank)
GenBank® is the genetic sequence database at the National
Center for Biotechnology Information (NCBI).
It was established in the year 1982 and now maintained by the
National Center for Biotechnology (NCBI).
It contains publicly available nucleotide sequences
DNA sequences can be submitted to GenBank using several
different methods.
BankIt: Web-based form for submission of a small number of
sequences
Sequin: More appropriate for complicated submissions containing
many sequences
Nucleotide sequence databases
 Genbank
 Perhaps the best known database
 Contains all publically available annotated DNA sequences
 Exchanges data daily with the DNA Data Bank of Japan (DDBJ) and
European Molecular Biology Laboratory (EMBL)
 Contains roughly 179 million sequence entries (Dec 2014)
 In October 2024, GenBank contained 34 trillion base pairs from over 4.7
billion nucleotide sequences and more than 580,000 formally described
species. The database started in 1982 by Walter Goad and Los Alamos
National Laboratory.
 Prior submission of sequence into Genbank/DDBJ/EMBL is a prerequisite for
publishing new sequence in any scientific journal
 Submission is easy and can be done electronically
 Each entry has a unique id known as the “Accession Number (AN)”
Structure of Genbank
A detailed structure of a
nucleotide sequence file format
in this database includes the
following:
 1. Locus: This can be defined as
a title given by GenBank itself to
name the sequence entry. It
includes the following:
 a. Locus Name: Similar to
accession number for the
sequence.
 b. Sequence Length: Tells the
number of bases existing in the
sequence.
Conti….
 c. Molecule-Type: Identifies the
type of nucleic acid sequence.
The various types are mRNA
(which is present as cDNA),
rRNA, snRNA, and DNA.
 d. GB Division: Postulates class
of the data according to
classification criteria of
GenBank.
 e. Modification Date: The date
on which the record was
modified.
 2. Definition: This denotes the
name of the nucleotide sequence.
 3. Accession: This covers
accession number, accession
version, and GI number.
 Accession number can be defined
as the unique identifier associated
with each nucleotide sequence
present in the database.
 4. VERSION - Identification number
assigned to a single, specific
sequence in the database. This
number is in the format
“accession.version.”
 5. GI Also a sequence
identification number. Whenever a
sequence is changed, the version
number is increased and a new GI
is assigned.
 6. Keyword: Defined words
that were used to index the
entries.
 7. The Source: This describes
organism from which
sequences have been
obtained.
 8. Organism - The scientific
name (usually genus and
species) and phylogenetic
lineage
 9. REFERENCE - Citations of
publications by sequence
authors, the journal from
which with the sequence was
derived
 10. Features: These
consist of the
information derived
from the sequence
such as biological
source,
 exon,
 intron,
 promoters,
 CDS
 alternate splice,
 Base Count,
 Origin
Growth of Genbank
(1982-2014)
Accession number
 A unique identifier of each record in the database

 Usually alpha-numeric in nature


Why do we need accession
numbers?
 Common names lead to non-specific results
 A search on “Cytochrome” will output many different types of
cytochromes (a, b, c, and others)

 Cannot distinguish among species


 Search on “Insulin” will return insulin sequences from many
organisms
Example Genbank entry
How to search Genbank?
 Can be performed via NCBI
online system Entrez

 Search could be general or


specific using keywords

 For example, Homo sapiens


[ORGN] AND 3260:3270
[SLEN:seq length]
Sister databases to Genbank
1. EMBL
 Sequence provider for Europe
 Maintained at European Bioinformatics Institute (EBI)

2. DDBJ
 Sequence provider for Asia
 Operated by the Center for Information Biology (CBI) in Japan

 The three database are part of the International Nucleotide Sequence


Database Collaboration
 Sync their data every 24 hours
 A query of three databases separately is therefore unnecessary
 Differ slightly in format and representation
European Molecular Biology
Laboratory (EMBL)
The EMBL Nucleotide Sequence Database is maintained by EBI,
UK
It was formed in the year 1974
It develops and maintains a large number of databases, and
scientists can access the data free of cost.
This database serves as the primary source of nucleotide
sequences for Europe.
in this database, the nucleotide sequence data generated by
large-scale genome-sequencing projects and those available from
the European Patent Office can be submitted
Conti…
Data collection is done in collaboration with GenBank (USA) and
the DNA Database of Japan (DDBJ).
The other genomic databases held at EBI are
Ensembl (a database of genome annotation)
Genome Reviews.
The daily releases of the database contain new submissions and
updated sequence data
while every 3 months the entire database is released.
DDBJ
DDBJ: DNA Data Bank of Japan Is a biological database that
collects DNA sequences submitted by researchers.
 It is run by the National Institute of Genetics, Japan.
DDBJ Flat File Format
The data submitted in DDBJ is managed and retrieved according
to the DDBJ format (flat file).
The flat file includes the sequence and the information of who
submitted the data, references, source organisms, and
information about the feature, etc
Ensembl Genome Database
Ensembl is one of several well known genome browsers for the
retrieval of genomic information from several organisms
including human, plants, bacteria and animals.
Created and maintained by the EBI and the Sanger Center (UK)
The obvious problem with manually
curating the database?

Difficult to keep pace with amount of sequence data


generated these days. Necessary to supplement with an
automatic alternative
Protein Databases
Swiss-Prot
 A protein sequence database which strives to provide a high level
of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
 Complete, Curated, Non-redundant and cross-referenced with 34
other databases
its repository contains the amino acid sequence, the protein name
and description, taxonomic data, and citation information
Protein sequence databases -
SwissProt
 A collection of annotated protein sequences

 Operated by the Swiss Institute of


Bioinformatics (SIB)

 Manually curated by a specialist and


verified from literature

 High quality database, gold standard for


protein annotation
 The gold standard for the accurate
determination of the total protein content
in complex samples is the total amino acid
analysis (AAA). During AAA, the intact (Amos Bairoch, creator of SwissProt
proteins are hydrolyzed to individual amino
acids usually by acidic hydrolysis. during his Ph.D, 1986)
Conti…

TrEMBL: TrEMBL (translation of EMBL nucleotide sequence database)


was introduced by the European Bioinformatics Institute in collaborating
with Swiss-Prot
• Created in 1996 as a computer annotated supplement to SWISS-PROT.
• Contains translations of all coding sequences (CDS) in EMBL.
PIR: The Protein Information Resource (PIR) is an integrated public
bioinformatics resource that supports genomic and proteomic research
and scientific studies
The PIR serves the scientific community through on-line access, and
performing off-line sequence identification services for researchers.
It is a database of freely accessible protein sequences which contains
high-quality data and functional information for the proteins
TrEMBL
 Translated EMBL

 Contains all translations of the EMBL nucleotide


database that have not yet been verified by the
SwissProt specialists

 Completely automatic so less authentic source of


information
UniprotKB
Protein Information Resource (PIR)
 Located at Georgetown University Medical Center, USA

 Public bioinformatics resource to support genomic and


proteomic research

 Established in 1984 by the National Biomedical Research


Foundation

 The NBRF previously compiled a comprehensive collection of


sequences in the “Atlas of protein sequence and structure”
(edited by Dr. Margaret Dayhoff, 1965-1978)
Universal Protein Resource (Uniprot)
 Unites the information in three databases, Swissprot,
TrEMBL, and PIR

 Consists of three parts


1. UniprotKB – based on Swissprot and TrEMBL and is a
comprehensive directory of protein annotations

2. Uniref – allows for fast similarity searches such as search for


sequences that are 90% identical

3. Uniprot Archive – collection of Uniprot sequences and their history


Composite Databases
Composite Databases
are collections of several primary database resources.
provide users with various tools and software for analysis of data.
NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from
high redundancy in the data deposited
Composite DB

 As there are many db which one to search? Some are


good in some aspects and weak in others?
 Composite db is the answer – which has several db for
its base data
 Search on these db is indexed and streamlined so that
the same stored sequence is not searched twice in
different db
Secondary db
 Store secondary structure info or results of searches of
the primary db

Compo DB Primary Source


PROSITE SWISS-PROT
PRINTS OWL
Secondary databases
Secondary Databases
Secondary Databases:
contain information derived from primary
databases.
store information such as conserved sequences,
active site residues, and signature sequences.
Protein Databank data is stored in secondary
databases. Examples include:
 Class Architecture Topology Homology (CATH),
 Kyoto Encyclopedia of Genes and Genomics
(KEGG),
 Protein Families (Pfam)
 and Structural Classification of Proteins (SCOP)
PFAM
A database of protein families, Pfam contains
annotations as well as multiple sequence
alignments generated using hidden Markov models
PROSITE
 Sometimes a newly sequenced protein gives no hits to sequence
databases

 How do we determine its function then?

“In some cases, the structure and function of an unknown protein


which is too distantly related to any protein of known structure to
detect its affinity by overall sequence alignment may be identified by
its possession of a particular cluster of residues types classified as a
motifs. The motifs, or templates, or fingerprints, arise because of
particular requirements of binding sites that impose very tight
constraint on the evolution of portions of a protein sequence” - A. M.
Lesk, 1988
Contd.
 Patterns are inferred from
multiple sequence
alignments

 Look for regions that are


conserved in evolution

 Could be important
binding sites, attachment
sites or catalytic sites
Biological databases
Biological databases can be broadly classified in to
Sequence database
structure database
and pathway databases.
Sequence databases are applicable to both nucleic acid
sequences and protein sequences, whereas structure databases
are applicable to only Proteins.
Sequence databases

Sequence databases
Nucleotide and protein sequence databases
represent the most widely used and some of the
best established biological databases.
serve as repositories for wet lab results and the
primary source for experimental results.
Major public data banks included in this type are
GenBank in USA,
EMBL (European Molecular Biology Laboratory)
in Europe
and DDBJ (DNADataBank) in Japan
Conti….
And protein databases includes
ExPaSy
UniProt
PIR
PDB
Swiss-Prot
TrEMBL
Advantages of Databases
provides information on
genomic context of genes,
Gene homologues, and paralogues,
RNA transcripts from the given genes,
peptide sequences, and
functions of gene families.
It allows access to complete genome sequences available in the
database.
Structure databases
There are many structural database that include
Protein Data Bank (PDB)
Important in solving real problems in molecular biology
PDB Established in 1972 at Brookhaven National Laboratory
(BNL)
It contains structural information of the macromolecules
determined by X-ray, crystallographic, NMR methods
PDB is maintained by the Research Collaboratory for Structural
Bioinformatics (RCSB).
Conti…
PROSITE: is a database of protein domains and families.
PROSITE contains biologically significant sites, patterns and
profiles that help to reliably identify to which known protein
family a new sequence belongs.
CATH: The CATH database (Class, architecure, topology,
homologous superfamily) is a hierarchical classification of protein
domain structures, which clusters proteins at four major
structural levels.
Pathway databases
Pathway databases
A pathway database (DB) is a DB that describes biochemical
pathways, reactions, and enzymes
Some examples of the pathway databases are
KEGG (The Kyoto Encyclopedia of Genes and Genomes)
BRENDA,
Biocyc.
Conti…

KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG)


is the primary resource for the Japanese Genome Net service
it is a collection of online databases dealing with genomes,
enzymatic pathways, and biological chemicals
KEGG contains three databases: PATHWAY, GENES, and LIGAND.
The PATHWAY database stores computerized knowledge on
molecular interaction networks.
The GENES database contains data concerning sequences of
genes and proteins generated by the genome projects.
The LIGAND database holds information about the chemical
compounds and chemical reactions that are relevant to cellular
processes.
Conti…
BioCyc: The BioCyc Database Collection is a compilation of
pathway and genome information for different organisms.
It includes two other databases,
 EcoCyc which describes Escherichia coli K-12;
 MetaCyc, which describes pathways for more than 300
organisms.

You might also like