Introduction to
Bioinformatics
Lecture 3+4
What is a database?
A database is an organized collection of related
information
A Computerized archive used to store and organize
data in such a way that information can be retrieved
easily.
A database is a repository of information that has a
specific structure that enables the entering and
extraction of data
In general, this database structure consists of files or
tables,
each containing numerous records and fields
Database System (DBS) is an integrated
collection of related files along with the detail
about their definition, interpretation,
manipulation and maintenance
A database system controls the data from unauthorized
access.
What are the advantages of using
databases?
Easy and quick retrieval of information
Provide backup support
Databases
Sequence info is stored in databases
So that they can be manipulated easily
The db (next slide) are located at diff places
They exchange info on a daily basis so that they are
up-to-date and are in sync
Primary db – sequence data
Biological databases
Need to collect and store biological data and its
associated knowledge into databases
Fundamental to the survival of science
Each year, Nucleic Acids Research (NAR) journal
dedicates an entire issue on the available databases!
Database management systems
Database management systems provide several functions in
addition to simple file management:
control security
maintain data integrity
provide for backup and recovery
control redundancy
allow data independence
perform automatic query optimization
Organisation
Organisation:
flat files
Relational databases
Flat-file databases
the simplest form of a database,
where collections of data, such as nucleotide and amino acid
sequence, are stored as either a large single text file
Conti…
Conti..
A relational database stores the data within a number of tables.
Each table consists of records and fields (rows and columns)
Two kinds of biological databases
1. Primary
Contain primary sequence information (nucleotide or protein)
and associated annotations
2. Secondary
Summarize the results from primary databases
Types of Database
The databases can be classified
into three categories on the basis
of the information stored.
They are Primary, Secondary and
Composite databases.
Primary databases contain data
that is derived experimentally.
They usually store information
related to the sequences or
structures of biological
components
They can be further divided into
protein or nucleotide databases
Primary databases
Nucleotide sequence databases
Protein sequence databases
Conti…
Useful Database
Secondary
(curated)
Primary (archival) RefSeq (seqs)
GenBank/EMBL/DDBJ UniProt - SwissProt
(seqs) (seqs)
PDB -(protein structures) Taxon (taxonomy)
PROSITE (binding
Medline
sites)
(literature) OMIM (genetics
IMEx databases literature/reviews)
(protein interactions) IMEx databases
(protein
interactions)
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
Primary Database
This databases contains the raw nucleic
acid sequence data which are produced
and submitted by researchers worldwide.
NCBI(The National Centre for
Biotechnology Information)
GenBank
DDBJ (DNA data bank of Japan)
Protein
SWISS-PROT(Swiss-Prot )
PIR
PIR (Protein Information Resource) MIPS
PDB(Protein Data Bank) SWISS-PROT
TrEMBL (Translated European Molecular TrEMBL
Biology Laboratory)
NATIONAL CENTER FOR BIOTECHNOLOGY
INFORMATION (NCBI)
developed at the National Institutes of Health (NIH) in 1988
Part of national library of medicine at national institute of health
provides access to a large amount of biomedical and genomic
information (www.ncbi.nlm.nih.gov/home/ about/mission.shtml).
It maintains a large scale of databases and bioinformatics tools
as well as services.
One of the most popular databases is GenBank
Conti…
Mission or role
The aim is to find novel techniques and methodologies for
dealing with huge and complex data
and provide better accessibility to analytical and computational
tools.
Maintenance of biological databases whether primary or
secondary.
It includes GENEBANK
NCBI provides the data retrieval systems such as ENTREZ
Provides computational sources for the analysis of the GENEBANK
data and other biological data
Conti…
Resources
The resources that are present on this site can be divided into
two major categories:
1) databases
2) tools
The major databases maintained at NCBI are
GenBank and PubMed (bibliographic database for biomedical
literature).
Other databases include the
Gene,
Genome,
Epigenomics,
Gene
Expression
RefSeq,
Structure, Database of Short Genetic Variation (dbSNP),
TAXONOMY, etc.
TOOLS at NCBI
The NCBI also provides a variety of tools for database search
The Entrez: is search engine of NCBI
The other tools include
Genomes Browser,
BLAST,
CDTree,
Genetic Codes,
Open Reading Frame Finder (ORF Finder),
SNP Database Specialized Search Tools,
NCBI
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Web Access: www.ncbi.nlm.nih.gov
Coding sequences: CDS
CDS is a sequence of nucleotides that corresponds with
the sequence of amino acids in a protein. A
typical CDS starts with ATG and ends with a stop codon.
The identification of coding sequences (CDS) is an
important step in the functional annotation of genes.
complete CDS" means the presence of a start codon
"ATG" and a stop codon "TAA/TGA/TAG".
partial" means there's a stop codon but no start codon
GenBank
GenBank (Genetic Sequence Databank)
GenBank® is the genetic sequence database at the National
Center for Biotechnology Information (NCBI).
It was established in the year 1982 and now maintained by the
National Center for Biotechnology (NCBI).
It contains publicly available nucleotide sequences
DNA sequences can be submitted to GenBank using several
different methods.
BankIt: Web-based form for submission of a small number of
sequences
Sequin: More appropriate for complicated submissions containing
many sequences
Nucleotide sequence databases
Genbank
Perhaps the best known database
Contains all publically available annotated DNA sequences
Exchanges data daily with the DNA Data Bank of Japan (DDBJ) and
European Molecular Biology Laboratory (EMBL)
Contains roughly 179 million sequence entries (Dec 2014)
In October 2024, GenBank contained 34 trillion base pairs from over 4.7
billion nucleotide sequences and more than 580,000 formally described
species. The database started in 1982 by Walter Goad and Los Alamos
National Laboratory.
Prior submission of sequence into Genbank/DDBJ/EMBL is a prerequisite for
publishing new sequence in any scientific journal
Submission is easy and can be done electronically
Each entry has a unique id known as the “Accession Number (AN)”
Structure of Genbank
A detailed structure of a
nucleotide sequence file format
in this database includes the
following:
1. Locus: This can be defined as
a title given by GenBank itself to
name the sequence entry. It
includes the following:
a. Locus Name: Similar to
accession number for the
sequence.
b. Sequence Length: Tells the
number of bases existing in the
sequence.
Conti….
c. Molecule-Type: Identifies the
type of nucleic acid sequence.
The various types are mRNA
(which is present as cDNA),
rRNA, snRNA, and DNA.
d. GB Division: Postulates class
of the data according to
classification criteria of
GenBank.
e. Modification Date: The date
on which the record was
modified.
2. Definition: This denotes the
name of the nucleotide sequence.
3. Accession: This covers
accession number, accession
version, and GI number.
Accession number can be defined
as the unique identifier associated
with each nucleotide sequence
present in the database.
4. VERSION - Identification number
assigned to a single, specific
sequence in the database. This
number is in the format
“accession.version.”
5. GI Also a sequence
identification number. Whenever a
sequence is changed, the version
number is increased and a new GI
is assigned.
6. Keyword: Defined words
that were used to index the
entries.
7. The Source: This describes
organism from which
sequences have been
obtained.
8. Organism - The scientific
name (usually genus and
species) and phylogenetic
lineage
9. REFERENCE - Citations of
publications by sequence
authors, the journal from
which with the sequence was
derived
10. Features: These
consist of the
information derived
from the sequence
such as biological
source,
exon,
intron,
promoters,
CDS
alternate splice,
Base Count,
Origin
Growth of Genbank
(1982-2014)
Accession number
A unique identifier of each record in the database
Usually alpha-numeric in nature
Why do we need accession
numbers?
Common names lead to non-specific results
A search on “Cytochrome” will output many different types of
cytochromes (a, b, c, and others)
Cannot distinguish among species
Search on “Insulin” will return insulin sequences from many
organisms
Example Genbank entry
How to search Genbank?
Can be performed via NCBI
online system Entrez
Search could be general or
specific using keywords
For example, Homo sapiens
[ORGN] AND 3260:3270
[SLEN:seq length]
Sister databases to Genbank
1. EMBL
Sequence provider for Europe
Maintained at European Bioinformatics Institute (EBI)
2. DDBJ
Sequence provider for Asia
Operated by the Center for Information Biology (CBI) in Japan
The three database are part of the International Nucleotide Sequence
Database Collaboration
Sync their data every 24 hours
A query of three databases separately is therefore unnecessary
Differ slightly in format and representation
European Molecular Biology
Laboratory (EMBL)
The EMBL Nucleotide Sequence Database is maintained by EBI,
UK
It was formed in the year 1974
It develops and maintains a large number of databases, and
scientists can access the data free of cost.
This database serves as the primary source of nucleotide
sequences for Europe.
in this database, the nucleotide sequence data generated by
large-scale genome-sequencing projects and those available from
the European Patent Office can be submitted
Conti…
Data collection is done in collaboration with GenBank (USA) and
the DNA Database of Japan (DDBJ).
The other genomic databases held at EBI are
Ensembl (a database of genome annotation)
Genome Reviews.
The daily releases of the database contain new submissions and
updated sequence data
while every 3 months the entire database is released.
DDBJ
DDBJ: DNA Data Bank of Japan Is a biological database that
collects DNA sequences submitted by researchers.
It is run by the National Institute of Genetics, Japan.
DDBJ Flat File Format
The data submitted in DDBJ is managed and retrieved according
to the DDBJ format (flat file).
The flat file includes the sequence and the information of who
submitted the data, references, source organisms, and
information about the feature, etc
Ensembl Genome Database
Ensembl is one of several well known genome browsers for the
retrieval of genomic information from several organisms
including human, plants, bacteria and animals.
Created and maintained by the EBI and the Sanger Center (UK)
The obvious problem with manually
curating the database?
Difficult to keep pace with amount of sequence data
generated these days. Necessary to supplement with an
automatic alternative
Protein Databases
Swiss-Prot
A protein sequence database which strives to provide a high level
of annotation:
* the function of a protein
* domains structure
* post-translational modifications
* variants
Complete, Curated, Non-redundant and cross-referenced with 34
other databases
its repository contains the amino acid sequence, the protein name
and description, taxonomic data, and citation information
Protein sequence databases -
SwissProt
A collection of annotated protein sequences
Operated by the Swiss Institute of
Bioinformatics (SIB)
Manually curated by a specialist and
verified from literature
High quality database, gold standard for
protein annotation
The gold standard for the accurate
determination of the total protein content
in complex samples is the total amino acid
analysis (AAA). During AAA, the intact (Amos Bairoch, creator of SwissProt
proteins are hydrolyzed to individual amino
acids usually by acidic hydrolysis. during his Ph.D, 1986)
Conti…
TrEMBL: TrEMBL (translation of EMBL nucleotide sequence database)
was introduced by the European Bioinformatics Institute in collaborating
with Swiss-Prot
• Created in 1996 as a computer annotated supplement to SWISS-PROT.
• Contains translations of all coding sequences (CDS) in EMBL.
PIR: The Protein Information Resource (PIR) is an integrated public
bioinformatics resource that supports genomic and proteomic research
and scientific studies
The PIR serves the scientific community through on-line access, and
performing off-line sequence identification services for researchers.
It is a database of freely accessible protein sequences which contains
high-quality data and functional information for the proteins
TrEMBL
Translated EMBL
Contains all translations of the EMBL nucleotide
database that have not yet been verified by the
SwissProt specialists
Completely automatic so less authentic source of
information
UniprotKB
Protein Information Resource (PIR)
Located at Georgetown University Medical Center, USA
Public bioinformatics resource to support genomic and
proteomic research
Established in 1984 by the National Biomedical Research
Foundation
The NBRF previously compiled a comprehensive collection of
sequences in the “Atlas of protein sequence and structure”
(edited by Dr. Margaret Dayhoff, 1965-1978)
Universal Protein Resource (Uniprot)
Unites the information in three databases, Swissprot,
TrEMBL, and PIR
Consists of three parts
1. UniprotKB – based on Swissprot and TrEMBL and is a
comprehensive directory of protein annotations
2. Uniref – allows for fast similarity searches such as search for
sequences that are 90% identical
3. Uniprot Archive – collection of Uniprot sequences and their history
Composite Databases
Composite Databases
are collections of several primary database resources.
provide users with various tools and software for analysis of data.
NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from
high redundancy in the data deposited
Composite DB
As there are many db which one to search? Some are
good in some aspects and weak in others?
Composite db is the answer – which has several db for
its base data
Search on these db is indexed and streamlined so that
the same stored sequence is not searched twice in
different db
Secondary db
Store secondary structure info or results of searches of
the primary db
Compo DB Primary Source
PROSITE SWISS-PROT
PRINTS OWL
Secondary databases
Secondary Databases
Secondary Databases:
contain information derived from primary
databases.
store information such as conserved sequences,
active site residues, and signature sequences.
Protein Databank data is stored in secondary
databases. Examples include:
Class Architecture Topology Homology (CATH),
Kyoto Encyclopedia of Genes and Genomics
(KEGG),
Protein Families (Pfam)
and Structural Classification of Proteins (SCOP)
PFAM
A database of protein families, Pfam contains
annotations as well as multiple sequence
alignments generated using hidden Markov models
PROSITE
Sometimes a newly sequenced protein gives no hits to sequence
databases
How do we determine its function then?
“In some cases, the structure and function of an unknown protein
which is too distantly related to any protein of known structure to
detect its affinity by overall sequence alignment may be identified by
its possession of a particular cluster of residues types classified as a
motifs. The motifs, or templates, or fingerprints, arise because of
particular requirements of binding sites that impose very tight
constraint on the evolution of portions of a protein sequence” - A. M.
Lesk, 1988
Contd.
Patterns are inferred from
multiple sequence
alignments
Look for regions that are
conserved in evolution
Could be important
binding sites, attachment
sites or catalytic sites
Biological databases
Biological databases can be broadly classified in to
Sequence database
structure database
and pathway databases.
Sequence databases are applicable to both nucleic acid
sequences and protein sequences, whereas structure databases
are applicable to only Proteins.
Sequence databases
Sequence databases
Nucleotide and protein sequence databases
represent the most widely used and some of the
best established biological databases.
serve as repositories for wet lab results and the
primary source for experimental results.
Major public data banks included in this type are
GenBank in USA,
EMBL (European Molecular Biology Laboratory)
in Europe
and DDBJ (DNADataBank) in Japan
Conti….
And protein databases includes
ExPaSy
UniProt
PIR
PDB
Swiss-Prot
TrEMBL
Advantages of Databases
provides information on
genomic context of genes,
Gene homologues, and paralogues,
RNA transcripts from the given genes,
peptide sequences, and
functions of gene families.
It allows access to complete genome sequences available in the
database.
Structure databases
There are many structural database that include
Protein Data Bank (PDB)
Important in solving real problems in molecular biology
PDB Established in 1972 at Brookhaven National Laboratory
(BNL)
It contains structural information of the macromolecules
determined by X-ray, crystallographic, NMR methods
PDB is maintained by the Research Collaboratory for Structural
Bioinformatics (RCSB).
Conti…
PROSITE: is a database of protein domains and families.
PROSITE contains biologically significant sites, patterns and
profiles that help to reliably identify to which known protein
family a new sequence belongs.
CATH: The CATH database (Class, architecure, topology,
homologous superfamily) is a hierarchical classification of protein
domain structures, which clusters proteins at four major
structural levels.
Pathway databases
Pathway databases
A pathway database (DB) is a DB that describes biochemical
pathways, reactions, and enzymes
Some examples of the pathway databases are
KEGG (The Kyoto Encyclopedia of Genes and Genomes)
BRENDA,
Biocyc.
Conti…
KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG)
is the primary resource for the Japanese Genome Net service
it is a collection of online databases dealing with genomes,
enzymatic pathways, and biological chemicals
KEGG contains three databases: PATHWAY, GENES, and LIGAND.
The PATHWAY database stores computerized knowledge on
molecular interaction networks.
The GENES database contains data concerning sequences of
genes and proteins generated by the genome projects.
The LIGAND database holds information about the chemical
compounds and chemical reactions that are relevant to cellular
processes.
Conti…
BioCyc: The BioCyc Database Collection is a compilation of
pathway and genome information for different organisms.
It includes two other databases,
EcoCyc which describes Escherichia coli K-12;
MetaCyc, which describes pathways for more than 300
organisms.