Biological Databases
What is database
• A database is a computerized archive used to
store and organize data in such a way that
information can be retrieved easily via a
variety of search criteria.
• Databases are composed of computer software
and hardware for data management.
Objective of database development
• The chief objective of the development of a
database is to organize data in a set of
structured records to enable easy retrieval of
information
Process of making a query
• Each record also called an entry should contain a number of
fields that hold the actual data items for example fields for
names, phone numbers, phone numbers, addresses dates.
• To retrieve a particular data from the database, a user can
specify a particular piece of information called value to be
found in a particular field and expect the computer retrieve
the whole data record.
Another feature of database is
• Knowledge discovery.
• Identification of connections between pieces of
information that were not known when information
was first entered.
• For example, databases containing raw sequence
information can perform extra computational tasks to
identify sequence homology or conserved motifs.
• Flat-file databases are simple and are
essentially “free” but limit data access to
manual processes and/or structured
programs.
• Relational databases are generally more
complex with varying costs but provide
advanced capabilities and more efficient
access options
• OOP, a programming paradigm based on the concept of
objects, which contain data and code.
• Object-oriented programming (OOP) is a programming
paradigm based on the concept of "objects", which may
contain data, in the form of fields, often known as
attributes; and code, in the form of procedures, often
known as methods. For example, a person is an object
which has certain properties such as height, gender, age,
etc. It also has certain methods such as move, talk, and
so on.
• Encapsulation is a concept used in object-oriented programming (OOP) to
bundle data and methods into easy-to-use unit.
• Encapsulation is defined as the wrapping up of data under a single unit. It
is the mechanism that binds together code and the data it manipulates.
Another way to think about encapsulation is, that it is a protective shield
that prevents the data from being accessed by the code outside this shield.
• Inheritance is a core element of object-oriented programming that serves
as a powerful instrument for reusing code. Inheritance is a mechanism
that allows us to create hierarchies of classes, facilitating the sharing of
properties and methods among them. ‘Inheritance’ is a concept that
shows the idea of code reusability and extensibility, enabling to make
efficient and flexible software solutions.
• Despite the obvious drawbacks of using flat
files in database management, many biological
databases still use this format. The justification
for this is that this system involves minimum
amount of database design and the search output
can be easily understood by working biologists
Based on their contents
Three types
• Primary databases
• Contain original biological data.
• They are archives of raw sequence or structural
data submitted by the scientific community.
• e.g., GenBank and Protein Data Bank(PDB) are
examples of primary databases
Secondary databases
• Contain computationally processed or manually
curated information based on original information
from primary databases.
• Translated protein sequence database containing
functional annotation belong to this category. Swiss-
Prot and Protein Information Resources (PIR).
Specialized Databases
• Are those that cater to a particular research
interest.
• For example Flybase, HIV sequence database
and ribosomal database project and databases
that specialize in a particular organism or a
particular type of data.
Primary Databases
(Introduction)
• Genbank
• Europeon Molecular Biology Laboratory (EMBL) database.
• DNA Data Bank of Japan(DDBJ).
• Most of the data in the databases are contributed directly authors
with a minimal level of annotation.
• A small number of sequences, especially those published in
1980s were entered manually from published literature by
database management staff.
• Presently sequence submission to either Genbank, EMBL, or
DDBJ is a Precondition for publication in most scientific journals
to ensure the fundamental molecular data to be made freely
available.
• These three public databases closely collaborate and exchange
new data daily.
• They together constitute the International Nucleotide Sequence
Database Collaboration.
• This means that by connecting to any one of
three databases, one should have access to
same nucleotide sequence data.
• Although the three databases all contain the
same sets of raw data, each of the individual
databases has a slightly different kind of
format to represent the data.
Secondary databases
• Sequence annotation information in the primary
database is often minimal.
• To turn the raw sequence information into more
sophisticated biological knowledge, most post
processing of the sequence information is needed.
• This begs the need for secondary databases, which
contain computationally processed sequence
information derived from the primary databases.
• Sequence annotations describe regions or sites of
interest in the protein sequence, such as post-
translational modifications, binding sites, enzyme
active sites, local secondary structure or other
characteristics reported in the cited references, or
predicted.
• annotation is the process of identifying and
describing the locations and functions of genes
and other biological features within a DNA, RNA,
or protein sequence
• The amount of computational processing work varies greatly
among the secondary databases; some are simple archives of
translated sequence data from identified open reading frames in
DNA; whereas other provide additional annotation and
information related to higher levels of information regarding
structure and functions.
• A prominent example is SWISS-PROT which provides detailed
sequence annotation that includes structure, function and
protein family assignment.
• The sequence data are mainly derived
fromTrEMBL, a databaseof translated nucleic
acidsequences stored in the EMBL database.
• The annotation of each entry is careful curated by
human experts and thus is of good quality.
• The protein annotation includes function, domain
structure, catalytic sites. Cofactor binding,
posttranslational modification, metabolic pathway
information, disease association and similarity with
other sequences.
• Much of this information is obtained from scientific literature
and entered by database curators.
• The annotation provides significant added value to each
original sequence record. The data record also provides cross
referencing links to other online resources of interest.
• Other features such as very low redundancy and high level of
integration with other primary and secondary databases make
SWISS-PROT very popular.
• PIR DATBASE= PROTEIN INFORMATION
RESOURCE DATABASE
• PFAM= COLLECTION OF PROTEIN FAMILIES
• DALI= DISTANCE MATRIX ALIGNMENT
Specialized databases
• Specialized databases normally serve a specific
research community or focus on a particular organism.
• The content of these databases may be sequences or
other types of information.
• The sequences in these databases may overlap with a
primary database but may also have new data
submitted directly by authors.
• Because they often curated by experts in the
field, they may have unique organizations and
additional annotations associated with the
sequences.
• Many genome databases that are taxonomic
specific fall within the category.
• Examples include Flybase, WormBase, AceDB
and TAIR
• WormBase is an online biological database about the
biology and genome of the nematode model organism
Caenorhabditis elegans and contains information about
other related nematodes.
• AceDB is a biological database for handling genomic
data. It was developed by Richard M. Durbin and Jean
Thierry-Mieg in 1989.[1] AceDB stands for a C. elegans
database. Although AceDB was initially created as a
database specifically for the nematode worm it has also
come to mean the database software itself, which has
been used to store information for other species.
• TAIR = The Arabidopsis Information
Resource (TAIR) is a highly sophisticated,
extensive, user friendly, Web-based resource
for researchers working on the model plant
Arabidopsis thaliana
• In additional, these are also specialized
databases that contain original data derived
from functional analysis.
• For example Genbank, EST (Expressed
Sequence Tag) database and Microarray Gene
Expression Database at the Europeon
BioinformaticsInstitutee.
Interconnection between biological
databases.
• As mentioned, primary databases are primary repositories and
distributors of raw sequence and structure information.
• The support nearly all other types of biological databases in a way
akin to the Associated press providing news feeds to local news
media which then tailor the news to suit their own particular needs.
Therefore, in the biological community, there is a frequent need for
the secondary and specialized databases to connect to the primary
databases and to keep uploading sequence information.
• In addition, a user often needs to get information from both
primary and secondary databases to complete a task because the
information in a single database is often insufficient.
• Instead of letting users visiting multiple databases, it is
inconvenient.
• For entries in a database to be cross referenced and linked to
related entries in other databases that contain additional
information. All these create a demand for linking different
databases.
Pitfalls of Biological databases
• One of he problems associated with biological databases is
overreliance on sequence information and related annotations,
without understanding the reliability of the information.
• What is often ignored is the fact that there are many errors in
sequence databases. Annotations of genes can also occasionally
be false or incomplete. All these types of errors can be passed on
to other databases causing propagation of errors.
• Most errors in nucleotide sequences are caused by sequencing errors. Some of
these errors cause frame shifts that make whole gene identification difficult or
protein translation impossible.
• Sometimes, gene sequences are contaminated with sequences from cloning vectors.
• Generally speaking, errors are more common for sequences produced before the
1990s;sequence quality has been greatly improved since. Therefore exceptional
care should be taken when dealing with more dated sequences. Redundancy is
another major problem affecting primary databases. There is tremendous
application of information in the databases for various reasons
• Cause of redundancy include repeated submission of
identical or overlapping sequences by the same or different
authors, revision of annotations, dumping of expressed
sequence tags (EST) data and poor database mangement
that fails to detect the redundancy.
• this makes some primary databases excessively large and
unwidelyfor information retrieval.
• Steps have been taken to reduce the redundancy. NCBI has
now created a non redundant databases., called RefSeq in
which identical sequences from same organism and associated
sequence fragments are merged into a single entry.
• Proteins sequences derived from the same DNA sequences are
explicitly linked as related entries. Sequence variants from the
same organism with very minor differences which may well be
caused by sequencing errors are treated as distinctly related
entries.
• This carefully curated databases can be considered a
secondary database.
• As mentioned, SWISS-PROT database also has
minimal redundancy for protein sequences compared
to most other databases.
• Another way to address the redundancy problem is
to create sequence-cluster databases such as
UniGene that coalesce EST sequences that are
derived from the same gene.
• The other common problem is erroneous annotations.
Often the same gene sequence is found under different
names resulting in multiple entries and confusion about
the data. Or conversely, unrelated genes bearing the same
name are found in the databases. To alleviate the problem
of naming genes, reannotation of genes and proteins
using a set of common, collected vocabulary to describe
a gene or protein is necessary.