0% found this document useful (0 votes)

72 views41 pages

Biological Databases

The document provides an overview of biological databases, defining them as computerized archives for data storage and retrieval, with objectives focused on organizing data for easy access. It categorizes databases into primary, secondary, and specialized types, detailing their roles in storing original data, processed information, and specific research interests, respectively. Additionally, it discusses the interconnections between these databases, the challenges of redundancy and erroneous annotations, and the importance of accurate data management in biological research.

Uploaded by

tassera9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views41 pages

Biological Databases

Uploaded by

tassera9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Biological Databases

What is database

• A database is a computerized archive used to

store and organize data in such a way that
information can be retrieved easily via a
variety of search criteria.
• Databases are composed of computer software
and hardware for data management.
Objective of database development

• The chief objective of the development of a

database is to organize data in a set of
structured records to enable easy retrieval of
information
Process of making a query
• Each record also called an entry should contain a number of
fields that hold the actual data items for example fields for
names, phone numbers, phone numbers, addresses dates.
• To retrieve a particular data from the database, a user can
specify a particular piece of information called value to be
found in a particular field and expect the computer retrieve
the whole data record.
Another feature of database is
• Knowledge discovery.

• Identification of connections between pieces of

information that were not known when information
was first entered.
• For example, databases containing raw sequence
information can perform extra computational tasks to
identify sequence homology or conserved motifs.
• Flat-file databases are simple and are
essentially “free” but limit data access to
manual processes and/or structured
programs.
• Relational databases are generally more
complex with varying costs but provide
advanced capabilities and more efficient
access options
• OOP, a programming paradigm based on the concept of
objects, which contain data and code.

• Object-oriented programming (OOP) is a programming

paradigm based on the concept of "objects", which may
contain data, in the form of fields, often known as
attributes; and code, in the form of procedures, often
known as methods. For example, a person is an object
which has certain properties such as height, gender, age,
etc. It also has certain methods such as move, talk, and
so on.
• Encapsulation is a concept used in object-oriented programming (OOP) to
bundle data and methods into easy-to-use unit.
• Encapsulation is defined as the wrapping up of data under a single unit. It
is the mechanism that binds together code and the data it manipulates.
Another way to think about encapsulation is, that it is a protective shield
that prevents the data from being accessed by the code outside this shield.

• Inheritance is a core element of object-oriented programming that serves

as a powerful instrument for reusing code. Inheritance is a mechanism
that allows us to create hierarchies of classes, facilitating the sharing of
properties and methods among them. ‘Inheritance’ is a concept that
shows the idea of code reusability and extensibility, enabling to make
efficient and flexible software solutions.
• Despite the obvious drawbacks of using flat
files in database management, many biological
databases still use this format. The justification
for this is that this system involves minimum
amount of database design and the search output
can be easily understood by working biologists
Based on their contents
Three types

• Primary databases

• Contain original biological data.

• They are archives of raw sequence or structural

data submitted by the scientific community.
• e.g., GenBank and Protein Data Bank(PDB) are
examples of primary databases
Secondary databases

• Contain computationally processed or manually

curated information based on original information
from primary databases.
• Translated protein sequence database containing
functional annotation belong to this category. Swiss-
Prot and Protein Information Resources (PIR).
Specialized Databases

• Are those that cater to a particular research

interest.
• For example Flybase, HIV sequence database
and ribosomal database project and databases
that specialize in a particular organism or a
particular type of data.
Primary Databases
(Introduction)

• Genbank

• Europeon Molecular Biology Laboratory (EMBL) database.

• DNA Data Bank of Japan(DDBJ).

• Most of the data in the databases are contributed directly authors

with a minimal level of annotation.

• A small number of sequences, especially those published in

1980s were entered manually from published literature by

database management staff.

• Presently sequence submission to either Genbank, EMBL, or

DDBJ is a Precondition for publication in most scientific journals

to ensure the fundamental molecular data to be made freely
available.

• These three public databases closely collaborate and exchange

new data daily.

• They together constitute the International Nucleotide Sequence

Database Collaboration.
• This means that by connecting to any one of
three databases, one should have access to
same nucleotide sequence data.
• Although the three databases all contain the
same sets of raw data, each of the individual
databases has a slightly different kind of
format to represent the data.
Secondary databases
• Sequence annotation information in the primary
database is often minimal.
• To turn the raw sequence information into more
sophisticated biological knowledge, most post
processing of the sequence information is needed.
• This begs the need for secondary databases, which
contain computationally processed sequence
information derived from the primary databases.
• Sequence annotations describe regions or sites of
interest in the protein sequence, such as post-
translational modifications, binding sites, enzyme
active sites, local secondary structure or other
characteristics reported in the cited references, or
predicted.
• annotation is the process of identifying and
describing the locations and functions of genes
and other biological features within a DNA, RNA,
or protein sequence
• The amount of computational processing work varies greatly

among the secondary databases; some are simple archives of

translated sequence data from identified open reading frames in

DNA; whereas other provide additional annotation and

information related to higher levels of information regarding

structure and functions.

• A prominent example is SWISS-PROT which provides detailed

sequence annotation that includes structure, function and

protein family assignment.

• The sequence data are mainly derived
fromTrEMBL, a databaseof translated nucleic
acidsequences stored in the EMBL database.
• The annotation of each entry is careful curated by
human experts and thus is of good quality.
• The protein annotation includes function, domain
structure, catalytic sites. Cofactor binding,
posttranslational modification, metabolic pathway
information, disease association and similarity with
other sequences.
• Much of this information is obtained from scientific literature

and entered by database curators.

• The annotation provides significant added value to each

original sequence record. The data record also provides cross

referencing links to other online resources of interest.

• Other features such as very low redundancy and high level of

integration with other primary and secondary databases make

SWISS-PROT very popular.

• PIR DATBASE= PROTEIN INFORMATION
RESOURCE DATABASE
• PFAM= COLLECTION OF PROTEIN FAMILIES
• DALI= DISTANCE MATRIX ALIGNMENT
Specialized databases
• Specialized databases normally serve a specific
research community or focus on a particular organism.
• The content of these databases may be sequences or
other types of information.
• The sequences in these databases may overlap with a
primary database but may also have new data
submitted directly by authors.
• Because they often curated by experts in the
field, they may have unique organizations and
additional annotations associated with the
sequences.
• Many genome databases that are taxonomic
specific fall within the category.
• Examples include Flybase, WormBase, AceDB
and TAIR
• WormBase is an online biological database about the
biology and genome of the nematode model organism
Caenorhabditis elegans and contains information about
other related nematodes.
• AceDB is a biological database for handling genomic
data. It was developed by Richard M. Durbin and Jean
Thierry-Mieg in 1989.[1] AceDB stands for a C. elegans
database. Although AceDB was initially created as a
database specifically for the nematode worm it has also
come to mean the database software itself, which has
been used to store information for other species.
• TAIR = The Arabidopsis Information
Resource (TAIR) is a highly sophisticated,
extensive, user friendly, Web-based resource
for researchers working on the model plant
Arabidopsis thaliana
• In additional, these are also specialized
databases that contain original data derived
from functional analysis.
• For example Genbank, EST (Expressed
Sequence Tag) database and Microarray Gene
Expression Database at the Europeon
BioinformaticsInstitutee.
Interconnection between biological
databases.
• As mentioned, primary databases are primary repositories and
distributors of raw sequence and structure information.

• The support nearly all other types of biological databases in a way

akin to the Associated press providing news feeds to local news
media which then tailor the news to suit their own particular needs.
Therefore, in the biological community, there is a frequent need for
the secondary and specialized databases to connect to the primary
databases and to keep uploading sequence information.
• In addition, a user often needs to get information from both

primary and secondary databases to complete a task because the

information in a single database is often insufficient.

• Instead of letting users visiting multiple databases, it is

inconvenient.

• For entries in a database to be cross referenced and linked to

related entries in other databases that contain additional

information. All these create a demand for linking different

databases.
Pitfalls of Biological databases
• One of he problems associated with biological databases is

overreliance on sequence information and related annotations,

without understanding the reliability of the information.

• What is often ignored is the fact that there are many errors in

sequence databases. Annotations of genes can also occasionally

be false or incomplete. All these types of errors can be passed on
to other databases causing propagation of errors.
• Most errors in nucleotide sequences are caused by sequencing errors. Some of

these errors cause frame shifts that make whole gene identification difficult or

protein translation impossible.

• Sometimes, gene sequences are contaminated with sequences from cloning vectors.

• Generally speaking, errors are more common for sequences produced before the

1990s;sequence quality has been greatly improved since. Therefore exceptional

care should be taken when dealing with more dated sequences. Redundancy is

another major problem affecting primary databases. There is tremendous

application of information in the databases for various reasons

• Cause of redundancy include repeated submission of

identical or overlapping sequences by the same or different

authors, revision of annotations, dumping of expressed
sequence tags (EST) data and poor database mangement
that fails to detect the redundancy.

• this makes some primary databases excessively large and

unwidelyfor information retrieval.

• Steps have been taken to reduce the redundancy. NCBI has
now created a non redundant databases., called RefSeq in
which identical sequences from same organism and associated
sequence fragments are merged into a single entry.

• Proteins sequences derived from the same DNA sequences are

explicitly linked as related entries. Sequence variants from the
same organism with very minor differences which may well be
caused by sequencing errors are treated as distinctly related
entries.
• This carefully curated databases can be considered a
secondary database.
• As mentioned, SWISS-PROT database also has
minimal redundancy for protein sequences compared
to most other databases.
• Another way to address the redundancy problem is
to create sequence-cluster databases such as
UniGene that coalesce EST sequences that are
derived from the same gene.
• The other common problem is erroneous annotations.

Often the same gene sequence is found under different

names resulting in multiple entries and confusion about
the data. Or conversely, unrelated genes bearing the same
name are found in the databases. To alleviate the problem
of naming genes, reannotation of genes and proteins
using a set of common, collected vocabulary to describe
a gene or protein is necessary.

Roeland Van Wijk - Light in Shaping Life - Biophotons in Biology and Medicine (2014, Meluna) - Libgen - Li
100% (10)
Roeland Van Wijk - Light in Shaping Life - Biophotons in Biology and Medicine (2014, Meluna) - Libgen - Li
430 pages
Biological Databases Lec 2,3
No ratings yet
Biological Databases Lec 2,3
49 pages
Fundamentals of Plant Physiology Lincoln Taiz Ian Annas Archive
100% (11)
Fundamentals of Plant Physiology Lincoln Taiz Ian Annas Archive
646 pages
Introductory Botany - Plants, People, and The Environment PDF
95% (20)
Introductory Botany - Plants, People, and The Environment PDF
649 pages
Willey, Neil - Environmental Plant Physiology-Garland Science (2016)
100% (9)
Willey, Neil - Environmental Plant Physiology-Garland Science (2016)
401 pages
D. Higgins, Willie Taylor Bioinformatics Sequence, Structure and Databanks PDF
100% (2)
D. Higgins, Willie Taylor Bioinformatics Sequence, Structure and Databanks PDF
268 pages
Module 3 - Biodiversity and Evolution
100% (12)
Module 3 - Biodiversity and Evolution
39 pages
Molecular Biology and Biotechnology
96% (27)
Molecular Biology and Biotechnology
145 pages
BioInformatics Quiz1 Week6
No ratings yet
BioInformatics Quiz1 Week6
6 pages
Crang Et Al. 2018 - Plant Anatomy
100% (6)
Crang Et Al. 2018 - Plant Anatomy
739 pages
Gene Discovery for Biology Students
No ratings yet
Gene Discovery for Biology Students
15 pages
Bioinformatics Biological Database
No ratings yet
Bioinformatics Biological Database
31 pages
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
No ratings yet
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
46 pages
Gene Mapping
100% (4)
Gene Mapping
35 pages
Plant Taxonomy
91% (11)
Plant Taxonomy
588 pages
Rese Rach
No ratings yet
Rese Rach
37 pages
Databases
No ratings yet
Databases
3 pages
Bioinformatics. CH 3 Databases (Summarized Notes)
50% (2)
Bioinformatics. CH 3 Databases (Summarized Notes)
5 pages
Bioinformatics Databases Explained
No ratings yet
Bioinformatics Databases Explained
5 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Biological Databases PDF
No ratings yet
Biological Databases PDF
13 pages
Sagrada Individualized Phil Iri English Results Post Test Sy 2022 2023
No ratings yet
Sagrada Individualized Phil Iri English Results Post Test Sy 2022 2023
7 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
BLAST Guide for Biologists
0% (1)
BLAST Guide for Biologists
3 pages
Biological Databases
No ratings yet
Biological Databases
13 pages
Bioinformatics for Researchers
No ratings yet
Bioinformatics for Researchers
23 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
UCSC Genome Browser
No ratings yet
UCSC Genome Browser
424 pages
Bioinformatics for Plant Scientists
No ratings yet
Bioinformatics for Plant Scientists
28 pages
Generating Structural Data Analysis
No ratings yet
Generating Structural Data Analysis
8 pages
Phylogeny Practice Problems
100% (1)
Phylogeny Practice Problems
3 pages
Bioinformatics Day2
No ratings yet
Bioinformatics Day2
3 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Bioinformatics (STH Sir)
No ratings yet
Bioinformatics (STH Sir)
13 pages
Plant Systematics (2010) Michael G
100% (10)
Plant Systematics (2010) Michael G
603 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
Biological Data and Database Biological Data
No ratings yet
Biological Data and Database Biological Data
10 pages
Capture D'écran . 2023-03-14 À 00.15.22
No ratings yet
Capture D'écran . 2023-03-14 À 00.15.22
54 pages
Biological Databases
No ratings yet
Biological Databases
3 pages
Burrows-Wheeler Transform
No ratings yet
Burrows-Wheeler Transform
42 pages
BIOINFORMATICS
No ratings yet
BIOINFORMATICS
22 pages
Biological Databases
No ratings yet
Biological Databases
20 pages
Bioinformatics Sequence Structure and Databanks
No ratings yet
Bioinformatics Sequence Structure and Databanks
4 pages
Class 11 Biology Topicwise Notes 2024 25 Chapter 1 The Living World
100% (4)
Class 11 Biology Topicwise Notes 2024 25 Chapter 1 The Living World
40 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
Discovering Genomics Proteomics and Bioi
No ratings yet
Discovering Genomics Proteomics and Bioi
2 pages
Module 2 Biodata
No ratings yet
Module 2 Biodata
36 pages
Bioinformatics Day1
No ratings yet
Bioinformatics Day1
5 pages
BTY 405 IGAP Format
No ratings yet
BTY 405 IGAP Format
4 pages
Biotechnology and Biodiversity
100% (2)
Biotechnology and Biodiversity
344 pages
Bioinformatics Definition
No ratings yet
Bioinformatics Definition
11 pages
Bioinformatics for Molecular Biologists
100% (1)
Bioinformatics for Molecular Biologists
18 pages
Data Base in Bioinformatics
No ratings yet
Data Base in Bioinformatics
30 pages
Aminoacid+Alignment Including PAM & BLOSUM
0% (1)
Aminoacid+Alignment Including PAM & BLOSUM
38 pages
Basics of Bioinformatics in Biological Research
No ratings yet
Basics of Bioinformatics in Biological Research
5 pages
BCH 505 Bioinformatics 3 (2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3 (2 2) Databases
17 pages
Biological Data Bases
No ratings yet
Biological Data Bases
36 pages
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
No ratings yet
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
9 pages
Sea Cucumber Genetic Study Malaysia
No ratings yet
Sea Cucumber Genetic Study Malaysia
11 pages
CH12
No ratings yet
CH12
8 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
No ratings yet
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
17 pages
Bioinformatics Database Basics
No ratings yet
Bioinformatics Database Basics
18 pages
Bioinformatics Lecture 1
No ratings yet
Bioinformatics Lecture 1
48 pages
Unit II Bioinformatics
No ratings yet
Unit II Bioinformatics
25 pages
Phylogeny
No ratings yet
Phylogeny
22 pages
Comparative Genomics
No ratings yet
Comparative Genomics
11 pages
Bioinformatics Practical File
No ratings yet
Bioinformatics Practical File
12 pages
Biologicaldatabase 190402034501
No ratings yet
Biologicaldatabase 190402034501
26 pages
2022 Fall BCHM 3P02 Project+1+Outline
No ratings yet
2022 Fall BCHM 3P02 Project+1+Outline
3 pages
Whale Formal Lab Report
No ratings yet
Whale Formal Lab Report
4 pages
Bioinformatics 34 22 3939
No ratings yet
Bioinformatics 34 22 3939
3 pages
Omics Hands-On Workshop Part 1 2023
No ratings yet
Omics Hands-On Workshop Part 1 2023
3 pages
Unit 2
No ratings yet
Unit 2
36 pages
161 Vansh Sharma
No ratings yet
161 Vansh Sharma
4 pages
02-A-Introduction To Biological Databases
No ratings yet
02-A-Introduction To Biological Databases
52 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
BioInf 4
No ratings yet
BioInf 4
49 pages
Computational Genomics Tutorial计算基因组学
No ratings yet
Computational Genomics Tutorial计算基因组学
90 pages
Bioinformatics Alignment
No ratings yet
Bioinformatics Alignment
128 pages
Biological Databases BDB
No ratings yet
Biological Databases BDB
5 pages
Bio-Biomedical Graduate Programs That Do Not Require GRE
No ratings yet
Bio-Biomedical Graduate Programs That Do Not Require GRE
5 pages
Biological - Databases Class Work 60
No ratings yet
Biological - Databases Class Work 60
60 pages
Introduction To Databases
No ratings yet
Introduction To Databases
21 pages
Bio in For Matics
No ratings yet
Bio in For Matics
20 pages
Database 2
No ratings yet
Database 2
15 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Pubmed and OMIM
No ratings yet
Pubmed and OMIM
44 pages
PC#1 Exercises Introduction To NCBI 2020-Solved
No ratings yet
PC#1 Exercises Introduction To NCBI 2020-Solved
6 pages
Databases Class Work
No ratings yet
Databases Class Work
48 pages
B.Tech in Computational Biology
No ratings yet
B.Tech in Computational Biology
3 pages
Bioinformatics Software DNASTAR
No ratings yet
Bioinformatics Software DNASTAR
1 page
Introduction To Databases
No ratings yet
Introduction To Databases
29 pages
Gene Families and Protein Families
No ratings yet
Gene Families and Protein Families
17 pages
Peace BMCB Seminar
No ratings yet
Peace BMCB Seminar
13 pages
PLSC 411 611 Class Paper 2024
No ratings yet
PLSC 411 611 Class Paper 2024
2 pages
Biological Databases
No ratings yet
Biological Databases
19 pages
Biological Databases
No ratings yet
Biological Databases
6 pages

Biological Databases

Uploaded by

Biological Databases

Uploaded by

Biological Databases

• A database is a computerized archive used to

• The chief objective of the development of a

• Identification of connections between pieces of

• Object-oriented programming (OOP) is a programming

• Inheritance is a core element of object-oriented programming that serves

• Contain original biological data.

• They are archives of raw sequence or structural

• Contain computationally processed or manually

• Are those that cater to a particular research

• Europeon Molecular Biology Laboratory (EMBL) database.

• DNA Data Bank of Japan(DDBJ).

• Most of the data in the databases are contributed directly authors

with a minimal level of annotation.

• A small number of sequences, especially those published in

1980s were entered manually from published literature by

database management staff.

DDBJ is a Precondition for publication in most scientific journals

• These three public databases closely collaborate and exchange

new data daily.

• They together constitute the International Nucleotide Sequence

among the secondary databases; some are simple archives of

translated sequence data from identified open reading frames in

DNA; whereas other provide additional annotation and

information related to higher levels of information regarding

structure and functions.

• A prominent example is SWISS-PROT which provides detailed

sequence annotation that includes structure, function and

protein family assignment.

and entered by database curators.

• The annotation provides significant added value to each

original sequence record. The data record also provides cross

referencing links to other online resources of interest.

• Other features such as very low redundancy and high level of

integration with other primary and secondary databases make

SWISS-PROT very popular.

• The support nearly all other types of biological databases in a way

primary and secondary databases to complete a task because the

information in a single database is often insufficient.

• Instead of letting users visiting multiple databases, it is

• For entries in a database to be cross referenced and linked to

related entries in other databases that contain additional

information. All these create a demand for linking different

overreliance on sequence information and related annotations,

sequence databases. Annotations of genes can also occasionally

protein translation impossible.

1990s;sequence quality has been greatly improved since. Therefore exceptional

another major problem affecting primary databases. There is tremendous

application of information in the databases for various reasons

identical or overlapping sequences by the same or different

• this makes some primary databases excessively large and

unwidelyfor information retrieval.

• Proteins sequences derived from the same DNA sequences are

Often the same gene sequence is found under different

You might also like