Biological Databases:
A biological database is a collection of data which is stored in an organized manner so that the
contents are easily accessed, managed and updated. This offers the scientists the opportunity to access
sequence and structure data for around lakhs of sequences from a broad range of organisms.
Biological databases represent as an extensive source in support of biological research. There are
different types of Databases. Based on Data source, they are classified as:
1) Primary Databases: Experimental data from nucleotide sequence, protein sequence or
molecular structure are stored here.
Examples: GenBank, PDB, DDBJ
2) Secondary Databases: Data from the inference obtained from analyzing primary data/ curated
data using computational or manual methods.
Examples: PIR, SWISS-PROT and Pfam.
Based on Data type, they are:
3) 3)Protein Databases: (Gerritsen, 2005) The most extensive sources of protein information are
protein sequence databases, which can be categorized into two types: Universal databases,
which aim to gather biological data across a wide range of species, and Specialized databases,
which focus on specific protein groups, families, or particular organisms.
4) Nucleotide Databases: (Lewitter, 1999) Nucleotide databases are described as essential
repositories that store the sequences of nucleotides (the building blocks of DNA and RNA)
from various organisms. These databases allow scientists to access, retrieve, and analyse
genetic information, which is crucial for understanding biological processes, genetic diversity,
and evolutionary relationships.
Based
5) Structural Database: (Oliviero Carugo, 2002) It is a specialized type of database that stores
and organizes detailed 3D structural information about biological macromolecules, such as
proteins, nucleic acids, and protein-nucleic acid complexes. Unlike sequence databases,
which focus on linear genetic or protein sequences, structural databases emphasize the spatial
arrangement of atoms within a molecule, offering atomic-level detail that is crucial for
understanding biological mechanisms, like enzyme activity or molecular interactions.
Structural databases are essential in fields like structural biology and bioinformatics, as they
provide unique insights into molecular functions, such as identifying functionally important
motifs (e.g., the catalytic triad in enzymes), and allow for the visualization and analysis of
macromolecular interactions. One of the most well-known structural databases is the Protein
Data Bank (PDB), which was established in 1971 and contains extensive entries of
macromolecular structures derived from experimental methods like X-ray crystallography and
NMR spectroscopy.
6) Sequence Database: (C Harger a, 2000)A sequence database is a specialized database that
stores nucleotide sequences (DNA or RNA) and provides access to associated biological,
bibliographic, and annotation information. The Genome Sequence Database (GSDB) is an
example of such a database, focusing on storing and providing access to publicly available
nucleotide sequences across various organisms.
Eg: UniProt, The Nucleotide Database, EMBL nucleotide sequence
Sequence Alignment:
(P. Haritha, 2018) Sequence alignment is essential for detecting similarities among protein,
DNA, and RNA sequences, aiding in the study of evolutionary relationships between them. This
analysis sheds light on the connections among groups of related proteins. In sequence alignment,
amino acids from the protein sequences are compared, usually arranged in a linear fashion. The
alignment tool identifies matching amino acids in each column, inserting gaps where needed.
Sequence alignment has a wide range of applications, including sequence assembly, gene and protein
annotation, structural and functional predictions, as well as phylogenetic and evolutionary research.
The primary types of alignment are Pairwise Alignment, which compares two sequences; Multiple
Sequence Alignment, which involves more than two sequences to find similarities and conserved
regions; and Structural Alignment, based on structural features.
(Sequence alignment software typically inserts gaps between nucleotide or amino acid
residues in the sequences to maximize the alignment of similar sites. In the end, a character matrix is
generated, with the rows representing the sequences and the columns reflecting aligned positions
across those sequences.)
(P. Haritha, 2018) Alignment applications include sequence assembly, annotation, structural and
functional prediction of genes and proteins, as well as phylogenetic and evolutionary analysis. The
main types of alignment are Pairwise Alignment, Multiple Sequence Alignment, and Structural
Alignment. Pairwise alignment compares two sequences, while multiple sequence alignment
examines more than two sequences to identify similarities and conserved regions.
Pairwise Alignment: This method identifies similarities between two sequences to reveal exact
matches. Two main types are Dot Plot Matrix Method and Dynamic Programming.
Multiple sequence Alignment: Multiple Sequence Alignment (MSA) is a technique used to align more
than two sequences at once, helping to identify conserved regions that appear consistently across
multiple sequences. This alignment is particularly useful for constructing phylogenetic trees, which
represent the evolutionary relationships among various sequences. There are two main methods for
performing MSA: Progressive alignment, which builds alignments in stages, and Iterative alignment,
which refines the alignment through repeated adjustments.
Protein:
The process of converting RNA sequences into amino acids is known as translation. Initially, DNA is
transcribed into RNA, which is then synthesized into protein sequences. However, once formed,
protein sequences cannot be traced back to the original DNA. These sequences consist of amino acids.
Protein synthesis occurs in three phases: Initiation, where the AUG initiator codon is located; and
Elongation, leading to the formation of polypeptides, or chains of amino acids.
(LaPelusa1 & Kaushik2., 2022) Proteins, often termed the cell’s workhorses, play vital roles in
providing structural support, facilitating movement, driving metabolism, regulating gene expression,
and enabling cell-environment interactions. Although they vary in shape and size, all proteins are
composed of the same fundamental building blocks.
Proteins are formed from twenty standard amino acids. Each amino acid contains an amino group
(NH3+), a carboxylate group (COO−), and a variable side chain, or R group, all attached to a central
carbon, known as the α-carbon. At physiological pH, amino acids carry both positive and negative
charges: the amino group is protonated, while the carboxyl group is deprotonated.
Amino acids are generally categorized based on their R groups: hydrophobic, polar (hydrophilic), or
charged. Within proteins, amino acids are linked via peptide bonds, and the sequence and nature of
these amino acids ultimately determine the protein's chemical and physical properties. The
arrangement of amino acids forms what is known as the protein's primary structure, while the folding
patterns due to backbone interactions constitute the secondary structure. The full 3D shape of the
protein, including the spatial arrangement of all its atoms, is called the tertiary structure. Proteins with
multiple polypeptide chains exhibit a quaternary structure, essential for stability and functionality.
Fundamentals of Protein Structure: The structural hierarchy of proteins—primary, secondary, tertiary,
and quaternary—is fundamental to their diverse roles within organisms.
Primary Structure: (LaPelusa1 & Kaushik2., 2022)A protein’s primary structure, the unique sequence
of amino acids, is crucial to its function. For instance, a single amino acid change in haemoglobin
(replacement of glutamic acid with valine at the sixth position in β-globin) results in sickle-cell
anaemia. This linear sequence forms the foundation upon which secondary, tertiary, and quaternary
structures develop.
Secondary Structure: (LaPelusa1 & Kaushik2., 2022)Secondary structure is stabilized by interactions
between the backbone’s peptide groups, particularly hydrogen bonds. The most common secondary
structures are the α-helix (a coiled configuration) and the β-pleated sheet (flat, folded segments). In an
α-helix, hydrogen bonds form between every fourth amino acid, while β-pleated sheets result from
larger loops that bring distant segments together. The primary sequence largely dictates whether a
segment adopts an α-helix or β-pleated sheet conformation. Certain amino acids, such as proline, are
less common in α-helices due to their unique structure.
Tertiary Structure: (LaPelusa1 & Kaushik2., 2022) The tertiary structure is the overall 3D shape that
results from various interactions between R-groups and between R-groups and the backbone. These
interactions include hydrogen bonds, hydrophobic effects, van der Waals interactions, covalent
disulfide bridges, and ionic bonds. As the protein folds, nonpolar amino acids often cluster at its core,
stabilized by van der Waals forces, while hydrogen bonds and ionic interactions further solidify the
structure. The primary structure strongly influences the protein’s final shape, which is vital to its
functionality. Denaturation and renaturation experiments, such as those conducted with ribonuclease,
highlight the primary structure’s role in protein folding and stability. Proteins can undergo folding
with the assistance of chaperonin molecules, which shield them from disruptive cellular conditions,
aiding in the attainment of the correct 3D structure. Improper folding is linked to various genetic
disorders.
Quaternary Structure: (LaPelusa1 & Kaushik2., 2022) While primary, secondary, and tertiary
structures involve single polypeptides, some proteins consist of multiple subunits held together by
interactions similar to those seen in the tertiary structure. This assembly, or quaternary structure,
allows such proteins to function cohesively as multi-unit complexes.
FASTA Tool
(P. Haritha, 2018)FASTA, developed in 1995 as an enhanced version of the FASTP tool from 1985,
provides efficient comparison of protein and nucleotide sequences. It can search DNA sequences and
assess statistical significance. The main FASTA programs include TFASTAX, TFASTAY (for DNA
library searches), and FASTAX, FASTAY (for protein databases). Using a heuristic algorithm, FASTA
first identifies identical regions within the sequences. It then applies the PAM-250 matrix to rescore
top regions, connects high-scoring diagonals, and includes gaps to achieve an optimal alignment score
through the Smith-Waterman algorithm. FASTA outputs four components: database information, score
distribution histogram, matched sequences with statistical data, and the aligned sequences.
BLAST Tool
(P. Haritha, 2018)BLAST (Basic Local Alignment Search Tool) enables efficient comparisons between
protein or nucleotide sequences. Its variants include megaBLAST (nucleotide-nucleotide similarity),
BLASTN (distant nucleotide sequences), BLASTP (protein-protein comparisons), and BLASTX
(translated nucleotide queries against protein databases), among others. PSI-BLAST creates Position-
Specific Scoring Matrices (PSSM) to refine protein database searches, while RPSBLAST and
DELTA-BLAST offer rapid searches using PSSM, with DELTA-BLAST performing faster than
RPSBLAST. BLAST conducts local alignment in three phases: Setup (generating words based on the
query), Preliminary Search (scoring matched words), and Traceback (aligning with gapped
extensions). BLAST’s efficiency allows it to surpass traditional dynamic programming methods and
perform multiple local alignments for two sequences.
Bibliography
C Harger a, G. C. (2000). The Genome Sequence DataBase. Nucleic Acids Research, 31-32.
Gerritsen, V. B. (2005). Protein Databases. ENCYCLOPEDIA OF LIFE SCIENCES , 1-7.
LaPelusa1, A., & Kaushik2., R. (2022). Physiology, Proteins. StatPearls Publishing.
Lewitter, A. P. (1999). Nucleotide sequence databases: a gold mine for biologists. Trends in
Biochemical Sciences, 276-280.
Oliviero Carugo, S. P. (2002). The evolution of structural databases. TRENDS in Biotechnology, 498-
501.
P. Haritha, ,. P. (2018). A Comprehensive Review on Protein Sequence Analysis. International Journal
of Computer Sciences and Engineering, 1433-1442.