Bioinformatics
Bioinformatics
Transcriptomics is the study of the 'transcriptome,' a term now widely understood to mean
the complete set of all the ribonucleic acid (RNA) molecules (called transcripts) expressed in
some given entity, such as a cell, tissue or organism./ Transcriptomics is the study
of the transcriptome, which is the complete set of RNA transcripts
produced by the genome of an organism under specific conditions or at a
specific time. It provides visions into gene expression patterns, regulatory
mechanisms, and functional elements of the genome. Here's a detailed
overview of transcriptomics:
Goals of Transcriptomics
Techniques in Transcriptomics
1. Microarrays:
o Hybridization-based technology to measure the expression of
thousands of genes simultaneously.
o Limited to pre-designed probes and less sensitive for low-
abundance transcripts.
2. RNA Sequencing (RNA-Seq):
o High-throughput sequencing of cDNA derived from RNA.
o Provides a comprehensive and quantitative view of the
transcriptome.
o Can detect novel transcripts, splice variants, and non-coding
RNAs.
3. Single-Cell RNA-Seq:
o Measures gene expression at the single-cell level.
o Reveals cell-to-cell variability and identifies rare cell types.
4. Quantitative PCR (qPCR):
o Quantifies specific RNA transcripts with high sensitivity and
accuracy.
o Often used to validate results from RNA-Seq or microarrays.
5. Long-Read Sequencing:
o Technologies like PacBio or Oxford Nanopore allow sequencing
of full-length RNA transcripts.
o Useful for studying splice variants and complex transcript
structures.
Applications of Transcriptomics
Challenges in Transcriptomics
1. Data Complexity:
o Transcriptomic data is vast and requires advanced
computational tools for analysis.
2. RNA Stability:
o RNA is less stable than DNA, making sample handling and
preparation critical.
3. Alternative Splicing:
o Detecting and quantifying splice variants can be challenging.
4. Single-Cell Variability:
o Single-cell transcriptomics requires specialized techniques to
handle low RNA quantities and technical noise.
Transcriptomics Workflow
Summary
2. Medical informatics can be concisely defined as “the rapidly developing scientific field that deals
with the storage, retrieval, and optimal use of biomedical information, data, and knowledge for
problem solving and decision making”.
3. Comparative genomics is the branch of bioinformatics which determines the genomic structure
and function relation between different biological species. For this purpose, intergenomic maps are
constructed which enable the scientists to trace the processes of evolution that occur in genomes of
different species. These maps contain the information about the point mutations as well as the
information about the duplication of large chromosomal segments.
4. Conserved sequences are sequences which persist in the genome despite such forces, and have
slower rates of mutation than the background mutation rate. Conservation can occur in coding and
non-coding nucleic acid sequences.
5. Artificial intelligence defined. Artificial intelligence is a field of science concerned with building
computers and machines that can reason, learn, and act in such a way that would normally require
human intelligence or that involves data whose scale exceeds what humans can analyze.
6. Put simply, genomics is the study of an organism's genome – its genetic material – and how that
information is applied. All living things, from single-celled bacteria, to multi-cellular plants, animals
and humans, have a genome – and ours is made up of DNA./ Genomics is any attempt to analyze or
compare the entire genetic complement of a species or species (plural). It is, of course possible to
compare genomes by comparing more-or-less representative subsets of genes within genomes.
7. In genome annotation, genomes are marked to know the regulatory sequences and protein
coding. It is a very important part of the human genome project as it determines the regulatory
sequences.
8. Comparative Studies Analysing and comparing the genetic material of different species is an
important method for studying the functions of genes, the mechanisms of inherited diseases and
species evolution. Bioinformatics tools can be used to make comparisons between the numbers,
locations and biochemical functions of genes in different organisms. Organisms that are suitable for
use in experimental research are termed model organisms. They have a number of properties that
make them ideal for research purposes including short life spans, rapid reproduction, being easy to
handle, inexpensive and they can be manipulated at the genetic level. An example of a human model
organism is the mouse. Mouse and human are very closely related (>98%) and for the most part we
see a one to one correspondence between genes in the two species. Manipulation of the mouse at
the molecular level and genome comparisons between the two species can and is revealing detailed
information on the functions of human genes, the evolutionary relationship between the two species
and the molecular mechanisms of many human diseases.
9. Proteomics: Proteomics is the study of proteins - their location, structure and function. It is the
identification, characterization and quantification of all proteins involved in a particular pathway,
organelle, cell, tissue, organ or organism that can be studied in concert to provide accurate and
comprehensive data about that system. Proteomics is the study of the function of all expressed
proteins. The study of the proteome, called proteomics, now evokes not only all the proteins in any
given cell, but also the set of all protein isoforms and modifications, the interactions between them,
the structural description of proteins and their higher-order complexes, and for that matter almost
everything 'post-genomic'.
11. Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub- disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess relationships
among members of large data sets; the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein structures; and the
development and implementation of tools that enable efficient access and management of different
types of information.
12. There are three important sub-disciplines within bioinformatics: the development of new
algorithms and statistics with which to assess relationships among members of large data sets; the
analysis and interpretation of various types of data including nucleotide and amino acid sequences,
protein domains, and protein structures; and the development and implementation of tools that
enable efficient access and management of different types of information.
13. A search engine is a software program that helps you find information on the internet. You can
use a search engine to search for websites or content that matches a keyword or phrase you enter.
14. The coding region of a gene, also known as the coding DNA sequence (CDS), is the portion of a
gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing,
structures, and functions of coding regions compared to non-coding regions over different species
and time periods can provide a significant amount of important information regarding gene
organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping
the human genome and developing gene therapy
15. Genes and Coding Sequences (CDSs) are both fundamental concepts in genetics and molecular
biology, but they refer to different aspects of DNA and its function. Here's a breakdown of their
differences:
16. Genes and Coding Sequences (CDSs) are both fundamental concepts in genetics and molecular
biology, but they refer to different aspects of DNA and its function. Here's a breakdown of their
differences:
Gene
A gene is a segment of DNA that contains the instructions for producing a functional product,
such as a protein or a functional RNA molecule (e.g., tRNA, rRNA, or regulatory RNAs).
Genes include both coding regions (exons) and non-coding regions (introns, promoters,
enhancers, and other regulatory elements).
A gene can produce multiple products through processes like alternative splicing.
The CDS refers specifically to the portion of a gene's DNA or RNA sequence that codes for a
protein.
It includes only the exons that are translated into amino acids, excluding introns and
untranslated regions (UTRs).
The CDS begins with a start codon (usually AUG) and ends with a stop codon (UAA, UAG, or
UGA in RNA).
The CDS is the part of the gene that is directly translated into a protein during gene
expression.
Key Differences:
A segment of DNA containing instructions for a The portion of a gene's DNA or RNA
Definition
functional product (protein or RNA). that codes for a protein.
Includes exons, introns, promoters, enhancers, Includes only exons that are
Components
and other regulatory regions. translated into protein.
Function Encodes proteins or functional RNAs and Directly specifies the amino acid
Aspect Gene CDS
Larger, as it includes both coding and non-coding Smaller, as it includes only the
Size
regions. coding regions.
Start and Begins at the transcription start site and ends at Begins at the start codon and ends
End the transcription termination site. at the stop codon.
Example:
The CDS within that gene would only include the exons that are translated into protein, from
the start codon to the stop codon.
In summary, a gene is a broader concept that includes the CDS as one of its components. The CDS is
the specific part of the gene that directly encodes the protein.
17. The Coding Sequence (CDS) is a specific portion of a gene's DNA or RNA sequence that directly
encodes the amino acid sequence of a protein. Here’s a detailed definition:
The CDS is the part of a gene's nucleotide sequence that is translated into protein.
It consists of a series of codons (three-nucleotide sequences) that specify the amino acids to
be incorporated into the protein during translation.
The CDS begins with a start codon (usually AUG in RNA, which corresponds to ATG in DNA)
and ends with a stop codon (UAA, UAG, or UGA in RNA; TAA, TAG, or TGA in DNA).
o Introns (non-coding regions within the gene that are removed during RNA splicing).
o Untranslated regions (UTRs) (regions at the 5' and 3' ends of the mRNA that are not
translated into protein but may play regulatory roles).
1. Start Codon: Marks the beginning of the CDS (e.g., AUG in RNA).
2. Stop Codon: Marks the end of the CDS (e.g., UAA, UAG, or UGA in RNA).
3. Exons Only: The CDS is composed of exons, which are the coding regions of the gene.
4. Directly Translated: The CDS is the part of the mRNA that is read by the ribosome to
synthesize a protein.
Example:
Copy
Copy
(Only the exons between the start and stop codons are part of the CDS.)
Importance of CDS:
The CDS is critical for determining the amino acid sequence of a protein, which in turn
determines the protein's structure and function.
In summary, the CDS is the portion of a gene that is directly translated into protein, starting at the
start codon and ending at the stop codon. It excludes introns and untranslated regions.
18. The terms gene and genome are fundamental in genetics, but they
refer to different levels of genetic organization. Here's a clear explanation
of their differences:
18.Gene
Genome
A genome is the complete set of genetic material (DNA or RNA) in
an organism.
It includes all the genes as well as non-coding sequences (e.g.,
regulatory elements, repetitive DNA, and intergenic regions).
The genome represents the entirety of an organism's hereditary
information.
In humans, the genome consists of approximately 3.2 billion base
pairs of DNA, distributed across 23 pairs of chromosomes.
The genome also includes the DNA in mitochondria (in eukaryotes)
or plasmids (in some bacteria).
Example: The human genome contains about 20,000–25,000
genes, along with vast amounts of non-coding DNA.
Analogy
Summary
1) FASTA, which is acceptable for one or more sequences. Please use the FASTA format that
starts with a definition line, followed with a hard return and the sequence. The simplest
Example:
CCTTTAT...
GGTAGGT...
2) The alignment format, which is acceptable for multiple sequences from the same locus or
same genomic region. Accepted alignment formats include FASTA+GAP, Nexus, Phylip,
and Clustal(w).
All sequence files must be in plain text using ASCII characters only. Use IUPAC codes for
your sequences.
Source modifiers will be requested as part of submission and use a controlled vocabulary to
describe how, when, and where you obtained your samples. You can also uniquely identify
your samples from the same organism with source modifier such as isolate, clone, strain or
specimen voucher.
You will be asked to provide values for certain source modifiers based on your organism
Source modifiers can be provided through the web form or through a tab delimited table.
For simple annotation (e.g. same feature for all sequences), follow the web form's
instructions.
Provide feature intervals based on the sequence(s) you are submitting. For protein-coding
sequences, annotate the coding regions (CDS) on your sequence(s), whether they are partial
or complete.
If you submitted an alignment, you will have an option to 'Propagate features' from a single
sequence (longest sequence recommended) to the other sequences in your submission. You
will have the option to manually edit or remove features after propagation.
Not providing complete feature annotation will delay accession number assignment and
processing.
20. Global alignment: Global alignment is a method of comparing two sequences, which aligns the
entire length of the sequences by maximizing the overall similarity. This method is used when
comparing sequences that are of the same length. • In global alignment, an attempt is made to align
the entire sequence, using as many characters as possible, up to both ends of each sequence. •
Sequences that are quite similar and approximately the same length are suitable candidates for
global alignment. • The global alignment is stretched over the entire sequence length to include as
many matching amino acids as possible up to and including the sequence ends. • Vertical bars
between the sequences indicate the presence of identical amino acids. 28 • Although there is an
obvious region of identity in this example (the sequence GKG preceded by a commonly observed
substitution of T for A), a global alignment may not align such regions so that more amino acids along
the entire sequence lengths can be matched.
21. Local alignment: In local alignment, instead of attempting to align the entire length of the
sequences, only the regions with the highest density of matches are aligned. This is useful for
identifying short conserved regions in protein or nucleotide sequences. • In local alignment,
stretches of sequence with the highest density of matches are aligned, thus generating one or more
islands of matches or subalignments in the aligned sequences. • Local alignments are more suitable
for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences
that differ in length, or sequences that share a conserved region or domain. • In a local alignment,
the alignment stops at the ends of regions of identity or strong similarity, and a much higher priority
is given to finding these local regions than to extending the alignment to include more neighboring
amino acid pairs. • Dashes indicate sequence not included in the alignment. This type of alignment
favors finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein
sequences.
22.
Used when sequences are expected to be Used when sequences are expected to
Purpose
similar across their entire length. share only local similarities.
Best for sequences of similar length and high Best for sequences of different lengths
Suitability
overall similarity. or with only partial similarity.
Example Use Comparing two versions of the same gene Identifying a conserved domain in two
Case from different species. unrelated proteins.
23. Bioinformatics joins mathematics, statistics, and computer science and information technology
to solve complex biological problems. These problems are usually at the molecular level which
cannot be solved by other means. This interesting field of science has many applications and
All the applications of bioinformatics are carried out in the user level. Here is the biologist
including the students at various level can use certain applications and use the output in their
groups:
Sequence Analysis
Function Analysis
Structure Analysis
Sequence Analysis: All the applications that analyzes various types of sequence information
and can compare between similar types of information is grouped under Sequence Analysis.
Function Analysis: These applications analyze the function engraved within the sequences and
helps predict the functional interaction between various proteins or genes. Also expressional
Structure Analysis: When it comes to the realm of RNA and Proteins, its structure plays a
vital role in the interaction with any other thing. This gave birth to a whole new branch termed
Structural Bioinformatics with is devoted to predict the structure and possible roles of these
Sequence Analysis:
The application of sequence analysis determines those genes which encode regulatory
sequences or peptides by using the information of sequencing. For sequence analysis, there are
many powerful tools and computers which perform the duty of analyzing the genome of various
organisms. These computers and tools also see the DNA mutations in an organism and also
detect and identify those sequences which are related. Shotgun sequence techniques are also
used for sequence analysis of numerous fragments of DNA. Special software is used to see the
It is easy to determine the primary structure of proteins in the form of amino acids which are
present on the DNA molecule but it is difficult to determine the secondary, tertiary or
quaternary structures of proteins. For this purpose either the method of crystallography is used
or tools of bioinformatics can also be used to determine the complex protein structures.
Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein
coding. It is a very important part of the human genome project as it determines the regulatory
sequences.
Comparative Genomics:-
Comparative genomics is the branch of bioinformatics which determines the genomic structure
and function relation between different biological species. For this purpose, intergenomic maps
are constructed which enable the scientists to trace the processes of evolution that occur in
genomes of different species. These maps contain the information about the point mutations as
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease
management. Complete sequencing of human genes has enabled the scientists to make
medicines and drugs which can target more than 500 genes. Different computational tools
and drug targets has made the drug delivery easy and specific because now only those cells
can be targeted which are diseased or mutated. It is also easy to know the molecular basis of a
disease.
The human genome will have profound effects on the fields of biomedical research and clinical
medicine. Every disease has a genetic component. This may be inherited (as is the case with an
estimated 3000-4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or
a result of the body's response to an environmental stress which causes alterations in the
genome (eg. cancers, heart disease, diabetes.). The completion of the human genome
means that we can search for the genes directly associated with different diseases and begin
to understand the molecular basis of these diseases more clearly. This new knowledge of the
molecular mechanisms of disease will enable better treatments, cures and even preventative
tests to be developed.
Personalized medicine
Clinical medicine will become more personalised with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritence affects the
body's response to drugs. At present, some drugs fail to make it to the market because a small
percentage of the clinical patient population show adverse affects to a drug due to sequence
variants in their DNA. As a result, potentially life saving drugs never make it to the
marketplace. Today, doctors have to use trial and error to find the best drug to treat a particular
patient as those with the same clinical symptoms can show a wide range of responses to the
same treatment. In the future, doctors will be able to analyse a patient's genetic profile and
prescribe the best available drug therapy and dosage from the beginning.
Preventative medicine
With the specific details of the genetic mechanisms of diseases being unravelled, the
become a distinct reality. Preventative actions such as change of lifestyle or having treatment
at the earliest possible stages when they are more likely to be successful, could result in huge
Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may
become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by
changing the expression of a person’s genes. Currently, this field is in its infantile stage with
clinical trials for many different types of cancer and other diseases ongoing.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved
understanding of disease mechanisms and using computational tools to identify and validate
new drug targets, more specific medicines that act on the cause, not merely the symptoms, of
the disease can be developed. These highly specific drugs promise to have fewer side effects
Microorganisms are ubiquitous, that is they are found everywhere. They have been found
surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are
present in the environment, our bodies, the air, food and water. Traditionally, use has been
made of a variety of microbial properties in the baking, brewing and food industries. The arrival
of the complete genome sequences and their potential to provide a greater insight into the
microbial world and its capacities could have broad and far reaching implications for
environment, health, energy and industrial applications. For these reasons, in 1994, the US
Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence
and toxic waste reduction. By studying the genetic material of these organisms, scientists can
begin to understand these microbes at a very fundamental level and isolate the genes that give
Waste cleanup
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most
radiation resistant organism known. Scientists are interested in this organism because of its
potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals.
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil
fuels for energy, are thought to contribute to global climate change. Recently, the DOE
Scientists are studying the genome of the microbe Chlorobium tepidum which has an unusual
Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential
These microorganisms thrive in water temperatures above the boiling point and therefore may
provide the DOE, the Department of Defence, and private companies with heat-stable enzymes
suitable for use in industrial processes Other industrially useful microbes include,
it is used by the chemical industry for the biotechnological production of the amino acid lysine.
The substance is employed as a source of protein in animal nutrition. Lysine is one of the
essential amino acids in animal nutrition. Biotechnologically produced lysine is added to feed
gum, which is used as a viscosifying and stabilising agent in many industries. Lactococcus
lactis is one of the most important micro-organisms involved in the dairy industry, it is a non-
pathogenic rod-shaped bacterium that is critical for manufacturing dairy products like
buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis ssp., is also used to prepare
pickled vegetables, beer, wine, some breads and sausages and other fermented foods.
Researchers anticipate that understanding the physiology and genetic make- up of this
bacterium will prove invaluable for food manufacturers as well as the pharmaceutical industry,
which is exploring the capacity of L. lactis to serve as a vehicle for delivering drugs.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of
bacterial infection among hospital patients. They have discovered a virulence region made up
transformation from harmless gut bacteria to a menacing invader. The discovery of the region,
known as a pathogenicity island, could provide useful markers for detecting pathogenic
strains and help to establish controls to prevent the spread of infection in wards.
Scientists used their genomic tools to help distinguish between the strain of Bacillus anthryacis
that was used in the summer of 2001 terrorist attack in Florida with that of closely related
anthrax strains.
Scientists have recently built the virus poliomyelitis using entirely artificial means. They did
this using genomic data available on the Internet and materials from a mail-order chemical
supply. The research was financed by the US Department of Defence as part of a biowarfare
response program to prove to the world the reality of bioweapons. The researchers also hope
their work will discourage officials from ever relaxing programs of immunisation. This project
Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea
means that evolutionary studies can be performed in a quest to determine the tree of life and
Crop improvement
Comparative genetics of the plant genomes has shown that the organization of their genes
has remained more conserved over evolutionary time than was previously believed. These
findings suggest that information obtained from the model crop systems can be used to
suggest improvements to other food crops. At present the complete genomes of Arabidopsis
Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been
successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist
insect attack means that the amount of insecticides being used can be reduced and hence the
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin
A, iron and other micronutrients. This work could have a profound impact in reducing
respectively. Scientists have inserted a gene from yeast into the tomato, and the result is a plant
whose fruit stays longer on the vine and has an extended shelf life.
Progress has been made in developing cereal varieties that have a greater tolerance for soil
alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to succeed
in poorer soil areas, thus adding more land to the global production base. Research is also in
Veterinary Science
Sequencing projects of many farm animals including cows, pigs and sheep are now well under
way in the hope that a better understanding of the biology of these organisms will have huge
impacts for improving the production and health of livestock and ultimately have benefits for
human nutrition.
Comparative Studies
Analysing and comparing the genetic material of different species is an important method for
studying the functions of genes, the mechanisms of inherited diseases and species evolution.
Bioinformatics tools can be used to make comparisons between the numbers, locations and
Organisms that are suitable for use in experimental research are termed model organisms. They
have a number of properties that make them ideal for research purposes including short life
spans, rapid reproduction, being easy to handle, inexpensive and they can be manipulated at
An example of a human model organism is the mouse. Mouse and human are very closely
related (>98%) and for the most part we see a one to one correspondence between genes in the
two species. Manipulation of the mouse at the molecular level and genome comparisons
between the two species can and is revealing detailed information on the functions of human
genes, the evolutionary relationship between the two species and the molecular mechanisms of
21.a. FASTA
routinely possible.
• FASTA provides a rapid way to find short stretches of similar sequence between a
• Each sequence is broken down into short words a few sequence characters long, and
these words are organized into a table indicating where they are in the sequence.
• If one or more words are present in both sequences, and especially if several words
• Pearson (1990, 1996) has continued to improve the FASTA method for similarity
• a comment line identified by a “>” character in the first column followed by the name
• an optional “*” which indicates end of sequence and which may or may not be present
b. GenBank sequence database is an open access and annotated collection of nucleotide sequences
and their protein translations including mRNA sequences with coding regions, segments of genomic
DNA with a single gene or multiple genes, and ribosomal RNA gene clusters. GenBank is produced
and maintained by the National Centre for Biotechnology Information (NCBI) as part of the
International collaboration with EMBL Data Library from the EBI and the DNA Data Bank of Japan
(DDBJ). Individual laboratory can submit sequence data or large scale sequencing centre can submit
bulk submission directly to the GenBank by using Banklt or Sequin. The Banklt is a webbased form
and Sequin is a stand alone software tool developed by the NCBI for submitting and updating
sequence to the GenBank, EMBL and DDBJ databases. After sequence submission the GenBank staffs
assigns an Accession Number to the newly entered sequence and performs quality assurance checks.
Then the newly submitted sequence is released to the database. Data that are stored in GenBank can
be retrieved by Entrez or by downloading File Transfer Protocol (FTP). The GenBank is a collection of
information on Expressed Sequence Tag (EST), Sequence Tagged Site (STS), Genome Survey Sequence
(GSS), and HighThroughput Genome Sequence (HTGS) and complete microbial genome sequences.
Information of GenBank can be accessed through the server http://www.ncbi.nlm.nih.gov/genbank/.
There are several ways to search and retrieve data from GenBank as given under – • Search GenBank
for sequence identifiers and annotations with Entrez Nucleotide , which is divided into three
divisions: CoreNucleotide (the main collection), dbEST (Expressed Sequence Tags), and dbGSS
(Genome Survey Sequences). • Search and align GenBank sequences to a query sequence using
BLAST. • Search, link, and download sequences programmatically using NCBI e-utilities.
c. EMBL: European Bioinformatics Institute (EBI) is part of European Molecular Biology Laboratory
(EMBL). EMBL-EBI now known as EMBL-Bank and was established in 1980 at the EMBL in Heidelberg,
Germany. It was the world's first nucleotide sequence database. EMBL-EBI provides freely available
data from life science experiments, performs basic research in computational biology and offers an
extensive user training programme for the researchers. EMBL-EBI stores data on DNA and RNA
(genes, genomes and variation), gene expression (RNA, protein and metabolite expression), protein
(sequence, families and motifs), structure (molecular and cellular structures), systems (reaction,
interaction, pathways), chemical biology (chemogenomics and metabolomics), ontologies
(taxonomies and controlled vocabularies) and literature (scientific publications and patents). EMBL-
EBI can be accessed through the server http://www.ebi.ac.uk. EMBL format: A sequence file in EMBL
format can contain several sequences. One sequence entry starts with an identifier line ("ID"),
followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two slashes ("//").
D. BLAST • An even faster program for similarity searching in sequence databases, called BLAST, was
developed by Altschul et al. (1990). at • This method is widely used from the Web site of the
National Center for Biotechnology Information the National Library of Medicine in Washington, DC
(http://www.ncbi.nlm.nih.gov/BLAST). • The BLAST server is probably the most widely used
sequence analysis facility in the world and provides similarity searching to all currently available
sequences. • Like FASTA, BLAST prepares a table of short sequence words in each sequence, but it
also determines which of these words are most significant such that they are a good indicator of
similarity in two sequences, and then confines the search to these words (and related ones). • There
are versions of BLAST for searching nucleic acid and protein databases, which can be used to
translate DNA sequences prior to comparing them to protein sequence databases (Altschul et al.
1997). • Recent improvements in BLAST include GAPPED-BLAST, which is threefold faster than the
original BLAST, but which appears to find as many matches in databases, and PSI BLAST (position-
specific-iterated BLAST), which can find more distant matches to a test protein sequence by
repeatedly searching for additional sequences that match an alignment of the query and initially
matched sequences. 34 1. BLASTn (Nucleotide BLAST): Compares one or more nucleotide query
sequences to a subject nucleotide sequence or a database of nucleotide sequences. This is useful
while exploring to determine evolutionary relationships among.
f. AutoDock is a docking tool, which is designed to predict the behavior of the small molecules and
helps user to perform the docking of ligands to a set of grids which describes the target, once
docking completes result can visualize in 3D view. AutoDock 4 is freely available under the GNU
General Public License. AutoDock uses a Monte Carlo simulation with a rapid energy evaluation using
grid based molecular affinity potentials. It is given a volume around the protein, the rotatable bonds
for the substrate, and an arbitrary starting configuration, and the procedure produces a relatively
unbiased docking. Different applications of AutoDock: Structure based drug design. X-ray
crystallography Lead optimization Combinatorial library design Protein-Protein docking.
Chemical mechanism studies.
g. A. Dot-matrix method Dot matrix method, also known as the dot plot method, is a graphical
method of sequence alignment that involves comparing two sequences by plotting them in a two-
dimensional matrix. In a dot matrix, two sequences that must be compared are plotted along a
matrix’s horizontal and vertical axes. The method then scans each residue of one sequence to
identify similarities with all residues in the other sequence. If a residue in one sequence matches a
residue in the other sequence, a dot is placed in the corresponding position in the matrix. Otherwise,
the matrix position is left blank. If the two sequences being compared are highly similar, the dot
plot will display as a single line along the matrix’s main diagonal. However, when the sequences are
less similar, the dot plot will show more scattered dots with fewer diagonal lines, indicating that the
sequences share less similarity. Dot plots can also find repeat elements in a single sequence. Short
parallel lines above and below the main diagonal indicate the repeats. presence of Figure: Example
of comparing two sequences using dot plots. (Xiong, J., 2006). 30 B. Dynamic programming
Dynamic programming is used to find the optimal alignment between two proteins or nucleic acid
sequences by comparing all possible pairs of characters in the sequences. Dynamic programming
can be used to produce both global and local alignments. The global pairwise alignment algorithm
using dynamic programming is based on the Needleman-Wunsch algorithm, while the dynamic
programming in local alignment is based on the Smith-Waterman algorithm. This method works in
the following three steps.
22. i. Progressive method The progressive method, also known as the tree-based algorithm, is a
step-wise assembly of multiple alignments based on pairwise similarity. This method is called
progressive because it aligns sequences in a step-wise manner. First, it performs pairwise
alignments of all the sequences using the Needleman–Wunsch global alignment method and records
the similarity scores. Then, it converts the scores into evolutionary distances to create a distance
matrix. A guide tree is constructed from the distance matrix using the neighbor-joining method.
The guide tree is used to direct the realignment of sequences based on their relative positions on the
tree, starting with the two most closely related sequences and adding more distant sequences one at
a time until all sequences are aligned. Clustal and T-Coffee are two well-known progressive
alignment programs.
23. Methods of Multiple Sequence Alignment Multiple sequence alignment can be performed using
either exhaustive or heuristic approaches. A. Exhaustive algorithms Exhaustive alignment involves
examining all possible alignments at once. A multidimensional search matrix is required to perform
multiple sequence alignment using the exhaustive algorithm, similar to the two-dimensional matrix
used in dynamic programming for pairwise alignment. This means that to align N sequences, an N
dimensional matrix is required. Dynamic programming is a powerful method for aligning
sequences, but as the number of sequences to be aligned increases, the amount of computational
time and memory space also increases. This means that the method becomes computationally
impractical for large data sets. As a result, dynamic programming is typically only used for small data
sets with fewer than ten short sequences. 31 Heuristic approaches are typically used for larger data
sets to achieve a more efficient alignment. B. Heuristic algorithm i. Progressive method The
progressive method, also known as the tree-based algorithm, is a step-wise assembly of multiple
alignments based on pairwise similarity. This method is called progressive because it aligns
sequences in a step-wise manner. First, it performs pairwise alignments of all the sequences using
the Needleman–Wunsch global alignment method and records the similarity scores. Then, it
converts the scores into evolutionary distances to create a distance matrix. A guide tree is
constructed from the distance matrix using the neighbor-joining method. The guide tree is used to
direct the realignment of sequences based on their relative positions on the tree, starting with the
two most closely related sequences and adding more distant sequences one at a time until all
sequences are aligned. Clustal and T-Coffee are two well-known progressive alignment programs.
24. Aims of Bioinformatics In general, the aims of bioinformatics are three-fold. 1. The first aim of
bioinformatics is to store the biological data organized in form of a database. This allows the
researchers an easy access to existing information and submits new entries. These data must be
annoted to give a suitable meaning or to assign its functional characteristics. The databases must
also be able to correlate between different hierarchies of information. For example: GenBank for
nucleotide and protein sequence information, Protein Data Bank for 3D macromolecular structures,
etc. 2. The second aim is to develop tools and resources that aid in the analysis of data. For example:
BLAST to find out similar nucleotide/amino-acid sequences, ClustalW to align two or more
nucleotide/amino-acid sequences, Primer3 to design primers probes for PCR techniques, etc. 3. The
third and the most important aim of bioinformatics is to exploit these computational tools to analyze
the biological data interpret the results in a biologically meaningful manner.
25. Goals
To study how normal cellular activities are altered in different disease states, the biological
data must be combined to form a comprehensive picture of these activities. Therefore, the field
of bioinformatics has evolved such that the most pressing task now involves the analysis and
interpretation of various types of data. This includes nucleotide and amino acid sequences,
protein domains, and protein structures. The actual process of analyzing and interpreting data