Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
139 views40 pages

Fundamentals of Bioinformatics

The document discusses the history and origin of bioinformatics from early taxonomy efforts in the 17th century through modern genome projects and the role of the internet and technologies like HTML, Java, XML, and CORBA in enabling data sharing and analysis. It provides an example of how Oxford GlycoSciences uses a bioinformatics system called ROSETTA as part of its proteomics work to identify disease-related proteins.

Uploaded by

Purvesh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views40 pages

Fundamentals of Bioinformatics

The document discusses the history and origin of bioinformatics from early taxonomy efforts in the 17th century through modern genome projects and the role of the internet and technologies like HTML, Java, XML, and CORBA in enabling data sharing and analysis. It provides an example of how Oxford GlycoSciences uses a bioinformatics system called ROSETTA as part of its proteomics work to identify disease-related proteins.

Uploaded by

Purvesh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

FUNDAMENTALS OF BIOINFORMATICS

Module 1: Origin, History and Scope of Bioinformatics

It was in the 17th century that biologists started dealing with problems of information
management. Early biologists were preoccupied with cataloguing and comparing species
of living things. By the middle of the 17th century, John Ray introduced the concept of
distinct species of animals and plants and developed guidelines based on anatomical
features for distinguishing conclusively between species. In 1730 Carlus Linnaeus
established the basis for the modern taxonomic naming system of kingdoms, classes,
genera and species. Taxonomy was the first informatics problem in biology. The

examples of online taxonomy projects. Over a century ago, bioinformatics history started
with an Austrian monk named Gregor Mendel.
He cross-fertilized different colors of the same species of flowers. He kept careful records
of the colors of flowers that he cross-fertilized and the color(s) of flowers they produced.
Mendel illustrated that the inheritance of traits could be more easily explained if it was
controlled by factors passed down from generation to generation.

After this discovery of Mendel, bioinformatics and genetic record keeping have come a
long way. Growth in laboratory technology facilitated collection of data at a rate that was
faster than the rate of data interpretation. Biologists reached a similar information
overload and started facing a lot of difficulties in the field of data analysis and
interpretation. Collecting and cataloguing information about individual genes (approx.30,
000) in human DNA and determining the sequence of three billion chemical bases that
made up the human DNA became the second informatics problem in biology. In 1990,
Human Genome Project was initiated as a prominent bioinformatics solution to the
problem, and this labelled the 21st century as the era of genomes.

Since Mendel, bioinformatics and genetic record keeping have come a long way. The
understanding of genetics has advanced remarkably in the last thirty years. In 1972, Paul
berg made the first recombinant DNA molecule using ligase. In that same year, Stanley
Cohen, Annie Chang and Herbert Boyer produced the first recombinant DNA organism.

In 1973, two important things happened in the field of genomics:

1. Joseph Sambrook led a team that refined DNA electrophoresis using agarose gel, and

2. Herbert Boyer and Stanely Cohen invented DNA cloning.

By 1977, a method for sequencing DNA was discovered and the first genetic engineering
company, Genetech was founded. By 1981, 579 human genes had been mapped and
mapping by in situ hybridization had become a standard method. Marvin Carruthers and
Leory Hood made a huge leap in bioinformatics when they invented a method for
automated DNA sequencing. In 1988, the Human Genome organization (HUGO) was
founded. This is an international organization of scientists involved in Human Genome
Project. In 1989, the first complete genome map was published of the bacteria
Haemophilus influenza. The following year, the Human Genome Project was started. By
1991, a total of 1879 human genes had been mapped. In 1993, Genethon, a human
genome research center in France Produced a physical map of the human genome.
Three years later, Genethon published the final version of the Human Genetic Map. This
concluded the end of the first phase of the Human Genome Project. In the mid-1970s, it
would take a laboratory at least two months to sequence 150 nucleotides. Ten years ago,
the only way to track genes was to scour large, well-documented family trees of relatively
inbred populations, such as the Ashkenzai Jews from Europez. These types of
genealogical searches 11 million nucleotides a day for its corporate clients and company
research. Bioinformatics was fuelled by the need to create huge databases, such as
GenBank and EMBL and DNA Database of Japan to store and compare the DNA
sequence data erupting from the human genome and other genome sequencing projects.
Today, bioinformatics embraces protein structure analysis, gene and protein functional
information, data from patients, pre-clinical and clinical trials, and the metabolic pathways
of numerous species.

Origin of internet:

The management and, more importantly, accessibility of this data is directly attributable
to the development of the Internet, particularly the World Wide Web (WWW). Originally
developed for military purposes in the 60's and expanded by the National Science
Foundation in the 80's, scientific use of the Internet grew dramatically following the
release of the WWW by CERN in 1992.

HTML: The WWW is a graphical interface based on hypertext by which text and graphics
can be displayed and highlighted. Each highlighted element is a pointer to another
document or an element in another document which can reside on any internet host
computer. Page display, hypertext links and other features are coded using a simple,
cross-platform HyperText Markup Language (HTML) and viewed on UNIX workstations,
PCs and Apple Macs as WWW pages using a browser.

Java: The first graphical WWW browser - Mosaic for X and the first molecular biology
WWW server - ExPASy were made available in 1993. In 1995, Sun Microsystems
released Java, an object-oriented, portable programming language based on C++. In
addition to being a standalone programming language in the classic sense, Java provides
a highly interactive, dynamic content to the Internet and offers a uniform operational level
for all types of computers, provided they implement the 'Java Virtual Machine' (JVM).
Thus, programs can be written, transmitted over the internet and executed on any other
type of remote machine running a JVM. Java is also integrated into Netscape and
Microsoft browsers, providing both the common interface and programming capability
which are vital in sorting through and interpreting the gigabytes of bioinformatics data now
available and increasing at an exponential rate.

XML: The new XML standard 8 is a project of the World-Wide Web Consortium (W3C)
which extends the power of the WWW to deliver not only HTML documents but an
unlimited range of document types using customized markup. This will enable the
bioinformatics community to exchange data objects such as sequence alignments,
chemical structures, spectra etc., together with appropriate tools to display them, just as
easily as they exchange HTML documents today. Both Microsoft and Netscape support
this new technology in their latest browsers.

CORBA: Another new technology, called CORBA, provides a way of bringing together
many existing or 'legacy' tools and databases with a common interface that can be used
to drive them and access data. CORBA frameworks for bioinformatics tools and
databases have been developed by, for example, NetGenics and the European
Bioinformatics Institute (EBI). Representatives from industry and the public sector under
the umbrella of the Object Management Group are working on open CORBA-based
standards for biological information representation The Internet offers scientists a
universal platform on which to share and search for data and the tools to ease data
searching, processing, integration and interpretation. The same hardware and software
tools are also used by companies and organisations in more private yet still global Intranet
networks. One such company, Oxford GlycoSciences in the UK, has developed a
bioinformatics system as a key part of its proteomics activity.

ROSETTA: ROSETTA focuses on protein expression data and sets out to identify the
specific proteins which are up- or down-regulated in a particular disease; characterise
these proteins with respect to their primary structure, post-translational modifications and
biological function; evaluate them as drug targets and markers of disease; and develop
novel drug candidates OGS uses a technique called fluorescent IPG-PAGE to separate
and measure different protein types in a biological sample such as a body fluid or purified
cell extract. After separation, each protein is collected and then broken up into many
different fragments using controlled techniques. The mass and sequence of these
fragments is determined with great accuracy using a technique called mass spectrometry.
The sequence of the original protein can then be theoretically reconstructed by fitting
these fragments back together in a kind of jigsaw. This reassembly of the protein
sequence is a task well-suited to signal processing and statistical methods. ROSETTA is
built on an object-relational database system which stores demographic and clinical data
on sample donors and tracks the processing of samples and analytical results. It also
interprets protein sequence data and matches this data with that held in public, client and
proprietary protein and gene databases. ROSETTA comprises a suite of linked HTML
pages which allow data to be entered, modified and searched and allows the user easy
access to other databases. A high level of intelligence is provided through a sophisticated
suite of proprietary search, analytical and computational algorithms. These algorithms
facilitate searching through the gigabytes of data generated by the Company's proteome
projects, matching sequence data, carrying out de novo peptide sequencing and
correlating results with clinical data. These processing tools are mostly written in C, C++
or Java to run on a variety of computer platforms and use the networking protocol of the
internet, TCP/IP, to co-ordinate the activities of a wide range of laboratory instrument
computers, reliably identifying samples and collecting data for analysis. The need to
analyse ever increasing numbers of biological samples using increasingly complex
analytical techniques is insatiable. Searching for signals and trends in noisy data
continues to be a challenging task, requiring great computing power. Fortunately this
power is available with today's computers, but of key importance is the integration of
analytical data, functional data and biostatistics. The protein expression data in
ROSETTA forms only part of an elaborate network of the type of data which can now be
brought to bear in Biology. The need to integrate different information systems into a
collaborative network with a friendly face is bringing together an exciting mixture of talents
in the software world and has brought the new science of bioinformatics to life.

A Chronological History of Bioinformatics

1953 - Watson & Crick proposed the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins.

1954 - Perutz's group develop heavy atom methods to solve the phase problem in
protein crystallography.

1955 - The sequence of the first protein to be analysed, bovine insulin, is


announced by F. Sanger.

1969 - The ARPANET is created by linking computers at Stanford, UCSB, The


University of Utah and UCLA.

1970 - The details of the Needleman-Wunsch algorithm for sequence comparison


are published.

1972 - The first recombinant DNA molecule is created by Paul Berg and his group.

1973 - The Brookhaven Protein DataBank is announced. Robert Metcalfe receives


his Ph.D from Harvard University. His thesis describes Ethernet.

1974 - Vint Cerf and Robert Khan develop the concept of connecting networks of
computers into an "internet" and develop the Transmission Control Protocol (TCP).

1975 - Microsoft Corporation is founded by Bill Gates and Paul Allen. Two-
dimensional electrophoresis, where separation of proteins on SDS polyacrylamide
gel is combined with separation according to isoelectric points, is announced by
Patrick H. O. Farrel.

1988 - The National Centre for Biotechnology Information (NCBI) is established at


the National Cancer Institute. The Human Genome Initiative is started (commission
on Life Sciences, National Research council. Mapping and sequencing the Human
Genome, National Academy Press: Washington, D.C.), 1988. The FASTA
algorithm for sequence comparison is published by William R Pearson and David
J Lipman. A new program, an Internet computer virus designed by a student,
infects 6,000 military computers in the US.
1989 - The genetics Computer Group (GCG) becomes a private company. Oxford
Molecular Group,Ltd.(OMG) founded, UK by Anthony Marchigton, David Ricketts,
James Hiddleston, Anthony Rees, and W. Graham Richards. Primary products:
Anaconds, Asp, Cameleon and others (molecular modeling, drug design, protein
design).

1990 - The BLAST program (Altschul,et.al.) is implemented. Molecular


applications group is founded in California by Michael Levitt and Chris Lee. Their
primary products are Look and SegMod which are used for molecular modeling
and protein design. InforMax is founded in Bethesda, MD. The company's products
address sequence analysis, database and data management, searching,
publication graphics, clone construction, mapping and primer design.

1991 - The research institute in Geneva (CERN) announces the creation of the
protocols which make -up the World Wide Web. The creation and use of expressed
sequence tags (ESTs) is described. Incyte Pharmaceuticals, a genomics company
headquartered in Palo Alto California, is formed. Myriad Genetics, Inc. is founded
in Utah. The company's goal is to lead in the discovery of major common human
disease genes and their related pathways. The company has discovered and
sequenced, with its academic collaborators, the following major genes: BRCA1,
BRACA1, CHD1, MMAC1, MMSC1, MMSC2, CtIP, p16, p19 and MTS2.

1993 - CuraGen Corporation is formed in New Haven, CT. Affymetrix begins


independent operations in Santa Clara, California.

1994 - Netscape Communications Corporation founded and releases Navigator,


the commercial version of NCSA's Mozilla. Gene Logic is formed in Maryland. The
PRINTS database of protein motifs is published by Attwood and Beck. Oxford
Molecular Group acquires IntelliGenetics.

1995 - The Haemophilus influenzae genome (1.8Mb) is sequenced. The


Mycoplasma genitalium genome is sequenced.

1996 - The genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mb) is
sequenced. The PROSITE database is reported by Bairoch, et.al. Affymetrix
produces the first commerical DNA chips.

1997 - The genome for E.coli (4.7 Mbp) is published. Oxford Molecular Group
acquires the Genetics Computer Group. LION bioscience AG founded as an
integrated genomics company with strong focus on bioinformatics. The company
is built from IP out of the European Molecular Biology Laboratory (EMBL), the
European Bioinformatics Institute (EBI), the German Cancer Research Center
(DKFZ), and the University of Heidelberg. Paradigm Genetics Inc., a company
focused on the application of genomic technologies to enhance worldwide food
and fiber production, is founded in Research Triangle Park, NC. deCode genetics
publishes a paper that described the location of the FET1 gene, which is
responsible for familial essential tremor, on chromosome 13 (Nature Genetics).

1998 - The genomes for Caenorhabditis elegans and baker's yeast are published.
The Swiss Institute of Bioinformatics is established as a non-profit foundation.
Craig Venter forms Celera in Rockville, Maryland. PE Informatics was formed as a
center of Excellence within PE Biosystems. This centre brings together and
leverges the complementary expertise of PE Nelson and Molecular Informatics, to
further complement the genetic instrumentation expertise of Applied Biosystems.
Inpharmatica, a new Genomics and Bioinformatics company, is established by
University College London, the Wolfson Institute for Biomedical Research, five
leading scientists from major British academic centres and Unibio Limited.
GeneFormatics, a company dedicated to the analysis and predication of protein
structure and function, is formed in San Diego. Molecular Simulations Inc. is
acquired by Pharmacopeia.

1999 - deCode genetics maps the gene linked to pre-eclampsia as a locus on


chromosome 2p13.

2000 - The genome for Pseudomonas aeruginosa (6.3 Mbp) is published. The
A.thaliana genome (100 Mb) is sequenced. The D.melanogaster genome (180 Mb)
is sequenced. Pharmacopeia acquires Oxford Molecular Group.

2001 - The human genome (3,000 Mbp) is published.

What is Bioinformatics?

Bioinformatics is conceptualizing biology in ter

and statistics) to understand and organize the information associated with these
molecules, on a large scale. Hence bioinformatics is a management information system
for molecular biology and has many practical applications in various fields.

mathematical, statistical and computing methods that aim to solve biological problems

Bioinformatics is a highly interdisciplinary field of biology, relying on basics from computer


science, biology, physics, chemistry and mathematics (statistics).The subject facilitates
biological information to be handles by the use of computers. Most of the large biological
molecules are polymers: ordered chains of simpler molecular modules called monomers.
Monomers can be thought of as beads or building blocks that, despite having different
color and shapes, all have the same thickness and the same way of connecting to one
another. Each monomer molecule is of the same general class, but each kind of monomer
has its own well- defined set of characteristics. Many monomer molecules can be joined
together to form a single, far larger, macromolecule which has exquisitely specific
informational content and /or chemical properties. According to this scheme, the
monomers in a given macromolecule of DNA or protein can be treated computationally
as letters of an alphabet, put together in pre-programmed arrangements to carry
messages or do work in a cell.

What does Bioinformatics comprise of?

Bioinformatics has moved just beyond data analysis and includes the following:

DNA Microarrays: There are new technologies designed to measure the relative number
of copies of a genetic message (levels of gene expression) at different stages in
development or disease or in different tissues.

Functional genomics: Large scale ways of identifying gene functions and associations
(yeast two hybrid methods)

Structural Genomics: attempts to crystallize and /or predict the structures of all proteins

Comparative genomics: Study of multiple whole genomes for understanding the


differences and similarities between all the genes of multiple species.

Medical informatics: The management of biomedical experimentation data associated


with particular molecules from mass spectroscopy, to in vitro assays to clinical side-
effects.

As a result of the massive surge in data and its complexity, many of the challenges in
biology have actually become challenges in computing. Such an approach is ideal
because of the ease with which computers can handle large quantities of data and probe
the complex dynamics observed in nature.

OBJECTIVES OF BIOINFORMATICS

The fundamental objectives of bioinformatics are the following:

1) Bioinformatics organizes data in a way that allows researchers to access existing


information and to submit new entries, for eg: the Protein Data Bank for 3D
macromolecular structures. While data - curation is an essential task, the
information stored in these databases is essentially useless until analysed. Thus,
the purpose of bioinformatics extends beyond mere volume control of data.
2) The second key objective is to develop tools and resources that aid in the analysis
of data. For example, having a sequenced a particular protein, it is of interest to
compare it with previously characterized sequences. Development of such
resources requires extensive knowledge of computational theory as well as a
thorough understanding of biology.
3) The third objective is to use the tools to analyze the data and interpret the results
in a biologically meaningful manner. Traditionally, biological studies examined
individual systems in detail and frequently compared those with a few that are
related. In bioinformatics, we can also conduct global analyses of all the available
data with the aim of uncovering common principles that apply across many
systems and highlight features that are unique to some.

Applications of Bioinformatics
Some of the applications related to biological information analysis are :
Information related to biomolecules can be mapped (e.g.), the sequences
can be parsed to find sites where so-
them.
Sequences can be compared, usually by aligning corresponding segments
and looking for matching and mismatching letters in their sequences. Genes
or proteins that are sufficiently similar are likely to be related and are
th
If a homologue exists then a newly discovered protein may be modelled-
that is the 3D structure of the gene product can be predicted without doing
laboratory experiments.
Bioinformatics is used in primer design. Primers are short sequences
needed to make many copies of a piece of DNA as used in PCR.
Bioinformatics is used to attempt to predict the function of actual gene
products
Structural biologists use bioinformatics to handle the vast and complex data
from X-ray crystallography, nuclear magnetic resonance (NMR) and
electron microscopy investigations and create the 3-D models of molecules.
There are other fields for example, medical imaging/image analysis, that
might also be considered as a part of bioinformatics.
FUNDAMENTALS OF BIOINFORMATICS

Module 2: IMPORTANCE AND USE OF BIOINFORMATICS

IMPORTANCE OF BIOINFORMATICS

Bioinformatics requires people from different working areas as it is a


multidisciplinary field. It is the combination of biology and computer science and is a
new emerging field that helps in collecting, linking, and manipulating different types
of biological information to discover new biological insight. Before the appearance of
bioinformatics, the scientists working in different biological fields, such as human
science, ecological science and many other fields were in a necessity of some tool
that helps them to work together. All the researches in biological field are interlinked
and had important information to share each other, but the problem was how to
integrate. In such circumstances, bioinformatics emerges to help these scientists or
researchers in fast research and leads to quick inventions by providing readily
available information with the help of computer technology.

Scientist and researchers spend their whole life in inventing things for human
benefits. After so many years of development, they have collected huge amount of
valuable data from their experiments and the collection of data is still continuing for
the better understandings of human life. Problem arises when they need to repeat
the research just because the old data is hard to obtain or they do not know whether
it exist or not; this wastes their valuable time. Let us take an example of DNA
identification. Every species or human beings have particular DNA strands that
contain the genetic instructions used in the development and functioning of all known

can find the root of different disease. Earlier it was hard to manage this information.
In order to collect and link DNA information from all over the world and to solve many
medical complications, bioinformatics is a very helpful hand for them. In addition,
scientists also need a tool that can interlink information from different areas like
biology, statistics, genomics etc to make their research faster. For instance, they
may need some data regarding effects of particular gene on human being and its
effect on animal or on other species, so that they can interlink and generate some
beneficial results or antidote that helps in human development. Eventually,
bioinformatics provides that help in interlinking information from different fields and
leads to quick results.

Finally, bioinformatics also helps in digitizing the information available on


paper or in the form of specimen, so that with the help of internet it could be easily
available to everyone everywhere. These days, computer is an important part of
every research without which so fast development cannot be imagine. Moreover,
these days it is important to keep everyone aware of current developments. This will
enable everyone to enhance their academic and research skills in their fields at
minimum expanse of time, money, and matters. In this scenario, bioinformatics
makes information readily available by collecting, linking, and manipulating. Briefly,
bioinformatics is playing a vital role in development of society by providing quick
biological research together very efficiently. This field is going to generate more
opportunities in future for all people working in different areas.

As you can see, the present challenges facing Bioinformaticians is to improve


database design, develop better software for database access / manipulation, and
device data-entry procedures to compensate for the varied computer procedures and

companies is to try to pull these labs away from their paper-based reporting systems
and move them electronic, web-based systems. Adding, storing, retrieving, and
sharing information becomes easy once the data is made electronic.

USE OF BIOINFORMATICS

As being an interface between modern biology and informatics, bioinformatics


involve discovery, development and implementation of computational algorithms and
software tools that help us to get a clear picture of biological processes.
Bioinformatics have its own importance in agriculture also. In developing country like
India it plays a big role in agriculture where it can be used to increase the nutritional
content, to increase the production, to get a resistance against diseases, etc. In
pharmaceutical area bioinformatics helps to reduce the time and cost involved for
different analysis, drug discovery, etc.

The use of bioinformatics is three fold. First is to collect and organise the data
in a way that allows researchers to access the information and to entry new results
as it is produced. Just depositing these data in to a database is useless until it get
analysed. So the second aim is to develop tools to analyse this data in required way.
For example, it is of interest to compare a sequenced protein with the already
existing sequence, may need a tool or program for the purpose of comparison.
Developing this kind of resources can only be done with a deep understanding of
computational theory and biology. The third aim is to analyse these data using these
tools and interpret the results in a biologically meaningful manner. In the traditional
biology it is impossible to examine the result of an experiment in an extendable way,
but using bioinformatics we can conduct a global analysis of all the available data.
The experiment results (the data) from biological laboratory will be stored in to the
bioinformatics databases and then the analysis of these data can be done by using
different computational techniques and tools. The data will be analysed in terms of
sequence, structure, function, evolution, pathways, etc. The result of analysis will be
then compared with the biological data and the experiment result.

information or data. So what does this information stands for? Data can be the raw
DNA sequences, protein sequences, macromolecular structures, genome
sequences, etc. DNA sequences are strings with four base letters (A, T, G, and C)
which comprises genes. Next one is protein sequences; these are nothing but a
collection of 20 amino acid letters. The most complex form of information is the
macromolecular structural data. More than half percentage of this data is the
structure of proteins. Genome sequence information is the collection of raw DNA
sequences ranging from 1.6 million bases to 3 billion bases. Now the issue is with
the organisation of this large amount of data. First of all most of this data can be
grouped together based on the biological similarities. For example genes can be
grouped together according to their function, pathway, etc. so the major aspects of
managing this data is developing methods to assess the similarities between these
biomolecules and identifying those that are related. There are different databases
available to deposits this data according to the information it carries. Each database
contains different types of data. Some examples for the databases are protein
sequence database, nucleotide sequence database, structural database, etc. So
according to the types of data these can be stored in different databases. Now the
question is that how can we use or analyse this data in practical life?

The information stored in databases is meaningless until it get analysed.


There is different type of analysis can be done using these datasets. The row DNA
sequence data can be analysed to find the coding and noncoding regions (introns
and exons) of the DNA sequence. Protein sequences can be compared using
bioinformatics algorithms. This method is using for multiple sequence alignment.
Annotation and encoding of sequence is the way to explain a whole genome. As we
mentioned before our body contains lots of proteins, genes, etc. The relationship
between these proteins, biological pathways of different genes, details about the
gene products are the core research areas in bioinformatics. One of the major areas
of research is at the gene expression level. The gene expression level research is a
way to understand the protein and mRNA that are produced by the cell. The main
application of this research is in the area of cancer. As we all know that the reason
and cure of cancer is still unknown. Bioinformatics algorithms can find the shared
sequences between different genes or proteins. The research on gene expression is
trying to solve these questions.

The use of bioinformatics is depends on the area or field we use it. Moreover
bioinformatics has its own importance and use in every field of study. Some uses of
bioinformatics is explained here, depends on the branch of study.

STRUCTURAL BIOINFORMATICS

The branch of bioinformatics that deal with the analysis and prediction of the
three-dimensional structure of biological macromolecules such as proteins, RNA,
and DNA is known as Structural Bioinformatics. It deals with generalizations about
macromolecular 3D structure such as comparisons of overall folds and local motifs,
principles of molecular folding, evolution, and binding interactions, and
structure/function relationships, working both from experimentally solved structures
and from computational models. The structural bioinformatics can be seen as a part
of computational structural biology.

Structural bioinformatics has existed in some form or other ever since the
determination of the first myoglobin structure. Structural bioinformatics has a
significant role in the field of molecular biology. The main challenge for structural
bioinformatics is the integration of structural information with other biological
information to have a deep understanding of biological function. The success of
genome-sequencing projects has created information about all the structures that
are present in individual organisms, as well as both the shared and unique features
of these organisms. Homology models for the structures identified through genome
sequencing projects can be created with the help of structural bioinformatics. The
resulting structures will be studied with respect to how they interact and perform their
functions. Similarly, the emergence of microarray expression measurements
provides an ability to consider how the expression of macromolecular structures is
regulated at a structural level including the reactions associated with transcription,
translation, and degradation. The alteration in functional characteristics such as
structure variation, genetic mutation and post translational modifications can be
studied through structure modelling. Moreover 3D structures of the biological
molecules can be analysed through this application. When the organization and
physical structure of entire cells is understood and represented in computational
models that provide insight into how thousands of structures within a cell work
together to create the functions associated with life. In any case, in the field of
biology structural bioinformatics is very important because structure alignments
provide us with more information that is not available from current sequence
alignment methods.

Structural bioinformatics is now in a renaissance with the success of the genome


sequencing projects, the emergence of high-throughput methods for expression
ia mass spectrometry. There are now
organized efforts in structural genomics to collect and analyse macromolecular
structures in a high throughput manner. These efforts include challenges in the
selection of molecules to study, the robotic preparation and manipulation of samples
-ray diffraction data, and the
annotation of these structures as they are stored in databases. In addition, there
have been advancements in the capabilities of NMR structure determination, which
previously could only study proteins in a limited range of sizes. The solution of the
malate synthase G complex from E. coli with 731 residues has pushed the frontier
for NMR spectroscopy and suggests that NMR is having its own renaissance.
Finally, the emergence of this structural information, when linked to the increasing
amount of genomic information and expression data, provides opportunities for
linking structural information to other data sources to understand how cellular
pathways and processes work at a molecular level. Protein structure prediction,
protein-protein interaction prediction, molecular interaction and protein-protein
docking are some of the applications of structural bioinformatics. All these can lead
us to have better understanding in molecular biology. Two commonly used
databases for the structural data are PDB and EMDB.

a) PDB:

The World Wide Web site of the Protein Data Bank at the Research Collaboratory for
Structural Bioinformatics (RCSB) offers a number of services for submitting and
retrieving three-dimensional structure data. The home page of the RCSB site
provides links to services for depositing three-dimensional structures, information on
how to obtain the status of structures undergoing processing for submission, ways to
download the PDB database, and links to other relevant sites and software.

b) EMDB:

In 2002, the EMDB was founded at the EBI specifically to archive the non-atomistic
structures (i.e. volume maps, masks and tomograms) determined by different
methods like single-particle methodology, ET and electron crystallography. Today,
the EMDB contains over 1300 released entries and is expected to grow 5 10-fold by
2020. Since 2007, the EMDB has been managed jointly by three partners: PDBe,
RCSB PDB and the National Centre for Macromolecular Imaging (NCMI) at Baylor
College of Medicine.

FUNCTIONAL BIOINFORMATICS:

Functional Bioinformatics is nothing but the development and implementation of


algorithms that facilitate the understanding of biological processes through the
application of statistical and data mining techniques. Life sciences in general are
generating huge amount of data with the accelerated development of high
throughput technologies. Therefore, in parallel to the accumulation of the
experimental information there is the obvious need of analysing all this huge amount
of data. Data mining, text mining, gene expression and algorithm tuning are some of
the main area under functional bioinformatics.

Functional bioinformatics aims to define gene function by making use of the vast
amount of information now available through high-throughput experimental methods
for mapping and sequencing genomes and approaches for characterising genes'
function, their organisation and expression under different conditions. Together these
functional genomic techniques contribute to the growing field of systems biology.
These techniques are also being increasingly applied to understanding disease in
human populations as part of genetic medicine. A particular disease characteristics
and biomarkers for that disease can be explored through a gene expression
analysis. It is a wide area in bioinformatics and some of the main techniques used in
this field are supervised and unsupervised methods. Unsupervised learners are not
provided with classifications. In fact, the basic task of unsupervised learning is to
develop classification labels automatically. Unsupervised algorithms seek out
similarity between pieces of data in order to determine whether they can be
characterized as forming a group. These groups are termed clusters, and there are a
whole family of clustering machine learning techniques. But in supervised algorithms,
the classes are predetermined. These classes can be conceived of as a finite set,
previously arrived at by a human. In practice, a certain segment of data will be
labelled with these classifications. The machine learner's task is to search for
patterns and construct mathematical models. These models then are evaluated on
the basis of their predictive capacity in relation to measures of variance in the data
itself. Many of the methods referenced in the documentation (decision tree induction,
naive Bayes, etc) are examples of supervised learning techniques. GEO and KEGG
are the commonly used databases for functional data even though there are plenty
of other databases available.

a) GEO (Gene Expression Omnibus)

Gene Expression Omnibus is an international public repository that archives and


freely distributes microarray, next-generation sequencing, and other forms of high-
throughput functional genomics data submitted by the research community.

b) KEGG (Kyoto Encyclopaedia of Genes and Genomes)

Kyoto Encyclopaedia of Genes and Genomes is a database resource for


understanding high-level functions and utilities of the biological system, such as the
cell, the organism and the ecosystem, from molecular-level information, especially
large-scale molecular datasets generated by genome sequencing and other high-
throughput experimental technologies.

SEQUENCE BIOINFORMATICS

Genetic data represent a treasure trove for researchers and companies


interested in how genes contribute to our health and well-being. Almost half of the
genes identified by the Human Genome Project have no known function.
Researchers are using bioinformatics to identify genes, establish their functions, and
develop gene-based strategies for preventing, diagnosing, and treating disease. A
DNA sequencing reaction produces a sequence that is several hundred bases long.
Gene sequences typically run for thousands of bases. The largest known gene is
that associated with Duchenne muscular dystrophy. It is approximately 2.4 million
bases in length. In order to study genes, scientists first assemble long DNA
sequences from series of shorter overlapping sequences. Scientists enter their
assembled sequences into genetic databases so that other scientists may use the
data. Since the sequences of the two DNA strands are complementary, it is only
necessary to enter the sequence of one DNA strand into a database. By selecting an
appropriate computer program, scientists can use sequence data to look for genes,
get clues to gene functions, examine genetic variation, and explore evolutionary
relationships. Bioinformatics is a young and dynamic science. New bioinformatics
software is being developed while existing software is continually updated.

The most important thing in sequencing is the identification of the gene within
the given DNA sequence. Once the sequence has been assembled, bioinformatics
analysis can be used to determine if the sequence is similar to that of a known gene.
This is where sequences from model organisms are helpful. For example, let's say
we have an unknown human DNA sequence that is associated with the disease
Prostate Cancer. A bioinformatics analysis finds a similar sequence from mouse that
is associated with a gene that codes for a membrane protein. It is a good bet that the
human sequence also is part of a gene that codes for the same membrane protein.
In such a way the bioinformatics sequencing can be useful to find an unknown gene
or protein of interest. There are bioinformatics programs like CLUSTALW, BLAST,
etc are available for sequencing. One of the widely used databases for sequence
data is UniProt.

a) UniProt:

The Universal Protein Resource (UniProt) is a comprehensive resource for


protein sequence and annotation data. UniProt is a collaboration between the
European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of
Bioinformatics and the Protein Information Resource. Across the three institutes
more than 100 people are involved through different tasks such as database
curation, software development and support.
Bioinformatics
1. Which of the following is an example of Homology and similarity tool?
(a) BLAST
(b) RasMol
(c) EMBOSS
(d) PROSPECT
2. In which year did the SWISSPROT protein sequence database begin?
(a) 1988
(b) 1985
(c) 1986
(d) 1987
3.Which of the following scientists created the first Bioinformatics database?
(a) Dayhoff
(b) Pearson
(c) Richard Durbin
(d) Michael.J.Dunn
4. The human genome contains approximately__________.
(a) 6 billion base pairs
(b) 5 billion base pairs
(c) 3 billion base pairs
(d) 4 billion base pairs
5. Which of the following tools is used for the identification of motifs?
(a) BLAST
(b) COPIA
(c) PROSPECT
(d) Pattern hunter
6. The first molecular biology server expasy was in the year __________.
(a) 1992
(b) 1993
(c) 1994
(d) 1995
7. What is the deposition of cDNA into the inert structure called?
(a) DNA probes
(b) DNA polymerase
(c) DNA microarrays
(d) DNA fingerprinting
8. The identification of drugs through the genomic study is called__________.
(a) Genomics
(b) Pharmacogenomics
(c) Pharmacogenetics
(d) Cheminformatics
9. Which of the following compounds has desirable properties to become a drug?
(a) Fit drug
(b) Lead
(c) Fit compound
(d) All of the above
10. Proteomics refers to the study of __________.
(a) Set of proteins in a specific region of the cell
(b) Biomolecules
(c) Set of proteins
(d) The entire set of expressed proteins in the cell
11. The process of finding the relative location of genes on a chromosome is called __________.
(a) Gene tracking
(b) Genome walking
(c) Genome mapping
(d) Chromosome walking
12. The computational methodology that tries to find the best matching between two molecules, a
receptor and ligand are called __________.
(a) Molecular fitting
(b) Molecular matching
(c) Molecular docking
(d) Molecule affinity checking
13. Which of the following are not the application of bioinformatics?
(a) Drug designing
(b) Data storage and management
(c) Understand the relationships between organisms
(d) None of the above
14. The term “invitro” is the Latin word which refers to__________.
(a) Within the lab
(b) Within the glass
(c) Outside the lab
(d) Outside the glass
15. The stepwise method for solving problems in computer science is called__________.
(a) Flowchart
(b) Algorithm
(c) Procedure
(d) Sequential design
16. The term Bioinformatics was coined by __________.
(a) J.D Watson
(b) Pauline Hogeweg
(c) Margaret Dayhoff
(d) Frederic Sanger
17. The laboratory work using computers and associated with web-based analysis generally online is
referred to as __________.
(a) In silico
(b) Dry lab
(c) Wet lab
(d) All of the above
Sol: (c) In silico.
18. Which of the following is the first completed and published gene sequence?
(a) X174
(b) T4 phage
(c) M13 phage
(d) Lambda phage
19. The laboratory work using computers and computer-generated models generally offline is referred
to as __________.
(a) Insilico
(b) Wet lab
(c) Dry lab
(d) All of the above
20. The computer simulation refers to __________.
(a) Dry lab
(b) Invitro
(c) In silico
(d) Wet lab

21. The first bioinformatics database was created by


A. Richard Durbin
B. Dayhoff
C. Michael j.Dunn
D. Pearson
22. SWISSPROT protein sequence database began in
A. 1985
B. 1986
C. 1987
D. 1988
23. An example of Homology & similarity tool?
A. PROSPECT
B. EMBOSS
C. RASMOL
D. BLAST
24. The tool for identification of motifs?
A. COPIA
B. patternhunter
C. PROSPECT
D. BLAST
5. First molecular biology server Expasy in the year?
A. 1991
B. 1992
C. 1993
D. 1994
6. Deposition of cDNA into inert structure is
A. DNA finingerprinting
B. DNA polymerase
C. DNA probes
D. DNA microarrays
7. Human genome contains about
A. 2 billion base pairs
B. 3 billion base pairs
C. 4 billion base pairs
D. 5 billion base pairs
Answer:- B
8. The identification of drugs through genomic study
A. Genomics
B. Cheminformatics
C. Pharmagenomics
D. Phrmacogenetics
Answer:- C
9. Analysing or comparing entire genome of species
A. Bioinformatics
B. Genomics
C. Proteomics
D. Pharmacogenomics
10. Characterizing molecular component is
A. Genomics
B. Cheminformatics
C. Proteomics
D. Bioinformatics
31. If you were using a proteomics approach to find the cause of a muscle disorder,
which of the following techniques might you be using?
a. creating a genomic library
b. sequencing the gene responsible for the disorder
c. developing physical maps from genomic clones
d. determining which environmental factors influence the expression of your gene of
interest annotating the gene sequence

32. Shotgun cloning differs from the clone-by-clone method in which of the following
ways?
A. The location of the clone being sequenced is known relative to other clones within the
genomic library in shotgun cloning.
B. Genetic markers are used to identify clones in shotgun cloning.
C. Computer software assembles the clones in the clone-by-clone method.
D. The entire genome is sequenced in the clone-by-clone method, but not in shotgun
sequencing.
E. No genetic or physical maps of the genome are needed to begin shotgun cloning.

33. CpG islands and codon bias are tools used in eukaryotic genomics to __________.
a. identify open reading frames
b. differentiate between eukaryotic and prokaryotic DNA sequences
c. find regulatory sequences
d. look for DNA-binding domains
e. identify a gene’s function
34. As the complexity of an organism increases, all of the following characteristics
emerge except __________.
a. the gene density decreases
b. the number of introns increases
c. the gene size increases
d. an increase in the number of chromosomes
e. repetitive sequences are present
35. Gene duplication has been found to be one of the major reasons for genome
expansion in eukaryotes. In general, what would be the selective advantage of gene
duplication?
a. If one gene copy is nonfunctional, a backup is available.
b. Larger genomes are more resistant to spontaneous mutations.
c. Duplicated genes will make more of the protein product.
d. Gene duplication will lead to new species evolution.

36. How are so many different antibodies produced from fewer than 300 major genes?
A. gene duplication
B. alternative splicing mechanisms
C. the formation of polyproteins
D. the formation of nonspecific B cells
E. recombination, deletions, and random assortment of DNA segments
37. Two-dimensional gels are used to __________.
A. separate DNA fragments
B. separate RNA fragments
C. separate different proteins
D. observe a protein in two dimensions
E. separate DNA from RNA
38. What would be a likely explanation for the existence of pseudogenes?
a. gene duplication
b. gene duplication and mutation events
c. mutation events
d. unequal crossing over
e. evolutionary pressure
39. If you enter a set of IUPAC codes into BLAST, you are probably trying to
A. find out whether a certain protein has any role in human disease.
B. search for the genes that are located on the same chromosome as a gene whose sequence
you have.
C. find which section of a piece of DNA is transcribed into mRNA.
D. determine the identity of a protein.
40. Your lab partner is using BLAST, and his best E value is 3. This means that
A. he’s found 3 proteins in the database that have the same sequence as his protein.
B. the chance that these similarities arose due to chance is one in 10^3.
C. there would be 3 matches that good in a database of this size by chance alone.
D. the match in amino acid sequencs is perfect, except for the amino acids at 3 positions.

41. Which of the following is an example of Homology and similarity tool?

(a) BLAST

(b) RasMol

(c) EMBOSS

(d) PROSPECT

42. In which year did the SWISSPROT protein sequence database begin?

(a) 1988

(b) 1985

(c) 1986

(d) 1987

43.Which of the following scientists created the first Bioinformatics database?

(a) Dayhoff

(b) Pearson

(c) Richard Durbin

(d) Michael.J.Dunn

44. The human genome contains approximately__________.

(a) 6 billion base pairs

(b) 5 billion base pairs


(c) 3 billion base pairs

(d) 4 billion base pairs

45. Which of the following tools is used for the identification of motifs?

(a) BLAST

(b) COPIA

(c) PROSPECT

(d) Pattern hunter

46. The first molecular biology server expasy was in the year __________.

(a) 1992

(b) 1993

(c) 1994

(d) 1995

47. What is the deposition of cDNA into the inert structure called?

(a) DNA probes

(b) DNA polymerase

(c) DNA microarrays

(d) DNA fingerprinting

48. The identification of drugs through the genomic study is called__________.

(a) Genomics

(b) Pharmacogenomics

(c) Pharmacogenetics

(d) Cheminformatics

49. Which of the following compounds has desirable properties to become a drug?

(a) Fit drug

(b) Lead
(c) Fit compound

(d) All of the above

50. Proteomics refers to the study of __________.

(a) Set of proteins in a specific region of the cell

(b) Biomolecules

(c) Set of proteins

(d) The entire set of expressed proteins in the cell

51. The process of finding the relative location of genes on a chromosome is called __________.

(a) Gene tracking

(b) Genome walking

(c) Genome mapping

(d) Chromosome walking

52. The computational methodology that tries to find the best matching between two molecules, a
receptor and ligand are called __________.

(a) Molecular fitting

(b) Molecular matching

(c) Molecular docking

(d) Molecule affinity checking

53. Which of the following are not the application of bioinformatics?

(a) Drug designing

(b) Data storage and management

(c) Understand the relationships between organisms

(d) None of the above

54. The term “invitro” is the Latin word which refers to__________.

(a) Within the lab

(b) Within the glass


(c) Outside the lab

(d) Outside the glass

55. The stepwise method for solving problems in computer science is called__________.

(a) Flowchart

(b) Algorithm

(c) Procedure

(d) Sequential design

56. The term Bioinformatics was coined by __________.

(a) J.D Watson

(b) Pauline Hogeweg

(c) Margaret Dayhoff

(d) Frederic Sanger

57. The laboratory work using computers and associated with web-based analysis generally online is
referred to as __________.

(a) In silico

(b) Dry lab

(c) Wet lab

(d) All of the above

58. Which of the following is the first completed and published gene sequence?

(a) X174

(b) T4 phage

(c) M13 phage

(d) Lambda phage

59. The laboratory work using computers and computer-generated models generally offline is referred
to as __________.

(a) Insilico
(b) Wet lab

(c) Dry lab

(d) All of the above

60. The computer simulation refers to __________.

(a) Dry lab

(b) Invitro

(c) In silico

(d) Wet lab

61. Which is model organism database?

GOLD

PROMISE

SGD

SCOP

62. BLAST X program is used for

translate protein sequence

translate DNA databse

translate input sequence

none of these

63. Information of all known nucleotide and protein sequences are available on

EMBL

DDBT

NCBI's Gene Bank

All of these

64. GeneBank and SWISSPORT are example of


primary database

secondary databse

composite database

none of these

65. SCOP is

it is primary database

it is nucleotide sequence databse

SCOP database is ahierarachial classification of protein 2D domain structures

structural databse, which identity strual and evolutionary relationships

66. 'FASTA' was published by

Joseph Sambrook

Pearson and Lipman

Sanger

Altschul et al

67. PDB is

Primary database for macromolecules

can be determined by gel electrophoresis

composite database

database for three dimensional struture of biological macromolecule

68. BLOSUM matrices are used for

Phylogenetic analysis

multiple sequence alighment

pairwise sequence alighment

none of these

69. Which is data retrieving tool?


ENTREZ

EMBL

PHD

All of these

70. Proteomics research can be categorized as

structural proteomics and functional proteomics

structural, functional and comparative proteomics

functioanl and comparative proteomics

none of these

71. Which of these is the most important aspect of planning and designing

a good sequencing experiment?

a) Careful choice of sample(s) and control(s)

b) Effective data analysis

c) Robust and precise experimental methods

d) All of the above

72. You need to use a first generation sequencing method for de novo sequencing, which template
should give optimum results for this project?

a) Genomic DNA

b) PCR product

c) Bacterial artificial chromosome

d) Plasmid DNA

73. Which of the following statement are correct as to why the quantity of template used is critical to
a sequencing reaction?

a) Excess template reduces the length of a read

b) Too little template will result in no readable sequence

c) Excess template reduces the quality of a read


d) All of the above

74. What will heterozygous single nucleotide substitution look like on your chromatogram?

a) Two peaks of equal height at the same position

b) One peak twice the height of those around it

c) Two peaks in the same position, one twice the height of the other

d) Three peaks of equal height at the same position

75. Which of these projects would be best suited for Next Generation Sequencing?

a) To determine if a tumour sample contains a common missense mutation

b) To find the transcriptome of a tumour sample

c) To genotype ten genomic DNA samples for a known single nucleotide polymorphism

d) All of the above.

76. Which of these is important for preparing templates for Next Generation Sequencing?

a) Isolating DNA from tissue

b) Breaking DNA up into smaller fragments

c) Checking the quality and quantity of the fragment library

d) All of the above.

77. Once the sequences are obtained from your Next Generation Sequencing experiment what is the
first thing you should do?

a) Perform a bioinformatics analysis of your data

b) Check your data using a different method

c) Publish your results

d) Further investigate the sequences of interest.

You might also like