SET I
BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. NCBI
2. Primary databases
3. PDB
4. Clustal w
5. Xenologous
6. PAM
7. Gap penalties
8. UPGMA
9. RASMOL
10. Propensity values
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on secondary databases with examples. (OR) Describe the structure of a
database.
12. Write note on Nucleic Acid sequence databases. (OR) Write note on sequence retrieval from
databases.
13. What is sequence alignment. Give examples. (OR) Describe FASTA format.
14. Describe Smith–Waterman algorithm. (OR) Describe Needleman–Wunsch algorithm.
15. Write about different types of phylogenetic trees (OR) How would you predict the secondary
structure of protein.
ANSWER ANY THREE QUESTIONS 3 X 10 = 30
16. Write an essay on uses and scope of bioinformatics.
17. Write an essay on structure and format of specialized databases.
18. Describe the Aminoacid substitution matrices.
19. Write an essay on Dynamic programming.
20. How would you determine the tertiary structure of protein using Homology Modeling.
KEY TO SET I
BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. NCBI: NCBI is now a leading source for public biomedical databases, software tools for
analyzing molecular and genomic data, and research in computational biology. Today NCBI creates
and maintains over 40 integrated databases for the medical and scientific communities as well as the
general public.
2. Primary databases: Primary databases (also known as data repositories) are highly
organised, user-friendly gateways to the huge amount of biological data produced by researchers
around the world. The primary databases were first developed for the storage of experimentally
determined DNA and protein sequences in the 1980s and 90s.
3. PDB: Protein Data Bank (PDB) format is a standard for files containing atomic coordinates.
Structures deposited in the Protein Data Bank at the Research Collaboratory for Structural
Bioinformatics (RCSB) are written in this standardized format.
4. Clustal w: Clustal W is a general purpose multiple sequence alignment program for DNA or
proteins.It produces biologically meaningful multiple sequence alignments of divergent sequences. It
calculates the best match for the selected sequences, and lines them up so that the identities,
similarities and differences can be seen.
5. Xenologous: xenolog (plural xenologs) (genetics) A type of ortholog where the homologous
sequences are found in different species because of horizontal gene transfer.
6. PAM: PAM matrix means probablilty of each amino acid changing. into another is ~ 1% and
probability of not changing is ~99% Page 5. Construction of a Dayhoff Matrix: PAM.
7. Gap penalties; A linear gap penalty is a gap penalty in which each inserted/deleted symbol
in the gap contributes a constant (negative) score to the alignment. As a result, if the gap contains
symbols, and the penalty for each inserted/deleted symbol is , then the entire gap is penalized a
total of .
8. UPGMA: UPGMA is unweighted, so all pairwise distances contribute equally. This means
that the distance between BFG and AD is the mean of all six possible pairwise combinations. The rest
of the new BFG columns and rows are calculated as the mean distances of B, F and G with the
remaining sequences C and E.
9. RASMOL: RasMol is a molecular graphics program intended for the visualisation of proteins,
nucleic acids and small molecules. The program is aimed at display, teaching, PROTEIN STRUCTURE.
10. Propensity values: The Chou–Fasman method predicts helices and strands in a similar
fashion, first searching linearly through the sequence for a "nucleation" region of high helix or strand
probability and then extending the region until a subsequent four-residue window carries a
probability of less than 1.
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on secondary databases with examples.
Secondary databases make use of publicly available sequence data in primary databases to to
provide layers of information to DNA or protein sequence data. Secondary databases comprise data
derived from analysing entries in primary databases.
(OR) Describe the structure of a database.
Biological databases can be broadly classified into sequence, structure and functional databases.
Nucleic acid and protein sequences are stored in sequence databases and structure databases store
solved structures of RNA and proteins.
12. Write note on Nucleic Acid sequence databases.
The Nucleic Acid Database is a relational database containing information about three-dimensional
nucleic acid structures. The methods used for data processing, structure validation, database
management and information retrieval, as well as the various services available via the World Wide
Web, are described.
(OR) Write note on sequence retrieval from databases.
SRS (Sequence Retrieval System) is an information indexing and retrieval system designed for
libraries with a flat file format such as the EMBL nucleotide sequence databank, the SwissProt
protein sequence databank or the Prosite library of protein subsequence consensus patterns.
13. What is sequence alignment. Give examples.
Sequence alignment is a way of arranging protein (or DNA) sequences to identify regions of similarity
that may be a consequence of evolutionary relationships between the sequences.
(OR) Describe FASTA format.
FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence
in FASTA format begins with a single-line description, followed by lines of sequence data.
14. Describe Smith–Waterman algorithm.
For computing the real alignment score, we need to distinguish a leftmost gap (of some sequence of
gaps) from the other gaps. For this purpose, we consider the following three functions: score0(i, j) =
the score of X[i.. m] and Y[j..n] with no gap before X[i] or Y[j], score1(i, j) = the score of X[i.
(OR) Describe Needleman–Wunsch algorithm.
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or
nucleotide sequences. The algorithm assigns a score to every possible alignment, and the purpose
of the algorithm is to find all possible alignments having the highest score.
15. Write about different types of phylogenetic trees.
Dendrogram. Cladogram. Phylogram. Chronogram. Dahlgrenogram. Phylogenetic network. Spindle
diagram. Coral of life.
(OR) How would you predict the secondary structure of protein.
Protein secondary structure refers to the local conformation proteins' polypeptide backbone. Most
commonly, the secondary structure prediction problem is formulated as follows: given a protein
sequence with amino acids, predict whether each amino acid is in the α-helix (H), β-strand (E), or coil
region (C).
ANSWER ANY THREE QUESTIONS 3 X 10 = 30
16. Write an essay on uses and scope of bioinformatics.
Apart from analysis of genome sequence data, bioinformatics is now being used for a vast array of
other important tasks, including analysis of gene variation and expression, analysis and prediction of
gene and protein structure and function, prediction and detection of gene regulation networks,
simulation environments.
17. Write an essay on structure and format of specialized databases.
Specialized databases are a collection of focused information on one or more specific fields of study.
This information or data is arranged or indexed so that the user can locate and retrieve it quickly and
easily.
18. Describe the Aminoacid substitution matrices.
An amino acid substitution scoring matrix encapsulates the rates at which various amino acid
residues in proteins are substituted by other amino acid residues, over time. Database search
methods make use of substitution scoring matrices to identify sequences with homologous
relationships.
19. Write an essay on Dynamic programming.
Dynamic programming computes its solution bottom up by synthesizing them from smaller
subsolutions, and by trying many possibilities and choices before it arrives at the optimal set of
choices. There is no a priori litmus test by which one can tell if the Greedy method will lead to an
optimal solution.
20. How would you determine the tertiary structure of protein using Homology Modeling.
Homology modeling, also known as comparative modeling of protein, refers to constructing an
atomic-resolution model of the "target" protein from its amino acid sequence and an
experimental three-dimensional structure of a related homologous protein (the "template").
SET II
BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. Define bioinformatics
2. Secondary databases
3. EMBL
4. Clustal w
5. Orthologous
6. Blosum
7. Mismatch
8. PHYLIP
9. Protein folding
10. NJ method
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on primary databases with examples. (OR) Describe the structure of a
database.
12. Write note on Protein sequence databases. (OR) Write note on sequence retrieval from
databases.
13. What is multiple sequence alignment. Give examples. (OR) Describe FASTA format.
14. Describe Local alignment. (OR) Describe Global alignment.
15. Write about different types of phylogenetic trees (OR) How would you predict the secondary
structure of protein.
ANSWER ANY THREE QUESTIONS 3 X 10 = 30
16. Write an essay on uses and scope of bioinformatics.
17. Write an essay on structure and format of databases.
18. Describe the Blosum matrices.
19. Write an essay on Dynamic programming.
20. How would you construct a phylogenetic tree using UPGMA method?
KEY TO SET II
BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. Bioinformatics: Bioinformatics is a science field that is similar to but distinct from
biological computation, while it is often considered synonymous to computational biology.
Biological computation uses bioengineering and biology to build biological computers,
whereas bioinformatics uses computation to better understand biology.
2. Secondary databases: Secondary databases make use of publicly available sequence
data in primary databases to to provide layers of information to DNA or protein sequence
data. Secondary databases comprise data derived from analysing entries in
primary databases.
3. EMBL: The EMBL Nucleotide Sequence Database ( http://www.ebi.ac.uk/embl.html )
constitutes Europe's primary nucleotide sequence resource. DNA and RNA sequences are
directly submitted from researchers and genome sequencing groups and collected from the
scientific literature and patent applications
4. Clustal w: Clustal W is a general purpose multiple sequence alignment program for DNA or
proteins.It produces biologically meaningful multiple sequence alignments of divergent sequences. It
calculates the best match for the selected sequences, and lines them up so that the identities,
similarities and differences can be seen.
5. Orthologous: Orthologous are homologous genes where a gene diverges after a
speciation event, but the gene and its main function are conserved. If a gene is duplicated in
a species, the resulting duplicated genes are paralogs of each other, even though over time
they might become different in sequence composition and function.
6. BLOSUM: In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a
substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to
score alignments between evolutionarily divergent protein sequences. They are based on
local alignments.
7. Mismatch: If two sequences in an alignment share a common
ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is,
insertion or deletion mutations) introduced in one or both lineages in the time since they
diverged from one another.
8. PHYLIP: PHYLIP format is a plain text format containing exactly two sections: a
header describing the dimensions of the alignment, followed by the multiple sequence
alignment itself.
9. Protein Floding: Protein folding is a process by which a polypeptide chain folds to
become a biologically active protein in its native 3D structure. Protein structure is crucial to
its function. Folded proteins are held together by various molecular interactions.
10. NJ: The neighbor-joining method is a special case of the star decomposition method.
In contrast to cluster analysis neighbor-joining keeps track of nodes on a tree rather than
taxa or clusters of taxa. The raw data are provided as a distance matrix and the initial tree is
a star tree.
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on primary databases with examples.
Primary databases are populated with experimentally derived data such as nucleotide
sequence, protein sequence or macromolecular structure. ... Once given
a database accession number, the data in primary databases are never changed: they form
part of the scientific record.
(OR) Describe the structure of a database.
Biological databases can be broadly classified into sequence, structure and functional databases.
Nucleic acid and protein sequences are stored in sequence databases and structure databases store
solved structures of RNA and proteins.
12. Write note on Protein sequence databases.
The Protein database is a collection of sequences from several sources, including
translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records
from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants
of biological structure and function.
(OR) Write note on sequence retrieval from databases.
SRS (Sequence Retrieval System) is an information indexing and retrieval system designed for
libraries with a flat file format such as the EMBL nucleotide sequence databank, the SwissProt
protein sequence databank or the Prosite library of protein subsequence consensus patterns.
13. What is Multiple sequence alignment. Give examples.
Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological
sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred and
the evolutionary relationships between the sequences studied. ... Accurate MSA tool, especially
good with proteins.
(OR) Describe FASTA format.
FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence
in FASTA format begins with a single-line description, followed by lines of sequence data.
14. Describe Local Alignment.
For computing the real alignment score, we need to distinguish a leftmost gap (of some sequence of
gaps) from the other gaps. For this purpose, we consider the following three functions: score0(i, j) =
the score of X[i.. m] and Y[j..n] with no gap before X[i] or Y[j], score1(i, j) = the score of X[i.
(OR) Describe Global Alignment.
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or
nucleotide sequences. The algorithm assigns a score to every possible alignment, and the purpose
of the algorithm is to find all possible alignments having the highest score.
15. Write about different types of phylogenetic trees.
Dendrogram. Cladogram. Phylogram. Chronogram. Dahlgrenogram. Phylogenetic network. Spindle
diagram. Coral of life.
(OR) How would you predict the secondary structure of protein.
Protein secondary structure refers to the local conformation proteins' polypeptide backbone. Most
commonly, the secondary structure prediction problem is formulated as follows: given a protein
sequence with amino acids, predict whether each amino acid is in the α-helix (H), β-strand (E), or coil
region (C).
ANSWER ANY THREE QUESTIONS 3 X 10 = 30
16. Write an essay on uses and scope of bioinformatics.
Apart from analysis of genome sequence data, bioinformatics is now being used for a vast array of
other important tasks, including analysis of gene variation and expression, analysis and prediction of
gene and protein structure and function, prediction and detection of gene regulation networks,
simulation environments.
17. Write an essay on structure and format of databases.
Specialized databases are a collection of focused information on one or more specific fields of study.
This information or data is arranged or indexed so that the user can locate and retrieve it quickly and
easily.
18. Describe the Blosum matrices.
In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for
sequence alignment of proteins. BLOSUM matrices are used to score alignments between
evolutionarily divergent protein sequences. They are based on local alignments
19. Write an essay on Dynamic programming.
Dynamic programming computes its solution bottom up by synthesizing them from smaller
subsolutions, and by trying many possibilities and choices before it arrives at the optimal set of
choices. There is no a priori litmus test by which one can tell if the Greedy method will lead to an
optimal solution.
20.How would you construct a phylogenetic tree using UPGMA method?
UPGMA is the simplest method for constructing trees. The great disadvantage of UPGMA is that it
assumes the same evolutionary speed on all lineages, i.e. the rate of mutations is constant over time
and for all lineages in the tree. Therefore, UPGMA frequently generates wrong tree topologies.
SET III
BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. PAM
2. Gap penalties
3. UPGMA
4. RASMOL
5. Propensity values
6. Define bioinformatics
7. Secondary databases
8. EMBL
9. Clustal w
10. Orthologous
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on primary databases with examples. (OR) Describe the structure of a
database.
12. Write note on Protein sequence databases. (OR) Write note on sequence retrieval from
databases.
13. What is multiple sequence alignment. Give examples. (OR) Describe FASTA format.
14. Describe Local alignment. (OR) Describe Global alignment.
15. Write about different types of phylogenetic trees (OR) How would you predict the secondary
structure of protein.
ANSWER ANY THREE QUESTIONS 3 X 10 = 30
16. Write an essay on uses and scope of bioinformatics.
17. Write an essay on structure and format of specialized databases.
18. Describe the Blosum matrices.
19. Write an essay on Dynamic programming.
20. How would you construct a phylogenetic tree using UPGMA method?
SET III
SET III KEY TO BASIC CONCEPTS IN BIOINFORMATICS
PART A
ANSWER ALL QUESTIONS 10 X 2 = 20
1. PAM: PAM matrix means probablilty of each amino acid changing. into another is ~ 1% and
probability of not changing is ~99% Page 5. Construction of a Dayhoff Matrix: PAM.
2. Gap penalties; A linear gap penalty is a gap penalty in which each inserted/deleted symbol
in the gap contributes a constant (negative) score to the alignment. As a result, if the gap contains
symbols, and the penalty for each inserted/deleted symbol is , then the entire gap is penalized a
total of .
3. UPGMA: UPGMA is unweighted, so all pairwise distances contribute equally. This means
that the distance between BFG and AD is the mean of all six possible pairwise combinations. The rest
of the new BFG columns and rows are calculated as the mean distances of B, F and G with the
remaining sequences C and E.
4. RASMOL: RasMol is a molecular graphics program intended for the visualisation of proteins,
nucleic acids and small molecules. The program is aimed at display, teaching, PROTEIN STRUCTURE.
5. Propensity values: The Chou–Fasman method predicts helices and strands in a similar
fashion, first searching linearly through the sequence for a "nucleation" region of high helix or strand
probability and then extending the region until a subsequent four-residue window carries a
probability of less than 1.
6. Bioinformatics: Bioinformatics is a science field that is similar to but distinct from biological
computation, while it is often considered synonymous to computational biology. Biological
computation uses bioengineering and biology to build biological computers,
whereas bioinformatics uses computation to better understand biology.
7. Secondary databases: Secondary databases make use of publicly available sequence data in
primary databases to to provide layers of information to DNA or protein sequence data. Secondary
databases comprise data derived from analysing entries in primary databases.
8. EMBL: The EMBL Nucleotide Sequence Database ( http://www.ebi.ac.uk/embl.html )
constitutes Europe's primary nucleotide sequence resource. DNA and RNA sequences are directly
submitted from researchers and genome sequencing groups and collected from the scientific
literature and patent applications
9. Clustal w: Clustal W is a general purpose multiple sequence alignment program for DNA or
proteins.It produces biologically meaningful multiple sequence alignments of divergent sequences. It
calculates the best match for the selected sequences, and lines them up so that the identities,
similarities and differences can be seen.
10. Orthologous: Orthologous are homologous genes where a gene diverges after a speciation
event, but the gene and its main function are conserved. If a gene is duplicated in a species, the
resulting duplicated genes are paralogs of each other, even though over time they might become
different in sequence composition and function.
ANSWER ALL QUESTIONS 5 X 5 = 25
11. Write short note on secondary databases with examples.
Secondary databases make use of publicly available sequence data in primary databases to to
provide layers of information to DNA or protein sequence data. Secondary databases comprise data
derived from analysing entries in primary databases.
(OR) Describe the structure of a database.
Biological databases can be broadly classified into sequence, structure and functional databases.
Nucleic acid and protein sequences are stored in sequence databases and structure databases store
solved structures of RNA and proteins.
12. Write note on Nucleic Acid sequence databases.
The Nucleic Acid Database is a relational database containing information about three-dimensional
nucleic acid structures. The methods used for data processing, structure validation, database
management and information retrieval, as well as the various services available via the World Wide
Web, are described.
(OR) Write note on sequence retrieval from databases.
SRS (Sequence Retrieval System) is an information indexing and retrieval system designed for
libraries with a flat file format such as the EMBL nucleotide sequence databank, the SwissProt
protein sequence databank or the Prosite library of protein subsequence consensus patterns.
13. What is sequence alignment. Give examples.
Sequence alignment is a way of arranging protein (or DNA) sequences to identify regions of similarity
that may be a consequence of evolutionary relationships between the sequences.
(OR) Describe FASTA format.
FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence
in FASTA format begins with a single-line description, followed by lines of sequence data.
14. Describe Local Alignment.
For computing the real alignment score, we need to distinguish a leftmost gap (of some sequence of
gaps) from the other gaps. For this purpose, we consider the following three functions: score0(i, j) =
the score of X[i.. m] and Y[j..n] with no gap before X[i] or Y[j], score1(i, j) = the score of X[i.
(OR) Describe Global Alignment.
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or
nucleotide sequences. The algorithm assigns a score to every possible alignment, and the purpose
of the algorithm is to find all possible alignments having the highest score.
15. Write about different types of phylogenetic trees.
Dendrogram. Cladogram. Phylogram. Chronogram. Dahlgrenogram. Phylogenetic network. Spindle
diagram. Coral of life.
(OR) How would you predict the secondary structure of protein.
Protein secondary structure refers to the local conformation proteins' polypeptide backbone. Most
commonly, the secondary structure prediction problem is formulated as follows: given a protein
sequence with amino acids, predict whether each amino acid is in the α-helix (H), β-strand (E), or coil
region (C).
16. Write an essay on uses and scope of bioinformatics.
Apart from analysis of genome sequence data, bioinformatics is now being used for a vast array of
other important tasks, including analysis of gene variation and expression, analysis and prediction of
gene and protein structure and function, prediction and detection of gene regulation networks,
simulation environments.
17. Write an essay on structure and format of specialized databases.
Specialized databases are a collection of focused information on one or more specific fields of study.
This information or data is arranged or indexed so that the user can locate and retrieve it quickly and
easily.
18. Describe the Blosum matrices.
In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for
sequence alignment of proteins. BLOSUM matrices are used to score alignments between
evolutionarily divergent protein sequences. They are based on local alignments
19. Write an essay on Dynamic programming.
Dynamic programming computes its solution bottom up by synthesizing them from smaller
subsolutions, and by trying many possibilities and choices before it arrives at the optimal set of
choices. There is no a priori litmus test by which one can tell if the Greedy method will lead to an
optimal solution.
20.How would you construct a phylogenetic tree using UPGMA method?
UPGMA is the simplest method for constructing trees. The great disadvantage of UPGMA is that it
assumes the same evolutionary speed on all lineages, i.e. the rate of mutations is constant over time
and for all lineages in the tree. Therefore, UPGMA frequently generates wrong tree topologies.