GOR Method For Protein Structure Prediction Using Cluster Analysis
GOR Method For Protein Structure Prediction Using Cluster Analysis
1
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013
1.1 Cluster Analysis Technique the requirement this eight letter DSSP alphabet translated into
Cluster analyzes the data objects without consulting a class the three letter code.
label. The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and Table 1: Reducing DSSP 8 classes code to 3 Classes
minimizing the interclass similarity. Clusters of objects are
formed so that objects within a cluster have high similarity in DSSP 8-classes 3-class
comparison to one another but are very dissimilar to in other
clusters. Each cluster that is formed can be viewed as a class α-helix (H) ,3/10 helix (G) Helix(H)
of objects from which rules can be derived [13]. Fig. 2 below
shows how several clusters might form a hierarchy. When a β-sheet (E), β- Bridge(B) Strand(E)
hierarchy of clusters like this is created the user can determine
the right number of clusters that adequately summarizes the π-helix (I),Turn(T), Bend(S), Coil(C) Coil(C)
data while still providing useful information (at the other
extreme, a single cluster containing all the records is a great 1.2.1 Chief elements of secondary structure are:
summarization but does not contain enough specific
information to be useful). Clustering analysis has received 1. Alpha helix: Alpha helix is most commonly known as 4-
significant attention in the area of gene expression. It allows turn helix and it’s the commonly occurring type of element
the identification of the structure of a data set, i.e. the in proteins. The helical structure is used to arrange amino
identification of groups of similar objects in multidimensional acids through 5A wide. Amino acids are translated to next
space. Clustering procedures yield a data description in terms amino acid along helical axis about 1.5A. A canonical α-
of clusters or groups of data points that possess strong internal helix has 3.6 residues per turn, and is built up from a
similarities. contiguous amino acid segment via hydrogen bond
formation between amino acids in positions i and i + 4. 10
Hierarchical Clustering: These methods start with each point
amino acids are specifying the average length of alpha
being considered a cluster and recursively combine pairs of helix. Minimum 4 amino acids are required for structure to
clusters (subsequently updating the inter-cluster distances)
be classified as alpha helix. The residues taking part in an
until all points are part of one hierarchically constructed
α-helix have φ angles around −60 and ψ angles around
cluster.
−50. Alpha helix present at the surface of protein cores.
These cores provide an interfacing with aqueous
environment.
2. β-sheet: The beta sheet is commonly known second type
of structure element. Two or more amino acid sequences
(beta strands) present in same protein that bond together
through hydrogen bond forms the beta sheet. A β-strand is
a more extended structure with 2.0 residues per turn.
Values for φ and ψ vary, with typical values of −140 and
130. β-strand interacts via hydrogen bonds with other β-
strands, which may be distant in sequence, to form a β-
sheet. These strands can bond with adjacent strand through
parallel in and anti-parallel configuration. A β -sheet
consists of individual β -strands, each of which is made up
of contiguous amino acid residues. The dihedral angle in
anti-parallel sheets are φ=-140 and ψ=135 and in parallel
sheets are φ=-120 and ψ=115. The Parallel Beta-Sheet is
. characterized by two peptide strands running in the same
direction held together by hydrogen bonding between the
Fig. 2 Hierarchy of Clusters. strands. The Antiparallel Beta-Sheet is characterized by
two peptide strands running in opposite directions held
1.2 SECONDARY STRUCTURE together by hydrogen bonding between the strands.
CLASSIFICATION 3. Coils and Loops: Coil or loop regions connect α-helices
The DSSP Code: There is one method to classify the and β-sheets and have varying lengths and shapes. They do
secondary structure named- “the Dictionary of Protein not have even patterns like alpha-helices and beta-sheets
Secondary Structure” commonly referred to as DSSP code to and they could be any other part of the protein structure.
define unambiguously secondary structure based on their They are recognized as random coil and not classified as
physical and geometrical properties. Database of Secondary protein secondary structure. These are also known as local
Structure in Proteins (DSSP) is widely used in protein science structures and have irregular shape. In loop or coil residue
to define the secondary structure assignment. located on the surface of the protein structure and tends to
Eight elements of secondary structure assignment are there be charged and polar. Glycine and proline are the
according to the DSSP classification, which are denoted by commonly known structures.
the letters H (α-helix), E(extended β-strand), G(310 helix), I
(α-helix), B(bridge, a single residue β-strand), T(β-turn), S 2. LITERATURE REVIEW
(bend), and C (coil). Previous research discusses the use of a new method for the
Number of elements according to DSSP classification is too prediction of the protein secondary structure from the amino
many for existing methods of the secondary prediction, acid sequence. The method is based on the most recent
instead usually only three states are predicted as in Table 1 version of the standard GOR algorithm. A significant
helix (H), extended (b-sheet) (E), and coil (C). According to improvement is obtained by combining multiple sequence
2
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013
alignments with the GOR method. Additional improvement in amino acids present in every positions. One matrix
the predictions is obtained by a simple correction of the corresponds to the central amino acid being found in α helix,
results when helices or sheets are too short, or if helices and the second for the amino acid being in a β strand, the third a
sheets are direct neighbours along the sequence [14]. The coil.
imposition of the requirement that the prediction must be
strong enough, i.e. that the difference between the probability
of the predicted (most probable) state and the probability of
the second most probable state must be larger than a certain
minimum value also improves significantly secondary
structure predictions.
3. GOR METHODOLOGY
It is based on information theory and Bayesian statistics.
Information theory approaches are popular in secondary
structure prediction and these approaches are mathematical
probability based. Information theory is a class of the
mathematical theory of probability and mathematical statistics
that defines the concept of information. Day-to-day increasing
amount of information related to protein structural has
motivate researchers to develop several approaches that use
this information theory for generating new ideas to predict
protein structure and function. Most commonly, the secondary
structure prediction problem is formulated as follows:
Given a protein sequence with amino acids r1 r2 . . . rn, predict
whether each amino acid ri is in an α −helix (H), a β −strand
(E), or neither (C) [18]. Predictions of secondary structure are
typically judged via the 3-state accuracy (Q3), which is the
percent of residues for which a predicted secondary structure
(H, E, or C) method is correct. Fig: 4 General Framework for Protein Secondary
Structure Prediction Method
M C GOR method works on window of 17 residues, eight nearest
neighboring residues are included in calculations for a given
E C residue. The conformational state among three states will be
PREDICTION predicted and depends upon the type of amino acid R as well
as neighboring residue along window. Information theory
R C helps to retrieve the information function. GOR method
calculates information from residue within sliding window as
P C in fig 5.
To determine the structure for a given amino acid position j,
the GOR method looks at a window of 8 amino acids before
Y E and 8 after the position of interest . Suppose aj is the amino
acid that we are trying to determine. GOR looks at the
A E residues in Equation
Fig 3: Prediction Scheme
C C Intuitively, it assigns a structure based on probabilities it has
The secondary structure prediction GOR method is one of the calculated from protein databases. These probabilities are of
first major methods proposed for prediction of structure from the form as
sequence. TheP three alphabets GOR were derived from the C
Pr[amino acid j is α
first letter of their names (Garnier-Osguthorpe-Robson. In
used version … of GOR method, database of 267 proteins are … | ]
used which contains 63,000 residues [11]. Pr[amino acid j is β
.
In prediction method for secondary structure of protein
determines the accuracy in terms of present percentage of | ]
helix, sheet and coil. Formation of α-helix, β- sheet and coils In GOR method, three scoring matrices, and each column
are predicted with respect to each amino acid residue present consist the probability of finding each amino acid at one of
in a sequence of amino acids residues. Result of the prediction the 17 positions, are prepared. Information theory forms on
of all secondary structure elements are combined to obtain the the basis of information function I(S, R) which will be fully
result of prediction of secondary structure of protein as in fig represented in mathematical notation together with other
4. Rather than considering propensities for a single residue, functions and formula. The information function is described
position-dependent propensities have been calculated for all in terms of logarithm ratio of the conditional probability P
residue types. GOR method work on various types of (S|R) of observing conformation S.
sequences formats which uses the information theory to The information available as to the joint occurrence of
generate the code that relates amino acids sequence and secondary structural conformation S and amino acid R is
secondary structure of proteins. Three scoring matrices are given by
prepared in GOR method to calculate the probability of each I(S; R) = log [P(S/R)/P(S)]
3
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013
4
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013
5. RESULTS AND DISCUSSION Departmental Research Committee (DRC) for their useful
In present work, we deal with amino acid residues to comments and constructive suggestions during all the phases
determine the secondary structure of sequences. GOR method of the present study as well as critically going through the
will be used to predict the structure of amino acids. manuscript.
Combination of amino acids results in formation of protein
through peptide bond. Words fail the author to express his deep sense of gratitude
GOR method is window structure based experiment. It uses towards his family members for their moral and financial
the 17*20 window size. It predicts the percentage of three support and encouragement without which the author would
conformational states according to the presence of it in protein not have been able to bring out this thesis.
sequence. The DSSP code is used in GOR method that will
reduce the 8 classes code to 3 classes. Different types of 8. REFERENCES
sequence formats are used in present work as input. Each [1] An, B., et al (2009) “Accuracy of Protein Secondary
format has its own format and significance. Classification Structure Prediction Continues to Rise” International
trees are generated from root node down to leaf node. It will Conference on MASS’ 09, pp.1-4.
check the values of one predictor or variable. MATLAB [2] Akitomi, J. (2007) “Method for predicting Secondary
platform is used to done the present work. FASTA is fast Structure of RNA, an apparatus for predicting and a
alignment format and contains the header and sequence. predicting program” US Patent 0235155.
Genbank database is collection of all publicly available [3] Balaban, D.J. and Aggarwal, A. (2005) “Method and
nucleotide sequences and their protein translations. GCG apparatus for providing a Bioinformatics Database” US
contains the sequence that is marked by a line ending with “..” Patent 7215804.
two dot character. [4] Chang, J. and Zhu, X. (2010) “Bioinformatics Database:
GOR method provides the secondary structure to every Intellectual Property Protection Strategy” Journal of
protein sequence. Structure is given in the format of Intellectual property Rights Vol 15, pp.447-454.
percentage corresponding to the presence of conformational [5] Chen, X., et al (2011) “The use for classification trees for
states that are helix, sheet and coil. Resulted graph drawn bioinformatics”, John Wiley & Sons, Inc. WIREs Data
according to their percentage values. User can edit its own Mining Knowledge Discovery vol. No. 4, pp 55–63.
sequence to achieve the results. [6] Deris, S.B. et al. (2007) “ Protein Secondary Structure
Prediction From Amino Acid Sequence Using Artificial
Intelligence Technique” , Journal of bioinformatics , vol.
6. CONCLUSION & FUTURE WORK No. 5, pp. 1-245.
I have studied the GOR method based on information theory [7] Exarchos, K.P. et al (2007) “Predicting peptide bond
and Bayesian Statistics is quite successful in its accuracy of conformation using feature selection and the Naive
secondary structure prediction. Probabilities of three Bayes approach” IEEE EMBS 2007, pp.5009-5012.
conformational states are predicted for each residue in the [8] Fallahi, H. and Yarani, R. (2010) “Positional preferences
sequence with the help of GOR method and this information by 20 amino acids in beta sheets” IEEE BIBMW,
can be used for further analysis. These are results are achieved pp.806-807.
when predictions are made on single sequence. [9] Gerhart, J. and Sacan, A. (2011) “BioDB: Integration of
The developed method is highly stable and consistent when biological knowledgebases” IEEE BIBMW 2011, pp.
tested against the different DSSP secondary structure 899.
reduction methods conducted in this research. Information [10] Greene, L.A. (2011) “Polypeptide Structural Motifs
regarding the secondary structure elements such as helix, Associated With Cell Signaling Activity” US Patent
sheet and coil that form for a particular sequence of amino 0004185.
acid is distributed across whole window. This information is [11] Garnier, J. et al (1996) “GOR method for predicting
retrieved from database of 267 proteins. Different types of Protein Secondary Structure from Amino Acid Sequence”
input formats of sequences are used to determine the accuracy Methods in Enzomology, vol 266, pp. 540-553.
of secondary structure prediction in GOR method. [12] Ismail, W.M. and Chowdhury, S. (2010) “Preference of
Various recommendations for further work in the domain of Amino Acids in Different Protein Structural Classes: A
protein secondary structure prediction can be done. Database Analysis” ICBBE 2010, p. 1-5.
1. Variety of different sequences formats can be [13] Jiang, D. Tang, C. and Zhang, A. (2004), “Cluster
introduced for further analysis. Analysis for Gene Expression Data”, IEEE Transactions
2. Varieties of Bioinformatics tools are available which on knowledge and data engineering, vol. 11, pp. 1370-
can be used to incorporate new research in 1386.
Bioinformatics field.
3. Present GOR method is based on single sequence but in [14] Kumar, B. and Jani, N.N. (2010) “Prediction of Protein
future it can be incorporated to multiple sequence Secondary Structure based on GOR Algorithm
alignment to achieve different results. Integrating with Multiple Sequences Alignment”
4. Since the research in bioinformatics field increasing International Journal of Advanced Engineering and
rapidly. So our requirement to achieve optimal result in Applications, pp.177-182.
less time. [15] Singh, R., et al (2010) “Chou-Fasman Method for
7. ACKNOWLEDGMENTS Protein Structure Prediction using Cluster Analysis”
The author wishes to express his sincere gratitude and World Academy of Science, Engineering and
indebtedness to his Supervisor, Prof. Rajbir Singh (Assoc. Technology 72 2010, pp. 982-987.
Prof. & Head, Department of Information Technology) for [16] Singh, M., et al (2008) “Protein Secondary Structure
his valuable guidance, attention-grabbing views and obliging Prediction” World Academy of Science, Engineering
nature which led to the successful completion of this study. I and Technology, pp. 458-461.
lack words to express my cordial thanks to the members of
5
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013
[17] Sen Z.T., et al (2005) “GOR V server for protein [18] Singh, M. (2001) “Predicting Protein Secondary and
secondary structure prediction” vol. 21, no. 11, pp Super Secondary Structure” CRC Press, pp. 29.1-29.30
2787-2788.
.
Singh R is an Associate Professor & Head, Department of Neha Jain is an Assistant Prof. with the Department
Information Technology of Lala Lajpat Rai Institute of of Computer Science & Engg., Northwest Institute of
Engineering & Technology Moga, India. He received his Engineering & Technology Dhudike, Moga, Punjab,
B.E (Honor) degree in Computer Science and Engineering INDIA. She received her B.Tech in Computer Science
from MD University, Rothak, Haryana and M-Tech & Engineering and M-Tech degree in Computer
degree in Computer Science and Engineering from Punjab Science & Engineering from Punjab Technical
Technical University, Jalandhar Pb. (INDIA). He has University, Jalandhar Pb. (INDIA). Her research
authored 03 books on Computer Science. His main field interest includes Bio-Informatics, Software
of research interest is Bio-Informatics and Data mining. Engineering, & Software Testing. She works on the
He works on the Gene Expression, Phylogenetic Trees protein structure prediction using Cluster Analysis in
and Prediction of Protein Sequence & Structure. MAT Lab
IJCATM : www.ijcaonline.org 6