Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views6 pages

GOR Method For Protein Structure Prediction Using Cluster Analysis

Uploaded by

Vaibhavi Awale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

GOR Method For Protein Structure Prediction Using Cluster Analysis

Uploaded by

Vaibhavi Awale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Computer Applications (0975 – 8887)

Volume 73– No.1, July 2013

GOR Method for Protein Structure Prediction using


Cluster Analysis

Prof. Rajbir Singh Neha Jain Dheeraj Pal Kaur


Associate Prof. & Head Assistant Prof. (CSE) Assistant Prof. (ECE)
Department of IT Department of CSE Department of ECE
LLRIET, Moga NWIET, Moga LLRIET, Moga

ABSTRACT amino acid interchangeably. There are 20 different amino


Protein structure prediction is one of the most important acids in nature that form proteins. These 20 are encoded by
problems in modern computational biology. The emphasis the universal genetic code. Nine standard amino acids are
here is on the use of computers because most of the tasks called "essential" for humans because they cannot be created
involved in genomic data analysis are highly repetitive or from other compounds by the human body, and so must be
mathematically complex. The problem of this research focus taken in as food.
on secondary structure prediction of amino acids. In the
present research work, the GOR (Garnier, Osguthorpe, and
Robson) Method is implemented so as to deal with amino acid
residues to predict the 2D structure using different input
formats of sequences. Combination of amino acids results in
formation of protein through peptide bond. The practical
implementation of protein structure prediction completely
depends on the availability of experimental database. The
analysis and interpretation of bioinformatics database which
includes various types of data such as nucleotide and amino
acid sequences, protein domains, and protein structures is an
important step to determine and predict protein structure so as
to understand the biological and chemical activities of
organisms. GOR method uses the information theory to
generate the code that relates amino acids sequence and
secondary structure of proteins. Three scoring matrices are
prepared in GOR method to calculate the probability of each Fig 1: Formation of Peptide
amino acids present in every positions. Cluster analysis is
used as data mining model to retrieve the result
Protein structures are the sequence of amino acids present in a
General Terms protein chain. Protein structures may be classified into four
Hierarchial clustering, GOR algorithm, Genetic computer levels or classes: primary, secondary, tertiary, and quaternary
group, Genbank database, FASTA format, structure.
1. Primary structure of proteins is the sequence of amino
Keywords acids which are held together by covalent bonds. Sequence
Amino Acid, Protein, Polypeptide, DNA, RNA, DSSP, GOR. direction is important component; it starts from amine (N)
to carboxyl (C) terminal.
2. Secondary structure Sequences of primary structures tend
1. INTRODUCTION to arrange themselves into some spatial arrangement; these
Proteins are the complex organic macromolecules which are
units are referred to as secondary structure. Important
essential for the functioning, structure and regulation of
factors in protein secondary structure are the angles and
body’s cells, tissues and organs. Proteins consist of amino
hydrogen bond patterns between the backbone atoms. A
acids joined together by peptide bond to form a polypeptide
common pattern in protein forms the secondary structure.
and it is a protein as in Figure 1. Many proteins function as
Secondary structure is further divided into three parts:
enzymes or form subunits of enzymes and some of the
alpha-helix, beta-sheet, and loop.
proteins play structural or mechanical roles. Some proteins
3. Tertiary structure is three-dimensional structure of the
used as the storage and transport of various ligands and some
protein, which is formed from the secondary structures of
function in immune response. Proteins serve as nutrients as
amino acids. In other words, tertiary structure is the
well and they provide the organism with the amino acids that
arrangement of all its atoms in spatial arrangement without
are not synthesized by that organism. A chain of such peptide
having its relationship to neighboring atoms.
bonds is called polypeptide and is a protein.
4. Quaternary Structure is the complex protein structure. It
An amino acid is any molecule that contains both an amino
is the arrangement of subunits in space without regard to
group and a carboxylic acid group. An amino acid residue is
internal geometry of subunits.
the residuals of an amino acid after it forms a peptide bond
and loses a water molecule. Since we are interested in amino
acids that form proteins, it is safe to use the terms residue and

1
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013

1.1 Cluster Analysis Technique the requirement this eight letter DSSP alphabet translated into
Cluster analyzes the data objects without consulting a class the three letter code.
label. The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and Table 1: Reducing DSSP 8 classes code to 3 Classes
minimizing the interclass similarity. Clusters of objects are
formed so that objects within a cluster have high similarity in DSSP 8-classes 3-class
comparison to one another but are very dissimilar to in other
clusters. Each cluster that is formed can be viewed as a class α-helix (H) ,3/10 helix (G) Helix(H)
of objects from which rules can be derived [13]. Fig. 2 below
shows how several clusters might form a hierarchy. When a β-sheet (E), β- Bridge(B) Strand(E)
hierarchy of clusters like this is created the user can determine
the right number of clusters that adequately summarizes the π-helix (I),Turn(T), Bend(S), Coil(C) Coil(C)
data while still providing useful information (at the other
extreme, a single cluster containing all the records is a great 1.2.1 Chief elements of secondary structure are:
summarization but does not contain enough specific
information to be useful). Clustering analysis has received 1. Alpha helix: Alpha helix is most commonly known as 4-
significant attention in the area of gene expression. It allows turn helix and it’s the commonly occurring type of element
the identification of the structure of a data set, i.e. the in proteins. The helical structure is used to arrange amino
identification of groups of similar objects in multidimensional acids through 5A wide. Amino acids are translated to next
space. Clustering procedures yield a data description in terms amino acid along helical axis about 1.5A. A canonical α-
of clusters or groups of data points that possess strong internal helix has 3.6 residues per turn, and is built up from a
similarities. contiguous amino acid segment via hydrogen bond
formation between amino acids in positions i and i + 4. 10
Hierarchical Clustering: These methods start with each point
amino acids are specifying the average length of alpha
being considered a cluster and recursively combine pairs of helix. Minimum 4 amino acids are required for structure to
clusters (subsequently updating the inter-cluster distances)
be classified as alpha helix. The residues taking part in an
until all points are part of one hierarchically constructed
α-helix have φ angles around −60 and ψ angles around
cluster.
−50. Alpha helix present at the surface of protein cores.
These cores provide an interfacing with aqueous
environment.
2. β-sheet: The beta sheet is commonly known second type
of structure element. Two or more amino acid sequences
(beta strands) present in same protein that bond together
through hydrogen bond forms the beta sheet. A β-strand is
a more extended structure with 2.0 residues per turn.
Values for φ and ψ vary, with typical values of −140 and
130. β-strand interacts via hydrogen bonds with other β-
strands, which may be distant in sequence, to form a β-
sheet. These strands can bond with adjacent strand through
parallel in and anti-parallel configuration. A β -sheet
consists of individual β -strands, each of which is made up
of contiguous amino acid residues. The dihedral angle in
anti-parallel sheets are φ=-140 and ψ=135 and in parallel
sheets are φ=-120 and ψ=115. The Parallel Beta-Sheet is
. characterized by two peptide strands running in the same
direction held together by hydrogen bonding between the
Fig. 2 Hierarchy of Clusters. strands. The Antiparallel Beta-Sheet is characterized by
two peptide strands running in opposite directions held
1.2 SECONDARY STRUCTURE together by hydrogen bonding between the strands.
CLASSIFICATION 3. Coils and Loops: Coil or loop regions connect α-helices
The DSSP Code: There is one method to classify the and β-sheets and have varying lengths and shapes. They do
secondary structure named- “the Dictionary of Protein not have even patterns like alpha-helices and beta-sheets
Secondary Structure” commonly referred to as DSSP code to and they could be any other part of the protein structure.
define unambiguously secondary structure based on their They are recognized as random coil and not classified as
physical and geometrical properties. Database of Secondary protein secondary structure. These are also known as local
Structure in Proteins (DSSP) is widely used in protein science structures and have irregular shape. In loop or coil residue
to define the secondary structure assignment. located on the surface of the protein structure and tends to
Eight elements of secondary structure assignment are there be charged and polar. Glycine and proline are the
according to the DSSP classification, which are denoted by commonly known structures.
the letters H (α-helix), E(extended β-strand), G(310 helix), I
(α-helix), B(bridge, a single residue β-strand), T(β-turn), S 2. LITERATURE REVIEW
(bend), and C (coil). Previous research discusses the use of a new method for the
Number of elements according to DSSP classification is too prediction of the protein secondary structure from the amino
many for existing methods of the secondary prediction, acid sequence. The method is based on the most recent
instead usually only three states are predicted as in Table 1 version of the standard GOR algorithm. A significant
helix (H), extended (b-sheet) (E), and coil (C). According to improvement is obtained by combining multiple sequence

2
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013

alignments with the GOR method. Additional improvement in amino acids present in every positions. One matrix
the predictions is obtained by a simple correction of the corresponds to the central amino acid being found in α helix,
results when helices or sheets are too short, or if helices and the second for the amino acid being in a β strand, the third a
sheets are direct neighbours along the sequence [14]. The coil.
imposition of the requirement that the prediction must be
strong enough, i.e. that the difference between the probability
of the predicted (most probable) state and the probability of
the second most probable state must be larger than a certain
minimum value also improves significantly secondary
structure predictions.

3. GOR METHODOLOGY
It is based on information theory and Bayesian statistics.
Information theory approaches are popular in secondary
structure prediction and these approaches are mathematical
probability based. Information theory is a class of the
mathematical theory of probability and mathematical statistics
that defines the concept of information. Day-to-day increasing
amount of information related to protein structural has
motivate researchers to develop several approaches that use
this information theory for generating new ideas to predict
protein structure and function. Most commonly, the secondary
structure prediction problem is formulated as follows:
Given a protein sequence with amino acids r1 r2 . . . rn, predict
whether each amino acid ri is in an α −helix (H), a β −strand
(E), or neither (C) [18]. Predictions of secondary structure are
typically judged via the 3-state accuracy (Q3), which is the
percent of residues for which a predicted secondary structure
(H, E, or C) method is correct. Fig: 4 General Framework for Protein Secondary
Structure Prediction Method
M C GOR method works on window of 17 residues, eight nearest
neighboring residues are included in calculations for a given
E C residue. The conformational state among three states will be
PREDICTION predicted and depends upon the type of amino acid R as well
as neighboring residue along window. Information theory
R C helps to retrieve the information function. GOR method
calculates information from residue within sliding window as
P C in fig 5.
To determine the structure for a given amino acid position j,
the GOR method looks at a window of 8 amino acids before
Y E and 8 after the position of interest . Suppose aj is the amino
acid that we are trying to determine. GOR looks at the
A E residues in Equation
Fig 3: Prediction Scheme
C C Intuitively, it assigns a structure based on probabilities it has
The secondary structure prediction GOR method is one of the calculated from protein databases. These probabilities are of
first major methods proposed for prediction of structure from the form as
sequence. TheP three alphabets GOR were derived from the C
Pr[amino acid j is α
first letter of their names (Garnier-Osguthorpe-Robson. In
used version … of GOR method, database of 267 proteins are … | ]
used which contains 63,000 residues [11]. Pr[amino acid j is β
.
In prediction method for secondary structure of protein
determines the accuracy in terms of present percentage of | ]
helix, sheet and coil. Formation of α-helix, β- sheet and coils In GOR method, three scoring matrices, and each column
are predicted with respect to each amino acid residue present consist the probability of finding each amino acid at one of
in a sequence of amino acids residues. Result of the prediction the 17 positions, are prepared. Information theory forms on
of all secondary structure elements are combined to obtain the the basis of information function I(S, R) which will be fully
result of prediction of secondary structure of protein as in fig represented in mathematical notation together with other
4. Rather than considering propensities for a single residue, functions and formula. The information function is described
position-dependent propensities have been calculated for all in terms of logarithm ratio of the conditional probability P
residue types. GOR method work on various types of (S|R) of observing conformation S.
sequences formats which uses the information theory to The information available as to the joint occurrence of
generate the code that relates amino acids sequence and secondary structural conformation S and amino acid R is
secondary structure of proteins. Three scoring matrices are given by
prepared in GOR method to calculate the probability of each I(S; R) = log [P(S/R)/P(S)]

3
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013

where P(S | R) is the conditional probability of conformation 4. SEQUENCE FORMATS


S given residue R, and P(S) is the probability of conformation Sequences can be read and write in variety of formats.
S. The information function is defined as the logarithmic ratio Sequence formats are ASCII TEXT which contains the
of the conditional probability P (S|R). Where S is observed information like arrangement of characters, symbols and
conformation which can be one of three states: helix (H), keywords that specify what things such as the sequence, ID
extended (E), or coil (C)- for residue R, where R is one of the name, comments, etc. look like in the sequence entry.
20 possible amino acids and the probability P(S) of the 1. FASTA format begins with a single-line description,
occurrence of conformation S. followed by lines of sequence data. FASTA format is text-
By Bayes’ rule, the probability of conformation S given based format used for representing nucleotides and peptide
amino acid R, (S | R) is given by sequences, in which single-letter codes are used to
P(S | R) = P(S, R) / P(R) represent nucleotide and amino acid. It is default format
where P(S, R) is the joint probability of S and R and P (R) is and contains the header and sequence. The description line
the probability of R. These probabilities can be predicted from is distinguished from the sequence data by a greater-than
the frequency of each amino acid found in each structure and (">") symbol in the first column. The simple format fasta
the frequency of each amino acid in the structural database. has the ID name as the first word on its title line. Sequence
Given these frequencies as end indicate when another line starts with “>”; it means
I (S; R) = log (fS,R / fS) next sequence starts. For example the ID name 'xyz'
Where fS, R is the frequency of amino acid R in conformation FASTA format sequence represented as below:
S and fS is the frequency of all amino acid residues found to >xyz some other comment
be in conformation S. ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatat
gcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaa
acggtcgcccagatcaaggctcatgtagcctcactggagggcatt

2. Genbank database is collection of all publicly available


nucleotide sequences and their protein translations.
National Center for Biotechnology Information (NCBI)
produced and maintained the Genbank database where
NCBI as part of the International Nucleotide Sequence
Database Collaboration (INSDC). GenBank Format
Sequence represented as below:
Eg: LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995

DEFINITION Aspergillus awamori internal transcribed


spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc
catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg
ggcgcctctg
121ccccccgggc ccgtgcccgc cggagacccc aacacgaaca
ctgtctgaaa gcgtgcagtc//
Fig: 5 Work Flow Diagram of GOR
3. GCG (Genetic Computer Group Format) contains:
Advantages of GOR method are: 1. Exactly one sequence
1. The GOR method identifies all factors that are included 2. Begins with annotation lines
in the analysis and calculates probabilities of all three 3. Start of the sequence is marked by a line ending
conformational states. with “..” two dot character
2. GOR algorithm is computationally fast utilizing less CPU 4. This line also contains the sequence identifier, the
memory. sequence length and a checksum.
3. It is possible to perform the full jack-knife procedure. In 5. GCG format sequence used by GCG program suites
this procedure single protein is removed from the 6. Sequence editing or check sum changed will no
database and the frequencies are recalculated. longer be valid and the sequence file will not work.
4. The GOR method reads a protein sequence and predicts GCG Format Sequence represented as below:
its secondary structure. For each residue along the XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
sequence, the program calculates the probabilities for AA03518 Length: 237 Check: 4514
each confirmation state such as (H, E and C), and on the 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc
basis of this probability secondary structure prediction for catccgtgtc
such states are calculated. Except in very few cases, the 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg
state with highest probability corresponds to the predicted ggcgcctctg
conformational state. 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa
gcgtgcagtc
181tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg
ttccggc

4
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013

5. RESULTS AND DISCUSSION Departmental Research Committee (DRC) for their useful
In present work, we deal with amino acid residues to comments and constructive suggestions during all the phases
determine the secondary structure of sequences. GOR method of the present study as well as critically going through the
will be used to predict the structure of amino acids. manuscript.
Combination of amino acids results in formation of protein
through peptide bond. Words fail the author to express his deep sense of gratitude
GOR method is window structure based experiment. It uses towards his family members for their moral and financial
the 17*20 window size. It predicts the percentage of three support and encouragement without which the author would
conformational states according to the presence of it in protein not have been able to bring out this thesis.
sequence. The DSSP code is used in GOR method that will
reduce the 8 classes code to 3 classes. Different types of 8. REFERENCES
sequence formats are used in present work as input. Each [1] An, B., et al (2009) “Accuracy of Protein Secondary
format has its own format and significance. Classification Structure Prediction Continues to Rise” International
trees are generated from root node down to leaf node. It will Conference on MASS’ 09, pp.1-4.
check the values of one predictor or variable. MATLAB [2] Akitomi, J. (2007) “Method for predicting Secondary
platform is used to done the present work. FASTA is fast Structure of RNA, an apparatus for predicting and a
alignment format and contains the header and sequence. predicting program” US Patent 0235155.
Genbank database is collection of all publicly available [3] Balaban, D.J. and Aggarwal, A. (2005) “Method and
nucleotide sequences and their protein translations. GCG apparatus for providing a Bioinformatics Database” US
contains the sequence that is marked by a line ending with “..” Patent 7215804.
two dot character. [4] Chang, J. and Zhu, X. (2010) “Bioinformatics Database:
GOR method provides the secondary structure to every Intellectual Property Protection Strategy” Journal of
protein sequence. Structure is given in the format of Intellectual property Rights Vol 15, pp.447-454.
percentage corresponding to the presence of conformational [5] Chen, X., et al (2011) “The use for classification trees for
states that are helix, sheet and coil. Resulted graph drawn bioinformatics”, John Wiley & Sons, Inc. WIREs Data
according to their percentage values. User can edit its own Mining Knowledge Discovery vol. No. 4, pp 55–63.
sequence to achieve the results. [6] Deris, S.B. et al. (2007) “ Protein Secondary Structure
Prediction From Amino Acid Sequence Using Artificial
Intelligence Technique” , Journal of bioinformatics , vol.
6. CONCLUSION & FUTURE WORK No. 5, pp. 1-245.
I have studied the GOR method based on information theory [7] Exarchos, K.P. et al (2007) “Predicting peptide bond
and Bayesian Statistics is quite successful in its accuracy of conformation using feature selection and the Naive
secondary structure prediction. Probabilities of three Bayes approach” IEEE EMBS 2007, pp.5009-5012.
conformational states are predicted for each residue in the [8] Fallahi, H. and Yarani, R. (2010) “Positional preferences
sequence with the help of GOR method and this information by 20 amino acids in beta sheets” IEEE BIBMW,
can be used for further analysis. These are results are achieved pp.806-807.
when predictions are made on single sequence. [9] Gerhart, J. and Sacan, A. (2011) “BioDB: Integration of
The developed method is highly stable and consistent when biological knowledgebases” IEEE BIBMW 2011, pp.
tested against the different DSSP secondary structure 899.
reduction methods conducted in this research. Information [10] Greene, L.A. (2011) “Polypeptide Structural Motifs
regarding the secondary structure elements such as helix, Associated With Cell Signaling Activity” US Patent
sheet and coil that form for a particular sequence of amino 0004185.
acid is distributed across whole window. This information is [11] Garnier, J. et al (1996) “GOR method for predicting
retrieved from database of 267 proteins. Different types of Protein Secondary Structure from Amino Acid Sequence”
input formats of sequences are used to determine the accuracy Methods in Enzomology, vol 266, pp. 540-553.
of secondary structure prediction in GOR method. [12] Ismail, W.M. and Chowdhury, S. (2010) “Preference of
Various recommendations for further work in the domain of Amino Acids in Different Protein Structural Classes: A
protein secondary structure prediction can be done. Database Analysis” ICBBE 2010, p. 1-5.
1. Variety of different sequences formats can be [13] Jiang, D. Tang, C. and Zhang, A. (2004), “Cluster
introduced for further analysis. Analysis for Gene Expression Data”, IEEE Transactions
2. Varieties of Bioinformatics tools are available which on knowledge and data engineering, vol. 11, pp. 1370-
can be used to incorporate new research in 1386.
Bioinformatics field.
3. Present GOR method is based on single sequence but in [14] Kumar, B. and Jani, N.N. (2010) “Prediction of Protein
future it can be incorporated to multiple sequence Secondary Structure based on GOR Algorithm
alignment to achieve different results. Integrating with Multiple Sequences Alignment”
4. Since the research in bioinformatics field increasing International Journal of Advanced Engineering and
rapidly. So our requirement to achieve optimal result in Applications, pp.177-182.
less time. [15] Singh, R., et al (2010) “Chou-Fasman Method for
7. ACKNOWLEDGMENTS Protein Structure Prediction using Cluster Analysis”
The author wishes to express his sincere gratitude and World Academy of Science, Engineering and
indebtedness to his Supervisor, Prof. Rajbir Singh (Assoc. Technology 72 2010, pp. 982-987.
Prof. & Head, Department of Information Technology) for [16] Singh, M., et al (2008) “Protein Secondary Structure
his valuable guidance, attention-grabbing views and obliging Prediction” World Academy of Science, Engineering
nature which led to the successful completion of this study. I and Technology, pp. 458-461.
lack words to express my cordial thanks to the members of

5
International Journal of Computer Applications (0975 – 8887)
Volume 73– No.1, July 2013

[17] Sen Z.T., et al (2005) “GOR V server for protein [18] Singh, M. (2001) “Predicting Protein Secondary and
secondary structure prediction” vol. 21, no. 11, pp Super Secondary Structure” CRC Press, pp. 29.1-29.30
2787-2788.
.

Singh R is an Associate Professor & Head, Department of Neha Jain is an Assistant Prof. with the Department
Information Technology of Lala Lajpat Rai Institute of of Computer Science & Engg., Northwest Institute of
Engineering & Technology Moga, India. He received his Engineering & Technology Dhudike, Moga, Punjab,
B.E (Honor) degree in Computer Science and Engineering INDIA. She received her B.Tech in Computer Science
from MD University, Rothak, Haryana and M-Tech & Engineering and M-Tech degree in Computer
degree in Computer Science and Engineering from Punjab Science & Engineering from Punjab Technical
Technical University, Jalandhar Pb. (INDIA). He has University, Jalandhar Pb. (INDIA). Her research
authored 03 books on Computer Science. His main field interest includes Bio-Informatics, Software
of research interest is Bio-Informatics and Data mining. Engineering, & Software Testing. She works on the
He works on the Gene Expression, Phylogenetic Trees protein structure prediction using Cluster Analysis in
and Prediction of Protein Sequence & Structure. MAT Lab

IJCATM : www.ijcaonline.org 6

You might also like