Protein Modeling with Discovery Studio
Kimmo Mattila
[email protected]
Discovery Studio
Discovery Studio (DS) is a commercial molecular modeling
program for biological macromolecules (proteins, nucleic acid).
World is full of good molecular modeling programs like:
• Sybyl
• Maestro
• gOpenMol
• VMD
• PyMol
• Discovery Studio
Is Discovery Studio any better?
Discovery Studio at CSC
CSC has a national academic license and installation
files for DS
The program can be installed in users own Windows
or Linux pc (requires fixed IP-address)
You can use DS in hippu1.csc.fi using X-term,
Nomachine or DS-client connection
The token license system does not limit the amount of
installations but the amount of simultaneous users.
(Close your DS if you are not using it)
Installing Discovery Studio
Academic university researchers can instal the software to a local
computer
Installation instuctions can be found form:
http://www.csc.fi/english/research/sciences/bioscience/programs/ds/ds_install
A group vise license contract is needed
A fixed IP address is needed
Installation files are downloaded from the scientist's user interface
(CSC user account is needed)
https://sui.csc.fi/group/sui/downloads
DS client only: 300 MB (windows), 400 MB (Linux)
DS complete: 3,2 GB (windows), 3,7 GB (linux)
Structure of Discovery Studio
• Both client and server are normally installed in the
same machine
• separate server machines can be used too
• In most cases you do not need to worry about this
• If you use Hippu1 as your DS server you do not
need to install the whole package but just the client
DS
DS Client DS Server
• Pipe Line Pilot
Default
ports: • Apache
•Interface
9943
•Visualization 9944 • Protocols (BLAST,
CHARMM, Modeler,
•Commands and tools CDOCKER etc. )
Discovery Studio interface
Command menus
Toolbars
Protocols
Tools
3D Window
Hierarchy view (graphics)
Discovery Studio menu commands
File menu: Contains commands for tasks such as opening molecular data
files, saving files to disk, printing, and accessing windows.
Edit menu: Contains commands for tasks such as copying and pasting,
selecting, finding, and setting preferences.
View menu: Contains commands for tasks such as changing the way
objects appear in the various views and for choosing which views should
be shown or hidden.
Chemistry menu: Contains commands for tasks that modify the chemical
makeup of the molecules.
Structure menu: Contains commands for tasks such as adding or
removing labels, adding or removing structure monitors, calculating the
solvent accessibility, cleaning up geometry, and superimposing multiple
molecules.
Sequence menu: Contains submenus and commands to manage protein
sequences and protein sequence alignments.
Window menu: Contains commands that allow you to control the display
of open windows in the current Discovery Studio session.
Help menu: Contains commands to access the Discovery Studio Help
system and the Accelrys website.
DS toolbars
“Toolbars” are mostly shortcuts to menu
commands
However, some functions are used only
through Toolbars
Not all toolbars are normally visible. You can
add or hide toolbars from:
View | Toolbars
DS tools
“Tools” contain methods to analyze
and modify your molecular model
Tools panel can be made visible
from:
View | Explorers | Tools
The CSC license covers most but
not all the tools
Most of the tools are run within the
client but some require connection to
the DS server
DS Protocols
Protocols are more advanced modeling
and analysis tasks that are computed by
the DS server
Tools panel can be made visible from:
View | Explorers | Protocols
Note that the license of CSC does not
cover all the protocols and tools
Hierarchy panel
The hierarchy panel is opened form:
View | Hierarchy
You can use hierarchy to select, atoms,
amino acids or molecules
Selections can be made for the 3D-view,
Data table and the sequence window too.
The logic in Discovery studio is:
1. First select the target
2. Then select the command
Data table
The Data table is opened form: View | Data table
Data table shows data and values associated to your molecular model
The data can be viewed, modified and sorted in the data table
If value is in gray background it can’t be modified
Try right clicking the data table window (you can find a hidden
command menus, that open by right-click, all over Discovery Studio)
Help
Discovery studio contains a large help system. No
WWW or printed manual is available DS interface
Open help form Help | Topics (if search tools are not
visible in the Help window, press CONTROL + s)
Other tricks to find the missing functionality or parameter:
Use the Feature search toolbar:
Try right-click to the window or object
Check, if your parameter is locates in preferences
Edit | preferences
Proteins and PDB
Using PDB data in DS
http://www.rcsb.org/
experimentally determined protein structures are
stored into PDB (Protein Data Bank) database
Sources: X-ray diffraction (about 80%), NMR (15 %),
others (5 %)
over 65 000 structures (many of them related and
nearly similar however)
this is much less than the amount of known protein
sequences (UniprotKB contains over 10 million
sequences)
PDBe Database
“Processed” version of pdb
Several search approaches e.g. ligand
search, interface analysis,
PISA to study assemblies interfaces and
monomers
http://www.ebi.ac.uk/pdbe/
Using PDB data in Discovery Studio
Protein structures can be automatically retrieved
from the PDB database server using the four letter
PDB-code.
• File | Open URL | PDB ID
You can do sequence based similarity searchers to
PDB with BLAST protocols.
Protocol: RCSB Structure search enables metadata
and sequece motif based searches.
Local PDB formatted files can be imported too (e.g.
files retrieved form PISA database).
Using PDB files
Note that there can be several thighs that may need editing in
your PDB file before you can start to use modeling tools
Things that often require editing in PDB files:
hydrogens are missing
some side chains are missing
part of main chain is missing
ligand structure is not recognized
multiple conformations for some side chains
several structures in the PDB file
Other things to check in a PDB-file:
main chain omega, phi and psi angles
side chain rotamers
Checking and fixing a protein
Protein Reports and Utilities tool can be used to get an overall
view to the selected PDB entry
Validate protein structure tool can be used to check the PDB
structure and report the problematic sites
Clean function in the Protein Reports and Utilities tool can be
used to fix some of the errors automatically
Hydrogens can be added by Chemistry/H
Protonation states can be fix with Build and Edit Protein Tool.
The protonation states of neutral histidines are selected using
Edit | Preferences | Protein Utilities
Checking and fixing a protein
Protocol: Protocols | General Purpose | Prepare Protein
• Standardize atom names, insert missing atoms in residues and remove
alternate conformations.
• Remove water and ligand molecules depending on the settings.
• Insert missing loop regions based on either SEQRES data or user
specified loop definitions.
• Optimize short and medium size loop regions with the LOOPER
algorithm (optional).
• Minimize the remaining loop regions (optional).
• Calculate the pK and protonate the structure (optional).
PDB file format
PDB-file contains only information about atom
locations (X,Y,Z).
Data about bonds, partial charges or force field types
is not included
The fourth column contains normally R-factor, but
can be something else too
Common format for molecular data
X-ray structures lack hydrogens and may contain
several copies of the protein.
NMR-structures contain several overlapping
structures
PDB file format
HEADER RETINOL TRANSPORT 27-JUL-92 1BRP 1BRP 2
COMPND RETINOL BINDING PROTEIN (HOLO FORM) 1BRP 3
SOURCE HUMAN (HOMO SAPIENS) PLASMA
…
HELIX 1 1 VAL 6 SER 8 4 ONE SHORT TURN 1BRP 71
HELIX 2 2 PRO 146 GLU 158 1 1BRP 72
SHEET 1 S1 9 GLY 22 LYS 30 0 1BRP 73
…
SEQRES 1 A 182 GLU ARG ASP CYS ARG VAL SER SER PHE ARG VAL LYS GLU
SEQRES 2 A 182 ASN PHE ASP LYS ALA ARG PHE SER GLY THR TRP TYR ALA
SEQRES 3 A 182 MET ALA LYS LYS ASP PRO GLU GLY LEU PHE LEU GLN ASP
…
ATOM 1 N GLU 1 22.826 21.377 -30.151 1.00100.00 1 1BRP 99
ATOM 2 CA GLU 1 23.744 21.686 -29.074 1.00100.00 1 1BRP 100
ATOM 3 C GLU 1 23.395 23.023 -28.464 1.00100.00 1 1BRP 101
ATOM 4 O GLU 1 22.798 23.102 -27.389 1.00100.00 1 1BRP 102
ATOM 5 CB GLU 1 25.225 21.681 -29.508 1.00100.00 1 1BRP 103
ATOM 6 CG GLU 1 26.155 20.992 -28.489 1.00100.00 1 1BRP 104
ATOM 7 CD GLU 1 27.285 21.840 -27.971 1.00100.00 1 1BRP 105
ATOM 8 OE1 GLU 1 28.301 22.075 -28.603 1.00100.00 1 1BRP 106
ATOM 9 OE2 GLU 1 27.087 22.244 -26.741 1.00100.00 1 1BRP 107
ATOM 10 N ARG 2 23.771 24.073 -29.182 1.00100.00 1 1BRP 108
ATOM 11 CA ARG 2 23.485 25.397 -28.690 1.00 86.16 1 1BRP 109
ATOM 12 C ARG 2 22.026 25.784 -28.629 1.00100.00 1 1BRP 110
…
HETATM 1483 O HOH 229 2.848 65.969 -30.833 1.00 53.03 1BRP1581
HETATM 1484 O HOH 230 38.756 38.831 -49.928 1.00 75.69 1BRP1582
Forcefield methods
Force field methods
Quantum mechanics Molecular mechanics
• electronic structure • atomic level
calculations • simple models of the
• Ab initio methods interactions to calculate
• Semi-empirical the energy of molecule
methods as a function of nuclear
• time-consuming positions only
• high level accuracy • Not so accurate as QM
methods
• < 500 heavy atoms
• >10000 atoms
Force Filed methods….
Force field methods are usually used for biomolecules.
to study complex system (i.e. the binding site in protein),
quantum mechanics and molecular mechanics methods
can be combined (hybrid QM/MM).
Force Filed methods….
Advantage Limitations
• large systems in • no atom level
reasonable electrostatics.
calculation time.
• dependent on the
• in some cases FF
can provide results quality and
as accurate as the availability of
highest QM in a parameters.
fraction of the CPU- • The calculated
time. energy is relative.
General form for Forcefield
Energy consists of sum of terms each describing the energy
required for distorting a molecule
• Ebond energy function for stretching a bond between two atoms
• Eangle energy required for bending an angle
• Etorsion torsional energy for rotation around a bond
• Eelec electrostatic energy ( non-bonded interactions due to distribution of the
electrons)
• Evdw Van der Waals energy (repulsion or attraction between non-bonded atoms)
Example. Functional form of CHARMm force field:
bond length angle dihedral angle improper torsion
electrostatic van der Vaals
Components of forcefield
The forcefield contains the necessary building
blocks for the calculations of energy and force:
• A list of atom types.
• A list of atomic charges (if not included in the
atom-type information).
• Functional forms for the components of the
energy expression.
• Parameters for the function terms.
Forcefield….
Torsion and non-bonded energy terms are
most important for biomolecules.
A force field is transferable (set of
parameters developed on a small number of
cases can be applied to much wider range of
systems).
Force fields are used in molecular
mechanics and molecular dynamics
calculations.
Forcefield functions and parameters
Goal: a simple function for reproducing
structural properties.
1) Empirical fitted force field: a functional form
and parameters is designed to satisfy
experimental results. ( cvff)
2) Ab initio fitted Force field: a functional form
and parameters are specified using
theoretical models and calculations. (cff)
Parameter assignment
the forcefield has the same functional form
for all atoms but different parameters for
each atom types.
the atom type and its parameter depends on
how that atom is bonded (example
CHARMm has 38 different carbon types).
atom type ≠ atom name
Parameter assignment...
molecule can be neutral but the charge
distribution is not equal=> partial atomic
charges are determined.
Partial charges are important!
• hydrogen bonds
• ionic bonds
• dipole moment
Isoleucine:
Elements Atom names CHARMm Partial
types charges
How to use forcefield ?
several forcefields are available commercially.
the validation of the force field depends on for which
purpose it is designed and what properties are
studied.
the quality of force field parameters is essential.
the complexity of functional form
computational power (The computational time for
calculating the force field energy grows as the square
of the number of atoms).
The ability to perform a calculation is no guarantee
that results can be trusted !
• unsuitable forcefield gives wrong results.
A common problem: a lack of (good) parameters.
different forcefields cannot usually be merged but
the results can be compared.
forcefield methods are good for predicting properties
for classes of molecules where a lot of information
exists.
Forcefields in DS
Discovery Studio can use CHARMM forcefields:
CHARMm, CHARMm polar H, CHARMm19, CHARMm22,
CHARMm27, XPLOLIG, MMFF, cff
Location of forcefield files in Disocvery Studio
DiscoveryStudio17/share/forcefield
The InsightII manual chapter “Forcefield based simulations” gives a
good introduction to force fields and their applications:
https://extras.csc.fi/msimanual/doc/insight2005/ffbs/FF_SimulTOC.html
Applications of forcefield methods
Applications of forecefields methods
• Molecular mechanics= Minimisation= Optimisation
• Molecular dynamics=Simulation
Basic assumption for using forcefield methods:
A REAL MOLECULE IS IN A STATE WHICH
CORRESPONDS A MODEL NEAR
POTENTIALENERGY MINIMUM
Search strategies
Several different strategies in use.
all methods are not using plain atomistic
models
Molecular dynamics
Monte Carlo
Genetic algorithm
Fragment based method
Point complementarity methods
Energy minimization
“A REAL MOLECULE IS IN A STATE
WHICH CORRESPONDS A MODEL NEAR
POTENTIALENERGY MINIMUM” ,
a search for the minimum of the potential
energy surface defined by energy function, is
done.
Minimum energy arrangements of the atoms
corresponds to stable state of the molecule.
Three major protocols for minimization
Steepest Descent
Conjugate Gradient
Adopted Bases Newton-Rhapson
Powell
The steepest descents method
the gradient of potential energy determines
the direction which leads to a largest
reduction in energy. A step will be taken to
that direction.
Robust and simple method,
useful when starting far from minimum
Convergence is slow near minimum
Conjugate gradient method
the gradient of the previous step is included
more efficient convergence to minimum →number
of iterations smaller than for steepest descents
works quickly when the molecule's structure is far
from an energy minimum.
the best choice for general use.
more complex →time per iteration is longer than
for steepest descents.
Newton-Rhapson method
use not only the first derivatives (i.e.the gradients)
but also the second derivatives ( the curvature of the
function) to locate a minimum.
initial guess of structure need to be close to
minimum.
Convergence is fast near a minimum
Computationally demanding for systems with many
atoms (suitable for less than 100 atoms).
Comparison of minimization methods
what determines which method is best?
1) size of system
2) current state of optimization
robustness (i.e. ability to reach minimum regardless of initial
conditions)
steepest descents > conjugate gradient and Newton-Raphson
number of iterations
steepest descents > conjugate gradient > Newton-Raphson
Strategy in minimization
use steepest descents for first 10-100 steps to
remove bad contacts.
then use Newton-Raphson or conjugate gradients
to complete minimization to convergence
DS Smart minimizer:
1. steepest descents (max 1000 iterations)
2. conjugate gradient (max 1000 iterations)
Strategy in minimization
as a minimum approaches, the rate of
convergence slows down and minimisation
method crawls toward minimum at an ever
decreasing speed.
• Convergence criteria (either for energy or
conformation changes) and amount of
minimization cycles are used to end minimisation.
The step size in minimisation algorithm is defined
either by energy change in pervious step or by
line search method.
Local or global energy minimum?
optimization methods can only locate the “close by”
minimum, which is normally a local minimum, not a the
global minimum from a given set of coordinates.
To check if the minimum is the local or the global, all
conformations need to be searched and the number of
the minima grows typically exponentially with the
number of variables.
Systematic search is not possible for complete proteins
and thus global minimum can not be defined.
Local or global energy minimum?
Systematic search can be N Number of Time
used to check possible possible
conformations, but only for
conformations
small systems.
1 3 3s
table1: possible
5 243 4 min
conformation for linear
alkanes CH3(CH2)n+1CH3)
15 14348907 166d
1
F. Jensen: Introduction to Computational
Chemistry
Applications of energy minimization
relieve any unfavourable interactions in the initial
configuration.
calculations of the energy of molecular structure
conformational search procedures
normal mode analysis
Molecular dynamics
• Movements of molecule are simulated in certain temperature
• Temperature = thermal movement
• Each atom has position, mass and velocity
• Force field affects to the model according to the Newtons’ law
F=ma
or
2
d ri
mi 2 = -? i [E(r1 ,r2 ,...,rN )] i = 1,...,N
dt
Molecular dynamics
Total energy of the system is
• Etot = Epot + Ekin
Normal dynamics (Verlet/leap frog)
Langevin dynamics
Using molecular dynamics
Proteins are not rigid but flexible objects
• Dynamical models are sometimes needed
• fluctuations
• Conformational search and changes
• ligand binding
• estimating thermodynamical parameters
Accuracy, size and timescales of the simulations
are quite limited (tens of thousands of atoms,
hundreds of nanoseconds)
Molecular dynamics parameters
Handling of nonbonded interactions
• All interactions can’t be explicitly included
• Cutoff
• Ewald summation
• Cell multipole
Time step
• is limited by the highest frequency in the model
• 0,5-5 fs => millions of simulation steps are needed
to reach nano second time scale.
Molecular dynamics parameters
Temperature
• normally 300 K
• Thermal equilibrium requires temperature control
or careful heating
• Simulated annealing utilizes higher temperatures
Other parameters
• Force field scaling
• dielectricity
Solvent environment
Solvent (water) environment requires more
computing but makes model more realistic
Periodic boundary conditions are used to create
continuous solvent environment
Discovery Studio: Protocol Simulation/Solvation
Analysis of molecular dynamics
The course of the simulation is recorded
(trajectory).
Later on several properties can be analyzed
• Geometry (angles distances)
• Changes and fluctuations
• Energies
• Interactions, hydrogen bonds
• Distributions and correlations
Discovery Studio: Animation toolbar , Analyze
trajectory tool and Analysis Protocols modules.
Docking
many protein related biological processes are
regulated or enabled by specific binding of small
organic molecules (ligands) to the proteins
• signal transduction
• enzyme activity /inhibition
many drugs are known or taught to work by
binding a target protein
what molecules could bind to the active site or
where and how do the active molecules bind?
Docking
Two basic components:
1. scoring function
2. search strategy
initial position
optimizing search
Systematic search is normally not possible
Docking
computer based docking can be used to predict binding
geometries for large libraries of candidate molecules, if
the protein structure is available
speed is an issue (maximum duration few
minutes/ligand)
What do we want to know?
• does this molecule bind or not?
• which of these molecules are most potential ligands?
• what is the binding geometry or site of this molecule?
• what is the binding affinity of this molecule?
docking does not try to simulate the binding process!
Scoring function
scoring function should distinguish the real
binding modes form other binding modes
• force fields
• empirical free energy functions
• knowledge based functions
scoring can be used together with the
conformation search method or only for ranking
the search results
scoring functions are the most critical issue of
docking
Ligand Preparation in DS
A. Manually
B. Ligand preparation protocol
• Charges are standardized for common groups
• The largest fragment is kept
• Hydrogens are added
• The molecule is represented in Kekule form
• The ionization states may be enumerated
• Tautomers may be generated
• Isomers may be generated
• Duplicates may be removed
• 3D coordinates may be calculated (not at CSC)
Protein Preparation in DS
Protein health tool
• check your structure
Protein report and Build and edit protein
• fix your force filed
Force Field tool
• set up the force filed
• Check the automation level from the protocols from
Edit/Preferences!
Define and Edit Binding Site tool
• look for cavities
• Define the binding site sphere using a cavity or specific site
Generate ligand
CDOCKER conformations trough
high temperature MD
CHARMm force filed based docking tool Random (rigid-body)
Uses soft-core potentials rotation
In the active site
Confomation serach using simulated
anealing
grid-based simulated
Grid based energy evaluation
annealing (several
Force field based scoring cycles)
Full minimization
Output # of refined
ligand poses sorted by
energy
After CDOCKER
You can
Rescore the structures using protocol:
• Calculate Binding Energies
Optimize binding site with the ligand using protocol:
• Ligand Minimization
Study the results with protocol:
• Analyze Ligand Poses
CDOCKER: Papers to read
Wu G, Robertson DH, Brooks CL 3rd, Vieth M.
Detailed analysis of grid-based molecular docking:
A case study of CDOCKER-A CHARMm-based MD docking algorithm.
J Comput Chem. 2003 Oct;24(13):1549-62.
Erickson JA, Jalaie M, Robertson DH, Lewis RA, Vieth M.
Lessons in molecular recognition:
the effects of ligand and protein flexibility on molecular docking accuracy.
J Med Chem. 2004 Jan 1;47(1):45-55.
Ferrara P, Gohlke H, Price DJ, Klebe G, Brooks CL 3rd.
Assessing scoring functions for protein-ligand interactions.
J Med Chem. 2004 Jun 3;47(12):3032-47.