0% found this document useful (0 votes)

38 views11 pages

Genetic Algorithm for QSAR Modeling

The document discusses a genetic algorithm tool for QSAR model development. It describes how the genetic algorithm works, including initializing the population, selecting individuals for breeding through a fitness function, applying genetic operators like crossover and mutation to generate new solutions, and terminating when certain conditions are met. The tool uses these concepts to automatically select significant descriptors during model development.

Uploaded by

ALDO JAVIER GUZMAN DUXTAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views11 pages

Genetic Algorithm for QSAR Modeling

Uploaded by

ALDO JAVIER GUZMAN DUXTAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Genetic Algorithm (GA)

A QSAR model development tool

NANOBRIDGES
-A Collaborative Project

The authors are grateful for the financial support from the European
Commission through the Marie Curie IRSES program, NanoBRIDGES project (FP7-
PEOPLE-2011-IRSES, Grant Agreement number 295128).
Genetic Algorithm (GA) Tool for QSAR model development

A genetic algorithm (GA) is a search heuristic method that mimics the process of natural
selection. Where the exhaustive search is impractical, heuristic methods are used to
speed up the process of finding a satisfactory solution. Genetic algorithms belong to the
larger class of evolutionary algorithms (EA), which generate solutions to optimization
problems using techniques inspired by natural evolution, such as inheritance, crossover,
mutation, and selection.
The evolution usually starts from a population of randomly generated
individuals, and is an iterative process, with the population in each iteration is known as
a generation. In each generation, the fitness of every individual in the population is
evaluated; the fitness is usually the value of the objective function in the optimization
problem being solved. The more fit individuals are stochastically selected from the
current population, and each individual's genome is modified (recombined and possibly
randomly mutated) to form a new generation. The new generation of candidate
solutions is then used in the next iteration of the algorithm. Commonly, the algorithm
terminates when either a maximum number of generations has been produced, or a
satisfactory fitness level has been reached for the population [1] [2].

Theoretical background and the Algorithm

Initialization of genetic algorithm

Initially many individual solutions are usually randomly generated to form an initial
population, allowing the entire range of possible solutions. Occasionally, the solutions
may be "seeded" in areas where optimal solutions are likely to be found.

Selection
During each successive generation, a proportion of the existing population is selected to
breed a new generation. Individual solutions are selected through a fitness function,
which evaluates each individual and based on this fitness function the best individuals
are selected.

The fitness function, used in this program is as follows:

i k
Pi  Pˆi
F 
ˆ ˆ
i 0 Pi , Max  Pi , Min … (1)

where,
F= Fitness score
k= Number of parameters
Pi= Property of an individual
P̂ = Desired value of that property
P̂i,Max = Maximum desired value of that property.
P̂i,Min = Minimum desired value of that property.
2
At present, 5 validation parameters (hence k=5) are used i.e. r 2 , rAdjusted , q2, average
rm2 , and delta rm2 with their desired values >0.6, >0.6, >0.6, >0.5, and <0.2 respectively.

Another fitness function i.e., Lack of fit (LOF) which is computed (and displayed in
output file) is as follows:

LOF= LSE/(1-((c+dp)/M))2… (2)

where,
LSE = least squares of the model
c= number of descriptors in the model
d= smoothing parameter (default value= 1.0)
p= total number of descriptors
M= total number of compounds

Genetic operators
The next step is to generate a second-generation population of solutions from those
selected through a combination of genetic operators: crossover (also called as
recombination), and mutation.
For each new solution to be produced, a pair of "parent" solutions is selected for
breeding from the pool selected previously. By producing a "child" solution using the
above methods of crossover and mutation, a new solution is created which typically
shares many of the characteristics of its "parents".Generally the average fitness will have
increased by this procedure for the population, since only the best solutions from the
first generation are selected for breeding, along with a small proportion of less fit
solutions. These less fit solutions ensure genetic diversity within the genetic pool of the
parents and therefore ensure the genetic diversity of the subsequent generation of
children. Although crossover and mutation are known as the main genetic operators, it
is possible to use other operators such as regrouping, colonization-extinction, or
migration in genetic algorithms. It is worth tuning the parameters such as the mutation
probability, crossover probability and population size to find reasonable settings for the
problem class being worked on.

Termination
This process is repeated until a termination condition has been reached. Common
terminating conditions are:
1. A solution is found that satisfies minimum criteria.
2. Fixed number of generations (usually user defined) is reached [1] [2].

Genetic algorithm (GA) Tool

GA tool (see snapshot 1) performs the genetic algorithm for selection of significant
variables (descriptors) during QSAR model development. The fitness function (fitness
score; equation 1) used to select the best solutions (i.e. descriptor combination) is well
explained before. Here, the fitness score is directly proportional to the quality of QSAR
model. And one can select the best model by selecting the QSAR model with highest
fitness score. The Lack of Fit (LOF) fitness function (equation 2) is also computed and
displayed in the output file. Though the LOF score is not used in the algorithm for
selecting the best QSAR models in each generation, but it is still useful to decide the
quality of the model and note that LOF is inversely proportional to the quality of model.
Some validation parameters [3] such as R2, R2 Adjusted, SEE, Q2 (LOO), SDEP are also
calculated and displayed in output file. One can judge the robustness of the QSAR model
by analyzing these validation parameters.

Genetic algorithm (GA) Tool Folder

The program folder consists of three folders "Data", "Lib" and "Output". For
convenience, user may keep input file in "Data" folder and may save output files in
"Output” folder, since by default, clicking on the browse button will open these folders.
"Lib" folder consists of library files required for running the program. Hence try not to
move or delete or rename these library files.

Snapshot 1: Genetic Algorithm Tool

“Lib” folder also consist of a descriptor database file (“DescriptorDatabase.xlsx”;

snapshot 1A) with basic information about descriptors calculated using cerius2, dragon
and PaDEL software. This information is used to display brief description about each
descriptors selected after running GeneticAlgorithm in the output file. The database file
can be updated by the user, either by inserting information about a descriptor in any of
the current sheets or add another sheet for descriptors calculated using different
software, if required. Since the program identifies the descriptor from its Descriptor
Symbol/Name (first column), it’s important that it should be accurately typed or copy-
pasted without any extra space and it is also case sensitive. Further take care that no cell
should be left blank. If any information is not available, type “Information Not Available”.
Snapshot 1A

Input file format (i.e. training set file format)

Three different file types are allowed i.e. xlsx , xls and csv as input file. The input file
(see snapshot 2) should consist of compound numbers (first column), descriptor values
and the endpoint values (last column) for each object/compound. The format in which
this information should be placed in the file is as follows:

First Row: Header i.e. name for each column, for instances, descriptor names, endpoint
name. It can be numerical, alphabet or alphanumerical in nature.
First column: Serial number/Compound number (only numerical values)
Subsequent columns: Property/Independent variables/Descriptor values; each column
will consist of each descriptor values for all the nanoparticles. These values should be
numerical values and not alphabets or alphanumerical values.
Last column: Endpoint values/Dependent variables (only numerical values)
Snapshot 2

How to run the program

It’s simple! Just click/double click on the jar file (GeneticAlgorithm.jar) present in the GA
folder. A window will open as shown in Snapshot 1, with few queries (given below) that
should be provided by the user before clicking on ‘Submit’ button to run the program.

“Select Training set File”: Click on ‘browse’ button to select the training set file. By
default, it will open the “Dataset” folder present in GA program folder. So for
convenience, user can keep the input file in the “Dataset” folder.
“Select Output Directory”: Click on ‘browse’ button to select the destination/output file
directory and define output file name. By default, it will open the “Output” folder present
in GA program folder. So for convenience, user can save the output files in the “Output”
folder.
Optionally, user can perform data pretreatment of dataset to remove constant and
inter-correlated descriptors prior to Genetic Algorithm execution. If user selects the
checkbox labeled, as “Data Pre-treatment” then the following information has to be
provided:
*Enter Variance cut-off: Enter the variance cut-off value based on which the constant
variables will be removed. By default, the cut-off value is set to 0.0001.
*Enter correlation coefficient cut-off: Enter the inter-correlation coefficient cut-off
value based on which the inter-correlated variables will be removed. By default, the cut-
off value is set to 0.99.

“Total number of Iterations”: User should mention the number of iteration

(generation) the algorithm will run. This is one of the stopping criteria (default value:
100). Other stopping criteria used in this program is that the algorithm will stop if there
is no change in fitness function value (less than 0.001 differences in fitness value) for
successive 10 generations.

“Equation Length”: This is equivalent to number of descriptor which will be present in

the generated models and also in the final best QSAR models (default value: 3).
At present, the algorithm used in this tool does not include addition and deletion as
operators. Hence the equation length remains same throughout the algorithm.

“Cross-Over Probability”: This value corresponds to the probability value (range 0 to

1) of performing cross-over operation within chromosomes (in this case various
generated/selected equation) in each generation. The default value is 1. In this version,
user cannot change the default value.

“Mutation Probability”: This value corresponds to the probability value (range 0 to 1)

of performing mutation operation within chromosomes in each generation. The default
value is 0.5, due to its low probability of occurring in nature.

“Initial number of equations generated”: This value corresponds to the number of

equation generated in the first step (first generation). In this version, user cannot change
the default value i.e. 100.

“Number of best equations selected”: This value corresponds to the number of best
equation selected in each generation (default value: 20).

“Smoothing parameters (for LOF calculation)”: It’s one of the terms used in
calculation of LOF fitness function (equation 1). The default value is 1.

Optionally, user can also perform process validation by selecting the checkbox labeled
with “Process Validation” and then mention the number of random models to be
generated (the default value is 10).
Output
Snapshot 3 Text file

Snapshot 4: _fileCode.xlsx file

Snapshot 5: _XRandom.xlsx file

1. Output text file (_GA.txt) (snapshot 3): The generated text file will consist of the
resultant MLR equation, fitness function information (i.e. fitness score and LOF
value), and validation parameters like r2, r2 adjusted, SEE, q2, and SDEP after the
successful execution of the GA. It will also consist of brief description about each
selected descriptors (if the selected descriptor information is present in the
database)
2. _Filecode.xlsx (snapshot 4): It consists of the compound no./serial no.(first
column), selected descriptors (subsequent columns) and endpoint (last column)
information. For each GA run, a new xlsx/xls/csv file with a different file name
(comprising of a file code) is generated. This file will consist of information about
compound number, descriptors and property/activity values of respective
selected best equation. About ‘file code’ it is well explained in a note below.
3. _Filecode_XRandomResults.xlsx (snapshot 5): The process validation results
will consist of r, r^2, q^2 values for the original and all the generated random
models; average r, r^2 and q^2 of random models; and cRp^2 value.
Note: Since GA is usually executed multiple times (like 10-15 times) till an optimized
solution is found. For user convenience, if the output file name is not changed and
the program is executed number of times by clicking ‘Submit’ button; the next GA
result will be appended in the same text file (snapshot 3). Now user can select the
best GA result based on fitness functions and/or validation parameters. Further, to
find the respective excel file (.xlsx/xls/csv) encompassing the selected descriptors
and endpoint information associated with that particular GA run, one could notice
the GA file code (snapshot 3; encircled) provided with every GA result that is printed
in the brackets. The respective excel file name will bear the same code (snapshot 4;
encircled) as that of the corresponding GA result.

References:

1. http://en.wikipedia.org/wiki/Genetic_algorithm

2. https://engineering.purdue.edu/~lips/Publications/proddesgn/Encyc.pdf

3. Roy, K.; Mitra, I., On various metrics used for validation of predictive QSAR models
with applications in virtual screening and focused library design. Combinatorial
Chemistry & High Throughput Screening 14, (6), 450-474.

Java External Library Used

Apache POI – the Java API for Microsoft Documents
 Available at http://poi.apache.org/
XMLBeans
 Available at http://xmlbeans.apache.org/

Disclaimer

For academic purpose only.

The program AD-MDI has been developed in Java language and is platform independent. The
software is validated on known data sets. Please report for discrepancy of result for any other
dataset. Contact us at any of the following addresses:

Dr. Tomasz Puzyn, Dr. Kunal Roy,

NanaBRIDGES Project Coordinator, Drug Theoretics and Cheminformatics Lab.,
Faculty of Chemistry, Dept. of Pharmaceutical Technology,
University of Gdansk, Jadavpur University,
Gdansk, Kolkata, West Bengal,
Poland 80-952 INDIA-700032
Email Id: [email protected] Email Id: [email protected]

Software Developer details:

Pravin Ambure,
Research Scholar,
Drug Theoretics and Cheminformatics Lab.,
Dept. of Pharmaceutical Technology,
Jadavpur University,
Kolkata, West Bengal,
INDIA-700032
E-mail Id: [email protected] (*for any queries regarding the tool)

Assignment JTW115E 2023-2024 v5
No ratings yet
Assignment JTW115E 2023-2024 v5
5 pages
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
100% (1)
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
289 pages
AutoCAD and Its Applications - Capítulo 5
100% (1)
AutoCAD and Its Applications - Capítulo 5
26 pages
Genetic Algorithm Paper Reviews
No ratings yet
Genetic Algorithm Paper Reviews
17 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
23 pages
Genetic Algorithms Explained
No ratings yet
Genetic Algorithms Explained
18 pages
Soft Computing Project
No ratings yet
Soft Computing Project
3 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
29 pages
Genetic Algorithm-Contd
No ratings yet
Genetic Algorithm-Contd
23 pages
10 Ga
No ratings yet
10 Ga
20 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
13 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
40 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
4 pages
Unit III
No ratings yet
Unit III
25 pages
SC Unit 4
No ratings yet
SC Unit 4
23 pages
Lecture 11 - Genetic Algorithms II
No ratings yet
Lecture 11 - Genetic Algorithms II
55 pages
Machine Learning Programming
No ratings yet
Machine Learning Programming
4 pages
Lecture Notes Unit 3
No ratings yet
Lecture Notes Unit 3
12 pages
Evolutionary Algorithms Guide
No ratings yet
Evolutionary Algorithms Guide
66 pages
Dr. Rajyalakshmi G Professor School of Mechanical Engineering, VIT, Vellore
No ratings yet
Dr. Rajyalakshmi G Professor School of Mechanical Engineering, VIT, Vellore
19 pages
Genetic Algo SC
No ratings yet
Genetic Algo SC
42 pages
Application of Artificial Intelligence: Future University Faculty of Engineering
No ratings yet
Application of Artificial Intelligence: Future University Faculty of Engineering
17 pages
Unit 5 Machine Learning Aktu
No ratings yet
Unit 5 Machine Learning Aktu
7 pages
ECI770-Intelligent Systems - Week 03
No ratings yet
ECI770-Intelligent Systems - Week 03
54 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
14 pages
Computational Intelligence Mid-2 Notes
No ratings yet
Computational Intelligence Mid-2 Notes
41 pages
Lecture 6 Genetic Algorithms
No ratings yet
Lecture 6 Genetic Algorithms
16 pages
Unit - 5
No ratings yet
Unit - 5
9 pages
UNIT 4 DA 15 Mark
No ratings yet
UNIT 4 DA 15 Mark
3 pages
Genetic Algorithm: Initialization
No ratings yet
Genetic Algorithm: Initialization
6 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
26 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
14 pages
Lecture Notes
No ratings yet
Lecture Notes
78 pages
Genetic Algorithm Basics & Steps
No ratings yet
Genetic Algorithm Basics & Steps
20 pages
ML RUSA Module 7 Genetic Algorithm
No ratings yet
ML RUSA Module 7 Genetic Algorithm
28 pages
Genetic Algorithm: Review and Application: Manoj Kumar, Mohammad Husian, Naveen Upreti & Deepti Gupta
No ratings yet
Genetic Algorithm: Review and Application: Manoj Kumar, Mohammad Husian, Naveen Upreti & Deepti Gupta
4 pages
Genetic Algorithms Overview
No ratings yet
Genetic Algorithms Overview
13 pages
Soft Computing Unit-5 Notes
No ratings yet
Soft Computing Unit-5 Notes
18 pages
Genetic Algorithms Overview
No ratings yet
Genetic Algorithms Overview
23 pages
09 Chapter2 PDF
No ratings yet
09 Chapter2 PDF
23 pages
Genetic Algorithms
No ratings yet
Genetic Algorithms
123 pages
GSM Ga
No ratings yet
GSM Ga
51 pages
Tutorial EA
No ratings yet
Tutorial EA
33 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
32 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
40 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
22 pages
2.1 Genetic Algorithm
No ratings yet
2.1 Genetic Algorithm
43 pages
Cippt
No ratings yet
Cippt
10 pages
Introduction To Genetic Algorithms (GA)
No ratings yet
Introduction To Genetic Algorithms (GA)
46 pages
Soft Computing Unit-5 by Arun Pratap Singh
100% (1)
Soft Computing Unit-5 by Arun Pratap Singh
78 pages
Genetic Algorithm in Machine Learning
No ratings yet
Genetic Algorithm in Machine Learning
14 pages
ML Unit IV
No ratings yet
ML Unit IV
27 pages
SC 4 Modul
No ratings yet
SC 4 Modul
26 pages
GENA
No ratings yet
GENA
17 pages
Genetic Algorithms
No ratings yet
Genetic Algorithms
3 pages
Genetic Algorithm Optimization Guide
No ratings yet
Genetic Algorithm Optimization Guide
26 pages
Genetic Algo 1
No ratings yet
Genetic Algo 1
4 pages
Mathematical Model - Algothim Results - Modified
No ratings yet
Mathematical Model - Algothim Results - Modified
13 pages
Genetic Algorithm: From Wikipedia, The Free Encyclopedia
No ratings yet
Genetic Algorithm: From Wikipedia, The Free Encyclopedia
3 pages
Genetic Algorithm: Manu Dev Hembrom
100% (1)
Genetic Algorithm: Manu Dev Hembrom
17 pages
ML Unit Iv Part Ii
No ratings yet
ML Unit Iv Part Ii
9 pages
d8 Venture + Dual I S Source Preinstallguide
No ratings yet
d8 Venture + Dual I S Source Preinstallguide
10 pages
Us 8202379 B1
No ratings yet
Us 8202379 B1
15 pages
Strem Nano Magnetic
No ratings yet
Strem Nano Magnetic
2 pages
EDF2: A Density Functional For Predicting Molecular Vibrational Frequencies
No ratings yet
EDF2: A Density Functional For Predicting Molecular Vibrational Frequencies
6 pages
Tutorial LQTA-QSAR
No ratings yet
Tutorial LQTA-QSAR
4 pages
Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
No ratings yet
Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
17 pages
Multidimensional Steric Parameters in The Analysis of Asymmetric Catalytic Reactions
No ratings yet
Multidimensional Steric Parameters in The Analysis of Asymmetric Catalytic Reactions
9 pages
0717 9707 Jcchems 63 03 4068
No ratings yet
0717 9707 Jcchems 63 03 4068
4 pages
Adc5000 Series: AC/DC Switch Mode Power Supplies and Rectifiers For Industrial and Telecom Applications
No ratings yet
Adc5000 Series: AC/DC Switch Mode Power Supplies and Rectifiers For Industrial and Telecom Applications
6 pages
Problems and Solutions - C4
83% (6)
Problems and Solutions - C4
25 pages
MODULE - Range and Kernel
No ratings yet
MODULE - Range and Kernel
23 pages
A Coding Style Guide For Java WorkShop and Java Studio Programming - Achut Reddy
No ratings yet
A Coding Style Guide For Java WorkShop and Java Studio Programming - Achut Reddy
35 pages
(Business Statistics) Chapter 3 Part 1
No ratings yet
(Business Statistics) Chapter 3 Part 1
30 pages
Quickassist Adapter 8950 Brief
No ratings yet
Quickassist Adapter 8950 Brief
3 pages
dg1-6 The Gauss Curvature (Detail)
No ratings yet
dg1-6 The Gauss Curvature (Detail)
12 pages
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
No ratings yet
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
18 pages
Coffee Habits of Mumbai Students
No ratings yet
Coffee Habits of Mumbai Students
12 pages
Pavement Engineering Solutions
No ratings yet
Pavement Engineering Solutions
1 page
STA02A2 - Chapter 1
No ratings yet
STA02A2 - Chapter 1
25 pages
RME Closed Door Part 1 - PEC
100% (2)
RME Closed Door Part 1 - PEC
14 pages
Design of Spur Gear
No ratings yet
Design of Spur Gear
23 pages
Class12 CS Practical File Slides Guidelines
No ratings yet
Class12 CS Practical File Slides Guidelines
12 pages
Document From Sagar
No ratings yet
Document From Sagar
74 pages
Sessional - 1 Blockchain (MCA)
No ratings yet
Sessional - 1 Blockchain (MCA)
9 pages
Examples: 238 17 Psychrometrics
No ratings yet
Examples: 238 17 Psychrometrics
12 pages
Sci8-Q1-W5-6-L2-3 - Work, Power and Energy
No ratings yet
Sci8-Q1-W5-6-L2-3 - Work, Power and Energy
4 pages
Aging Performance and Moisture Solubility of Veg. Oils For Power Trfs.
No ratings yet
Aging Performance and Moisture Solubility of Veg. Oils For Power Trfs.
6 pages
The Effects of Instrument in Measurements
No ratings yet
The Effects of Instrument in Measurements
18 pages
2021 Pure Form 4
No ratings yet
2021 Pure Form 4
14 pages
Strain Gauges For Integration in Fiber Composite Materials LI66
No ratings yet
Strain Gauges For Integration in Fiber Composite Materials LI66
2 pages
Algomasterio System Design Interview Handbook
No ratings yet
Algomasterio System Design Interview Handbook
19 pages
General Anisotropic Elasticity: Abstract This Chapter Is An Introduction To General Anisotropic Elasticity, I.E. To The
100% (1)
General Anisotropic Elasticity: Abstract This Chapter Is An Introduction To General Anisotropic Elasticity, I.E. To The
56 pages
Physical and Political Divisions of The World
No ratings yet
Physical and Political Divisions of The World
72 pages
Noise SNR
No ratings yet
Noise SNR
10 pages
Transformer Test Report
No ratings yet
Transformer Test Report
17 pages

Genetic Algorithm for QSAR Modeling

Uploaded by

Genetic Algorithm for QSAR Modeling

Uploaded by

Genetic Algorithm (GA)

A QSAR model development tool

Theoretical background and the Algorithm

Initialization of genetic algorithm

The fitness function, used in this program is as follows:

LOF= LSE/(1-((c+dp)/M))2… (2)

Genetic algorithm (GA) Tool

Genetic algorithm (GA) Tool Folder

Snapshot 1: Genetic Algorithm Tool

“Lib” folder also consist of a descriptor database file (“DescriptorDatabase.xlsx”;

Input file format (i.e. training set file format)

How to run the program

“Total number of Iterations”: User should mention the number of iteration

“Equation Length”: This is equivalent to number of descriptor which will be present in

“Cross-Over Probability”: This value corresponds to the probability value (range 0 to

“Mutation Probability”: This value corresponds to the probability value (range 0 to 1)

“Initial number of equations generated”: This value corresponds to the number of

Snapshot 4: _fileCode.xlsx file

Java External Library Used

For academic purpose only.

Dr. Tomasz Puzyn, Dr. Kunal Roy,

Software Developer details:

You might also like