Deterministic Unconstrained
Optimisation – Part I
Rosenbrock's banana function
J=f(x1,x2) = (a-x1)2 + b(x2-x12)2
Gobal minima is at x1 = a and x2 =b
minimum f = 0
(Usually a and b are set to 1 and 100 respectively)
Contents
Comments of characteristics of real life problems
Classification of optimization problems
Deterministic optimization methods
General procedure
Gradient & Hessian
General line search methods
Steepest descent method
Conjugate gradient method
Some other methods
Unconstrained minimization
Characteristics of real-life problems:
Design variables are invariably more than one
The objective function may be non-linear
The objective function may be non-deterministic (not an
issue for the time being)
Evaluation of objective function may be expensive
Gradient or Hessian of objective function may not be
available
We discuss various deterministic methods of
optimization when number of design variables is
more than one
We also assume that the design variables have
only side constraint (unconstrained optimisation)
Brute force method Line search
method along ei
Choose ei (unit vectors) as the
set of search directions e2
Minimize J by searching along
unit vectors one after the other
e1
till the function is minimum
Method fails if J has a narrow
valley at an angle to the unit
vectors
Note that a better set of directions than the ei’s
should be possible. Such directions should permit
large step size along narrow valleys be “non-
interfering” directions
Powell’s method
Powell’s method is an extension of brute force line
search method which uses basis vectors as the
search directions.
Powell’s method starts with initial guess P0 and uses
each of the basis vector direction, one after the
other, to minimise the function in n steps to locate
Pn. This step is identical to brute force line search
method.
It then locates the optimal point by line search
method using the vector given by (Pn − P0)
The method is iterative and the each iteration
requires (n+1) line searches
Gradient based multidimensional
unconstrained minimization
Optimization methods in
n dimensions
Gradient based Non gradient
methods based methods
Methods Methods deterministic Non-deterministic
that do that
not require Nelder– Genetic
require Hessian Mead Algorithms
Hessian Simplex
Simulated
Divided Annealing
Rectangles
Method Particle
Swarm
Optimization
Gradient based multidimensional
unconstrained minimization
General procedure
Assume that the mathematical statement of the
problem is ready involving
the objective function
the design variables (they must be independent) and
other parameters
Iteratively search for the optima involving
identifying the search direction along which optima lies
searching in that direction for locating the position of
optima by using line search method
Most procedures require the objective function
and its gradient G
Some procedures also require Hessian H
Convex design space
Most optimisation algorithms assume convex
design space
A real-valued function defined on an n-
dimensional interval is convex if the line segment
between any two points on the graph of the
function is above or on the graph in a Euclidean
space
In reality design space can be non-convex.
It is essential to find out if the design space is
convex before attempting optimisation
Convex/concave Design Space in 2D
(x12 ,x22) (x12 ,x22)
(x11 ,x2 1) (x11 ,x21)
Convex 2D domain Concave 2D domain
Convex Sets
a, b S a (1 )b S 0,1
Convex vs non-convex function
Condition for convexity
f (x1 (1 ) x2 ) f ( x1 ) (1 ) f ( x2 ), 0 1
y convex function y non-convex function
x2
x2
x2 )
)
x1
x1
1
(
x1
local optima
f (
x x
local and also global optima
global optima
Gradient and Hessian
J
X
1
J
X 2
Gradient of a function is J G ( J ) .
.
.
J
X
n
The gradient vector is perpendicular to hyperplane
tangent to the contour surfaces of constant J
For n=1 and 2, hyperplanes are points and
contour lines respectively
Hessian and its use
The second derivative of objective function produces
n(n+1)/2 partial derivatives
2 J 2 J
, if i j and ,i j
x1x2 xi2
The second-order partial derivatives represents the
Hessian matrix
2J 2J 2J Hessian matrix is real square
... (n x n) symmetric matrix
21x 2
x1x2 x1xn
We note any real square
J 2J 2J
H ... symmetric matrix has
x x x2 2
x2 xn
22 1 o only real Eigen values
J J J
2 2
o has real distinct orthogonal
...
xn x1 xn x2 xn2
Eigen vectors if Eigen values
are distinct
Hessian and its use
Near the minimum, J can be approximated to be
a quadratic in X and can be expressed in terms
of gradient G and Hessian matrix H:
J ( X) 12 XT HX G T X C
It can be seen that condition for minimum is
J ( X) 12 (HX XT H ) G T 0 or
12 (H T X HX) G 0
If H is symmetric then HX G 0
Thus minimisation of J identical to solution of the
linear algebraic equations usually written as
AX=b (H=A and b=-G)
Hessian and its use
Expanding J(x) about the stationary point x* in a
direction p and noting that G(x*) = 0, at the
stationary point the behavior of the function is
determined by H 0
J (x * p) J (x*) G (x*) T p 12 2pT Hp
J (x*) 12 2pT Hp
H is a symmetric matrix, and therefore it has real
orthogonal eigenvectors, i.e.
Hui ui , u 1
J (x * u i ) J (x*) 12 2uTi Hu i
J (x*) 12 2 i
Gradient and Hessian
Thus J(x*+ui) increases or decrease over
and above J(x*) depending on whether λi is
positive, negative or zero
For J to be minimum H be must be positive
definite, i.e. all its eigen values must be
positive
Gradient based methods
Assume that J is quadratic and G and H are
constants
J (x) a G T x 12 x T Hx and
J G Hx
Therefor unique minimum for J will be given by
J G Hx* 0 or
x* H 1G
If n is very large, the method is not feasible as it
requires inverse of (n x n) H matrix
Realistic methods minimize the n-dimensional
function through several 1D line-minimizations.
Line search methods
Start with X0 and a direction (a vector S0 in n
dimensions)
Use 1D minimization method and minimize J(α) =
J(X0 + α S0), S0 (or p0) is the initial search
direction and α is the step size.
Sk is the search direction for major iteration. αk is
the step length from the line search
The important distinguishing feature of a gradient-
based algorithm its search direction
Any line search that satisfies sufficient decrease
can be used, but one that satisfies the Strong
Wolfe conditions (on step size) is recommended.
A general gradient based method
start
Input: Initial guess, X0
Search Output: Optimum, X*
direction
k←0
while Not converged do
Compute a search direction Sk
Line Find a step length αk, such that
Update x search J(Xk+αk Sk) < J(Xk)
(the curvature condition may
also be included)
n Is J Update the design variables:
min ? Xk+1 ← Xk + αk Sk
k←k+1
end while
y
stop
Standard procedure (flow chart)
Some methods
do not need H(X)
Sensitivity Analysis
Analysis analysis
Perform
0
Calculate Calculate search
Input X 1D search
J ( X), G ( X), H ( X) direction S q
X q X q 1 S q
q=q+1
n
y
stop Converged?
The search direction
There are many algorithms
Random search
Powell method
Steepest descent
Flecture-Reeves (FR) method
David-Flecture-Powell (DFP) method
Broydon-Fletcher-GoldFarb-Shanno (BFGS) method
Newton's method
Some of the above are explained
Newton's method- the simplest variant
If J is twice differentiable, J can be expressed by
using Taylor's series in terms of G and H
G X k 1 G X k H X k ( X k 1 X k )
but G X k 1 0 condition for optimality
X X k 1 X k H 1G X k or
X k 1 X k H 1G ( X k ) or
X k 1 X k H 1 J ( X k )
The above expression is similar to Newton’s
method in 1D.
y ' ( x k 1 ) y ' ( x k ) y ' ' ( x k )( x k 1 x k ) or
k 1
x x y' ( x ) / y' ' ( x )
k k k
A variant of Newton's method -
Method of Steepest descent
In the quasi Newton method, the Hessian matrix
is approximated to be the Identity matrix
Xk 1 Xk I J Xk
This is the Method of steepest descent. It uses
the negative of the gradient of objective function
(steepest direction)as the search direction
Chose 0 < < 1 for stability (as is usually done)
We may assume that the change in the magnitude
of X is the same as the one obtained in the
previous iteration. Note that pk=Gk/|Gk|
( k 1)T k 1
G p
k 1 G ( k 1)T p k 1 k G ( k )T p k k k 1
G ( k )T p k
A variant of Newton's method -
Method of Steepest descent
Alternately, an analytic formula for k can also be
found out by assuming quadratic J in x with G = -b
and H = A calculated at xk
J (x) 12 xT Ax xT b
J (x k p k ) 12 (x k p k )T A(x k p k ) (x k p k )T b
12 2 p ( k )T A p k p ( k )T Ax k p ( k )T b constants
as A is (n×n) is symmetric and positive definite
To minimise J wrt , we set dJ/d=0, which gives
p ( k )T
( Ax k
b)
p A p p A x p b 0 or
( k )T k ( k )T k ( k )T
p ( k )T A p k
Method of Steepest descent
Justification for quasi Newton method
Xk 1 Xk I J Xk , 0 1
Consider Taylor expansion about Xn
X
J X X J X J X
k k k T
Note that LHS and hence RHS must be negative.
J X X 0
k T
It can be seen that method of steepest descent
method involves the negative of the gradient of
objective function as the search direction
It can be shown that the method does not give fast
convergence when close to the local minima
Method of Steepest descent
Input: Initial guess, X0, convergence tolerances, εg, εa and εr.
Output: Optimum design variables, X*
k←0
repeat
Compute the J and G(Xk) ≡ ∇J(Xk); if | G(Xk) |< εg stop
otherwise compute the search direction, Sk ← −G(Xk)/|G(Xk)|
Perform line search to find step length k
Update the current point, Xk+1 ← Xk + k Sk
k←k+1
until |J(Xk) − J(Xk−1)| ≤ εa + εr|J(Xk−1)|
εg absolute tolerance on gradient (typically 10-6)
εa relative tolerance on objective function (typically 10-2)
εr absolute tolerance on the function (typically 10-2)
Method of Steepest descent
|J(Xk+1)-J(Xk)|≤a+≤r|J(Xk| is a check for the successive
reductions in J
If J is of order 1, r dominates, if J is smaller than 1, then
the absolute error dominates
The method of steepest descent has a problem that with
the exact line search, the steepest descent direction at
each iteration is orthogonal to the previous one
dJ ( X k 1 )
0
d k 1
J ( X ) ( X k S k )
0
X K 1
T J ( X k 1 )S k 0
G T ( X k 1 )G ( X k ) 0
Method of Steepest descent
J ( X) 12 ( X 12 10 X 22 )
The method is inefficient as successive search directions
are perpendicular to each other.
Error decreases in the first few iterations, but the method is
slow near the minimum.
The algorithm is guaranteed to converge, but no of
iterations can be infinite. The rate of convergence is linear.
Steepest descent
Graphical interpretation
The method suffers from poor convergence
Lecture Ends