Optimization 2
Lecture outline
Unconstrained continuous optimization:
• Convexity
• Iterative optimization algorithms
• Gradient descent
• Newton’s method
• Gauss-Newton method
New topics:
• Axial iteration
• Levenberg-Marquardt algorithm
• Application
Introduction: Problem specification
Suppose we have a cost function (or objective function)
f ! x " # $%n & $%
Our aim is find the value of the parameters x that minimize this function
x ! ' ()* +,-
x
f ! x "
subject to the following constraints:
• equality c i ! x "#$ %, i $ &, . . . , me
• inequality c i ! x " # $ %, i & me ' (,. . . , m
We will start by focussing on unconstrained problems
Unconstrained optimization
function of one
variable f !x"
+,- f ! x "
x
local global x
minimum minimum
• down-hill search (gradient descent) algorithms can find local minima
• which of the minima is found depends on the starting point
• such minima often occur in real applications
Reminder: convexity
Class of functions
convex Not convex
• Convexity provides a test for a single extremum
• A non-negative sum of convex functions is convex
Class of functions continued
single extremum – convex single extremum – non-convex
Not convex
multiple extrema – non-convex noisy horrible
Optimization algorithm – key ideas
! "#$% δx &'() * ) + * f , x - δx . < f , x .
! /)#& 0 12+%&0* 3 0 +$0#*24+*#520'6%+*20 x n ! " # 7 0 x n - δx
! 82%'(20*)20643912:0* 3 0 +0&24#2&03;0< = 0 1#$20&2+4()2&0δx 7 α p
15
10
-5
-5 0 5 10 15
Optimization algorithm – Random direction
Choosing the direction 1: axial iteration
Alternate minimization over x and y
15
10
-5-5 0 5 10 15
Optimization algorithm
axial directions
Gradient and Partial Derivatives
A function of several variables can be written as f (x1 , x2 ), Gradient and Tangent Plane /
etc. Often times, we abbreviate multiple arguments in a 1st Degree Taylor Expansion
single vector as f (x).
Let a function f : Rn → R. The gradient of f is the
column vector of partial derivatives ∇f (x)
∂f (x)
∂x1
∇f (x) := ..
.
∂f (x)
∂xn
Suppose now a function g(x, y) with signature g : Rn × τx1 (y) = f (x) + (y − x)> ∇f (x)
Rm → R. Its derivative with respect to just x is written
as ∇x g(x, y).
Choosing the direction 2: steepest descent
Move in the direction of the gradient ! f "xn#
15
10
-5-5 0 5 10 15
Optimization algorithm – Steepest descent
Steepest descent
15
10
-5
-5 0 5 10 15
$ % & ' # ()*+,'-.#,/#'0')12&')'#3')3'-+,456*)#. 7 # .&'#47-.75) 6,-'/8
$ 9:.')#'*4,-'#;,-,;,<*.,7-#.&'#-'2#()*+,'-.#,/#*62*1/ orthogonal
. 7 .&' 3)'0,75/ /.'3 +,)'4.,7- =.)5' 7: *-1 6,-' ;,-,;,<*.,7-8>
$ ?7-/'@5'-.61A#.&'#,.')*.'/#.'-+#. 7 # <,(B<*(#+72-#.&'#0*66'1#,-#*#0')1##
,-'C4,'-. ;*--')
Gradient Descent
• Iterative method starting at an initial point x(0)
• Step to the next point x(k+1) in the direction of the
negative gradient
x(k+1) = x(k) − ∇f (x(k) )
120
• Repeat until k∇f (x(k) )k < for a chosen 100
80
f 60
• But: No convergence is guaranteed. 40 10
20
For convergence, an additional line search is required. 0
8
6
1.00
0.75 4
0.50
x1
0.25 2
x2 0.00 0.25
Line Search 0.50
0.75 0
• Take the descent step direction d = −∇f (x) Gradient Descent for
• Select the step length α as minα≥0 f (x + αd)
f (x) = 12 (x1 )2 + 5(x2 )2
• In practice, α is selected with heuristics
A harder case: Rosenbrock’s function
! ! !
f " x , y # $ %&&"y ' x # ( " % ' x #
Rosenbrock function
3
2.5
1.5
0.5
-0.5
-1
-2 -1 0 1 2
" # $ # % & % ' #(') * ' +,, ,-
Steepest descent on Rosenbrock function
Steepest Descent Steepest Descent
3
2.5
0.85
2
1.5 0.8
1
0.75
0.5
0 0.7
-0.5
0.65
-1 -0.95 -0.9 -0.85 -0.8 -0.75
-2 -1 0 1 2
• The zig-zag behaviour is clear in the zoomed view (100 iterations)
• The algorithm crawls down the valley
Optimization algorithm – Steepest descent 2
Optimization algorithm – Steepest descent for matrices
Conjugate Gradients – sketch only
! " # $ # % " & ' &( c o n j u g a t e g r a d i e n t s )"&&*#* *+))#**,-# '#*)#.% ',/#)0
%,&.* p n *+)" % " 1 % , % ,* 2+1/1.%##' % & /#1)" %"# $ , . , $ + $ ,. 1 3. ,% #
.+$4#/ &( *%#5*6
7 81)"9 p n ,*9)"&*#.9% & 9 4#9)&.:+21%#9% & 9 1;;95/#-,&+*9*#1/)"9',/#)%,&.*99
< , % " 9 /#*5#)%9% & 9 %"#9=#**,1.9 H>
p!nHp j ? @, @?< j < n
7 ! " # 9 /#*+;%,.29*#1/)"9',/#)%,&.*91/#9$+%+1;;C9;,.#1/;C ,.'#5#.'#.%6
7 RemarkablyD p n )1. 4# )"&*#. +*,.2 &.;C E.&<;#'2# &( p n " # , A f F x n " # G 9 9
1.'9A f F x n G 9 F*##9H+$#/,)1; I#),5#*G
Afn!Afn p
pn ? A f n B n" #
A f n!" # A f n " #
Choosing the direction 3: conjugate gradients
Again, uses first derivatives only, but avoids “undoing” previous
work
$ 9 - # DB+,; '-/,7-*6#@ 5 * + ) * .,4 #: 7 ) ; # 4*-#E'#; ,- ,; ,<' + #,-#a t m o s t N
47 - F5 ( * .' #+'/4'-. /.'3/8
$ G #+ , H ' ) ' - . # / .* ) .,- ( 3 7 ,- ./ 8
$ I , - , ; 5 ; # ,/#)'*4&'+#,-#'J*4.61#K /.'3/8
The Hessian Matrix
Let f : Rn → R twice differentiable. Its second (partial)
derivatives make up the Hessian Matrix ∇2f (x):
2nd Degree Taylor Expansion
∂ 2f (x) ∂ 2f (x)
∂x ∂x ···
1 1 ∂x1 ∂xn
∇2f (x) :=
.. .. ..
2. . .
∂ f (x) 2
∂ f (x)
···
∂xn ∂x1 ∂xn ∂xn
• The order of differentiation does not matter if the
function has continuous second (higher-order) τx2 (y) = f (x) +
partial derivatives (Schwarz’s Theorem) (y − x)> ∇f (x) +
• Then the Hessian is symmetric >
1
2
2 (y − x) ∇ f (x) (y − x)
∇2f (x) = [∇2f (x)]>
Choosing the direction 4: Newton’s method
Start from Taylor expansion in 2D
9 # :5-4.,7-#;*1#E'#*33)7J,;*.'+#674*661#E1#,./#%*167)#/'),'/#'J3*-/,7-##
*E75.#*#37,-. x $
∂ !f ∂ !f
∂f ∂f δx " ∂x ! ∂x∂y δx
f = x ! δx > L f = x> ! , ! =δx, δy> ∂ !f ∂ !f
∂ x ∂y δy K δy
∂x ∂y ∂y!
%&'# 'J3*-/,7-#. 7 # /'47-+#7)+')#,/#*#@5*+)*.,4 :5-4.,7-
" !
f = x ! δ x > M a ! g ! δx ! δx H δx
K
D72#;,-,;,<'#.&,/#'J3*-/,7-#70') δxN
" !
;,- f = x ! δ x > M a ! g ! δx ! δx H δx
δx K
<
:#$ f , x - δx . 7 a - g>δx - δx>H δx
δx >
"34 + : #$#: ': ?2 42@'#42 * ) + * A f , x - δx . 7 0B +$% &3
A f , x - δx . 7 g - Hδx 7 0
? # * ) &31'*#3$ δx 7 C H O" g , D + * 1 + 9 δx 7 C H E g .F
15
/)#& 0 G#52&0*)20#*24+*#52 '6%+*2
10
x n ! " & x n ) H#n"gn
5
-5
-5 0 5 10 15
x n ! " ! x n " H#n"gn
$ P:#f = x > # ,/#@5*+)*.,4A#.&'-#.&'#/765.,7-#,/#:75-+#,-#7-' /.'38
$ % & ' # ;'.&7+#&*/#@5*+)*.,4#47-0')('-4'#=*/#,-#.&'#" Q 4*/'>8
$ % & ' # /765.,7-#δx M # OH " n#g n ,/#(5*)*-.''+#. 7 # E'#*#+72-&,66#+,)'4.,7-##
3)70,+'+#. & * . # H,/#37/,.,0' +'R-,.'
$ S*.&')#.&*-#F5;3#/.)*,(&.#. 7 # .&'#3)'+,4.'+#/765.,7-#*. # x n O # H"n#gnA##
, . # ,/#E'..')#. 7 # 3'):7);#*#line search
x n % # M x n O αnH "n#gn
$ P:#HM # I .&'-#.&,/#)'+54'/#. 7 # /.''3'/. +'/4'-.8
Newton’s method - example
Newton method with line search
Newton method with line search
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
gradient < 1e-3 after 15 iterations gradient < 1e-3 after 15 iterations
ellipses show successive
quadratic approximations
•The algorithm converges in only 15 iterations – far superior to steepest
descent
•However, the method requires computing the Hessian matrix at each
iteration – this is not always feasible
Optimization algorithm – Newton method
Optimization algorithm – Newton2 method
Performance issues for optimization algorithms
1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence
Non-linear least squares
M
'
f F xG ? ri
i& #
Gradient
M
A f FxG ? J r iF x G A r iF x G Ari
i
Hessian
M
H? A A ! f F x G ? J A 9 r iF x G A 9!r Fi x G
i
M
? J A r i F x G A !9 r iF x G B 9 ri F x G A A !9 r iF x G
i
<",)"9,*9155/&K,$1%#' 1*
!Uri
M Ari
H() ? J A r i F x G A !9 r i F x G
i
! " , * 9 ,*9%"#9G a u s s - N e w t o n 155/&K,$1%,&.
x n ! " " x n # αnH#n"gn $ % & ' Hn ( x ) " H$% ( x n )
Gauss-Newton method with line search
Gauss-Newton method with line search
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
gradient < 1e-3 after 14 iterations gradient < 1e-3 after 14 iterations
•minimization with the Gauss-Newton approximation with line search
takes only 14 iterations
Comparison
Newton Gauss-Newton
Newton method with line search Gauss-Newton method with line search
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-2 -1 0 1 2 -2 -1 0 1 2
gradient < 1e-3 after 15 iterations gradient < 1e-3 after 14 iterations
• requires computing Hessian •approximates Hessian by
product of gradient of residuals
• exact solution if quadratic
• requires only derivatives
Summary of minimizations methods
&'()*+ x n ! " , x n ! δx
"- %+.*/0-
H δx , # g
1- $)2334%+.*/0-
HVD#δx , # g
5-6$7)(8+0* (+39+0*-
λ δx , # g
Levenberg-Marquardt algorithm
$ 92*1 :)7; .&' ;,-,;5;A ,- )'(,7-/ 7: -'(*.,0' 45)0*.5)'A .&'
V*5//BD'2.7- *33)7J,;*.,7- ,/ -7. 0')1 (77+8
$ P- /54& )'(,7-/A * /,;36' /.''3'/.B+'/4'-. /.'3 ,/ 3)7E*E61 .&' E'/.
36*-8
$ % & ' W'0'-E')(BI*)@5*)+. ;'.&7+ ,/ * ;'4&*-,/; :7) 0*)1,-( E'B
.2''- /.''3'/.B+'/4'-. *-+ V*5//BD'2.7- /.'3/ +'3'-+,-( 7- &72
(77+ .&' H() *33)7J,;*.,7- ,/ 674*6618
1.4
1.2
0.8
0.6
0.4
0.2
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Newton gradient
descent
$ % & ' # ; ' . & 7 + # 5/'/#.& ' #; 7 + ,R ' + X'//,*-
H= x , λ > M H$% ! λ I
$ T & ' - #λ ,/#/;*66A# H*33)7J,;*.'/#.& ' #V*5//BD'2.7- X'//,*-8
$ T & ' - #λ ,/#6*)('A# H,/#467/'#. 7 # .& ' #,+'-.,.1A#4*5/,-(#/.''3'/.B+'/4'-.##
/.'3/#. 7 # E' .*Y'-8
LM Algorithm
H= x , λ > M H $ % = x > ! λ I
"8#Z'.#λ M # [.[[" =/*1>
K8 Z760' δx M O H = x , λ > & # g
G8 P: f = x n ! δx > > f = x n > A ,-4)'*/' λ = \ " [ /*1> *-+ (7 . 7 K8
]8 ^.&')2,/'A#+'4)'*/'#λ = \ [ . " # /*1>A#6'.# x n ' # ( M # x n ! δx A#*-+#(7#. 7 # K8
N o t e : T h i s a l g o r i t h m d o e s n o t r e q u i r e e x p l i c i t lin e searches.
Example
Levenberg-Marquardt method
3 Levenberg-Marquardt method
3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
gradient < 1e-3 after 31 iterations gradient < 1e-3 after 31 iterations
! "#$#%#&'(#)$*+,#$-*./0/$1/2-3"'24+'25(*6 $ ) *7#$/*,/'289:*(';/,*<=**
#(/2'(#)$,>
Matlab: lsqnonlin
Comparison
Gauss-Newton Levenberg-Marquardt
Levenberg-Marquardt method
Gauss-Newton method with line search
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1
-2 -1 0 1 2 -1
gradient < 1e-3 after 14 iterations -2 -1 0 1 2
gradient < 1e-3 after 31 iterations
•more iterations than Gauss-Newton,
but
• no line search required,
• and more frequently converges