Chapter 07
Chapter 07
Continuous Optimization
3667 Recall from Section 1.1.4 that training a machine learning model often
3668 boils down to finding a good set of parameters. The notion of “good” is
3669 determined by the objective function or the probabilistic model, which we
3670 will see examples of in the second part of this book. Given an objective
3671 function finding the best value is done using optimization algorithms. We Since we consider
3672 will assume in this chapter that our objective function is differentiable data and models in
RD the
3673 (see Chapter 5), hence we have access to a gradient at each location in
optimization
3674 the space to help us find the optimum value. By convention most objective problems we face
3675 functions in machine learning are intended to be minimized, that is the are continuous
3676 best value is the minimum value. Intuitively finding the best value is like optimization
problems, as
3677 finding the valleys of the objective function, and the gradients point us
opposed to
3678 uphill. The idea is to move downhill (opposite to the gradient) and hope combinatorial
3679 to find the deepest point. optimization
Consider the function in Figure 7.1. The function has a global minimum problems for
discrete variables.
around the value x = 4.5 which has the objective function value of
global minimum
169
Draft chapter (March 15, 2018) from “Mathematics for Machine Learning” c 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to mml-book.com. Please do not post or distribute this file, please link
to mml-book.com.
170 Continuous Optimization
around 47. Since the function is “smooth” the gradients can be used to
help find the minimum by indicating whether we should take a step to the
right or left. This assumes that we are in the correct bowl, as there exists
local minimum another local minimum around the value x = 0.7. Recall that we can solve
Stationary points for all the stationary points of a function by calculating its derivative and
are points that have setting it to zero. Let
zero gradient.
`(x) = x4 + 7x3 + 5x2 17x + 3. (7.1)
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.1 Optimization using Gradient Descent 171
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
172 Continuous Optimization
3728 Its asymptotic rate of convergence is inferior to many other methods. Us-
3729 ing the ball rolling down the hill analogy, when the surface is a long thin
3730 valley the problem is poorly conditioned. For poorly conditioned convex
3731 problems, gradient descent increasingly ‘zigzags’ as the gradients point
3732 nearly orthogonally to the shortest direction to a minimum point, see
3733 Fig. 7.2. }
3744 • When the function value increases after a gradient step, the step size
3745 was too large. Undo the step and decrease the stepsize.
3746 • When the function value decreases the step could have been larger. Try
3747 to increase the stepsize.
3748 Although the “undo” step seems to be a waste of resources, using this
3749 heuristic guarantees monotonic convergence.
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.1 Optimization using Gradient Descent 173
3750 We can use this gradient directly in a gradient descent algorithm. How-
3751 ever for this particular special case, it turns out that there is an analytic
3752 solution, which can be found by setting the gradient to zero. We can see
3753 that this analytic solution is given by Ax = b. We will see more on solving
3754 squared error problems in Chapter 9.
3755
3756
3771 where ↵ 2 [0, 1]. Sometimes we will only know the gradient approxi-
3772 mately. In such cases the momentum term is useful since it averages out
3773 different noisy estimates of the gradient. One particularly useful way to
3774 obtain an approximate gradient is using a stochastic approximation, which
3775 we discuss next.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
174 Continuous Optimization
3788 where xn 2 RD are the training inputs, yn are the training targets and ✓
3789 are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
by updating the vector of parameters according to
N
X
✓ i+1 = ✓ i i (rL(✓ i ))
>
= ✓i i (rLn (✓ i ))> (7.13)
n=1
3790 for a suitable stepsize parameter i . Evaluating the sum-gradient may re-
3791 quire expensive evaluations of the gradients from all individual functions
3792 Ln . When the training set is enormous and/or no simple formulas exist,
3793 evaluating the sums ofPgradients becomes very expensive.
N
3794 Consider the term n=1 (rLn (✓ i )) in (7.13) above: we can reduce the
3795 amount of computation by taking a sum over a smaller set of Ln . In con-
3796 trast to batch gradient descent, which uses all Ln for n = 1, . . . , N , we
3797 randomly choose a subset of Ln for mini-batch gradient descent. In the
3798 extreme case, we randomly select only a single Ln to estimate the gradi-
3799 ent.
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.2 Constrained Optimization and Lagrange Multipliers 175
3800 Why should one consider using an approximate gradient? A major rea-
3801 son is practical implementation constraints, such as the size of CPU/GPU
3802 memory or limits on computational time. We can think of the size of the
3803 subset used to estimate the gradient in the same way that we thought of
3804 the size of a sample when estimating empirical means 6.4.1. In practice,
3805 it is good to keep the size of the mini-batch as large as possible. Large
3806 mini-batches reduce the variance in the parameter update. Furthermore This often leads to
3807 large mini-batches take advantage of highly optimized matrix operations more stable
convergence since
3808 in vectorized implementations of the cost and gradient. However when
the gradient
3809 we choose the mini-batch size, we need to make sure it fits into CPU/GPU estimator is less
3810 memory. Typical mini-batch sizes are 64, 128, 256, 512, 1024, which de- noisy.
3811 pends on the way computer memory is laid out and accessed.
3812 Remark. When the learning rate decreases at an appropriate rate, and sub-
3813 ject to relatively mild assumptions, stochastic gradient descent converges
3814 almost surely to local minimum (Bottou, 1998). }
3815 If we keep the mini-batch size small, the noise in our gradient estimate
3816 will allow us to get out of some bad local optima, which we may otherwise
3817 get stuck in.
3818 Stochastic gradient descent is very effective in large-scale machine learn-
3819 ing problems (Bottou et al., 2018), such as training deep neural networks
3820 on millions of images (Dean et al., 2012), topic models (Hoffman et al.,
3821 2013), reinforcement learning (Mnih et al., 2015) or training large-scale
3822 Gaussian process models (Hensman et al., 2013; Gal et al., 2014).
3824 where f : RD ! R.
In this section we have additional constraints. That is for real valued
functions gi : RD ! R for i = 1, . . . , m we consider the constrained
optimization problem
min f (x) (7.15)
x
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
176 Continuous Optimization
Figure 7.3
Illustration of
constrained
optimization. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side (indicated
by the circle). The
box constraints
( 1 6 x 6 1 and
1 6 y 6 1) require
that the optimal
solution are within
the box, resulting in
an optimal value
indicated by the
star.
3827 This gives infinite penalty if the constraint is not satisfied, and hence
3828 would provide the same solution. However, this infinite step function is
3829 equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multipliers
3830 ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
3831 step function with a linear function.
Lagrangian We associate to problem (7.15) the Lagrangian by introducing the La-
grange multipliers i corresponding to each inequality constraint respec-
tively (Boyd and Vandenberghe, 2004, Chapter 4).
m
X
L(x, ) = f (x) + i gi (x)
i=1
>
= f (x) + g(x) (7.18)
3832 where in the last line we have a concatenated all constraints gi (x) into a
3833 vector g(x), and all the Lagrange multipliers into a vector 2 Rm .
3834 We now introduce the idea of Lagrangian duality. In general, duality
3835 in optimization is the idea of converting an optimization problem in one
3836 set of variables x (called the primal variables), into another optimization
3837 problem in a different set of variables (called the dual variables). We
3838 introduce two different approaches to duality: in this section we discuss
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.2 Constrained Optimization and Lagrange Multipliers 177
3842 This is also known as weak duality. Note that the inner part of the right weak duality
3843 hand side is the dual objective function D( ) and the theorem follows.
3844 In contrast to the original optimization problem which has constraints,
3845 minx2Rd L(x, ) is an unconstrained optimization problem for a given
3846 value of . If solving minx2Rd L(x, ) is easy, then the overall problem
3847 is easy to solve. The reason is that the outer problem (maximization over
3848 ) is a maximum over a set of affine functions, and hence is a concave
3849 function, even though f (·) and gi (·) may be non-convex. The maximum
3850 of a concave function can be efficiently computed.
3851 Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
3852 problem by differentiating the Lagrangian with respect to x and setting
3853 the differential to zero and solving for the optimal value. We will discuss
3854 two concrete examples in Section 7.3.1 and 7.3.2, where f (·) and gi (·)
3855 are convex.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
178 Continuous Optimization
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 179
3891 Example
3892 A nonnegative weighted sum of convex functions is convex. Observe that
3893 if f is a convex function, and ↵ > 0 is a nonnegative scalar, then the
3894 function ↵f is convex. We can see this by multiplying ↵ to both sides of
3895 equation in Definition 7.2, and recalling that multiplying a nonnegative
3896 number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (✓x + (1 ✓)y) 6 ✓f1 (x) + (1 ✓)f1 (y) (7.27)
f2 (✓x + (1 ✓)y) 6 ✓f2 (x) + (1 ✓)f2 (y). (7.28)
Summing up both sides gives us
f1 (✓x + (1 ✓)y) + f2 (✓x + (1 ✓)y)
6 ✓f1 (x) + (1 ✓)f1 (y) + ✓f2 (x) + (1 ✓)f2 (y) (7.29)
where the right hand side can be rearranged to
✓(f1 (x) + f2 (x)) + (1 ✓)(f1 (y) + f2 (y)) (7.30)
3897 completing the proof that the sum of convex functions is convex.
3898 Combining the two facts above, we see that ↵f1 (x) + f2 (x) is convex
3899 for ↵, > 0. This closure property can be extended using a similar argu-
3900 ment for nonnegative weighted sums of more than two convex functions.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
180 Continuous Optimization
Figure 7.7
Illustration of a
linear program. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side. The
optimal value given
the constraints are
shown by the star.
3901
3902
3903 Remark. The inequality defining convex functions, see 7.25, is sometimes
Jensen’s inequality3904 called Jensen’s inequality. In fact a whole class of inequalities for taking
3905 nonnegative weighted sums of convex functions are all called Jensen’s
3906 inequality. }
subject to Ax 6 b
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 181
>
5 x1
min2 (7.32)
x2R 3 x2
2 3 2 3
2 2 33
6 2 47 6 8 7
6 7 x1 6 7
subject to 6
6 2 17 6 7
7 x2 6 6 5 7 (7.33)
4 0 15 4 15
0 1 8
3917
3918
3919
subject to c + A> = 0
> 0.
3920 This is also a linear program, but with m variables. We have the choice
3921 of solving the primal (7.31) or the dual (7.34) program depending on
3922 whether m or d is larger. Recall that d is the number of variables and m is
3923 the number of constraints in the primal linear program.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
182 Continuous Optimization
subject to Ax 6 b
3925 where A 2 Rm⇥d , b 2 Rm and c 2 Rd . The square symmetric matrix Q 2
3926 Rd⇥d is positive definite, and therefore the objective function is convex.
3927 This is known as a quadratic program. Observe that it has d variables and
3928 m linear constraints.
3929
3930 Example
3931 An example of a quadratic program is illustrated in Figure 7.3, which has
3932 two variables. The objective function is quadratic with a positive semidefi-
3933 nite matrix Q, resulting in elliptical contour lines. The optimal value must
3934 lie in the shaded (feasible) region, and is indicated by the star.
> >
1 x1 2 1 x1 5 x1
min + (7.36)
x2R2 2 x2 1 4 x2 3 x2
2 3 2 3
1 0 1
6 1 7
0 7 x1 6 17
subject to 6
4 0 5 66
4
7 (7.37)
1 x2 15
0 1 1
3935
3936
3937
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 183
3965 Note that the convex conjugate definition above does not need the func-
3966 tion f to be convex nor differentiable. In the definition above, we have
3967 used a general inner product (Section 3.2) but in the rest of this section
3968 we will consider the standard dot product between finite dimensional vec-
3969 tors (hs, xi = s> x) to avoid too many technical details.
To understand the above definition in a geometric fashion, consider an This derivation is
nice simple one dimensional convex and differentiable function, for ex- easiest to
understand by
ample f (x) = x2 . Note that since we are looking at a one dimensional
drawing the
problem, hyperplanes reduce to a line. Consider a line y = sx + c. Recall reasoning as it
that we are able to describe convex functions by their supporting hyper- progresses.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
184 Continuous Optimization
planes, so let us try to describe this function f (x) by its supporting lines.
Fix the gradient of the line s 2 R and for each point (x0 , f (x0 )) on the
graph of f , find the minimum value of c such that the line still inter-
sects (x0 , f (x0 )). Note that the minimum value of c is the place where a
line with slope s “just touches” the function f (x) = x2 . The line passing
through (x0 , f (x0 )) with gradient s is given by
The y -intercept of this line is sx0 + f (x0 ). The minimum of c for which
y = sx + c intersects with the graph of f is therefore
Note that c changes with x0 and therefore with s, which is why we can
think of it as a function of s, which we call f ⇤ (s).
3974 The conjugate function has nice properties, for example for convex
3975 functions, applying the Legendre transform again gets us back to the origi-
3976 nal function. In the same way that the slope of f (x) is s, the slope of f ⇤ (s)
3977 is x. The following two examples show common uses of convex conjugates
3978 in machine learning.
Example
To illustrate the application of convex conjugates, consider the quadratic
function based on a positive definite matrix K 2 Rn⇥n . We denote the
primal variable to be y 2 Rn and the dual variable to be ↵ 2 Rn .
1
f (y) = y>K y (7.47)
2
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 185
3980
3981
Example
In machine learning we often use sums of functions, for example the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R ! R. This also illustratesPnthe applica-
tion of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ),
n
X
L⇤ (z) = sup hz, ti `i (ti ) (7.52)
t2Rn i=1
n
X
= sup zi t i `i (ti ) definition of dot product (7.53)
t2Rn i=1
Xn
= sup zi ti `i (ti ) (7.54)
n
i=1 t2R
Xn
= `⇤i (zi ) definition of conjugate (7.55)
i=1
3982
3983
3984
3985 Recall that in Section 7.2 we derived a dual optimization problem using
3986 Lagrange multipliers. Furthermore for convex optimization problems we
3987 have strong duality, that is the solutions of the primal and dual problem
3988 match. The Fenchel-Legendre transform described here also can be used
3989 to derive a dual optimization problem. Furthermore then the function is
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
186 Continuous Optimization
Example
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
min f (y) + g(x) = min max f (y) + g(x) + (Ax y)> u (7.57)
Ax=y x,y u
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
For general inner Recall the convex conjugate (Definition 7.4) and the fact that dot products
products, A> is are symmetric,
replaced by the
adjoint A⇤ .
h i
max min y > u + f (y) + min x> A> u + g(x) (7.62)
u y x
3993
3994
3995
3996 The Legendre-Fenchel conjugate turns out to be quite useful for ma-
3997 chine learning problems that can be expressed as convex optimization
3998 problems. In particular for convex loss functions that apply independently
3999 to each example, the conjugate loss is a convenient way to derive a dual
4000 problem. We will see such an example in Chapter 10.
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.4 Further Reading 187
4034 Exercises
7.1 Consider the univariate function
f (x) = x3 + 2x2 + 5x 3.
4035 Find its stationary points and indicate whether they are maximum, mini-
4036 mum or saddle points.
4037 7.2 Consider the update equation for stochastic gradient descent (Equation (7.13)).
4038 Write down the update when we use a mini-batch size of one.
7.3 Express the following optimization problem as a standard linear program in
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
188 Continuous Optimization
matrix notation
max p> x + ⇠
x2R2 ,⇠2R
Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.