Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
81 views20 pages

Chapter 07

This document discusses continuous optimization problems in machine learning. It explains that machine learning models are often trained by finding the optimal parameters of an objective function. Gradient descent is introduced as a first-order optimization algorithm that finds local minima of differentiable functions by taking steps in the negative gradient direction. The concepts are first explained for a one-dimensional function before being generalized to high-dimensional optimization problems. Convex functions are also introduced, which have the property that all local minima are global minima.

Uploaded by

richa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views20 pages

Chapter 07

This document discusses continuous optimization problems in machine learning. It explains that machine learning models are often trained by finding the optimal parameters of an objective function. Gradient descent is introduced as a first-order optimization algorithm that finds local minima of differentiable functions by taking steps in the negative gradient direction. The concepts are first explained for a one-dimensional function before being generalized to high-dimensional optimization problems. Convex functions are also introduced, which have the property that all local minima are global minima.

Uploaded by

richa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

3666

Continuous Optimization

3667 Recall from Section 1.1.4 that training a machine learning model often
3668 boils down to finding a good set of parameters. The notion of “good” is
3669 determined by the objective function or the probabilistic model, which we
3670 will see examples of in the second part of this book. Given an objective
3671 function finding the best value is done using optimization algorithms. We Since we consider
3672 will assume in this chapter that our objective function is differentiable data and models in
RD the
3673 (see Chapter 5), hence we have access to a gradient at each location in
optimization
3674 the space to help us find the optimum value. By convention most objective problems we face
3675 functions in machine learning are intended to be minimized, that is the are continuous
3676 best value is the minimum value. Intuitively finding the best value is like optimization
problems, as
3677 finding the valleys of the objective function, and the gradients point us
opposed to
3678 uphill. The idea is to move downhill (opposite to the gradient) and hope combinatorial
3679 to find the deepest point. optimization
Consider the function in Figure 7.1. The function has a global minimum problems for
discrete variables.
around the value x = 4.5 which has the objective function value of
global minimum

Figure 7.1 Example


objective function.
Gradients are
indicated by arrows,
and the global
minimum is
indicated by the
dashed blue line.

169
Draft chapter (March 15, 2018) from “Mathematics for Machine Learning” c 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to mml-book.com. Please do not post or distribute this file, please link
to mml-book.com.
170 Continuous Optimization

around 47. Since the function is “smooth” the gradients can be used to
help find the minimum by indicating whether we should take a step to the
right or left. This assumes that we are in the correct bowl, as there exists
local minimum another local minimum around the value x = 0.7. Recall that we can solve
Stationary points for all the stationary points of a function by calculating its derivative and
are points that have setting it to zero. Let
zero gradient.
`(x) = x4 + 7x3 + 5x2 17x + 3. (7.1)

Its gradient is given by


d`(x)
= 4x3 + 21x2 + 10x 17. (7.2)
dx
Since this is a cubic equation, it has three solutions when set to zero. Two
of them are minima and one is a maximum (around x = 1.4). Recall
that to check whether a stationary point is a minimum or maximum we
need to take the derivative a second time and check whether the second
derivative is positive or negative at the stationary point.
d2 `(x)
= 12x2 + 42x + 10 (7.3)
dx2
3680 By substituting our visually estimated values of x = 4.5, ⇣ 1.4, 0.7 we ⌘
2
3681 will observe that as expected the middle point is a maximum d dx `(x)
2 <0
3682 and the other two stationary points are minimums.
3683 Note that we have avoided analytically solving for values of x in the pre-
3684 vious discussion, although for low order polynomials such as the above we
In fact according to
3685 could. In general, we are unable to find analytic solutions, and hence we
the Abel-Ruffini 3686 need to start at some value, say x0 = 10 and follow the gradient. The
theorem, also
3687 gradient indicates that we should go right, but not how far (this is called
known as Abel’s
impossibility 3688 the step size). Furthermore, if we had started at the right side (e.g. x0 = 0)
theorem, there is in
3689 the gradient would have led us to the wrong minimum. Figure 7.1 illus-
general no algebraic
3690 trates the fact that for x > 1, the gradient points towards the minimum
solution for
3691 on the right of the figure, which has a larger objective value.
polynomials of
degree 5 or more. 3692 We will see in Section 7.3 a class of functions called convex functions
3693 that do not exhibit this tricky dependency on the starting point of the
convex functions 3694 optimization algorithm. For convex functions all local minima are global
3695 minimum. It turns out that many machine learning objective functions
3696 are designed such that they are convex, and we will see an example in
3697 Chapter 10.
3698 The discussion in this Chapter so far was about a one dimensional func-
3699 tion, where we are able to visualize the ideas of gradients, descent direc-
3700 tions and optimal values. In the rest of this chapter we develop the same
3701 ideas in high dimensions. Unfortunately we can only visualize the con-
3702 cepts in one dimension, but some concepts do not generalize directly to
3703 higher dimensions, therefore some care needs to be taken when reading.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.1 Optimization using Gradient Descent 171

3704 7.1 Optimization using Gradient Descent


We now consider the problem of solving for the minimum of a real-valued
function
min f (x) (7.4)
x

3705 where f : Rd ! R is an objective function that captures the machine


3706 learning problem at hand. We assume that our function f is differentiable,
3707 and we are unable to analytically find a solution in closed form.
3708 Gradient descent is a first-order optimization algorithm. To find a local
3709 minimum of a function using gradient descent, one takes steps propor-
3710 tional to the negative of the gradient of the function at the current point.
3711 Recall from Chapter 5 that the gradient points in the direction of the steep-
3712 est ascent and it is orthogonal to the contour lines of the function we wish
3713 to optimize. We use the
3714 Consider the quadratic function f (x) = (x 1)2 in Figure 7.2. We begin convention of row
vectors for
3715 our search for the minimum value at x0 = 0.7, where the subscript
gradients.
3716 0 indicates the initial estimate. Observe that the gradient dfd(x) x
x= 0.7
3717 points toward the right hand side. This means that we should take a step
3718 to the right (increasing x), which makes intuitive sense. However it is
3719 unclear how big a step to take, and the example algorithm overshoots the
3720 (unknown) minimum to x1 = 2.5. Now the gradient points to the left, and
3721 we take a step to the left, and so on.
Let us consider multivariate functions. Imagine a surface (described by
the function f (x)) with a ball starting at a particular location x0 . When
the ball is released, it will move downhill in the direction of steepest de-
scent. Gradient descent exploits the fact that f (x0 ) decreases fastest if one
moves from x0 in the direction of the negative gradient ((rf )(x0 ))> of
f at x0 . We assume in this book that the functions are differentiable, and
refer the reader to more general settings in Section 7.4. Then, if
x1 = x0 ((rf )(x0 ))> (7.5)
3722 for a small step size > 0 then f (x1 ) 6 f (x0 ). Note that we use the
3723 transpose for the gradient since otherwise the dimensions will not work
3724 out.
This observation allows us to define a simple gradient-descent algo-
rithm: If we want to find a local optimum f (x⇤ ) of a function f : Rn !
R, x 7! f (x), we start with an initial guess x0 of the parameters we wish
to optimize and then iterate according to
xi+1 = xi i ((rf )(xi ))
>
. (7.6)
3725 For suitable step size i, the sequence f (x0 ) > f (x1 ) > . . . converges to
3726 a local minimum.
3727 Remark. Gradient descent can be relatively slow close to the minimum:

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
172 Continuous Optimization

Figure 7.2 Gradient


descent can lead to
zigzagging and slow
convergence.

3728 Its asymptotic rate of convergence is inferior to many other methods. Us-
3729 ing the ball rolling down the hill analogy, when the surface is a long thin
3730 valley the problem is poorly conditioned. For poorly conditioned convex
3731 problems, gradient descent increasingly ‘zigzags’ as the gradients point
3732 nearly orthogonally to the shortest direction to a minimum point, see
3733 Fig. 7.2. }

3734 7.1.1 Stepsize


3735 As mentioned earlier, choosing a good stepsize is important in gradient
The stepsize is also3736 descent. If the stepsize is too small, gradient descent can be slow. If the
called the learning3737 stepsize is chosen too large, gradient descent can overshoot, fail to con-
rate
3738 verge, or even diverge. We will discuss the use of momentum in the next
3739 section. It is a method that smoothes out erratic behavior of gradient up-
3740 dates and dampens oscillations.
3741 Adaptive gradient methods rescale the stepsize at each iteration, de-
3742 pending on local properties of the function. There are two simple heuris-
3743 tics (Toussaint, 2012):

3744 • When the function value increases after a gradient step, the step size
3745 was too large. Undo the step and decrease the stepsize.
3746 • When the function value decreases the step could have been larger. Try
3747 to increase the stepsize.

3748 Although the “undo” step seems to be a waste of resources, using this
3749 heuristic guarantees monotonic convergence.

Example (Solving a Linear Equation System)


When we solve linear equations of the form Ax = b, in practice we solve
Ax b = 0 approximately by finding x⇤ that minimizes the the squared
error

kAx bk2 = (Ax b)> (Ax b) (7.7)

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.1 Optimization using Gradient Descent 173

if we use the Euclidean norm. The gradient of (7.7) with respect to x is

rx = 2(Ax b)> A . (7.8)

3750 We can use this gradient directly in a gradient descent algorithm. How-
3751 ever for this particular special case, it turns out that there is an analytic
3752 solution, which can be found by setting the gradient to zero. We can see
3753 that this analytic solution is given by Ax = b. We will see more on solving
3754 squared error problems in Chapter 9.
3755

3756

3757 Remark. When applied to the solution of linear systems of equations Ax =


3758 b gradient descent may converge slowly. The speed of convergence of gra-
3759 dient descent is dependent on the condition number  = (A) max
(A)min
, which is
3760 the ratio of the maximum to the minimum singular value of A. The con-
3761 dition number essentially measures the ratio of the most curved direction
3762 versus the least curved direction, which corresponds to our imagery that
3763 poorly conditioned problems are long thin valleys: they are very curved in
3764 one direction, but very flat in the other. Instead of directly solving Ax = b,
3765 one could instead solve P 1 (Ax b) = 0, where P is called the precon-
3766 ditioner. The goal is to design P 1 such that P 1 A has a better condition
3767 number, but at the same time P 1 is easy to compute. For further infor-
3768 mation on gradient descent, pre-conditioning and convergence we refer
3769 to (Boyd and Vandenberghe, 2004, Chapter 9). }

3770 7.1.2 Gradient Descent with Momentum


Gradient descent with momentum (Rumelhart et al., 1986) is a method
that introduces an additional term to remember what happened in the
previous iteration. This memory dampens oscillations and smoothes out
the gradient updates. Continuing the ball analogy, the momentum term
Goh (2017) wrote
emulates the phenomenon of a heavy ball which is reluctant to change an intuitive blog
directions. The idea is to have a gradient update with memory to imple- post on gradient
ment a moving average. The momentum-based method remembers the descent with
momentum.
update xi at each iteration i and determines the next update as a linear
combination of the current and previous gradients

xi+1 = xi i ((rf )(xi ))


>
+ ↵ xi (7.9)
xi = xi xi 1 = i 1 ((rf )(xi 1 ))
>
, (7.10)

3771 where ↵ 2 [0, 1]. Sometimes we will only know the gradient approxi-
3772 mately. In such cases the momentum term is useful since it averages out
3773 different noisy estimates of the gradient. One particularly useful way to
3774 obtain an approximate gradient is using a stochastic approximation, which
3775 we discuss next.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
174 Continuous Optimization

3776 7.1.3 Stochastic Gradient Descent


3777 Computing the gradient can be very time consuming. However, often it is
3778 possible to find a “cheap” approximation of the gradient. Approximating
3779 the gradient is still useful as long as it points in roughly the same direction
Stochastic gradient3780 as the true gradient.
descent 3781 Stochastic gradient descent (often shortened in SGD) is a stochastic ap-
3782 proximation of the gradient descent method for minimizing an objective
3783 function that is written as a sum of differentiable functions. The word
3784 stochastic here refers to the fact that we acknowledge that we do not
3785 know the gradient precisely, but instead only know a noisy approxima-
3786 tion to it. By constraining the probability distribution of the approximate
3787 gradients, we can still theoretically guarantee that SGD will converge.
In machine learning given n = 1, . . . , N data points, we often consider
objective functions which are the sum of the losses Ln incurred by each
example n. In mathematical notation we have the form
N
X
L(✓) = Ln (✓) (7.11)
n=1

where ✓ is the vector of parameters of interest, i.e., we want to find ✓ that


minimizes L. An example from regression (Chapter 9), is the negative log-
likelihood, which is expressed as a sum over log-likelihoods of individual
examples,
N
X
L(✓) = log p(yn |xn , ✓) (7.12)
n=1

3788 where xn 2 RD are the training inputs, yn are the training targets and ✓
3789 are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
by updating the vector of parameters according to
N
X
✓ i+1 = ✓ i i (rL(✓ i ))
>
= ✓i i (rLn (✓ i ))> (7.13)
n=1

3790 for a suitable stepsize parameter i . Evaluating the sum-gradient may re-
3791 quire expensive evaluations of the gradients from all individual functions
3792 Ln . When the training set is enormous and/or no simple formulas exist,
3793 evaluating the sums ofPgradients becomes very expensive.
N
3794 Consider the term n=1 (rLn (✓ i )) in (7.13) above: we can reduce the
3795 amount of computation by taking a sum over a smaller set of Ln . In con-
3796 trast to batch gradient descent, which uses all Ln for n = 1, . . . , N , we
3797 randomly choose a subset of Ln for mini-batch gradient descent. In the
3798 extreme case, we randomly select only a single Ln to estimate the gradi-
3799 ent.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.2 Constrained Optimization and Lagrange Multipliers 175

3800 Why should one consider using an approximate gradient? A major rea-
3801 son is practical implementation constraints, such as the size of CPU/GPU
3802 memory or limits on computational time. We can think of the size of the
3803 subset used to estimate the gradient in the same way that we thought of
3804 the size of a sample when estimating empirical means 6.4.1. In practice,
3805 it is good to keep the size of the mini-batch as large as possible. Large
3806 mini-batches reduce the variance in the parameter update. Furthermore This often leads to
3807 large mini-batches take advantage of highly optimized matrix operations more stable
convergence since
3808 in vectorized implementations of the cost and gradient. However when
the gradient
3809 we choose the mini-batch size, we need to make sure it fits into CPU/GPU estimator is less
3810 memory. Typical mini-batch sizes are 64, 128, 256, 512, 1024, which de- noisy.
3811 pends on the way computer memory is laid out and accessed.
3812 Remark. When the learning rate decreases at an appropriate rate, and sub-
3813 ject to relatively mild assumptions, stochastic gradient descent converges
3814 almost surely to local minimum (Bottou, 1998). }
3815 If we keep the mini-batch size small, the noise in our gradient estimate
3816 will allow us to get out of some bad local optima, which we may otherwise
3817 get stuck in.
3818 Stochastic gradient descent is very effective in large-scale machine learn-
3819 ing problems (Bottou et al., 2018), such as training deep neural networks
3820 on millions of images (Dean et al., 2012), topic models (Hoffman et al.,
3821 2013), reinforcement learning (Mnih et al., 2015) or training large-scale
3822 Gaussian process models (Hensman et al., 2013; Gal et al., 2014).

3823 7.2 Constrained Optimization and Lagrange Multipliers


In the previous section, we considered the problem of solving for the min-
imum of a function
min f (x) (7.14)
x

3824 where f : RD ! R.
In this section we have additional constraints. That is for real valued
functions gi : RD ! R for i = 1, . . . , m we consider the constrained
optimization problem
min f (x) (7.15)
x

subject to gi (x) 6 0 for all i = 1, . . . , m


3825 It is worth pointing out that the functions f and gi could be non-convex
3826 in general, and we will consider the convex case in the next section.
One obvious, but not very practical, way of converting the constrained
problem (7.15) into an unconstrained one is to use an indicator function
m
X
J(x) = f (x) + 1(gi (x)) (7.16)
i=1

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
176 Continuous Optimization

Figure 7.3
Illustration of
constrained
optimization. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side (indicated
by the circle). The
box constraints
( 1 6 x 6 1 and
1 6 y 6 1) require
that the optimal
solution are within
the box, resulting in
an optimal value
indicated by the
star.

where 1(z) is an infinite step function


(
0 if z 6 0
1(z) = . (7.17)
1 otherwise

3827 This gives infinite penalty if the constraint is not satisfied, and hence
3828 would provide the same solution. However, this infinite step function is
3829 equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multipliers
3830 ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
3831 step function with a linear function.
Lagrangian We associate to problem (7.15) the Lagrangian by introducing the La-
grange multipliers i corresponding to each inequality constraint respec-
tively (Boyd and Vandenberghe, 2004, Chapter 4).
m
X
L(x, ) = f (x) + i gi (x)
i=1
>
= f (x) + g(x) (7.18)

3832 where in the last line we have a concatenated all constraints gi (x) into a
3833 vector g(x), and all the Lagrange multipliers into a vector 2 Rm .
3834 We now introduce the idea of Lagrangian duality. In general, duality
3835 in optimization is the idea of converting an optimization problem in one
3836 set of variables x (called the primal variables), into another optimization
3837 problem in a different set of variables (called the dual variables). We
3838 introduce two different approaches to duality: in this section we discuss

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.2 Constrained Optimization and Lagrange Multipliers 177

3839 Lagrangian duality, and in Section 7.3.3 we discuss Legendre-Fenchel du-


3840 ality.
Theorem 7.1. The problem in (7.15)
min f (x)
x

subject to gi (x) 6 0 for all i = 1, . . . , m


is known as the primal problem, corresponding to the primal variables x. primal problem
The associated Lagrangian dual problem is given by Lagrangian dual
problem
maxm D( ) (7.19)
2R

subject to > 0, (7.20)


3841 where are the dual variables and D( ) = minx2Rd L(x, ).
Proof Recall that the difference between J(x) in (7.16) and the La-
grangian in (7.18) is that we have relaxed the indicator function to a
linear function. Therefore when > 0, the Lagrangian L(x, ) is a lower
bound of J(x). Hence the maximum of L(x, ) with respect to is J(x)
J(x) = max L(x, ). (7.21)
>0

Recall that the original problem was minimising J(x),


min max L(x, ). (7.22)
x2Rd >0

By the minimax inequality (Boyd and Vandenberghe, 2004) it turns out


that, for any function swapping the order of the minimum and maximum
above results in a smaller value.
min max L(x, ) > max mind L(x, ) (7.23)
x2Rd >0 >0 x2R

3842 This is also known as weak duality. Note that the inner part of the right weak duality
3843 hand side is the dual objective function D( ) and the theorem follows.
3844 In contrast to the original optimization problem which has constraints,
3845 minx2Rd L(x, ) is an unconstrained optimization problem for a given
3846 value of . If solving minx2Rd L(x, ) is easy, then the overall problem
3847 is easy to solve. The reason is that the outer problem (maximization over
3848 ) is a maximum over a set of affine functions, and hence is a concave
3849 function, even though f (·) and gi (·) may be non-convex. The maximum
3850 of a concave function can be efficiently computed.
3851 Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
3852 problem by differentiating the Lagrangian with respect to x and setting
3853 the differential to zero and solving for the optimal value. We will discuss
3854 two concrete examples in Section 7.3.1 and 7.3.2, where f (·) and gi (·)
3855 are convex.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
178 Continuous Optimization

Figure 7.4 Example


of a convex
function.

Remark (Equality constraints). Consider (7.15) with additional equality


constraints
minf (x) (7.24)
x

subject to gi (x) 6 0 for all i = 1, . . . , m


hj (x) = 0 for all j = 1, . . . , n
3856 We can model equality constraints by replacing them with two inequality
3857 constraints. That is for each equality constraint hj (x) = 0 we equivalently
3858 replace it by two constraints hj (x) 6 0 and hj (x) > 0. It turns out that
3859 the resulting Lagrange multipliers are then unconstrained.
3860 Therefore we constrain the Lagrange multipliers corresponding to the
3861 inequality constraints in (7.24) to be non-negative, and leave the La-
3862 grange multipliers corresponding to the equality constraints unconstrained.
3863 }

3864 7.3 Convex Optimization


3865 We focus our attention of a particularly useful class of optimization prob-
3866 lems, where we can guarantee global optimality. When f (·) is a convex
3867 function, and when the constraints involving g(·) and h(·) are convex sets,
convex optimization
3868 this is called a convex optimization problem. In this setting, we have strong
problem 3869 duality: The optimal solution of the dual problem is the same as the opti-
strong duality 3870 mal solution of the primal problem. The distinction between convex func-
convex functions 3871 tions and convex sets are often not strictly presented in machine learning
convex sets
3872 literature, but one can often infer the implied meaning from context.
3873 Convex functions are functions such that a straight line between any
3874 two points of the function lie above the function. Figure 7.1 shows a non-
3875 convex function and Figure 7.2 shows a convex function. Another convex
3876 function is shown in Figure 7.4.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 179

Definition 7.2. A function f : RD ! R is convex if for all x, y in the


domain of f , and for any scalar ✓ with 0 6 ✓ 6 1, we have Technically, the
domain of the
f (✓x + (1 ✓)y) 6 ✓f (x) + (1 ✓)f (y) (7.25) function f must also
be a convex set.
3877 The constraints involving g(·) and h(·) above truncate functions at a
3878 scalar value, resulting in sets. Another relation between convex functions
3879 and convex sets is to consider the set obtained by “filling in” a convex
3880 function. A convex function is a bowl like object, and we imagine pouring
3881 water into it to fill it up. This resulting filled in set, called the epigraph
3882 of the convex function, is a convex set. Convex sets are sets such that a
3883 straight line connecting any two elements of the set lie inside the set. Fig-
3884 ure 7.5 and Figure 7.6 illustrates convex and nonconvex sets respectively. Figure 7.5 Example
of a convex set
Definition 7.3. A set C is convex if for any x, y 2 C and for any scalar ✓
with 0 6 ✓ 6 1, we have
✓x + (1 ✓)y 2 C (7.26)
3885 Remark. We can check that a function or set is convex from first principles Figure 7.6 Example
3886 by recalling the definitions. In practice we often rely on operations that of a nonconvex set
3887 preserve convexity to check that a particular function or set is convex.
3888 Although the details are vastly different, this is again the idea of closure
3889 that we introduced in Chapter 2 for vector spaces. }
3890

3891 Example
3892 A nonnegative weighted sum of convex functions is convex. Observe that
3893 if f is a convex function, and ↵ > 0 is a nonnegative scalar, then the
3894 function ↵f is convex. We can see this by multiplying ↵ to both sides of
3895 equation in Definition 7.2, and recalling that multiplying a nonnegative
3896 number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (✓x + (1 ✓)y) 6 ✓f1 (x) + (1 ✓)f1 (y) (7.27)
f2 (✓x + (1 ✓)y) 6 ✓f2 (x) + (1 ✓)f2 (y). (7.28)
Summing up both sides gives us
f1 (✓x + (1 ✓)y) + f2 (✓x + (1 ✓)y)
6 ✓f1 (x) + (1 ✓)f1 (y) + ✓f2 (x) + (1 ✓)f2 (y) (7.29)
where the right hand side can be rearranged to
✓(f1 (x) + f2 (x)) + (1 ✓)(f1 (y) + f2 (y)) (7.30)
3897 completing the proof that the sum of convex functions is convex.
3898 Combining the two facts above, we see that ↵f1 (x) + f2 (x) is convex
3899 for ↵, > 0. This closure property can be extended using a similar argu-
3900 ment for nonnegative weighted sums of more than two convex functions.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
180 Continuous Optimization

Figure 7.7
Illustration of a
linear program. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side. The
optimal value given
the constraints are
shown by the star.

3901

3902

3903 Remark. The inequality defining convex functions, see 7.25, is sometimes
Jensen’s inequality3904 called Jensen’s inequality. In fact a whole class of inequalities for taking
3905 nonnegative weighted sums of convex functions are all called Jensen’s
3906 inequality. }

3907 7.3.1 Linear Programming


Consider the special case when all the functions above are linear, that is

min c> x (7.31)


x2Rd

subject to Ax 6 b

Linear programs are


3908 where A 2 Rm⇥d and b 2 Rm . This is known as a linear program. It has d
one of the most 3909 variables and m linear constraints.
widely used
3910
approaches in
industry. 3911 Example
3912 An example of a linear program is illustrated in Figure 7.7, which has
3913 two variables. The objective function is linear, resulting in linear contour
3914 lines. The constraint set in standard form is translated into the legend. The
3915 optimal value must lie in the shaded (feasible) region, and is indicated by
3916 the star.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 181

 > 
5 x1
min2 (7.32)
x2R 3 x2
2 3 2 3
2 2 33
6 2 47  6 8 7
6 7 x1 6 7
subject to 6
6 2 17 6 7
7 x2 6 6 5 7 (7.33)
4 0 15 4 15
0 1 8
3917

3918

3919

The Lagrangian is given by


>
L(x, ) = c> x + (Ax b)
where 2 Rm is the vector of non-negative Lagrange multipliers. It is
easier to see what is going on by rearranging the terms corresponding to
x.
>
L(x, ) = (c + A> )> x b
Taking the derivative of L(x, ) with respect to x and setting it to zero
gives us
c + A> = 0.
>
Therefore the dual Lagrangian is D( ) = b. Recall we would like to
maximize D( ). This is traditionally presented as minimising the negative
of it. In addition to the constraint due to the derivative of L(x, ) being
zero, we also have the fact that > 0, resulting in the following dual
optimization problem
minm b> (7.34)
2R

subject to c + A> = 0
> 0.
3920 This is also a linear program, but with m variables. We have the choice
3921 of solving the primal (7.31) or the dual (7.34) program depending on
3922 whether m or d is larger. Recall that d is the number of variables and m is
3923 the number of constraints in the primal linear program.

3924 7.3.2 Quadratic Programming


Consider when the objective function is a convex quadratic function, and
the constraints are affine,
1
mind x> Qx + c> x (7.35)
x2R 2

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
182 Continuous Optimization

subject to Ax 6 b
3925 where A 2 Rm⇥d , b 2 Rm and c 2 Rd . The square symmetric matrix Q 2
3926 Rd⇥d is positive definite, and therefore the objective function is convex.
3927 This is known as a quadratic program. Observe that it has d variables and
3928 m linear constraints.
3929

3930 Example
3931 An example of a quadratic program is illustrated in Figure 7.3, which has
3932 two variables. The objective function is quadratic with a positive semidefi-
3933 nite matrix Q, resulting in elliptical contour lines. The optimal value must
3934 lie in the shaded (feasible) region, and is indicated by the star.

 >    > 
1 x1 2 1 x1 5 x1
min + (7.36)
x2R2 2 x2 1 4 x2 3 x2
2 3 2 3
1 0  1
6 1 7
0 7 x1 6 17
subject to 6
4 0 5 66
4
7 (7.37)
1 x2 15
0 1 1
3935

3936

3937

The Lagrangian is given by


1
L(x, ) = x> Qx + c> x + > (Ax b)
2
1 >
= x> Qx + (c + A> )> x b
2
where again we have rearranged the terms. Taking the derivative of L(x, )
with respect to x and setting it to zero gives
Qx + (c + A> ) = 0
Assuming that Q is invertible, we get
x= Q 1 (c + A> ) (7.38)
Substituting (7.38) into the primal Lagrangian L(x, ) we get the dual
Lagrangian
1 >
D( ) = (c + A> )Q 1 (c + A> ) b
2
Therefore the dual optimization problem is given by
1 >
minm (c + A> )Q 1 (c + A> ) + b (7.39)
2R 2

subject to > 0. (7.40)

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 183

3938 We will see an application of Quadratic Programming in machine learning


3939 in Chapter 10.

3940 7.3.3 Legendre-Fenchel Transform and Convex Conjugate


3941 Let us revisit the idea of duality, which we saw in Section 7.2, without
3942 considering constraints. One useful fact about convex sets is that a convex
3943 set can be equivalently described by its supporting hyperplanes. A hyper-
3944 plane is called a supporting hyperplane of a convex set if it intersects the supporting
3945 convex set and the convex set is contained on just one side of it. Recall hyperplane
3946 that for a convex function, we can fill it up to obtain the epigraph which
3947 is a convex set. Therefore we can also describe convex functions in terms
3948 of their supporting hyperplanes. Furthermore observe that the supporting
3949 hyperplane just touches the convex function, and is in fact the tangent to
3950 the function at that point. And recall that the tangent of a function f (x) at
3951 a given point x0 is the evaluation of the gradient of that function at that
2
3952 point d dxf (x)
2 . In summary, because convex sets can be equivalently
x=x0
3953 described by its supporting hyperplanes, convex functions can be equiv-
3954 alently described by a function of their gradient. The Legendre transform Legendre transform
3955 formalizes this concept . Physics students are
3956 We begin with the most general definition which unfortunately has a often introduced to
the Legendre
3957 counterintuitive form, and look at special cases to try to relate the defini-
transform as
3958 tion to the intuition above. The Legendre-Fenchel transform is a transfor- relating the
3959 mation (in the sense of a Fourier transform) from a convex differentiable Lagrangian and the
3960 function f (x) to a function that depends on the tangents s(x) = rx f (x). Hamiltonian in
classical mechanics.
3961 It is worth stressing that this is a transformation of the function f (·) and
Legendre-Fenchel
3962 not the variable x or the function evaluated at a value. The Legendre- transform
3963 Fenchel transform is also known as the convex conjugate (for reasons we
3964 will see soon) and is closely related to duality.
Definition 7.4. The convex conjugate of a function f : RD ! R is a convex conjugate
function f ⇤ defined by
f ⇤ (s) = sup hs, xi f (x) (7.41)
x2RD

3965 Note that the convex conjugate definition above does not need the func-
3966 tion f to be convex nor differentiable. In the definition above, we have
3967 used a general inner product (Section 3.2) but in the rest of this section
3968 we will consider the standard dot product between finite dimensional vec-
3969 tors (hs, xi = s> x) to avoid too many technical details.
To understand the above definition in a geometric fashion, consider an This derivation is
nice simple one dimensional convex and differentiable function, for ex- easiest to
understand by
ample f (x) = x2 . Note that since we are looking at a one dimensional
drawing the
problem, hyperplanes reduce to a line. Consider a line y = sx + c. Recall reasoning as it
that we are able to describe convex functions by their supporting hyper- progresses.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
184 Continuous Optimization

planes, so let us try to describe this function f (x) by its supporting lines.
Fix the gradient of the line s 2 R and for each point (x0 , f (x0 )) on the
graph of f , find the minimum value of c such that the line still inter-
sects (x0 , f (x0 )). Note that the minimum value of c is the place where a
line with slope s “just touches” the function f (x) = x2 . The line passing
through (x0 , f (x0 )) with gradient s is given by

y f (x0 ) = s(x x0 ) (7.42)

The y -intercept of this line is sx0 + f (x0 ). The minimum of c for which
y = sx + c intersects with the graph of f is therefore

inf sx0 + f (x0 ). (7.43)


x0

3970 The convex conjugate above is by convention defined to be the negative


3971 of this. The reasoning in this paragraph did not rely on the fact that we
3972 chose a one dimensional convex and differentiable function, and holds for
3973 f : RD ! R which is nonconvex and non differentiable.
However convex differentiable functions such as the example f (x) =
The classical x2 is a nice special case, where there is no need for the supremum, and
Legendre transform there is a one to one correspondence between a function and its Legendre
is defined on convex
transform. For a convex differentiable function, we know that at x0 the
differentiable
functions in RD tangent touches f (x0 ), therefore

f (x0 ) = sx0 + c (7.44)

Rearranging to get an expression for c

c = sx0 f (x0 ). (7.45)

Note that c changes with x0 and therefore with s, which is why we can
think of it as a function of s, which we call f ⇤ (s).

f ⇤ (s) = sx f (x). (7.46)

3974 The conjugate function has nice properties, for example for convex
3975 functions, applying the Legendre transform again gets us back to the origi-
3976 nal function. In the same way that the slope of f (x) is s, the slope of f ⇤ (s)
3977 is x. The following two examples show common uses of convex conjugates
3978 in machine learning.

Example
To illustrate the application of convex conjugates, consider the quadratic
function based on a positive definite matrix K 2 Rn⇥n . We denote the
primal variable to be y 2 Rn and the dual variable to be ↵ 2 Rn .

1
f (y) = y>K y (7.47)
2

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.3 Convex Optimization 185

Applying Definition 7.4, we obtain the function


1
f ⇤ (↵) = sup hy, ↵i y>K y. (7.48)
y2Rn 2
Observe that the function is differentiable, and hence we can find the
maximum by taking the derivative and with respect to y setting it to zero.
⇥ ⇤
@ hy, ↵i 2 y > K 1 y
= (↵ K 1 y)> (7.49)
@y
and hence when the gradient is zero we have y = 1 K↵. Substituting
into (7.48) yields
✓ ◆> ✓ ◆
1 > 1 1 1

f (↵) = ↵ K↵ K↵ K K↵ (7.50)
2
1 >
= ↵ K↵ . (7.51)
2
3979

3980

3981

Example
In machine learning we often use sums of functions, for example the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R ! R. This also illustratesPnthe applica-
tion of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ),
n
X
L⇤ (z) = sup hz, ti `i (ti ) (7.52)
t2Rn i=1
n
X
= sup zi t i `i (ti ) definition of dot product (7.53)
t2Rn i=1

Xn
= sup zi ti `i (ti ) (7.54)
n
i=1 t2R
Xn
= `⇤i (zi ) definition of conjugate (7.55)
i=1
3982

3983

3984

3985 Recall that in Section 7.2 we derived a dual optimization problem using
3986 Lagrange multipliers. Furthermore for convex optimization problems we
3987 have strong duality, that is the solutions of the primal and dual problem
3988 match. The Fenchel-Legendre transform described here also can be used
3989 to derive a dual optimization problem. Furthermore then the function is

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
186 Continuous Optimization

3990 convex and differentiable, the supremum is unique. To further investigate


3991 the relation between these two approaches, let us consider a linear equal-
3992 ity constrained convex optimization problem.

Example
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then

min f (Ax) + g(x) = min f (y) + g(x). (7.56)


x Ax=y

By introducing the Lagrange multiplier u for the constraints Ax = y ,

min f (y) + g(x) = min max f (y) + g(x) + (Ax y)> u (7.57)
Ax=y x,y u

= max min f (y) + g(x) + (Ax y)> u (7.58)


u x,y

where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,

max min f (y) + g(x) + (Ax y)> u (7.59)


u x,y
 h i
= max min y > u + f (y) + min(Ax)> u + g(x) (7.60)
u y x
 h i
= max min y > u + f (y) + min x> A> u + g(x) (7.61)
u y x

For general inner Recall the convex conjugate (Definition 7.4) and the fact that dot products
products, A> is are symmetric,
replaced by the
adjoint A⇤ .
 h i
max min y > u + f (y) + min x> A> u + g(x) (7.62)
u y x

= max f ⇤ (u) g ⇤ ( A> u). (7.63)


u

Therefore we have shown that

min f (Ax) + g(x) = max f ⇤ (u) g ⇤ ( A> u). (7.64)


x u

3993

3994

3995

3996 The Legendre-Fenchel conjugate turns out to be quite useful for ma-
3997 chine learning problems that can be expressed as convex optimization
3998 problems. In particular for convex loss functions that apply independently
3999 to each example, the conjugate loss is a convenient way to derive a dual
4000 problem. We will see such an example in Chapter 10.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.
7.4 Further Reading 187

4001 7.4 Further Reading


4002 Continuous optimization is an active area of research, and we do not try
4003 to provide a comprehensive account of recent advances.
4004 From a gradient descent perspective, there are two major weaknesses
4005 which each have their own set of literature. The first challenge is the fact
4006 that gradient descent is a first order algorithm, and does not use infor-
4007 mation about the curvature of the surface. When there are long valleys,
4008 the gradient points perpendicularly to the direction of interest. Conjugate
4009 gradient methods avoid the issues faced by gradient descent by taking
4010 previous directions into account (Shewchuk, 1994). Second order meth-
4011 ods such as Newton methods use the Hessian to provide information about
4012 the curvature. Quasi-Newton methods such as L-BFGS try to use cheaper
4013 computational methods to approximate the Hessian (Nocedal and Wright,
4014 2006).
4015 The second challenge are non-differentiable functions. Gradient meth-
4016 ods are not well defined when there are kinks in the function. In these
4017 cases, subgradient methods can be used (Shor, 1985). For further informa-
4018 tion and algorithms for optimizing non-differentiable functions, we refer
4019 to the book by Bertsekas (1999).
4020 Modern applications of machine learning often mean that the size of
4021 datasets prohibit the use of batch gradient descent, and hence stochastic
4022 gradient descent is the current workhorse of large scale machine learning
4023 methods. Recent surveys of the literature include (Hazan, 2015; Bottou
4024 et al., 2018).
4025 For duality and convex optimization, the book by Boyd and Vanden-
4026 berghe (Boyd and Vandenberghe, 2004) includes lectures and slides on-
4027 line. A more mathematical treatment is provided by Bertsekas (2009).
4028 Convex optimization is based upon convex analysis, and the reader inter-
4029 ested in more foundational results about convex functions is referred to
4030 Hiriart-Urruty and Lemaréchal (2001); Rockafellar (1970); Borwein and
4031 Lewis (2006). Legendre-Fenchel transforms are also covered in the above
4032 books on convex analysis, but more beginner friendly presentations are
4033 available at Zia et al. (2009); Gonçalves (2014).

4034 Exercises
7.1 Consider the univariate function

f (x) = x3 + 2x2 + 5x 3.

4035 Find its stationary points and indicate whether they are maximum, mini-
4036 mum or saddle points.
4037 7.2 Consider the update equation for stochastic gradient descent (Equation (7.13)).
4038 Write down the update when we use a mini-batch size of one.
7.3 Express the following optimization problem as a standard linear program in

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
188 Continuous Optimization

matrix notation
max p> x + ⇠
x2R2 ,⇠2R

4039 subject to the constraints that ⇠ > 0, x0 6 0 and x1 6 3.


7.4 The hinge loss (which is the loss used by the Support Vector Machine) is
given by
L(↵) = max{0, 1 ↵}

If we are interested in applying gradient methods such as L-BFGS, and do


not want to resort to subgradient methods, we need to smooth the kink in
the hinge loss. Compute the convex conjugate of the hinge loss L⇤ ( ) where
is the dual variable. Add a `2 proximal term, and compute the conjugate
of the resulting function
L⇤ ( ) + 2
2
4040 where is a given hyperparameter.

Draft (2018-03-15) from Mathematics for Machine Learning. Errata and feedback to mml-book.com.

You might also like