Derivative Free Optimization
Optimization and AMS Masters - University Paris Saclay
Exercices - Class 2 and 3
Anne Auger
[email protected]
https://www.cmap.polytechnique.fr/~auger/teaching.html
I Order statistics - Effect of selection
We want to illustrate the effect of selection on the distribution of candidate solutions in a stochastic
algorithm. More precisely we consider a (1, λ)-ES algorithm whose state is given by Xt ∈ Rn . At each
iteration t, λ candidate solutions are sampled according to
Xit+1 = Xt + Ut+1
i
i i
with (Ut+1 )1≤i≤λ i.i.d. and Ut+1 ∼ N (0, Id ). Those candidate solutions are evaluated on the function
n
f : R → R to be minimized and then ranked according the their f values:
t+1 t+1
f (X1:λ ) ≤ . . . ≤ f (Xλ:λ )
where i:λ denotes the index of the ith best candidate solution. The best candidate solution is then selected
that is
t+1
Xt+1 = X1:λ .
t+1
We will compute for the linear function f (x) = x1 to be minimized the conditional distribution of X1:λ
t+1
(i.e. after selection) and compare it to the distribution of Xi (i.e. before selection).
1. What is the distribution of Xit+1 conditional to Xt ? Deduce the density of each coordinate of Xit+1 .
We remind that given λ random variables independent and identically distributed Y1 , Y2 , . . . , Yλ , the order
statistics Y(1) , Y(2) , . . . , Y(λ) are random variables defined by sorting the realizations of Y1 , Y2 , . . . , Yλ in
increasing order. We consider that each random variable Yi admits a density f (x) and we denote F (x)
the cumulative distribution function, that is F (x) = Pr(Y ≤ x).
2. Compute the cumulative distribution of Y(1) and deduce the density of Y(1) .
t+1
3. Let U1:λ be the random vector such that
t+1 t+1
X1:λ = Xt + U1:λ
t+1
Express for the minimization of the linear function f (x) = x1 , the first coordinate of U1:λ as an
order statistic.
t+1
4. Deduce the conditional distribution and conditional density of the random vector X1:λ .
II Adaptive step-size algorithms
The implementations can be done in the programming langage your prefer (Matlab/Python, ...)
We are going to test the convergence of several algorithms on some test functions, in particular on the
so-called sphere function
n
X
fsphere (x) = x2i
i=1
and the ellipsoid function
n
X i−1
felli (x) = (100 n−1 xi )2 .
i=1
1. What is the condition number associated to the Hessian matrix of the functions above? Are the
functions ill-conditioned?
2. Implement the functions.
The (1 + 1)-ES algorithm is on of the simplest stochastic search method for numerical optimization. We
will start by implementing a (1 + 1)-ES with constant step-size. The pseudo-code of the algorithm is
given by
Initialize x ∈ Rn and σ > 0
while not terminate
x0 = x + σN (0, I)
if f (x0 ) ≤ f (x)
x = x0
where N (0, I) denotes a Gaussian vector with mean 0 and covariance matrix equal to the identity.
1. Implement the algorithm. You can write a function that takes as input an initial vector x, an initial
step-size σ and a maximum number of function evaluations and returns a vector where you have
recorded at each iteration the best objective function value.
2. Use the algorithm to minimize the sphere function in dimension n = 5. We will take as initial search
point x0 = (1, . . . , 1) [x=ones(1,5)] and initial step-size σ = 10−3 [sigma=1e-3] and stopping
criterion a maximum number of function evaluations equal to 2 × 104 .
3. Plot the evolution of the function value of the best solution versus the number of iterations (or
function evaluations). We will use a log scale for the y-axis (semilogy).
4. Explain the three phases observed on the figure.
To accelerate the convergence, we will implement a step-size adaptive algorithm, i.e. σ is not fixed
once for all. The method to adapt the step-size is called one-fifth success rule. The pseudo-code of
the (1 + 1)-ES with one-fifth success rule is given by:
Initialize x ∈ Rn and σ > 0
while not terminate
x0 = x + σN (0, I)
if f (x0 ) ≤ f (x)
x = x0
σ = 1.5 σ
else
σ = (1.5)−1/4 σ
2
5. Implement the (1+1)-ES with one-fifth success rule and test the algorithm on the sphere function
fsphere (x) in dimension 5 (n = 5) using x0 = (1, . . . , 1), σ0 = 10−3 and as stopping criterion a
maximum number of function evaluations equal to 6 × 102 . Plot the evolution of the square root of
the best function value at each iteration versus the number of iterations. Use a logarithmic scale for
the y-axis. Compare to the plot obtained on Question 3. Plot also on the same graph the evolution
of the step-size.
6. Use the algorithm to minimize the function felli in dimension n = 5. Plot the evolution of the
objective function value of the best solution versus the number of iterations. Why is the (1 + 1)-ES
with one-fifth success much slower on felli than on fsphere ?
7. Same question with the function
n−1
X
fRosenbrock (x) = (100(x2i − xi+1 )2 + (xi − 1)2 ) .
i=1
8. We now consider the functions, g(fsphere ) and g(felli ) where g : R → R, y 7→ y 1/4 . Modify your
implementation in Questions 5 and 6 so as to save at each iteration the distance between x and
the optimum. Plot the evolution of the distance to the optimum versus the number of function
evaluations on the functions fsphere and g(fsphere ) as well as on the functions felli and g(felli ). What
do you observe? Explain.