Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views46 pages

Midterm 1 Notes

The document discusses unconstrained optimization techniques, focusing on gradient descent and its application to find the minimum of functions. It explains the process of iterating from an initial point, adjusting step sizes based on the gradient, and includes examples of quadratic functions and methods like Newton's and Secant methods. Additionally, it covers concepts such as momentum in gradient descent and the steepest descent direction using Taylor series approximations.

Uploaded by

lopezjeanna4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Midterm 1 Notes

The document discusses unconstrained optimization techniques, focusing on gradient descent and its application to find the minimum of functions. It explains the process of iterating from an initial point, adjusting step sizes based on the gradient, and includes examples of quadratic functions and methods like Newton's and Secant methods. Additionally, it covers concepts such as momentum in gradient descent and the steepest descent direction using Taylor series approximations.

Uploaded by

lopezjeanna4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Outline:

1. Unconstrained optimization: gradient descent


CONTINUOUS OPTIMIZATION 2. Constrained optimization and Lagrange multipliers

1
2

1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent


• In general, analytic solution is not
• Consider the function f(x) = x4 + 7x3 + 5x2 – 17x + 3
available.
• Find min f(x)
f(x) = x4 + 7x3 + 5x2 – 17x + 3
• Start from an initial starting point, say x0
• The global minimum of f(x) is x ≈ -4.5
= -6 and follow the negative of the
• There exists another local solution x ≈ 0.7
gradient.
• The stationary points of f(x) can be obtained by
• How far? Step size λ > 0.
calculating its derivative and setting it to 0:
• For x0 = 0, the negative gradient would
f'(x) = 4x3 + 21x2 + 10x – 17
lead to a local (not global) minimum.
• There are 3 points where f‘(x) is 0: x ≈ -4.5, -1.4 and 0.7.
• Many machine learning objective
• By checking the second derivative at these points
functions are designed such that they are
f"(x) = 12x2 + 42x + 10
convex.
• At x = -1.4, f"(x) < 0 indicating a local maximum.
• For convex optimization, all local
• At x = -4.5 and 0.7, f"(x) > 0 indicating local minima.
minimums are global minimum.
3 4
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent

• Consider the problem of solving for the minimum of a real-valued function Example 1. Contour of f(x,y) = c
for c = {-5,0,5,10,20}
min 𝑓𝑓(𝒙𝒙) min 𝑓𝑓(x, y) = x2 + 2y2 – 4x + xy + y
𝒙𝒙
𝑥𝑥,𝑦𝑦
where f : n → .
• Suppose we start at (x,y) = (0,0)
• We assume that f is differentiable and analytical solution in closed form is not possible to find. • The gradient of f(x,y) is
• Gradient descent is a first-order optimization algorithm: start with initial point x0, and iterate:
2x + y − 4 −4
xi+1 = xi - λi ∇ f(xi) ∇x,y f(x,y) = =
x + 4y + 1 1
where λi is step size at iteration i. • The function value at (0,0) is f(0,0) = 0
• Take steps proportional to the negative of the gradient of the function at the current point xi . • Suppose we move along the negative of the gradient with step size λ = ½, the new point is
• Contour lines: the set of lines where the function is at a certain value, i.e. f(x) = c for some constant c. 0 4 2
+½ =
• The gradient points in the direction that is orthogonal to the contour lines of the function we wish to 0 −1 −½
• The new function value is x2 + 2y2 – 4x + xy + y = 4 + ½ - 8 – 1 - ½ = -5.
optimize.
5 6

1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent


Step size:
Iterative methods for solving f(x) = 0
• When the function value increases after a gradient step, the step size is too large. Undo the step and
• Bisection method:
decrease the step size.
Given a function f(x) continuous on the interval [a0,b0] and such that f(a0)f(b0) ≤ 0
• When the function value decreases (too slowly) the step could have been larger. Try to increase the step
For n = 0,1,2 …. until satisfied do:
size.
o Set m = (an + bn)/2 f(x)
• Find an optimal step size, given the direction di, find λ such by
o If f(an)f(m) ≤ 0, set an+1 = an, bn+1 = m
min f(xi + λ di)
λ o Otherwise, set an+1 = m, bn+1 = bn.
𝛿𝛿 Then f(x) has a zero in the interval [an+1,bn+1]
• If the function f(x) is differentiable, then we find λ > 0 such that f(xi + λ di) = 0 o
𝛿𝛿λ

𝛿𝛿 Stopping condition: a0 m = (a0 + b0)/2 b0


• Let g(λ) = f(xi + λ di), the problem now is to find λ > 0 such that, g(λ) = 0.
𝛿𝛿λ = a1
o When the interval (bn+1 - an+1) ≤ ε1 = b1
• An unconstrained optimization with n variables is now a sequence of optimization problems with one
o When the function value |f(½(an+1+bn+1))| ≤ ε2
variable. 7 8
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent
Iterative methods for solving f(x) = 0 Iterative methods for solving f(x) = 0

• Newton method: If f(x) = g'(x), then the iterates are: • Secant method:

Given a differentiable function f(x) and initial starting point x0 xn+1 = xn – g'(xn)/g"(xn) Given a continuous function f(x) and two points x-1,x0:

o For n = 0,1,2 …. until satisfied do: o For n = 0,1,2 …. until satisfied do:

Calculate xn+1 = xn – f(xn)/f'(xn) f(x)


Calculate xn+1 = [f(xn)xn-1 – f(xn-1)xn]/[f(xn) – f(xn-1)]

Explanation: Explanation:

• The line passing through (x0,f(x0)) with slope f'(x0) is: • For some function, its derivative may not be easy to compute or not available.

y = f'(x0) x + f(x0) – f'(x0) x0 • Approximate f'(xn) ≈ [f(xn) – f(xn-1)]/(xn – xn-1)


x2
• Let y = 0, then Hence, xn+1 = xn – f(xn)/f'(xn)
x0 x1
x = - [f(x0) – f'(x0) x0]/f'(x0) ≈ xn – f(xn)(xn – xn-1)/[f(xn) – f(xn-1)] =

= x0 – f(x0)/f'(x0) ≡ x1 = {xn[f(xn) – f(xn-1)] – f(xn)(xn – xn-1)}/[f(xn) – f(xn-1)]


= [f(xn)xn-1 – f(xn-1)xn]/[f(xn) – f(xn-1)] 10
9

1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent


Example 2. Example 3.
• Consider the quadratic function in 2 dimensions: • Let A be an n by n matrix and b be an n-dimensional vector, consider the quadratic function:
x1 2 1 x1 5 x1 f(x) = ½ xTAx − bTx ∇f(x) = Ax − b  Ax − b = 0 implies x is the solution of the linear equation Ax = b
f(x1,x2) = ½ T – T
x2 1 20 x2 3 x2
• Suppose we want to minimize f(x) by starting from x0 and computing iteratively xi+1 = xi - λi∇f(xi), where
2 1 x1 5
with gradient ∇ f(x1,x2) = – λi > 0 is the solution of: min f(xi + λ di) = min g(λ)
1 20 x2 3 λ λ
−3 • First, compute g'(λ) = ∇ f(xi + λ di)Tdi
• Starting with initial starting point x0 = , we obtain the gradient:
−1
= [A (xi + λ di) − b] Tdi
2 1 −3 5 −7 5 −12
∇ f(-3,-1) = – = – = = diTA xi + λdiTAdi − b Tdi
1 20 −1 3 −23 3 −26
• Let step size λ = 0.085. setting it to 0, and solving for λ we obtain
λi = diT(b − A xi) /diTAdi
−3 12 −1.98
• x1 = x0 - λ ∇f(x0) = + 0.085 =
−1 26 −1.21 • Let ri= b − A xi . Applying gradient descent di = − ∇f(xi) = ri , and the step size λi = ri Tri/riTAri
11 12
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent
Example 3 (continued). Example 4.
• Let A be an n by n symmetric positive-definite matrix and b be an n-dimensional vector, consider the Multiple linear regression problem:
quadratic function: • Let A be an m by n matrix of independent/explanatory variables in a data set and b be m dimensional
f(x) = ½ xTAx − bTx ∇f(x) = Ax − b vector of dependent/target variable. We aim to find an n dimensional vector weights x such that the sum
• Steepest descent method for minimizing quadratic function f(x) with exact line search: of the squared errors is minimized:
Start with initial point x0: minimize E(x) = ½ ǁA x - bǁ2
o For i = 0,1,2 …. until satisfied do: • The gradient of the error function E(x) is
Stopping condition:
o Let ri = b – A xi ∇E(x) = (Ax – b)TA
• The gradient is sufficiently small ǁ∇ f(xi)ǁ = ǁriǁ ≤ ε
o Calculate • The number of maximum iterations is reached. • We can apply the gradient descent algorithm to find x.
λi = ri Tri/riTAri • Or setting the gradient to 0, we can solve for x:
xi+1 = xi + λi ri x = (ATA)-1ATb

13 14

1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent


Gradient descent with momentum: Gradient descent with momentum:

• An additional term (momentum) to remember what happened in the previous iteration is added: xi+1 = xi - λi ∇ f(xi) + α(xi – xi-1) where α > 0 is the momentum parameter. Movement in the
previous iteration
Example 5. Minimize f(x1,x2) = (x1 – 2)2 + (x2 – 4)2
xi+1 = xi - λi ∇ f(xi) + α(xi – xi-1)
Movement in the previous
iteration × α
Movement for current iteration
where α > 0 is the momentum parameter. Start from x0 =
0
and with no momentum
6
Movement in the λ = 0.25 , α = 0.5.
previous iteration
Movement in the previous
iteration × α
2(x1 – 2) −4
∇f(x0) = =
Movement for current iteration 2(x2 – 4) 4
xi-1 with no momentum
With momentum:
Movement for current iteration
x1 = x0 - λ∇ f(x0)
xi with with momentum reaching
xi+1 0 −4 1 x2 = x1 - λ∇ f(x1 ) + α(x1 – x0)
xi+1 = − 0.25 =
6 4 5 1 −2 1
= − 0.25 + 0.5
x2 = x1 - λ∇ f(x1) 5 2 −1

1 −2 1.5 2
= − 0.25 = =
5 2 4.5 4
15 16
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent
Steepest descent direction. Steepest descent direction.

• Recall Taylor series is a representation of a function f as an infinite sum of terms determined using derivatives • For a function f: n →  , the linear approximation of f(x) at xk may be written as
of f evaluated at x0. fk(x) = f(xk) + ∇ f(xk) T(x – xk) ⇔ fk(xk+ λd) = f(xk) + ∇ f(xk) T(λd) = f(xk) + λ ∇ f(xk)Td
• The Taylor polynomial of degree n of f:  →  at x0 is defined as • A vector d is called a direction of descent of a function f at x if there exists λ > 0 such that
𝑓𝑓 𝑘𝑘 𝑥𝑥0 f (x + λd) < f(x)
Tn(x) := ∑𝑛𝑛𝑘𝑘=0 𝑥𝑥 − 𝑥𝑥0 𝑘𝑘
𝑘𝑘!
• If f is differentiable at x with a nonzero gradient, then d = – ∇ f(x)/ǁ ∇ f(x)ǁ is the direction of steepest
where f(k)(x0) is the kth derivative of f at x0 and f(k)(x0)/k! are the coefficients of the polynomial.
descent.
• For a function f: n →  , the linear and quadratic approximation of f(x) at xk may be written as
• Consider the problem: minimize ∇ f(xk) Td subject to ǁdǁ ≤ 1
o fk(x) = f(xk) + ∇ f(xk) T(x – xk)
• The solution is d* = – ∇ f(x)/ǁ ∇ f(x)ǁ
o fk(x) = f(xk) + ∇f(xk) T(x – xk) + ½ (x – xk)THk (x – xk)

where ∇ f(xk) is the gradient and Hk is the Hessian of the function f(x) at xk. 17 18

1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent


Newton’s method.
Steepest descent method.
• For a function f: n →  , the quadratic approximation of f(x) at xk may be written as
 Initialization step. Let ε > 0 be a termination scalar. Choose a starting points x0 . Let k = 0.
fk(x) = f(xk) + ∇f(xk) T(x – xk) + ½ (x – xk)THk (x – xk)
 Main step.
where ∇ f(xk) is the gradient and Hk is the Hessian of the function f(x) at xk.
o If ǁ∇ f(xk)ǁ ≤ ε, stop.
• If the Hessian Hk is invertible, the minimum of fk(x) is obtained when its gradient is zero:
o Otherwise, let dk = – ∇ f(xk) and let λ be the optimal solution of the problem to minimize f(xk+ λd)
subject to λ ≥ 0. ∇f(xk) + Hk(x – xk) = 0 ⇔ x – xk = – (Hk)-1 ∇f(xk)

o Update xk+1 = xk + λ dk.  Initialization step. Let ε > 0 be a termination scalar. Choose a starting points x0 . Let k = 0.

o Replace k = k + 1, and repeat main step.  Main step.

o If ǁ∇ f(xk)ǁ ≤ ε, stop. See page 9 for the case


of n = 1.
o Otherwise, update xk+1 = xk – (Hk)-1 ∇f(xk)
19
o Replace k = k + 1, and repeat main step. 20
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent
Stochastic gradient descent Stochastic gradient descent
• Stochastic gradient descent is a stochastic approximation of the gradient descent method for minimizing • Batch update:
an objective function that is written as a sum of differentiable function. θi+1 = θi – γi ∇ L(θi)
• Given n = 1,2, … N data points, let the loss Ln be the loss incurred by sample n. = θi – γi ∑𝑁𝑁
𝑛𝑛=1 ∇Ln(θi)

• The total loss for N data points is • When the training set is large and/or no simple formulas exist, evaluating the sum of the gradients
𝑁𝑁
L(θ) = ∑𝑛𝑛=1 Ln(θ), where θ is the vector of parameters of interest. become very expensive.
• For example, the least-squares loss Ln(θ) = (yn – f(xn, θ))2 in regression given the independent variables xn • Stochastic approach: we do not compute the gradient precisely, but only compute an approximation of
and target variable yn. the gradient.
• Standard gradient descent is a batch optimization method performed using the full training set by • Mini-batch gradient update: randomly choose a subset Ln for update.
updating the vector parameter • In extreme case, one sample is randomly selected and the weights/parameters in the model are then
θi+1 = θi – γi ∇ L(θi) = θi – γi ∑𝑁𝑁
𝑛𝑛=1 ∇Ln(θi) updated.

21 22

2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers

• Consider the problem • Consider the original primal problem:

min f(x) min f(x)


𝒙𝒙 𝒙𝒙
subject to gi(x) ≤ 0 for all i = 1,2, … m subject to gi(x) ≤ 0 for all i = 1,2, … m
where the function f: n →  is real valued and similarly, gi : n → . • The Lagrangian associated with the above constrained problem is
• It is assumed that the function f and gi are convex. L(x,λ) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 λi gi(x) = f(x) + λ g (x)
T

• One way to solve the constrained optimization problem is to convert it to an unconstrained problem: where λi ≥ 0 is the Lagrange multiplier corresponding to each inequality constraint gi(x) ≤ 0
minimize J(x) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 𝐼𝐼 𝑔𝑔𝑖𝑖 𝒙𝒙 , where I(z) is an infinite step function
• Associated with the original primal problem, we have the dual problem:

max D(λ)
0 𝑖𝑖𝑖𝑖 𝑧𝑧 ≤ 0 λ
I(z) = �
∞ 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 subject to λ ≥ 0, where λ ∈ m are the dual variables and
• It is still difficult to find the minimum of J(x).
D(λ) = min L(x,λ)
𝒙𝒙
23 24
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers

• Consider the Lagrangian: • The original constrained primal problem:


L(x,λ) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 λi gi(x) min f(x)
𝒙𝒙
• Let J(x) = max L(x,λ) subject to gi(x) ≤ 0 for all i = 1,2, … m
λ≥0
• The Lagrange multipliers are non-negative, λ ≥ 0. becomes the minmax problem:

• When x satisfies the constraint gi(x) ≤ 0, let λ𝑖𝑖 = 0 to maximize λi gi(x). min J(x) = min max L(x,λ)
𝒙𝒙 𝒙𝒙 λ ≥ 0
• When x violates the constraint, that is, gi(x) > 0, we let λ𝑖𝑖 = ∞ to maximize λi gi(x).
• Let’s compare the solutions of the above minmax problem and the maxmin problem max min L(x,λ)
• The value of J(x) is: λ ≥ 0 𝒙𝒙
o f(x) when all the constraints are all satisfied, i.e. gi(x) ≤ 0 for i = 1,2, … m. • We have: L(x,λ) = f(x) + ∑𝑚𝑚 λ g
𝑖𝑖=1 i i (x) ≤ f(x) for all feasible x and λ ≥ 0

o ∞ when at least one of the constraints is not satisfied. • Hence, f(x) ≥ L(x,λ) ≥ min L(x,λ) = D(λ) for all feasible x and λ ≥ 0
𝒙𝒙

• The original constrained problem is now min J(x)


𝒙𝒙
min f(x) • We conclude: min max L(x,λ) ≥ max min L(x,λ) ⇨ primal obj. ≥ dual obj.
𝒙𝒙 𝒙𝒙 λ ≥ 0 λ ≥ 0 𝒙𝒙
subject to gi(x) ≤ 0 for all i = 1,2, … m 25 26

2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 1 (Linear programming):
• Weak duality: min max L(x,λ) ≥ max min L(x,λ) • Consider the problem constrained optimization problem when the objective function and the constraints
𝒙𝒙 λ≥0 λ ≥ 0 𝑥𝑥
Given x ∈  that is feasible for the primal problem and λ ≥ 0, the primal objective function value f(x) is
n are all linear functions:

greater than or equal to the dual objective function value D(λ). Primal LP: min cTx
𝑥𝑥
• If we find the tightest possible lower bound for the primal objective function by finding optimal λ ≥ 0 such subject to Ax ≤ b
that D(λ) is maximized, we have strong duality: where c ∈ n , A ∈ m×n , b ∈ m

min max L(x,λ) = max min L(x,λ) Recall: D(λ) = min L(x,λ) • The Lagrangian is L(x,λ) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 λi gi(x) = f(x) + λ g (x) = c x + λ (Ax – b)
T T T
𝒙𝒙 λ ≥ 0 λ ≥ 0 𝑥𝑥 𝒙𝒙
= (c + A T λ)Tx – λTb
• In contrast to the primal problem which is constrained, the problem min L(x,λ) is unconstrained for a
𝑥𝑥 where λ ∈ m is the vector of non-negative Lagrange multipliers.
given λ : the dual problem provides an alternative in solving the primal problem if we can solve min L(x,λ) • Taking the derivative of L(x,λ) with respect to x and setting it to 0 yields:
𝑥𝑥
easily. c + AT λ = 0

27 • Recall: min L(x,λ) = D(λ) ⇨ D(λ) = – λTb 28


𝒙𝒙
2. Constrained optimization and Lagrange multipliers BT2103/BT3104:
Example 1 (Linear programming) continued:

Primal LP: min cTx Dual LP: max – λTb


𝑥𝑥
λ
subject to Ax ≤ b
subject to c + A T λ = 0
where c ∈ n , A ∈ m×n , b ∈ m
λ ≥ 0 where λ ∈ m

• Weak duality: min max L(x,λ) ≥ max min L(x,λ)


𝒙𝒙 λ≥0 λ ≥ 0 𝑥𝑥
• If x is feasible for LP and λ is feasible for the dual LP, then:
Dual LP: – min bT λ
c + AT λ = 0 λ
subject to A T λ = -c
xT( c + A T λ ) = 0 ⇒ cTx = - xT A T λ
λ ≥ 0 where λ ∈ m
Ax ≤ b and λ ≥ 0
λT(Ax ) ≤ λTb ⇨ - cTx ≤ λTb ⇨ cTx ≥ - λTb ⇨ primal obj. ≥ dual obj.
29 30

BT2103/BT3104: • Consider linear program: BT2103/BT3104: • Suppose we have the linear program:
max – λTb min 𝐜𝐜T𝐱𝐱  - max − 𝐜𝐜T𝐱𝐱
subject to c + A T λ = 0  c + A T λ ≥ 0, c + A T λ ≤ 0
subject to A𝐱𝐱 ≤ b
λ≥0 x ∈ n
• The same LP: • Consider now the equivalent LP:
max – λTb - max − 𝐜𝐜T (𝐱𝐱𝐱𝐱 − 𝐱𝐱𝐱𝐱)
−A T c
subject to λ ≤ 𝐱𝐱1
AT −c subject to A 𝐱𝐱𝟏𝟏 − 𝐱𝐱𝐱𝐱 ≤ b  A −A ≤ b
𝐱𝐱2
λ≥ 0 𝐱𝐱𝟏𝟏 ≥ 0 , 𝐱𝐱𝟐𝟐 ≥ 0
• The dual of the above LP:
• The dual of the above LP:
min 𝐜𝐜T𝐲𝐲1 − 𝐜𝐜T𝐲𝐲2
- min 𝐛𝐛T𝐮𝐮
−A T T 𝐲𝐲1 subject to
𝐀𝐀𝐓𝐓
𝐮𝐮 ≥
−𝐜𝐜
subject to
AT 𝐲𝐲2 ≥ – b  – A 𝐲𝐲1 + A 𝐲𝐲2 ≥ – b −𝐀𝐀𝐓𝐓 𝐜𝐜
𝐲𝐲1 ≥ 0 , 𝐲𝐲2 ≥ 0 𝐮𝐮 ≥ 0
• Let 𝐱𝐱 = 𝐲𝐲1 – 𝐲𝐲2 • We have the following dual LP:
• We have the following dual LP: - min 𝐛𝐛T𝐮𝐮  max −𝐛𝐛T𝐮𝐮
When the constraints are equality When the variables are unrestricted
min 𝐜𝐜T𝐱𝐱 subject to 𝐀𝐀𝐓𝐓 𝒖𝒖 = −𝐜𝐜
constraints, the corresponding duals in sign, the corresponding dual
are unrestricted in sign (can be 0, subject to A𝐱𝐱 ≤ b constraints are equality constraints 𝐮𝐮 ≥ 0
positive or negative)
31 32
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 2 (Quadratic programming) :
Example 1 (Linear programming) continued:
Consider the quadratic optimization problem
Primal LP: min cTx Dual LP: max – λTb
𝑥𝑥 λ min ½ xTQx + cTx
subject to Ax ≤ b 𝑥𝑥
subject to c + AT λ=0 subject to Ax ≤ b
where c ∈ n , A∈ m×n , b∈ m
λ ≥ 0 where λ ∈ m where c ∈ n , A ∈ m×n , b ∈ m and Q ∈ n×n is a symmetric positive definite matrix.
• Strong duality: Suppose x* is feasible for LP , λ* is feasible for the dual LP and cTx* = - λ*Tb, then x* is o The Lagrangian is given by
the solution of the primal LP and λ∗ is the solution of the dual LP: L(x,λ) = ½ xTQx + cTx + λT (Ax – b)
o Let x be any point feasible for primal LP, then by weak duality: cTx ≥ - λ*Tb = cTx* , then x* must be the = ½ xTQx + (c + AT λ) T x – λT b
solution of primal LP. o The derivative of the Lagrangian with respect to x is ∇x L(x,λ) = Qx + (c + AT λ)
o Let λ be any point feasible for dual LP, then by weak duality: cTx* ≥ - λTb , that is, - λ*Tb ≥ - λTb , then o Setting the derivative to 0 and solving for x, we obtain x = - Q-1(c + AT λ)
λ* must be the solution of dual LP. o Hence, L(x,λ) = ½ xTQx + cTx + λT (Ax – b)
33 = ½ xTQx + (c + AT λ) T x – λT b = - ½(c + AT λ)T Q-1(c + AT λ) – λT b 34

2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 2 (Quadratic programming) continued: Example 3 General non-linear optimization:

• Consider the primal quadratic optimization problem: • Consider the primal quadratic optimization problem:

min ½ xTQx + cTx min f(x)


𝒙𝒙
𝑥𝑥
subject to Ax ≤ b subject to gi(x) ≤ 0 for all i = 1,2, … m.

where c ∈ n , A ∈ m×n , b ∈ m and Q ∈ n×n is a symmetric positive definite matrix. • Its dual is

• Its dual is max D(λ)


λ
max - ½(c + AT λ)T Q-1(c + AT λ) – λT b subject to λ ≥ 0, where D(λ) = min L(x,λ) = min f(x) + λT g (x)
λ Support Vector Machines 𝑥𝑥 𝑥𝑥
subject to λ ≥ 0 where λ ∈ m ⇅
max f(x) + λT g (x)
λ,𝑥𝑥
subject to ∇𝐱𝐱 f(x) + λT ∇𝐱𝐱 g (x) = 0
35 λ ≥0 36
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Remark 1: Remark 2:
• Problem with equality constraints: • Problem with maximizing objective function:

min f(x) max f(x)


𝑥𝑥 𝒙𝒙
subject to gi(x) ≤ 0 for all i = 1,2 …. m
subject to gi(x) ≤ 0 for all i = 1,2 …. m
hj(x) = 0 for all j = 1,2…. k
• Convert into the equivalent minimizing problem:
• Convert each equality constraint into 2 inequality constraints:
− min - f(x)
min f(x) 𝑥𝑥
𝑥𝑥
subject to gi(x) ≤ 0 for all i = 1,2 …. m
subject to gi(x) ≤ 0 for all i = 1,2 …. m
hj(x) ≤ 0 for all j = 1,2…. k
hj(x) ≥ 0 for all j = 1,2…. k ⇔ - hj(x) ≤ 0 for all j = 1,2…. k

37 38

2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Remark 3: Karush Kuhn Tucker (KKT) optimality conditions:
• Problem with non-negative variables: • Given the nonlinear optimization problem:

min f(x) NLP: min f(x)


𝒙𝒙 𝒙𝒙

subject to gi(x) ≤ 0 for all i = 1,2 …. m subject to gi(x) ≤ 0 for all i = 1,2 …. m
x ≥ 0 x∈ n • Necessary conditions:
• Consider the equivalent problem: If x is an optimal solution, then there must exist multipliers λ1, λ2, … λm such that

min f(x) o Primal feasibility: gi(x) ≤ 0 for all i = 1,2 …. m


𝑥𝑥
o Dual feasibility: λ1, λ2, … λm ≥ 0
subject to Gi(x) ≤ 0 for all i = 1,2 …. m+n
o Stationarity: ∇𝐱𝐱 f(x) + λT ∇𝐱𝐱 g (x) = 0 ⇔ ∇𝐱𝐱 f(x) + ∑𝑚𝑚
𝑖𝑖=1 λi ∇𝐱𝐱 gi(x) = 0
g (x) 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 1,2, … 𝑚𝑚
where Gi(x) =� i o Complementary slackness: λi gi (x) = 0 for all i = 1,2 …. m
− 𝒙𝒙𝑗𝑗 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 𝑚𝑚 + 1, 𝑚𝑚 + 2, . . 𝑚𝑚 + 𝑛𝑛, 𝑗𝑗 = 𝑖𝑖 − 𝑚𝑚
39 40
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Karush Kuhn Tucker (KKT) optimality conditions: Karush Kuhn Tucker (KKT) optimality conditions:
• Complementary slackness conditions: λi × gi (x) = 0 for all i = 1,2 …. m • Given the nonlinear optimization problem:
• Note the feasibility conditions require:
NLP: min f(x)
𝒙𝒙
o Primal feasibility: gi(x) ≤ 0 for all i = 1,2 …. m
o Dual feasibility: λ1, λ2, … λm ≥ 0 subject to gi(x) ≤ 0 for all i = 1,2 …. m

• When λi > 0 , then to satisfy the complementary slackness condition: gi (x) = 0 • Sufficient conditions:

• When gi (x) < 0 , then to satisfy the complementary slackness condition: λi = 0 If x and λ satisfy the necessary conditions, and the function f(x), g1(x), g2(x), … gm(x) are convex, then x
Complementary slackness:
• The constraint gi(x) ≤ 0 is said to be binding if it satisfied as equality: gi (x) = 0 o si ≥ 0, λi ≥ 0 is the optimal solution of NLP.

• Otherwise, it is said to be non-binding: gi (x) < 0 o si × λ i = 0 for all i = 1, 2, … ,m

• If the constraint is not binding at x, then there is a positive slack si > 0 such that: gi (x) + si = 0
• If multiplier λi > 0 ⇨ constraint must be binding, gi (x) = 0 , si = 0
• If the constraint is not binding, gi (x) < 0 , si > 0 ⇨ the multiplier λi = 0 41 42

Z = x1 + x2 = c
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers

Example 1: x2 Example 2: Let x1 = 8.5, x2 = 8.75, x3 = 17.25, λ1 = 10, λ2 = 0

min Z = x1 + x2 min -x1(30-x1) -x2(50-2x2) + 3x1+ 5x2+ 10x3 o Primal feasibility:

1 ∇ g1(x) x1 + x2 – x3 = 0 and x3 = 17.25


subject to x1 ≥ 1 ⇒ -x1 ≤ -1 ◆ subject to x1 + x2 – x3 ≤ 0
o Dual feasibility: λ1, λ2 ≥ 0
x2 ≥ 1 ⇒ -x2 ≤ -1 x3 ≤ 17.25 o Stationarity:
KKT necessary conditions: ∇ g2 (x) KKT necessary conditions: -30 + 2x1+ 3 + λ1= -30 -17 +3 + 10 = 0
-∇f(x)
o Primal feasibility: – x1 ≤ -1 and – x2 ≤ -1 o Primal feasibility: x1 + x2 – x3 ≤ 0 and x3 ≤ 17.25 -50 + 4x2+ 5 + λ1= -50 + 35 + 5 + 10 = 0
10 -λ1 + λ2= 10 – 10 + 0 = 0
o Dual feasibility: λ1, λ2 ≥ 0 x1 o Dual feasibility: λ1, λ2 ≥ 0
1 o Complementary slackness:
1 −1 0 0 o Stationarity: -30 + 2x1+ 3 + λ1= 0 λ1(x1+ x2–x3) = 10(0) = 0
o Stationarity: + λ1 + λ2 = Stationarity: the negative gradient of the
1 0 −1 0 -50 + 4x2+ 5 + λ1= 0 λ2(x3–17.25) = 0
o Complementary slackness: λ1 (x1 - 1) = 0 objective function is linear combination of the All the sufficient optimality conditions are satisfied.
10 - λ1 + λ2= 0
λ2 (x2 - 1) = 0 gradient of the constraints:
o Complementary slackness: λ1(x1+ x2–x3) = 0 Since the objective function and the 2 constraints are
-∇f(x) = λ1 ∇ g1(x) + λ2 ∇ g2(x) convex, x1 = 8.5, x2 = 8.75, x3 = 17.25 is a solution.
Solution x1 = x2 = 1, λ1 = λ2 = 1
λ2(x3–17.25) = 0
λ1, λ2 ≥ 0
43 44
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers

Example 3: Let x1 = 2, x2 = 0, x3 = 8 and u1 = 0, u2 = 10, u3 = 10, u4 = 0, u5 = 5, u6 = 0 Example 4:


o Primal feasibility:
Primal LP: maximize 60x1 + 30x2 + 20x3 NLP: minimize 2x2 + y2 – xy – 8x – 3y
8x1 + 6x2 + x3 = 24
subject to: 8x1 + 6x2 + x3 ≤ 48 4x1 + 2x2 + 1.5 x3 = 20 subject to: 3x + y = 10
2x1 +1.5x2 + 0.5 x3 = 8
• Necessary conditions:
4x1 + 2x2 + 1.5 x3 ≤ 20 Sufficient conditions:
2x1 +1.5x2 + 0.5 x3 ≤ 8 o Dual feasibility: u1, u2, u3, u4, u5, u6 ≥ 0 o Primal feasibility: gi(x) ≤ 0 The objective function is convex, the
o Stationarity:  3x + y - 10 ≤ 0
x1, x2, x3 ≥ 0 constraint is linear, hence
-60 + 8u1 + 4u2 + 2 u3 – u4 = 0  -3x – y + 10 ≤ 0
Dual LP: minimize 48u1 + 20u2 + 8u3 x = 69/28, y = 73/28 is the solution of NLP.
-30 + 6u1 + 2u2 + 1.5u3 – u5 = 0
o Dual feasibility: u1, u2 ≥ 0
-20 + u1 + 1.5u2 + 0.5 u3 – u6 = 0
subject to: 8u1 + 4 u2 + 2 u3 ≥ 60 Stationarity: ∇𝐱𝐱 f(x) + ∑𝑚𝑚
o 𝑖𝑖=1 ui ∇𝐱𝐱 gi(x)
Complementary slackness: x1 = 2, x2 = 0, x3 = 8 is
6u1 + 2 u2 + 1.5 u3 ≥ 30 solution of primal LP,  4x – y – 8 - 3u1 + 3u2 = 0
(8x1 + 6x2 + x3 – 48)u1 = 0
and u1 = 0, u2 = 10, u3 = 10  2y – x – 3 - u1 + u2 = 0
(4x1 + 2x2 + 1.5 x3 - 20)u2 = 0
u1 + 1.5u2 + 0.5 u3 ≥ 20 solve dual LP
(2x1 +1.5x2 + 0.5 x3 – 8)u3 = 0 o Complementary slackness: λi ∇𝐱𝐱 gi (x) = 0 for all i = 1,2
u1 , u2, u3 ≥ 0 x1u4 = x2u5 = x3u6 = 0 cTx = bTu = 280 45 o x = 69/28, y = 73/28, u1 = 0, u2 = ¼ satisfy all the above conditions. 46

2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 5 (Linear programming) : Example 5 (Linear programming) :

Consider the linear problem with equality constraints and non-negative variables: Consider the linear problem with equality constraints and non-negative variables:
min cTx
min cTx 𝑥𝑥
𝑥𝑥
subject to Ax = b ⇒ Ax - b ≤ 0 , - Ax + b ≤ 0
subject to Ax = b ⇒ Ax - b ≤ 0 , - Ax + b ≤ 0
x ≥ 0 ⇒ -x ≤ 0 where c ∈ n , A ∈ m×n , b ∈ m
x ≥ 0 ⇒ -x ≤ 0 where c ∈ n , A ∈ m×n , b ∈ m • Its dual is
min cT x
o The Lagrangian is given by max – (λ - κ) T b x
λ, κ,ϕ subject to A x = b
L(x,λ) = cTx + λT (Ax – b) + κT (- Ax + b) - ϕ Tx subject to c + AT λ - AT κ - ϕ = 0 x ≥ 0
= (c + AT λ - AT κ - ϕ ) T x – (λ - κ) T b λ, κ , ϕ ≥ 0 where λ, κ ∈ m and ϕ ∈ n


o The derivative of the Lagrangian with respect to x is ∇x L(x,λ) = c + AT λ - AT κ - ϕ Let u = – (λ - κ) , then the dual problem is:
max bT u max bT u
Setting the derivative to 0, we obtain u
o u,ϕ
subject to AT u ≤ c
L(x,λ) = – (λ - κ) T b subject to c - AT u - ϕ = 0
where 𝐮𝐮 ∈ m
47 ϕ ≥ 0 where 𝐮𝐮 ∈ m and ϕ ∈ n 48
Outline:

1. Fundamentals of regression problems

2. Gradient descent
ERROR BASED LEARNING: REGRESSION
3. Matrix approach and statistical analysis

4. Extension and variations

5. Maximum likelihood estimation

6. Maximum a posteriori estimation

7. Logistic regression

1
2

1. Fundamentals of regression models 1. Fundamentals of regression models


Big ideas: • A mathematical model that captures the relationship
• A parameterized prediction model is initialized with a set of random parameters and an error function is used to judge between two variables x and y:
how well this initial model performs when making predictions for instances in a training dataset. y = b + mx
• Based on the value of the error function, the parameters are iteratively adjusted to create a more and more accurate • We can use this model to capture the relationship
model. ID Size Floor Broadband Energy Rental between two features such as Size and Rental Price.
rate rating price
• A simple dataset that includes rental prices 1 500 4 8 C 320 • Intercept = b = 6.47
2 550 7 50 A 380 • Slope = m = 0.62, for every increase of a square foot in
and a number of descriptive features for
3 620 9 7 A 400 size, rental price increase by 0.62 Euros.
10 Dublin city-center offices. 4 630 5 24 B 390
• A 730 square foot office is predicted to be rented for
5 665 8 100 C 385
6 700 4 8 B 410
6.47 + 0.62*730 = 459.07 Euros per month.
7 770 10 7 B 480
8 880 12 50 A 600
9 920 14 8 C 570
10 1000 9 24 B 620 3 4
1. Fundamentals of regression models 1. Fundamentals of regression models
• The simple regression model is written as • The sum of squared error function:
Mw(d) = w[0] + w[1] × d[1] L2(Mw ,D) = ½ ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖2
where = ½ ∑𝑛𝑛𝑖𝑖=1( ti - (w[0] + w[1] × di[1]))2

𝑤𝑤 0 𝑤𝑤[0] 𝑤𝑤0 6.47


o w is the weight vector , the parameters w[0] and w[1] where the optimal weight is = =
𝑤𝑤 1 𝑤𝑤[1] 𝑤𝑤1 0.62
are referred to as weights • Least squares optimization: find the global minimum of
o d is an instance defined by a single descriptive feature d[1] the error surface to find the weights of the model that
o Mw(d) is the prediction output by the model for the instance d. best fits the training dataset.
• A set of weights that capture the relationship between the two Note that the function L2(Mw ,D)
features well is set to fit the training data. o is convex
• To find the optimal set of weights, we need to measure the errors o has a global minimum (because the function is convex).
in prediction: ei = ti - Mw(di)
• The sum of squared error function: L2(Mw ,D) = ½ ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖2
5 6

1. Fundamentals of regression models 1. Fundamentals of regression models


• A 3D surface plot and a bird’s eye view contour plot of the • Each pair of weights w[0] and w[1] defines a point of the x-y plane known as the weight space.
error function • The sum of the squared errors function determines the height of the error surface.
L2(Mw ,D) = ½ ∑𝑛𝑛𝑖𝑖=1( ti - (w[0] + w[1] × di[1]))2 • The model that best fits the training data correspond to the lowest point on the error surface.
• Compute the partial derivatives with respect to the weights and set them to 0:
𝜕𝜕 𝜕𝜕
L (Mw ,D) =0 L (Mw ,D) =0
𝜕𝜕𝜕𝜕[0] 2 𝜕𝜕𝜕𝜕[1] 2

to find the optimal weights.


• The vector of the partial derivatives is called the gradient of L2(Mw,D) with respect to the weight vector w:
𝜕𝜕
L2(Mw ,D)
𝜕𝜕𝜕𝜕 0
∇ L2(Mw ,D) = 𝜕𝜕
L2(Mw ,D)
𝜕𝜕𝜕𝜕 1

7 8
1. Fundamentals of regression models 2. Gradient descent
• Example 1. Error surfaces and plot of f(w0,w1) = 5(w1 – w0)2 + (2w1 + w0 + 3)2 • There are different ways to find a point in the weight space where the derivatives equal to zero.
𝜕𝜕
f(w0,w1) 5(2)(−1)(w1 – w0) + 2(2w1 + w0 + 3) • One way is to apply a guided search algorithm known as gradient descent
𝜕𝜕𝜕𝜕 0
∇ f(w0,w1) = =
𝜕𝜕
f(w0,w1) 5(2)(w1 – w0) + 2(2)(2w1 + w0 + 3)
𝜕𝜕𝜕𝜕 1
18
= for w[0] = 2, w[1] = 2
36

What is this function


errorDelta(D,w[j])?

9 10

2. Gradient descent 2. Gradient descent


𝜕𝜕
• Multivariate Linear Regression model is defined as: L2(Mw,D) • Weight adjustment is proportional to the gradient:
𝜕𝜕𝜕𝜕 0
Mw(d) = w[0] + w[1] × d[1] + ……. + w[m] × d[m] 𝜕𝜕
L2(Mw,D) w[j] ← w[j] + α errorDelta(D,w[j])
𝜕𝜕𝜕𝜕 1
= w[0] + ∑𝑚𝑚 where errorDelta(D,w[j]) is the negative of the gradient. Hence,
𝑗𝑗=1 𝑤𝑤[𝑗𝑗] × 𝑑𝑑[𝑗𝑗] 𝜕𝜕
∇ L2(Mw,D) = L2(Mw,D)
𝜕𝜕𝜕𝜕 2
∑𝑚𝑚 w[j] ← w[j] + α (−1) [∑𝑛𝑛𝑖𝑖=1 ti − 𝐰𝐰T𝐝𝐝i × (−𝐝𝐝𝑖𝑖[𝑗𝑗])]
= 𝑗𝑗=0 𝑤𝑤[𝑗𝑗] × 𝑑𝑑[𝑗𝑗] 𝜕𝜕
L2(Mw,D)
𝜕𝜕𝜕𝜕 3 or
= 𝒘𝒘𝑇𝑇𝒅𝒅 = 𝒘𝒘. 𝒅𝒅 (let d[0] = 1 for all data samples) 𝜕𝜕
L2(Mw,D)
𝜕𝜕𝜕𝜕 4 w[j] ← w[j] + α [∑𝑛𝑛𝑖𝑖=1 ti − 𝐰𝐰T𝐝𝐝i × 𝐝𝐝𝑖𝑖[𝑗𝑗]]
• The sum of squared errors loss function:
∑𝑛𝑛𝑖𝑖=1 ti − 𝒘𝒘𝑇𝑇𝐝𝐝𝑖𝑖 × (−𝐝𝐝i[0]) and α is a constant learning rate, α ˃ 0
L2(Mw ,D) = ½ ∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖2 ∑𝑛𝑛𝑖𝑖=1 ti − 𝒘𝒘𝑇𝑇𝐝𝐝𝑖𝑖 × (−𝐝𝐝i[1])
• Let the error for data instance i be ei = ti − 𝐰𝐰T𝐝𝐝i
= ½ ∑𝑛𝑛𝑖𝑖=1( ti - (w[0] + w[1] × di[1] + …… + w[m] × di[m] ))2 = ∑𝑛𝑛𝑖𝑖=1 ti − 𝒘𝒘𝑇𝑇𝐝𝐝i × (−𝐝𝐝i[2])
∑𝑛𝑛𝑖𝑖=1 ti − 𝒘𝒘𝑇𝑇𝐝𝐝𝑖𝑖 × (−𝐝𝐝i[3]) • If the prediction is too low compared to the actual target value, then the error ei will be positive. If di[j] is positive,
= ½ ∑𝑛𝑛𝑖𝑖=1 (ti − 𝒘𝒘𝑇𝑇𝐝𝐝𝑖𝑖)2
∑𝑛𝑛𝑖𝑖=1 ti − 𝒘𝒘𝑇𝑇𝐝𝐝i × −𝐝𝐝i 4 then w[j] will be updated to a new higher value.
• To compute the gradient of the errors loss function:
• If the prediction is too high compared to the actual target value, then the error ei will be negative. If di[j] is positive,
apply the chain rule, first take derivative of the square function, di[0] = 1 for all i.
then w[j] will be updated to a new lower value.
then take the derivative of (ti − 𝒘𝒘𝑇𝑇𝐝𝐝i) with respect to the weights w[i]
11 12
2. Gradient descent 2. Gradient descent
• Consider the model that includes one descriptive feature x and a target feature y: • A sequence of simple linear regression models developed by the gradient descent approach. The bottom right panel
y = w[0] + w[1] x shows the sum of squared errors.
• Error plots:

• Lines indicate path that the gradient descent algorithm would take across the error surface from 4 different starting
points to the global minimum.
13 14

2. Gradient descent 2. Gradient descent


Initial weights:
• Choosing learning rate (a: α = 0.002, b: α = 0.08, c: α = 0.18): Example.
w0 −0.146
w1 0.185
=
w2 −0.444
w[3] 0.119

Sample ID #1:

d0 1
d1 500
=
d2 4
d[3] 8
• Rental price = w[0] + w[1] × Size + w[2] × Floor + w[3] × Broadband rate
• Rule of thumb: choose some learning rate α in the range [0.000001,10] • Energy rating is not included in the model (for now)
• Choose initial weights randomly in the range [-0.2, 0.2] • Initial weights: w[0] = -0.146, w[1] = 0.185, w[2] = -0.444, w[3] = 0.119
• Learning rate: α = 0.00000002, number of data samples n = 10
15 16
2. Gradient descent 2. Gradient descent
Example (continued). Example (continued).
• w[j] ← w[j] + α [∑𝑛𝑛𝑖𝑖=1 ti − 𝐰𝐰T𝐝𝐝 i × 𝐝𝐝𝑖𝑖[j]] • Batch update.
• When i = 1:
o Error = t1 − 𝐰𝐰T𝐝𝐝1 = 320 – 93.26 = 226.74
o di[0] = 1, contribution to errorDelta(D,w[i]) :
226.74 × 1 = 226.74
o di[1] = 500,
contribution to errorDelta(D,w[i]) :
226.74 × 500 = 113370.05

17 18

2. Gradient descent 3. Matrix approach and statistical analysis


Example (continued). • The linear regression model can be expressed as:
• Batch update. Iteration 1: ½× Sum of Squared Errors = 533785.80 Y=Xβ+e
Iteration 2: ½× Sum of Squared Errors = 443361.52
where Y is an n by 1 vector of target feature, X is n by p matrix of descriptive features,

β is an p by 1 vector of parameters (weights) and e is an n by 1 vector of errors,

n is the number of instances, p is the number of parameters.

• The least squares estimators for β:

min S(β) = min ½ (e12 + e22 + …… en2) = ½ eT e = ½ (Y - X β)T (Y - X β )

• Take the derivative of S(β) and set it to 0:

- XT(Y - X β ) = 0,

• we have XTX β = XTY

and β = (XTX )-1 XTY if (XTX )-1 exists.


19 20
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis

1 500 4 8 320
Rental price 1 550 7 50 380
320 1 620 9 7 400
1 500 4 8 320 1 630 5 24 390 10 7235 82 286
380
1 550 7 50 380 1 665 8 100 385 7235 5479725 62840 203810
X = Y= XTX =
1 620 9 7 400 400
1 700 4 8 410 82 62840 772 2395
1 630 5 24 390 390 1 770 10 7 480 286 203810 2395 16442
1 665 8 100 385 1 880 12 50 600
X = Y= 385
1 700 4 8 410 410 1 920 14 8 570
1 770 10 7 480 1 1000 9 24 620
480
1 880 12 50 600 Linear function: f(x, β) = βTx
600
1 920 14 8 570
1 1000 9 24 620 570 4555 19.5615 w[0] = 19.5616
620 3447725 0.5487 w[1] = 0.5487
XTY = β = (XTX )-1 XTY = Affine function: f(x, β) = β0 + βTx
39770 4.9635 w[2] = 4.9635
128300 −0.0621 w[3] = -0.0621 where β0 is a constant.

21 22

3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
Interpreting multivariate linear regression model.
• The parameter estimate (weight) for Size is 0.54874,
Output summary from R: • The parameter estimate (weight) for Size is its corresponding std. error is 0.07945 and t-value =
Coefficients:
0.54874: for every square foot increase in 0.54874/0.07945 = 6.907
Estimate Std. Error t value Pr(>|t|)
office size, we can expect the rental price to • The observed p-value Pr ˃ |t| is the probability of
(Intercept) 19.56156 43.64459 0.448 0.669738
go up by 0.5487 Euros (provided all the observing a value of the t statistic that is at least as
size 0.54874 0.07945 6.907 0.000455 ***
floor 4.96355 3.93842 1.260 0.254361 other variables remain the same). contradictory to the null hypothesis as the actual
broadband -0.06210 0.30486 -0.204 0.845332 • Of the 3 descriptive features, only Size is one computed from the data.
--- found to have a significant impact on the • Degrees of freedom for the t-distribution = n – p – 1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 model based on t-statistic. = 10 – 3 – 1 = 6.
• For Size is p-value is 0.000455, we can reject the null
Residual standard error: 27.34 on 6 degrees of freedom
hypothesis H0: β1 = 0 at the 5% level of significance
Multiple R-squared: 0.9552, Adjusted R-squared: 0.9328
and conclude that the weight for Size β1 ≠ 0.
F-statistic: 42.65 on 3 and 6 DF, p-value: 0.0001932 6.907
23 24
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
SUMMARY OUTPUT
• Analysis of Variance (ANOVA) Table: • Analysis of Variance (ANOVA) Table: ANOVA

Regression Statistics
Multiple R 0.977348054 df SS MS F Significance F
R Square 0.955209219
Adjusted R Square 0.932813829
Standard Error 27.3391202 Regression 3 95637.935 31879.311 42.6520 0.000193
Observations 10

Residual 6 4484.564 747.427


ANOVA
Total 9 100122.5
df SS MS F Significance F
Regression 3 95637.93504 31879.31168 42.65204581 0.000193237
Residual 6 4484.56496 747.4274933 RegressionSS + ResidualSS =
Total 9 100122.5 TotalSS
• The value Significance > F indicates the p-value for the hypothesis test:
• DF Regression = 3, there are 3 independent variables in this model, Size, Floor, Broadband rate
H0: β = 0 versus Ha: β ≠ 0
• DF Total = n-1 = 9, because
• Since the p-value is very small (= 0.000193), we reject the null hypothesis that β = 0 and
∑10
𝑖𝑖=1 (yi − 𝑦𝑦) = 0
conclude that the model is useful. Note: degrees of freedom for F distribution is (p,n-p-1) = (3,6).
• knowing 9 of them, we will know the value of the 10th difference.
• When there are p independent variables:
• 𝑦𝑦 is the average of yi 25 H0: β1 = β2 = ….. βp = 0 versus Ha: at least one of the βi ≠ 0, i = 1,2, … p 26

3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
Regression Statistics

RegressionSS + ResidualSS = TotalSS Multiple R 0.977348054 Validation of a regression model:


Explained variation + unexplained variation = TotalSS R Square 0.955209219 • It is assumed that the functional relationship f: n →  is linear:
Adjusted R Square 0.932813829 Y = f(X1, X2, … Xn )
R Square = RegressionSS/TotalSS
Standard Error 27.3391202
when we build a multiple linear regression models: Y = X β + e
= 1 – ResidualSS/TotalSS Observations 10
• Assumption in the model:
o E(e) = 0
o Var(e) = Iσ2 • The linear relationship is correct
• R2 near zero indicates very little of the variability in yi is explained by the linear relationship of Y with X.
o e1, e2, e3, …. en is such that • The instances are independent of each
• R2 near 1 indicates almost all of the variability in yi is explained by the linear relationship of Y with X. other
each ei is normally distributed
• The error e (equivalently, the dependent
• R2 is known as the coefficient of determination or multiple R-square. In a simple regression model, R2 is equal to the with mean zero and variable Y) follows a normal distribution

square of r = correlation coefficient between X and Y. standard deviation σ. • The variability of Y given X is the same
regardless of X (homoscedasticity)
• RegressionSS + ResidualSS = TotalSS is true only for “training data”.
27 28
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis

Validation of a regression model: Validation of a regression model:


• The error plot, the prediction plot and the error histogram for the Rental Price regression model with w[0] = 19.5616, • An example of errors from regression that violate model assumption:
w[1] = 0.5487, w[2] = 4.9635 and w[3] = -0.0621.

29 30

4. Extension and variations 4. Extension and variations


Setting the learning rate with weight decay Setting the learning rate with weight decay
• The office rental dataset, trained with α0 = 0.18 and c = 10: • Large value for learning rate α may cause the weights the jump back and forth across the error surface.

• Small value for on the other hand results in slow convergence.

• Use weight decay strategy:


𝑐𝑐
αt = α0 ( ) where
𝑐𝑐+ τ

α0 is the initial learning rate, c is a constant that controls how fast the rate decays and τ is the current iteration of the
• The office rental dataset, trained with α0 = 0.25 and c = 100: gradient descent algorithm.

31 32
4. Extension and variations 4. Extension and variations

Handling categorical descriptive features Handling categorical descriptive features

• Multiplying the value of Energy rating with a weight is not sensible.


lm(formula = rental ~
• Transform the dataset by replacing Energy rating with: size + floor + broadband + EratingA + EratingB + EratingC)
Output:
o Energy rating A = 1 if the original Energy rating is A. Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
o Energy rating B = 1 if the original Energy rating is B. (Intercept) -17.01390 36.39712 -0.467 0.6645
size 0.64315 0.09088 7.077 0.0021 **
o Energy rating C = 1 if the original Energy rating is C. floor 0.01672 4.59138 0.004 0.9973
broadband -0.13249 0.24992 -0.530 0.6241
The three columns are linearly dependent: EratingA 42.09485 17.40362 2.419 0.0729 .
EratingB -4.46061 22.89259 -0.195 0.8550
EratingC NA NA NA NA
ERA + ERB + ERC = 1 ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.76 on 4 degrees of freedom


⇨ (XTX )-1 does not exists Multiple R-squared: 0.9828, Adjusted R-squared: 0.9613
F-statistic: 45.67 on 5 and 4 DF, p-value: 0.001274

33 34

4. Extension and variations 4. Extension and variations


• R automatically handles features with categorical values (A,B,C).
Handling categorical descriptive features The value for A is used as ‘base’. Handling nonlinear relationship between the descriptive features and target feature

lm(formula = rental ~ • For Energy rating A: • Consider the quadratic relationship of the form: Y = b + wX + dX2
size + floor + broadband + Energy) Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband Let Z = X2
Output: • For Energy rating B:
then we have Y = b + wX + dZ
Coefficients:
Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband –

(Intercept)
Estimate Std. Error t value Pr(>|t|)
25.08095 34.50268 0.727 0.5075
46.5555 = • Consider the Cobb-Douglas production function: Q = α Cβ1Lβ2
size 0.64315 0.09088 7.077 0.0021 **
floor 0.01672 4.59138 0.004 0.9973
-21.47 + 0.6431 size + 0.0167 floor – 0.1325 broadband where Q is the output, C is capital and L is labor.
broadband -0.13249 0.24992 -0.530 0.6241
EnergyB -46.55546 25.51954 -1.824 0.1422
• For Energy rating C:
EnergyC -42.09485 17.40362 -2.419 0.0729 . Taking the log of both sides of the equation: log Q = log α + β1 log C + β2 log L
---
Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband –
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
42.09485 = and letting Y = log Q, β0 = log α, X1 = log C, X2 = log L, we have the multiple regression model:
Residual standard error: 20.76 on 4 degrees of freedom
Multiple R-squared: 0.9828, Adjusted R-squared: 0.9613
-17.01 + 0.6431 size + 0.0167 floor – 0.1325 broadband Y = β 0 + β 1 X1 + β 2 X2
F-statistic: 45.67 on 5 and 4 DF, p-value: 0.001274
• If two office units are identical in size, floor level, broadband rate,
• We can then find the model parameters by the standard linear regression approach.
Represent n categorical values by the one with A energy rating is expected to have rental that is 46.55
binary string of size n-1: Euros higher than the one with B energy rating per month.
[A,B,C]→ [(0,0), (1,0), (0,1)] 35 36
4. Extension and variations 4. Extension and variations
Ridge regression Ridge regression
• The least squares estimators for β: • The parameter estimates βλ for ridge regression:
β = (XTX )-1 XTY
βλ = (XTX + λI) -1 XTY where I is a p by p identity matrix.
may not exist or they can have very large values.
depends on the value of λ.
• Instead of minimizing the sum of squared errors, we impose the ridge constraint
min S’(β): min ½ (Y - X β)T (Y - X β ) • Select the best value of λ by cross validation.

subject to ∑𝑝𝑝𝑖𝑖 β𝑖𝑖2 ≤ 𝑡𝑡 • The descriptive features X should be standardized (mean = 0, std = 1) and target feature Y centered (mean = 0).

• The above constrained optimization problem is equivalent to: • Ridge regression is applied to overcome serious multi-colinearity problems.
min S”(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ∑𝑝𝑝𝑖𝑖 β𝑖𝑖2
⇔ min S”(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ‖ β ‖2 where λ ˃ 0 is the penalty/regularization parameter.
• Take the derivative of S”(β) and set it to 0:
- XT(Y - X β ) + λ β = 0,
• We have XTX β + λ β = XTY and β = (XTX + λI) -1 XTY where I is a p by p identity matrix.
37 38

4. Extension and variations 4. Extension and variations


LASSO regression Comparing Ridge regression with ‖ β ‖2 ≤ 1 and LASSO regression with ‖ β ‖1 ≤ 1
• We solve
min S*(β): min ½ (Y - X β)T (Y - X β )
subject to ∑𝑝𝑝𝑖𝑖 |β𝑖𝑖 | ≤ 𝑡𝑡
or equivalently,
𝑝𝑝
min S**(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ∑𝑖𝑖 |β𝑖𝑖|

⇔ min S**(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ‖ β‖1


where λ ˃ 0 in LASSO (Least absolute shrinkage and selection operators) regression.

• When t is sufficiently small (λ is sufficiently large), we can expect many of the parameters β𝑖𝑖 to be zero ⇒ feature
selection.
β1=0 , β2=1
• Unlike original least squares regression and ridge regression, there is no close form solution for the parameters in
LASSO regression.
39 40
5. Maximum likelihood estimation 5. Maximum likelihood estimation
• Maximum likelihood estimation (MLE): define a function of the parameters θ that enables us to find a model that fits • Maximum likelihood estimation (MLE): define a function of the parameters θ that enables us to find a model that fits
the data well. the data well.
• Let p(x|θ) be the probability density function that provides the probability of observing data x given θ. • For example, in linear regression we specify a Gaussian likelihood: p(yn|xn ,θ) = N(yn| xnT θ, σ2) where observation
• The maximum likelihood estimator gives us the most likely parameter θ. uncertainty is modelled as Gaussian noise εn ∼ N(0, σ2) (see slide page 28).
• Supervised learning: • Given an input xn, predict the target value yn to be distributed according to the Gaussian (normal) distribution with
o Pairs of observations (x1,y1), (x2,y2) …., (xN,yN) with xn ∈ D and labels yn ∈ . α 1
mean xnT θ and variance σ2. In the figure below θ = β , xn =
o Given a vector xn, we want the probability distribution of the label yn: specify the conditional probability Xn N(yn| xnT θ, σ2) ⇨
Mean of distribution for yn= xnT θ
distribution of the labels given the observations for the particular parameter setting θ. Variance = σ2
o For each observation label pair (xn,yn), find N(yn| xnT θ, σ2) ⇨
Mean of distribution for yn= xnT θ
p(yn|xn ,θ) Variance = σ2
• For example, in linear regression we specify a Gaussian likelihood: p(yn|xn ,θ) = N(yn| xnT θ, σ2) where observation
uncertainty is modelled as Gaussian noise εn ∼ N(0, σ2) (see slide page 28).
41 42

5. Maximum likelihood estimation 5. Maximum likelihood estimation


• We assume that the pairs of observation (x1,y1), (x2,y2) …., (xN,yN) are independent and identically distributed (i.i.d). • We assume that the pairs of observation (x1,y1), (x2,y2) …., (xN,yN) are independent and identically distributed (i.i.d).
• By using the independent assumption, for the entire data set (X,Y), we have • By using the independent assumption, for the entire data set (X,Y), we have
p(Y|X, θ) = ∏𝑁𝑁
𝑛𝑛=1 p(yn| xn, θ) p(Y|X, θ) = ∏𝑁𝑁
𝑛𝑛=1 p(yn| xn ,θ)

⇨ recall P(a,b) = P(a ∩ b) = P(a) × P(b) if a and b are independent events. • “Identically distributed” means that each term in the product is of the same distribution with the same parameters,
• “Identically distributed” means that each term in the product is of the same distribution with the same parameters, e.g. N(yn| xnT θ, σ2) .
e.g. N(yn| xnT θ, σ2) . • Example: minimizing the negative log-likelihood where p( yn|xn ,θ) is N(yn|xnT θ, σ2) gives us linear regression least
• We consider the negative log-likelihood: squares parameter estimates:
• Take the derivative of
L(θ) = - log p(Y|X,θ) = - log ∏𝑁𝑁 𝑁𝑁
𝑛𝑛=1 p(yn| xn,θ) = - ∑𝑛𝑛=1 log p(yn| xn, θ) L(θ) with respect to θ
⇨ recall log(a × b) = log(a) + log(b) • Set the derivative to 0
and solve for θ
Find θ that maximizes the likelihood function p(Y|X, θ)
• Answer: see page 20
⇔ minimize the negative likelihood function p(Y|X, θ)
⇔ minimize the negative of the log-likelihood function L(θ) (because log(x) is a strictly increasing function).
43 44
6. Maximum A Posteriori Estimation 6. Maximum A Posteriori Estimation (MAP)
• If we have prior knowledge about the distribution of the parameters θ, we can update this distribution after observing • Given the training data (X,Y) to find the MAP estimate θMAP , we compute the log-posterior:
some data x.
𝒑𝒑 Y X, θ 𝒑𝒑(θ)
log 𝑝𝑝(θ|X,Y) = log 𝒑𝒑(Y|X)
= log p(Y|X, θ) + log p(θ) + const
• Bayes’ theorem allows us to compute a posterior distribution p(θ|x):
and minimize the negative log-posterior distribution with respect to θ:
𝑝𝑝 x θ 𝑝𝑝(θ)
𝑝𝑝(θ|x) =
𝑝𝑝(x)
θMAP ∈ argmin { - log p(Y|X, θ) - log p(θ)}
where p(x|θ) is the likelihood of the data x given θ and p(x) is the probability distribution of the data x. θ
• For example, with a Gaussian prior p(θ) = N(0, b2I) , we have the negative log posterior:
• Since the distribution p(x) does not depend on θ, finding the maximum of the posterior distribution 𝑝𝑝(θ|x) is
- log 𝑝𝑝(θ|X,Y) = - log p(Y|X, θ) - log p(θ) + const
equivalent to maximizing the denominator numerator 𝑝𝑝 x θ 𝑝𝑝(θ)
1 1
= (Y - X θ)T(y - X θ) + θTθ + const See page 44
Bayes’ formula: 2σ2 2𝑏𝑏2

max 𝑝𝑝(θ|x ) ∝ max 𝑝𝑝 x θ 𝑝𝑝(θ) P(A|B) = P(A ∩ B)/P(B) • Note: consider the prior for a single parameter θ that is Gaussian with mean 0 and variance b2, then

= P(B|A) P(A)/P(B) p(θ) =


1
exp (-½ (θ/b)2)
2π𝑏𝑏2

+ ½ (θ/b)2 = const +
1 1
- log p(θ) = - log ( 2π𝑏𝑏2
) θ2
45
2𝑏𝑏2 46

6. Maximum A Posteriori Estimation (MAP) 7. Logistic regression

• Taking the derivative of the negative log posterior: • It is a technique for converting binary classification problems into linear regression.

- log 𝑝𝑝(θ|X,Y) = - log p(Y|X, θ) - log p(θ) • The response variables are assumed to have values either 0 or 1.

=
1
(Y - X θ)T(Y - X θ) +
1
θTθ + const • Using logistic regression, the posterior probability P(y|x) of the target variable y conditioned to the
2σ2 2𝑏𝑏2
input x = (X1, X2, …. , Xn) is modeled according to the logistic function:
and setting it to 0, we obtain the MAP estimate
2
𝜎𝜎
θMAP = (XTX + 𝑏𝑏2 I)-1 XTY

• Compare with the parameter estimates without prior on page 20:


e = 2.718281828…
𝜎𝜎2
o There is an additional term 𝑏𝑏2 I added to XTX
ln e = 1
σ𝟐𝟐
o This ensures that the matrix (XTX + I) is invertible.
𝑏𝑏𝟐𝟐

o Compare with the parameter estimates for ridge regression on page 38.

47 48
7. Logistic regression 7. Logistic regression

• Graph of the logistic function f(x) = 1/(1 + e-x): • The ratio of the two conditional probabilities is:

This is the odds in favor of Y = 1


• f(x) is an increasing function
and its logarithm:
• f(x) is close to 0 for large negative x

• f(x) is close to 1 for large positive x This is the logit or the log-odds (= z)
• f(x) = 0.5 when x = 0
• If Xi is increased by 1:
• Its derivative is f’(x) = f(x) (1 – f(x))
ln(x) = loge(x)

• Hence, 0 ≤ P(Y=1|X1,X2, …., Xn) ≤ 1 and 0 ≤ P(Y=0|X1,X2, …., Xn) ≤ 1 is the odds-ratio: the multiplicative increase in the odds when Xi increases by one (other variables
• The above logistic function is also known as the sigmoid function. remaining constant)

49 50

7. Logistic regression 7. Logistic regression

Bernoulli random variable Likelihood function

• Given a classification problem where the target variable Yi, i = 1,2 … ,n is binary-valued, 0 or 1 with probabilities 1 - πi
• A Bernoulli trial is an experiment with 2 possible outcomes: success and failure with probability of success = π and
and πi, the logistic regression model can be stated as finding parameters β0 and β1 such that:
probability of failure = 1 – π, where 0 ≤ π ≤ 1.
E[Yi|Xi] = πi = exp(β0 + β1Xi)/[1 + exp(β0 + β1Xi)] Compare to Gaussian likelihood
• A Bernoulli random variable Y is defined by the probability mass function: for linear regression on page 41
• Since each Yi is a Bernoulli random variable, where
1 −π 𝑦𝑦 = 0 P(Yi = 1|Xi) = πi and P(Yi = 0|Xi) = 1 - πi
PY(y) = �
π 𝑦𝑦 = 1
the probability distribution function can be represented as follows:
• Alternatively, it can be defined by the probability mass function: fi(Yi) = πiYi (1 - πi )1-Yi for Yi = 0, 1; i = 1,2, ….n
PY(y) = πy (1 - π)1-y for y = 0, 1 • Note:
• The expected value of Bernoulli random variable Y is: o It is assumed that there is only one independent variable/descriptive feature: simple logistic regression. The discussion to follow
E[Y] = 1(π) + 0(1 − π) = π applies to more than one independent variable. Here Yi is the response for the i-th data, Xi is the corresponding value of the
independent variable.
o See the equation on page 48, set n = 1.
51 52
7. Logistic regression 7. Logistic regression
Maximum likelihood estimation log(a) + log(b) = log(ab) Maximum likelihood estimation
• The probability distribution function of Yi is: log(a)N = N log(a) • Instead of maximizing the joint probability function
fi(Yi) = πiYi (1 − πi )1-Yi for Yi = 0, 1; i = 1,2, ….n Here n is the number of data samples! g(Y1,Y2,… Yn) =∏𝑛𝑛𝑖𝑖=1 fi(Yi)= ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi
• Since the Yi observations are independent, their joint probability function is:
we find the best likelihood estimates β0 ,β1 by maximizing the log likelihood function:
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1− πi )1−Yi πi = exp(β0 + β1Xi)/[1 + exp(β0 + β1Xi)]
loge L(β0 ,β1) = ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi))
• Taking the logarithm of the joint probability function: 1 - πi = 1/[1 + exp(β0 + β1Xi)]
loge g(Y1,Y2,… Yn) = loge ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi • Let b0 and b1 be the maximum likelihood estimates.
πi
= exp(β0 + β1Xi)
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )1−Yi 1 − πi • The fitted logistic response function for a data sample with independent variable value X is:
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) −Yi � = exp(b0 + b1 X )/[1 + exp(b0 + b1 X )]
π
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) −Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )
and: 1 − π
� = 1/[1 + exp(b0 + b1 X )]
π
= loge ∏𝑛𝑛𝑖𝑖=1 (1 − iπ ) Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) �
π
i = exp(b0 + b1X ) This is the odds in favor of Y = 1
π π 1 − �
π
= ∑𝑛𝑛𝑖𝑖=1 loge(1 − iπ ) Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) = ∑𝑛𝑛𝑖𝑖=1 Yi loge(1 − iπ ) + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )
i i �
π
loge 1 − = b0 + b 1 X This is the logit or the log-odds (= z)
= ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi)) = loge L(β0 ,β1) ⇐ Maximize! 53 �
π 54

7. Logistic regression 7. Logistic regression


Note on Maximum Likelihood Estimation Example 1.

• The maximum value that can be achieved by a joint probability function is 1 (best fit). A system analyst studied the effect of computer programming experience on ability to complete within a
specified time a complex programming task. They had varying amount of experience (measured in months).
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi
All persons were given the same programming task and their success in the task was recorded: Y = 1 if task
• The log likelihood function
was completed successfully within the allotted time, Y = 0 otherwise.
loge L(β0 ,β1) = ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi)) Person Months-Experience Success
ranges from minus infinity to 0. 1 14 0
2 29 0
• -LL or -2LL is normally reported as a model’s fit. The closer it is to 0, the better the fit. 3 6 0
• When there is no independent variable in the model, the maximum likelihood is achieved with 4 25 1
.. …. ….
1
� = ∑𝑛𝑛𝑖𝑖=1 Yi
π  the proportion of data samples with Yi = 1 23 28 1
𝑛𝑛
24 22 1
25 8 1

55 56
7. Logistic regression 7. Logistic regression
Example 1.
Example 1.
• A standard logistic regression package was run on the data and the parameter values found are • Suppose there is another programmer with 15 months experience, i.e. X1 = 15.
β0 = -3.0595 and β1 = 0.1615. • Recall the parameter values are β0 = -3.0595 and β1 = 0.1615 ⇐

• The estimated mean response for i = 1, where X1 = 14 months-experience is o b = β0 + β1 X1 = -3.0595 + 0.1615 (15) = - 0.637 (logit when X1 = 15)
o eb = e-0.637 = 0.5289
o a = β0 + β1 X1 = -3.0595 + 0.1615 (14) = -0.7985 (logit when X1 = 14)
o P(Y=1|X1=15) = 0.5289/(1.0 + 0.5289) = 0.3459.
o ea = e-0.7985 = 0.4500 ⇐
o The estimated probability that a person with 15 months experience will successfully complete the programming
o P(Y=1|X1=14) = 0.4500/(1.0 + 0.4500) = 0.3103 task is 0.3459.
o The estimated probability that a person with 14 months experience will successfully complete the o The odds in favor of completing the task =
programming task is 0.3103. 0.3459/(1 – 0.3459) = 0.5288.

o The odds in favor of completing the task = o Comparing the two odds: 0.5288/0.4500 = 1.1753 = e0.1615 ⇐
o The odds increase by 17.53% with each additional month of experience
0.3103/(1 – 0.3103) = 0.4500 ⇐
57 58

7. Logistic regression 7. Logistic regression


Example 2. Example 2.
• Reference: https://stats.idre.ucla.edu/sas/output/proc-logistic/
> stdata <- read.table("student-data.txt",header=FALSE)
• The data were collected from 200 high school students. > science <- stdata$V11
> reading <- stdata$V8
• The variables reading write math science socst are the results of standardized tests on reading, writing, math,
> honors <- stdata$V13
science and social studies (respectively), and the variable female is coded 1 if female, 0 if male.
> fitlogreg <- glm(honors ~ female + reading + science, family = binomial)
• The response variable is honors with two possible values: fitlogreg

o high writing test score if the writing score is greater than or equal to 60 (honors = 1),
Call: glm(formula = honors ~ female + reading + science, family = binomial)
Coefficients:
o low writing test score, otherwise (honors = 0).
(Intercept) female reading science o Estimate for Intercept = β0 = -12.7772
• The predictor variables used: gender (female), reading test score (reading), and science test score (science). -12.77720 1.48250 0.10354 0.09479 o Estimate for β1 = 1.4825 (corresponds to variable female)
Degrees of Freedom: 199 Total (i.e. Null); 196 Residual
• The dataset can be downloaded from https://stats.idre.ucla.edu/ o Estimate for β2 = 0.10354 (corresponds to variable reading)
Null Deviance: 231.3
o Estimate for β3 = 0.09479 (corresponds to variable science)
59 Residual Deviance: 160.2 AIC: 168.2 60
7. Logistic regression 7. Logistic regression

> exp(coef(fitlogreg)) Example 2.


Example 2. (Intercept) female reading science > load(InformationValue)
2.824433e-06 4.403931e+00 1.109086e+00 1.099428e+00 > fitted.value <- predict(fitlogreg,stdata,type="response")
o Estimate for Intercept = β0 = -12.7772
> plotROC(actuals = honors, predictedScores=fitted.value)
o Estimate for β1 = 1.4825 (corresponds to variable female) > optimalCutoff(honors,fitted.value)
[1] 0.5090678 ⇐ minimize classification error with this cutoff:
o Estimate for β2 = 0.10354 (corresponds to variable read) if (predicted output > Cutoff, then Class 1, else Class 0)
Concordance(honors,fitted.value)
• Estimate for β3 = 0.09479 (corresponds to variable science): the estimate logistic regression coefficient for a one unit
$Concordance
change in science score, given the other variables in the model are held constant. If a student were to increase [1] 0.856116
his/her science score by one point, the difference in log-odds (logit response values) for high writing score is $Discordance
expected to increase by 0.09479 units, all other variables held constant (page 54). [1] 0.143884
$Tied
• Odds ratio point estimate corresponds to variable female = 4.4039 = eβ1: the odds of a female student getting high [1] -5.551115e-17
writing test score is more than 4-fold higher than a male student (given the same reading and science test scores). $Pairs
[1] 7791
61 62

7. Logistic regression 7. Logistic regression


Example 2. Degrees of Freedom: 199 Total (i.e. Null); 196 Residual
Example 2. Null Deviance: 231.3
• Percent Concordant - A pair of observations with different observed responses is said to be concordant if the observation with Residual Deviance: 160.2 AIC: 168.2
the lower ordered response value (honcomp = 0) has a lower predicted mean score than the observation with the higher
• Null Deviance: indicates how well the response variable is predicted by a model that only includes the intercept = 231.3.
ordered response value (honcomp = 1).
1
• Percent Discordant - If the observation with the lower ordered response value has a higher predicted mean score than the • � = ∑𝑛𝑛𝑖𝑖=1 Yi = 53/200.
There are 53 class 1 and 147 class 0 samples. Fit probability of class 1 with: π
𝑛𝑛

observation with the higher ordered response value, then the pair is discordant.
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi = (53/200)53 × (147/200)147
• Percent Tied - If a pair of observations with different responses is neither concordant nor discordant, it is a tie.
loge g(Y1,Y2,… Yn) = (53) loge(53/200) + (147) loge(147/200) = -115.644 = loge L(β0)
• Pairs - This is the total number of distinct pairs with one case having a positive response (honcomp = 1) and the other having a
negative response (honcomp = 0). -2 loge L(β0 ,−−) = (-2)(-115.644) = 231.388 = Null Deviance

o The total number ways the 200 observations can be paired up is 200(199)/2 = 19,900. • Number of parameters = 3, number of levels in the dependent variable k = 2.

o There 53 honors students, and 147 non-honors students.


• With the model, the Residual Deviance is obtained = 160.2 with the loss of 3 degrees of freedom in this dataset.
o Of the 19,900 possible pairings, 7791 (=147×53) have different values on the response variable
• AIC = The Akaike Information Criterion is used to assess the quality of the model through comparison of related models.
and 19,900-7791=12,109 have the same value on the response variable.
63 AIC = - 2LL + 2 × (# parameters + (k-1)) = 160.2 + 2(3 + 1) = 168.2 64
BT4240 Tutorial # 1, January 28, 2025.
!
1
1. Starting from initial point x0 = , compute x1 using the steepest descent approach
−1
with optimal stepsize (exact line search) to find the minimum of the function

f (x) = f (x1 , x2 ) = x31 + x32 − 2x21 + 3x22 − 8

Also compute the gradient at x1 .

!
3x21 − 4x1
∇f (x1 , x2 ) =
3x22 + 6x2
!
−1
∇f (1, −1) =
−3
!
1
d = −∇f (1, −1) =
3
!
1+λ
x + λd =
−1 + 3λ
!
3(1 + λ)2 − 4(1 + λ)
∇f (x + λd) =
3(−1 + 3λ)2 + 6(−1 + 3λ)

f (λ) = ∇f (x + λd)T d = 3(1 + λ)2 − 4(1 + λ) + 9(−1 + 3λ)2 + 18(−1 + 3λ)
= −10 + 2λ + 84λ2 = 0

Set the derivative to 0, and obtain



−2 ± 4 + 3360
λ =
168
1 5
λ1 = , λ2 = −
3 14
Updated weight x1 is

x1 = x0 + λ 1 d
! !
1 1 1
= +
−1 3 3
!
4
= 3
0
!
0
∇f (x1 ) =
0

2. Use the steepest descent method to approximate the optimal solution to the problem

min x21 + 2x22 − 2x1 x2 − 2x2

with fixed stepsize λ = 0.1.

1
!
0.5
Starting from the point x0 = , show the first few iterations.
0.5
!
2x1 − 2x2
g(x1 , x2 ) = ∇f (x1 , x2 ) =
4x2 − 2x1 − 2
The first few iterations:

k x1 x2 g1 g2 f(x1,x2)
1 0.50000 0.50000 0.00000 -1.00000 -0.75000
2 0.50000 0.60000 -0.20000 -0.60000 -0.83000
3 0.52000 0.66000 -0.28000 -0.40000 -0.86480
4 0.54800 0.70000 -0.30400 -0.29600 -0.88690
5 0.57840 0.72960 -0.30240 -0.23840 -0.90402
.....
96 0.99969 0.99981 -0.00024 -0.00015 -1.00000
97 0.99972 0.99982 -0.00022 -0.00013 -1.00000
98 0.99974 0.99984 -0.00020 -0.00012 -1.00000
99 0.99976 0.99985 -0.00019 -0.00011 -1.00000
100 0.99978 0.99986 -0.00017 -0.00011 -1.00000

Solution: x∗ = (1, 1).


3. Use the Newton’s method to find the optimal solution to the problem
min x21 + 2x22 − 2x1 x2 − 2x2
!
0.5
starting from xo = .
0.5
Compute the Hessian and its inverse, then update x :
!
2 −2
H(x1 , x2 ) =
−2 4
!
−1 1 4 2
H (x1 , x2 ) =
4 2 2
x1 = x0 − H −1 (0.5, 0.5)∇f (0.5, 0.5)
! ! ! ! ! !
0.5 1 0.5 0 0.5 −0.5 1
= − = − =
0.5 0.5 0.5 −1 0.5 −0.5 1
The solution is obtained after just one iteration if the function is quadratic.
minimize (x31 −x2 )2 +2(x2 −x1 )4 . If Newton’s method is applied
4. Consider the problem to !
0
starting from x0 = , what will x1 be?
1
!
2(x31 − x2 )(3x21 ) + 8(x2 − x1 )3 (−1)
∇f (x1 , x2 ) =
2(x31 − x2 )(−1) + 8(x2 − x1 )3
!
6x21 (x31 − x2 ) − 8(x2 − x1 )3
=
−2(x31 − x2 ) + 8(x2 − x1 )3
!
−8
∇f (0, 1) =
10

2
Compute the Hessian and its inverse, then update x :
!
12x1 (x31 − x2 ) + 6x21 (3x21 ) + 24(x2 − x1 )2 −6x21 − 24(x2 − x1 )2
H(x1 , x2 ) =
−6x21 − 24(x2 − x1 )2 2 + 24(x2 − x1 )2
!
24 −24
H(0, 1) =
−24 26
!
13 1
H −1 (0, 1) = 24
1
2
1
2 2
! ! !
13 1
−1 0 24 2 −8
x1 = x0 − H (0, 1)∇f (0, 1) = − 1 1
1 2 2 10
!
− 32
=
0

5. Consider the problem:


minimize (x1 − 3)2 + (x2 − 2)2
s.t.
x21 + x22 ≤ 5
x1 + 2x2 ≤ 4
−x1 ≤ 0
−x2 ≤ 0
(a) Verify that the necessary optimality conditions hold at the solution point (2, 1).
First note that the first 2 constraints are binding at x = (2, 1). Thus the Lagrangian
multipliers λ3 and λ4 associated with the other 2 constraints −x1 ≤ 0 and −x2 ≤ 0
respectively, are equal to zero.
Note also that
! ! ! ! !
2(x1 − 3) −2 2x1 4 1
∇f (x) = = , ∇g1 (x) = = , ∇g2 (x) =
2(x2 − 2) −2 2x2 2 2
1 2
Thus λ1 = 3 and λ2 = 3 will satisfy the KKT conditions:
∇f (x) + λ1 ∇g1 (x) + λ2 ∇g2 (x) = 0.
! ! ! !
−2 1 4 2 1 0
+ + =
−2 3 2 3 2 0
We obtain the solution as the problem has a convex objective function and the
constraints are also all convex.
(b) Check whether the KKT conditions are satisfied at the point xˆT = (0, 0).
Here the 3rd and 4th constraints are binding, while the 1st and 2nd are not. Hence,
let λ1 = λ2 = 0. Note that
! ! !
−6 −1 0
∇f (x̂) = , ∇g3 (x̂) = , ∇g4 (x̂) =
−4 0 −1
Hence, we want to find λ3 ≥ 0 and λ4 ≥ 0 such that
∇f (x̂) + λ3 ∇g3 (x̂) + λ4 ∇g4 (x̂) = 0.
This is not possible.

3
6. Consider the quadratic optimization problem:
min −15x1 − 30x2 − 4x1 x2 + 2x21 + 4x22
subject to
x1 + 2x2 ≤ 30
x1 ≥ 0
x2 ≥ 0

(a) Find the matrix Q and vector c such that the objective function above can be
expressed as 12 xT Qx + cT x.

f (x1 , x2 ) = −15x1 − 30x2 − 4x1 x2 + 2x21 + 4x22


!
−15 − 4x2 + 4x1
∇f (x1 , x2 ) = Qx + c =
−30 − 4x1 + 8x2
!
4 −4
Q = H(x1 , x2 ) =
−4 8
!
−15
c =
−30

Check that 12 xT Qx + cT x = −15x1 − 30x2 − 4x1 x2 + 2x21 + 4x22 :


! ! !
4 −4 x1 4x1 − 4x2
Qx =
−4 8 x2 −4x1 + 8x2
!
4x1 − 4x2
xT Qx = (x1 x2 ) = 4x21 − 8x1 x2 + 8x22
−4x1 + 8x2
1 T
x Qx + cT x = −15x1 − 30x2 − 4x1 x2 + 2x21 + 4x22
2
(b) Show that the solution of the problem is (x1 , x2 ) = (12, 9).
We check all the necessary conditions:
ˆ Primal feasibility:
x1 + 2x2 = 12 + 2(9) − 30 ≤ 0
−x1 ≤ 0
−x2 ≤ 0
ˆ Dual feasibility: let λ2 = λ3 = 0. Need to find λ1 ≥ 0 that satisfies the
stationarity condition:
! ! !
−15 − 4x2 + 4x1 1 0
+ λ1 =
−30 − 4x1 + 8x2 2 0
For x1 = 12, x2 = 9,
! ! !
−3 1 0
+ λ1 =
−6 2 0
is true with λ1 = 3.
Summary: all the necessary conditions are satisfied. The objective function and
constraints are convex, hence (12, 9) is the solution of QP.

4
BT4240 Tutorial # 2, February 4, 2025.

1. A study was undertaken to examine the profit Y per sales dollar earned by a construction
company and its relationship to the size X1 of the construction contract (in hundreds
of thousands of dollars) and the number X2 of years of experience of the construction
supervisor. Data were obtained from a sample of n = 18 construction projects undertaken
by the construction company over the past two years. Below is the ANOVA table of a
regression analysis using the model:

Y = β0 + β1 X1 + β2 X2 + β3 X12 + β4 X1 X2 + e

Analysis of Variance
Source DF Sum of Mean F value Prob > F
Squares Square
Model 4 77.02982 19.25745 20.440 0.0001
Error 13 12.24796 0.94215
Total 17 89.27778

Parameter Estimates
Variable DF Parameter Standard T for H0 : Prob > |T |
Estimate Error Parameter = 0
INTERCEP 1 19.304957 2.05205668 9.408 0.0001
CSIZE X1 1 -1.486602 1.17773388 -1.262 0.2290
SUPEXP X2 1 -6.371452 1.04230807 -6.113 0.0001
CSIZESQ X1*X1 1 -0.752248 0.22523749 -3.340 0.0063
SIZEEXP X1*X2 1 1.717053 0.25377879 6.766 0.0001

(a) Test the overall adequacy of the model at the 1% level of significance.
Since the p-value is 0.0001 for the F test is less than 0.01 (= 1%), we reject the null
hypothesis that β1 = β2 = β3 = β4 = 0. The model is adequate.
(b) (2 points) Compute and interpret the value of R2 .
R2 = 77.02982/89.277778 = 0.8628, that is, around 86% of the variations in the
profit Y can be explained by the multiple linear regression model.
(c) Explain how the prob-value for testing H0 : β2 = 0 versus Ha : β2 ̸= 0 has been
calculated.
The total area under the t-distribution with DF = 13 with t less than -6.113 and
greater than 6.113 is 0.0001.
(d) By using the prob-value for β3 , determine whether we can reject H0 : β3 = 0 versus
Ha : β3 ̸= 0 at the 5% level of significance.
Since the p-value of 0.0063 is less than 0.05, we reject H0 in favor of Ha .
(e) True or False: the regression model above is not a linear model as it involves X12
and X1 X2 .
False. The model is linear with respect to the model parameters β1 , β2 , β3 , β4 .

2. We are investigating the starting salary of new graduates from the university. A multiple
linear regression with 4 independent variables is generated. The input variables are:

1
ˆ CAP (cumulative average point): continuous in the interval [2, 5].
ˆ Faculty: categorical variables, Faculty1, Faculty2, Faculty3, Faculty4.
ˆ Internship: binary value (0 = no, 1 = yes).

The following is the output from R:

Call:
lm(formula = Salary ~ CAP + as.factor(Faculty) + as.factor(Internship), data=data)

Residuals:
Min 1Q Median 3Q Max
-1297.23 -514.77 -2.33 494.34 1056.74

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2010.9 489.3 4.110 0.000236 ***
CAP 451.3 138.9 3.250 0.002605 **
as.factor(Faculty)2 130.1 313.6 0.415 0.680974
as.factor(Faculty)3 -981.3 334.8 -2.931 0.005997 **
as.factor(Faculty)4 1238.5 305.6 4.052 0.000279 ***
as.factor(Internship)1 629.6 249.9 2.519 0.016622 *
---
Signif. codes: 0 ’****’ 0.001 ’***’ 0.01 ’**’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 695.9 on 34 degrees of freedom
Multiple R-squared: 0.7048, Adjusted R-squared: 0.6614
F-statistic: 16.23 on 5 and 34 DF, p-value: 3.531e-08

(a) What is the expected starting salary of a student from Faculty1 with CAP = 2.0
and no internship experience?
Ŷ = 2010.9 + 451.3(2) = 2913.5
(b) How do you interpret the coefficient estimate of Faculty4?
Compared to students from Faculty 1, students from Faculty 4 are expected to
have starting salary that is higher by $1238.5 when they have the same CAP and
internship experience.
(c) What can you conclude from the large p-value corresponding to the coefficient of
Faculty2?
The difference in starting salary of students from Faculty 1 and Faculty 2 is not
statistically significant (if they have the same CAP and internship experience).
(d) Explain how the value of Multiple R-squared is computed.
The multiple R-squared is the ratio of sum of squared regression/total sum of
squared of the data, where
ˆ sum of squared regression = model sum of squares = i (ŷi − y)2 ,
P

ˆ total sum of squared of the data = i (yi − y)2 ,


P

ˆ y = average value of yi in the data.


ˆ ŷi = predicted value for data sample i.
(e) True of False: we will get the same results if we call R function as follows:
lm(Salary ~ CAP + as.factor(Faculty) + Internship, data = data)

2
Explain briefly.
True, since Internship has only two possible values, 0 or 1.

3. You are tasked to build a regression model to predict yi using the input variables
xi,1 , xi,2 , ....xi,p , where i = 1, 2, ..., N . For each data sample i, its importance is re-
flected by the corresponding weight wi > 0. The prediction error is (yi − ŷi ), where
the prediction ŷi is computed as ŷi = β0 + β1 xi,1 + β2 xi,2 + . . . βp xi,p . Find the optimal
parameters β = (β0 , β1 , ..., βp ) that will minimize the sum of weighted squared errors:
N
X
wi (yi − ŷi )2
i=1

(Hint: use matrix notation, let the input data be X and the target variable be Y).
Let W be a diagonal matrix with Wi,i = wi .

Ŷ = Xββ
S(β β )T W(Y − Xβ
β ) = (Y − Xβ β)
T
∂S/∂ββ = −2X W(Y − Xβ β) = 0
β = (XT WX)−1 XT WY

4. The data below show for a consumer finance company operating in six cities, the number
of competing loan companies operating in the city (X) and the number per thousand of
the company’s loans made in that city that are currently delinquent (Y ):

i: 1 2 3 4 5 6
Xi : 4 1 2 3 3 4
Yi : 16 5 10 15 13 22

We are modeling the relationship between X and Y as linear function:

Yi = β0 + β1 Xi

The least squares estimators for β = (β0 , β1 ) are computed using ridge regression with

3
penalty parameter λ = 1. Find the optimal values of the parameters (β0 , β1 ).
1 λ
β) =
S(β (Y − Xββ )T (Y − Xβ β )T + β T β
2 2
∇S(ββ ) = −XT (Y − Xβ β ) + λβ
β
0 = −XT Y + XT Xβ β + λββ
T T

X X + λI β = X Y
 
1 4
 1 1 
 
 1 2 
X =  1 3 

 
 1 3 
1 4
 T  
1 4 16

 1   5 
1
    
1   10 
2 81
XT Y =

  15  = 261
   

 1 3
  
 1   13 
3
1 4 22
   
6 17 1 0
XT X + λI = +
17 55 0 1
 
1 56 −17
(XT X + λI)−1 =
103 −17 7
99
      
β0 1 56 −17 81 103
β= = = 450
β1 103 −17 7 261 103

5. Consider the problem of collusive bidding among road construction contractors. Con-
tractors sometimes scheme to set bid prices higher than the fair (or competitive) price.
Suppose an investigator has obtained information on the bid status (fixed or competi-
tive) for a sample of 31 contracts. Two variables thought to be related to bid status are
also recorded for each contract: number of bidders (x1 ), and the difference between the
lowest bid and the estimated competitive bid where the difference x2 is measured as a
percentage of the estimate. The response y is recorded as follows:

1 if fixed bid
y=
0 if competitive bid

A logistic regression model is obtained to predict the target variable y from the following
31 data samples:
The output from R is as follows:

Call: glm(formula = status ~ numbids + pdiff, family = binomial)

Coefficients:
(Intercept) numbids pdiff
1.4212 -0.7553 0.1122

Degrees of Freedom: 30 Total (i.e. Null); 28 Residual


Residual Deviance: 22.84 AIC: 28.84

4
Status #bids % diff Status #bids % diff Status #bids % diff
1 4 19.2 1 2 24.1 0 4 -7.1
1 3 3.9 0 9 4.5 0 6 10.6
0 2 -3.0 0 11 16.2 1 6 72.8
0 7 28.7 1 3 11.5 1 2 56.3
0 5 -0.5 0 3 -1.3 0 3 12.9
0 8 34.1 0 10 6.6 1 5 -2.5
0 13 24.2 0 7 2.3 1 3 36.9
0 4 -11.7 1 2 22.1 1 3 10.4
0 2 9.1 0 5 2.0 0 6 12.6
1 5 18.0 0 3 1.5 1 4 27.3
0 10 -8.4

(a) Compute and interpret the odds ratio estimate for the variable number of bidders
(x1 ).
Odds ratio for β1 is e−0.7553 = 0.47. Interpretation: if the number of bids increases
by 1, the odds that this contract is fixed actually decreases to only about 47% of the
original contract (with the % difference between the lowest bid and the estimated
competitive bid unchanged).
(b) Compute the Null Deviance for this data.
Number of Class 0 samples = 19, number of Class 1 samples = 12. Probability of
12
Class 1 samples = π = 12/31. Null deviance = (−2)×(12×ln 31 +19×ln 19
31 ) = 41.38.
(c) What is your prediction for a contract where the number of bids is 3 and the %
difference between the lowest bid and the estimated competitive bid is 50?
ˆ Probability of Status = 1 is 1.0/(1.0 + exp−1.4212+0.7553×bids−0.1122×dif f ) =
0.9916.
ˆ Prediction: Status = fixed.

5
BT4240

1. (10 points) Consider the minimization problem:

minimize f (x1 , x2 ) = x21 + x1 x2 + 10x22 − 5x1 − 3x2 + 12


!
1
Starting with initial starting point x0 = , compute x1 if we
1
(a) (5 points) apply the gradient descent method with step-size λ = 0.5.
(b) (5 points) apply the Newton’s method.
2. (10 points) Consider the following nonlinear programming problem:

minimize z = (x1 − 1)2 + (x2 − 2)2

subject to

−x1 + x2 ≤ 1
x1 + x2 ≤ 2
x1 , x2 ≥ 0

(a) (6 points) Use the KKT conditions to derive an optimal solution.


(b) (4 points) Is the solution you obtain in part (a) a global solution or a local
solution? Explain your answer.
3. (10 points) Consider the quadratic minimization problem:
1 1
minimize f (x1 , x2 , x3 , x4 ) = x21 + x22 + x23 + x24 + x3 + x4 + 10
2 4
subject to

−x1 + x2 ≤ 5
x1 + x 3 + x4 ≤ 2
x1 + x2 − x3 + 3x4 ≤ 4

(a) (5 points) State the KKT optimality conditions for the above problem.
(b) (5 points) State the dual of the above problem.
Note: You do not need to find the optimal solution
4. (10 points) The data below show for a consumer finance company operating in six cities,
the number of competing loan companies operating in the city (X) and the number
per thousand of the company’s loans made in that city that are currently delinquent
(Y ):

2
BT4240

i: 1 2 3 4 5 6
Xi : 4 1 2 3 3 4
Yi : 16 5 10 15 13 22

We are modeling the relationship between X and Y as a linear function:

Yi = β0 + β1 Xi

The least squares estimators for (β0 , β1 ) are computed using ridge regression with
penalty parameter λ = 1.
• (5 points) Find the optimal values of parameters (β0 , β1 ).
• (5 points) If we apply the gradient descent method to update the weights, starting
from initial weights (β0 , β1 ) = (1, 0), what will be the new values for (β0 , β1 ) after
one iteration? Set the learning rate α = 0.5.

— END OF PAPER —

3
1 a) Of()
(h)
=

② =
50- rofed

(i) - i (iv)
=

= =

(=)

(2) (i)
6)0H =

② ,
=
Go -

H"*f(o)

(i) - (ii) (ii)


·

&

-(i) i (5)
E =

(9)
. a)
2 D Primal feasibility :
9 ,
(c) = -
x
,
+
xz -
1 = 0

9) (x) = x, +
xz 2 1 0
-

93(x) = x= 0

9n(x) =
-

xz =0

② Qual
feasibility : M ,
42 ,
13 ,
zy 0

③ ① 2( 1) x1( 1) 22() xy( -) 0 = (x 2 x2 xz


stationality 2
+ + + = =
+ -

=
,
-

# ((2 -

2) + =, (1) + x2() +
1g( 1) - =
0 = 2x2 + x
,
+ x2
-

x3 =
4

④ comp .
Slackness : X, (-x , + xz - 1) = 0

x2(x + x -
2) = 0

xz( xi) -

= 0

xy( xz) - = 0

⑤ g , (1) & 92() to be E Xz =


For
binding
X =

>
= 1 ,
=

with ( ,
2) ,
1
,
=0 & +3 ,
xy
=
0
③ ((z)
]
-
y
, + x = 2 = -
x+ x = 1

2(2) + 21 + x2 =
4 = x
,
+
x = 1

is all KKT conditions satisfied at (E ,


2) with x1 ,
13, xy = 0 &T =

6) Global solution since obj function is convex In all constrants are linear (convex) .
Thus
,
solution

is
global minimum .

a) ① Primal (1)
3
.
feasibility 9 x,+ X -5 -0
=
: .
-

92(x) =
x
,
+ xy + xy
-
2 10

93() =
x , + xz -

x3 + 3xy -
410

② Dual feasibility : i ,2
, is 20

Stationarity: (x,
③ + x, (-) + x2(1) +
x(1) = 0 = (x, -

x1 + x2 + 23 = 0

2x + x
,
(1) + 22(0) + xg(1) =
0 = 2x2 + x1 +
x3 = 0

& x3 + 1 + x
,
(0) + 22(1) +
x3( -

1) = 0 = xz + 1 +
x2 -

x3 = 0

Exy + 1 + 2, (0) +
22(1) + 23(3) =
0
= [xy + 1 + x2 +
3x3 = 0

Comp sladness ,
:
E x, ( -X
, + xz
-
5) = 0

⑤ xz(x + Xy + Xy -
1) = 0

⑧ 23 (x ,
+ x2 -

xy + 3xy
-
4) = 0

6) D 2(,X) =
x ,
+
+ xz + Ex +
yxy + x3 + xy + 10 + 1
,
(X , +
xz
-
3) + x2(x , +
xy +
xy -

2)
+ xz(x , + xz -

xz + 3xy -
4)

=
x ,+ x
,
( x
,
+ 22 + x3) + xi + xz(x ,
+
13) + Ex xg(1 + x2 x3) 7xy
+ -

+ + xy(1 + xz + 343) + 10 -

541 -

2x7
=

41

② For a
quadratic function of the form f(x) = ax* + bx
,
the minimum is attend at #() = Lax + b =
0 = X =
-a
with a value of a (2) -

minin
③ Qual function
:

Ex )
g(x 4) 10-541-2x2-4x3
, _) (1 3)
+
max ,
42, = -

=
+
2 +

4 ,
42 23
,

subject to X
, %2 ,
&
y
= 0

0x][
4

②B =
(x+X + x = )" XT Y

-[0i])
I "(ii)
=
[ ]]
p =

[]
6) B , [0] =
Po[o] + a ((ti -
(8[03 +
Bo[i]X) :
-
X : [0)]
=
1 + 0 .
5[(15)(1) = (4)(i) + (9)(x) + (14)(x) + (2)(1) +
(2)(k)]
= 1 + 0 .
5(75)

= 38 5 .

B (1) =
Bo(7] + < ((ti -

18010) : Bott]xi)) ·
X: [17]
=
0 + 0 .
5 [(15)(4) +
(4)() + (a)(2) + (14)(3) + (12)(z) +
(21)(4)]
=
122

do after 1 iteration (Po ,


P, ) =
(38 .

5
,
122)
BT4240

1. (10 points) Consider the following linearly constrained optimization problem:


maximize f (x) = − ln(x1 + 1) − x22
subject to
x1 + 2x2 ≤ 3
x1 , x2 ≥ 0
where ln denotes natural logarithm.
(a) Use the KKT conditions to derive an optimal solution.
(b) Is the solution you obtain in part (a) a global solution or a local solution? Explain
your answer.
2. (10 points) Starting from the initial trial solution (x1 , x2 , x3 ) = (1, 1, 1), use the gradi-
ent descent procedure to solve the following unconstrained problem:
minimize f (x) = −3x1 x2 − 3x2 x3 + x21 + 6x22 + x23
Set the step size λ = 12 . You only need to do one iteration of the procedure.
3. (10 points) Consider the linear model:
E(Y |X) = β1 X + β2 X 2
V ar(Y |X) = Iσ 2 .
An experiment yields the following data:
   
8 −1
   

 12 


 1 

−1 0
   
   
Y =   X=  

 11 


 1 

   

 38 


 −2 

44 2
Calculate the least squares estimates for the two model parameters β1 and β2 .
4. (10 points) Consider the data points in IR2 as follows:
x1 : (0, 1) d1 = +1 (Class 1)
x2 : (1, 0) d2 = −1 (Class 2)
x3 : (0, −1) d3 = +1 (Class 1)
x4 : (−1, 0) d4 = −1 (Class 2)
Describe how a support vector machine with polynomial kernel (p = 2) would transform
the input space to a higher dimensional feature space such that a linear decision surface
can be constructed.

2
1
.
a) ① convert to minimization
problem :

min f(x) =
In (X , + 1) + x2 subject to x1 + 2x113 & X
,, X, 20

② Primal feasibility :
9 . 67) = x
, + 2x2
-
3 = 0 ③ Dual feasibitty : X , 12
, xz
= 0

92(c) = =
X
,
10

93() = -
xz = 0

Stationanty :
⑤ + 2, (1) + y ( -

1) =
0 = + x -
x = 0

2x2 + m
, (2) +
2g()) =
0 = 2x2 +
2x =

23 = 0

& Comp . slachness : <, (x , + 2x2 -

3) = 0

22( -
x, ) = 0

0
xz( xz) -
=

⑤ >() optimal 00X


Assume 92() & are
binding at solution EX 1 = 0

Then 12 1370
,
& X, = 0

# + 10) -

x2 = 0 EX = 1

2(0) + 2(0) -

2 = 0 =
13 = 0

a all conditions satisfiel at 10, 0) with x, =


0
,
i = 1 is =
0
,

b) For the minimization problem Ind #1) is concave while Xg* is convex Since &1 X2 = 0 deviation
,
, .
, ,
any
10, 0) InGut) the objective function
from would care to incrise so even
though is not connex
,
we

have found the


global minimum-

For the original problem ,


this means we have food a
globe maximum.
(
I
2
. O
of =

② E, =
Yo -xXf/ * d)

=
(i) =( -)
=
-

=
(

Fi =
3
. Let X = Y
,

(x X) XTY
+
Than B =
+

[ii]"[]
=

s[i] [i]
[2]
B
=

[: ]
Matrix derivatives cheat sheet
Kirsty McNaught
October 2017

1 Matrix/vector manipulation
You should be comfortable with these rules. They will come in handy when you want to simplify an
expression before differentiating. All bold capitals are matrices, bold lowercase are vectors.

Rule Comments
(AB)T = BT AT order is reversed, everything is transposed
T T T T
(a Bc) = c B a as above
aT b = bT a (the result is a scalar, and the transpose of a scalar is itself)
(A + B)C = AC + BC multiplication is distributive
(a + b)T C = aT C + bT C as above, with vectors
AB 6= BA multiplication is not commutative

2 Common vector derivatives


You should know these by heart. They are presented alongside similar-looking scalar derivatives to help
memory. This doesn’t mean matrix derivatives always look just like scalar ones. In these examples, b is
a constant scalar, and B is a constant matrix.

Scalar derivative Vector derivative


df df
f (x) → dx f (x) → dx

bx → b xT B → B
bx → b xT b → b
2 T
x → 2x x x → 2x
bx2 → 2bx xT Bx → 2Bx

For a more comprehensive reference, see https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.


pdf

You might also like