Midterm 1 Notes
Midterm 1 Notes
1
2
• Consider the problem of solving for the minimum of a real-valued function Example 1. Contour of f(x,y) = c
for c = {-5,0,5,10,20}
min 𝑓𝑓(𝒙𝒙) min 𝑓𝑓(x, y) = x2 + 2y2 – 4x + xy + y
𝒙𝒙
𝑥𝑥,𝑦𝑦
where f : n → .
• Suppose we start at (x,y) = (0,0)
• We assume that f is differentiable and analytical solution in closed form is not possible to find. • The gradient of f(x,y) is
• Gradient descent is a first-order optimization algorithm: start with initial point x0, and iterate:
2x + y − 4 −4
xi+1 = xi - λi ∇ f(xi) ∇x,y f(x,y) = =
x + 4y + 1 1
where λi is step size at iteration i. • The function value at (0,0) is f(0,0) = 0
• Take steps proportional to the negative of the gradient of the function at the current point xi . • Suppose we move along the negative of the gradient with step size λ = ½, the new point is
• Contour lines: the set of lines where the function is at a certain value, i.e. f(x) = c for some constant c. 0 4 2
+½ =
• The gradient points in the direction that is orthogonal to the contour lines of the function we wish to 0 −1 −½
• The new function value is x2 + 2y2 – 4x + xy + y = 4 + ½ - 8 – 1 - ½ = -5.
optimize.
5 6
• Newton method: If f(x) = g'(x), then the iterates are: • Secant method:
Given a differentiable function f(x) and initial starting point x0 xn+1 = xn – g'(xn)/g"(xn) Given a continuous function f(x) and two points x-1,x0:
o For n = 0,1,2 …. until satisfied do: o For n = 0,1,2 …. until satisfied do:
Explanation: Explanation:
• The line passing through (x0,f(x0)) with slope f'(x0) is: • For some function, its derivative may not be easy to compute or not available.
13 14
• An additional term (momentum) to remember what happened in the previous iteration is added: xi+1 = xi - λi ∇ f(xi) + α(xi – xi-1) where α > 0 is the momentum parameter. Movement in the
previous iteration
Example 5. Minimize f(x1,x2) = (x1 – 2)2 + (x2 – 4)2
xi+1 = xi - λi ∇ f(xi) + α(xi – xi-1)
Movement in the previous
iteration × α
Movement for current iteration
where α > 0 is the momentum parameter. Start from x0 =
0
and with no momentum
6
Movement in the λ = 0.25 , α = 0.5.
previous iteration
Movement in the previous
iteration × α
2(x1 – 2) −4
∇f(x0) = =
Movement for current iteration 2(x2 – 4) 4
xi-1 with no momentum
With momentum:
Movement for current iteration
x1 = x0 - λ∇ f(x0)
xi with with momentum reaching
xi+1 0 −4 1 x2 = x1 - λ∇ f(x1 ) + α(x1 – x0)
xi+1 = − 0.25 =
6 4 5 1 −2 1
= − 0.25 + 0.5
x2 = x1 - λ∇ f(x1) 5 2 −1
1 −2 1.5 2
= − 0.25 = =
5 2 4.5 4
15 16
1. Unconstrained optimization: gradient descent 1. Unconstrained optimization: gradient descent
Steepest descent direction. Steepest descent direction.
• Recall Taylor series is a representation of a function f as an infinite sum of terms determined using derivatives • For a function f: n → , the linear approximation of f(x) at xk may be written as
of f evaluated at x0. fk(x) = f(xk) + ∇ f(xk) T(x – xk) ⇔ fk(xk+ λd) = f(xk) + ∇ f(xk) T(λd) = f(xk) + λ ∇ f(xk)Td
• The Taylor polynomial of degree n of f: → at x0 is defined as • A vector d is called a direction of descent of a function f at x if there exists λ > 0 such that
𝑓𝑓 𝑘𝑘 𝑥𝑥0 f (x + λd) < f(x)
Tn(x) := ∑𝑛𝑛𝑘𝑘=0 𝑥𝑥 − 𝑥𝑥0 𝑘𝑘
𝑘𝑘!
• If f is differentiable at x with a nonzero gradient, then d = – ∇ f(x)/ǁ ∇ f(x)ǁ is the direction of steepest
where f(k)(x0) is the kth derivative of f at x0 and f(k)(x0)/k! are the coefficients of the polynomial.
descent.
• For a function f: n → , the linear and quadratic approximation of f(x) at xk may be written as
• Consider the problem: minimize ∇ f(xk) Td subject to ǁdǁ ≤ 1
o fk(x) = f(xk) + ∇ f(xk) T(x – xk)
• The solution is d* = – ∇ f(x)/ǁ ∇ f(x)ǁ
o fk(x) = f(xk) + ∇f(xk) T(x – xk) + ½ (x – xk)THk (x – xk)
where ∇ f(xk) is the gradient and Hk is the Hessian of the function f(x) at xk. 17 18
o Update xk+1 = xk + λ dk. Initialization step. Let ε > 0 be a termination scalar. Choose a starting points x0 . Let k = 0.
• The total loss for N data points is • When the training set is large and/or no simple formulas exist, evaluating the sum of the gradients
𝑁𝑁
L(θ) = ∑𝑛𝑛=1 Ln(θ), where θ is the vector of parameters of interest. become very expensive.
• For example, the least-squares loss Ln(θ) = (yn – f(xn, θ))2 in regression given the independent variables xn • Stochastic approach: we do not compute the gradient precisely, but only compute an approximation of
and target variable yn. the gradient.
• Standard gradient descent is a batch optimization method performed using the full training set by • Mini-batch gradient update: randomly choose a subset Ln for update.
updating the vector parameter • In extreme case, one sample is randomly selected and the weights/parameters in the model are then
θi+1 = θi – γi ∇ L(θi) = θi – γi ∑𝑁𝑁
𝑛𝑛=1 ∇Ln(θi) updated.
21 22
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
• One way to solve the constrained optimization problem is to convert it to an unconstrained problem: where λi ≥ 0 is the Lagrange multiplier corresponding to each inequality constraint gi(x) ≤ 0
minimize J(x) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 𝐼𝐼 𝑔𝑔𝑖𝑖 𝒙𝒙 , where I(z) is an infinite step function
• Associated with the original primal problem, we have the dual problem:
max D(λ)
0 𝑖𝑖𝑖𝑖 𝑧𝑧 ≤ 0 λ
I(z) = �
∞ 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 subject to λ ≥ 0, where λ ∈ m are the dual variables and
• It is still difficult to find the minimum of J(x).
D(λ) = min L(x,λ)
𝒙𝒙
23 24
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
• When x satisfies the constraint gi(x) ≤ 0, let λ𝑖𝑖 = 0 to maximize λi gi(x). min J(x) = min max L(x,λ)
𝒙𝒙 𝒙𝒙 λ ≥ 0
• When x violates the constraint, that is, gi(x) > 0, we let λ𝑖𝑖 = ∞ to maximize λi gi(x).
• Let’s compare the solutions of the above minmax problem and the maxmin problem max min L(x,λ)
• The value of J(x) is: λ ≥ 0 𝒙𝒙
o f(x) when all the constraints are all satisfied, i.e. gi(x) ≤ 0 for i = 1,2, … m. • We have: L(x,λ) = f(x) + ∑𝑚𝑚 λ g
𝑖𝑖=1 i i (x) ≤ f(x) for all feasible x and λ ≥ 0
o ∞ when at least one of the constraints is not satisfied. • Hence, f(x) ≥ L(x,λ) ≥ min L(x,λ) = D(λ) for all feasible x and λ ≥ 0
𝒙𝒙
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 1 (Linear programming):
• Weak duality: min max L(x,λ) ≥ max min L(x,λ) • Consider the problem constrained optimization problem when the objective function and the constraints
𝒙𝒙 λ≥0 λ ≥ 0 𝑥𝑥
Given x ∈ that is feasible for the primal problem and λ ≥ 0, the primal objective function value f(x) is
n are all linear functions:
greater than or equal to the dual objective function value D(λ). Primal LP: min cTx
𝑥𝑥
• If we find the tightest possible lower bound for the primal objective function by finding optimal λ ≥ 0 such subject to Ax ≤ b
that D(λ) is maximized, we have strong duality: where c ∈ n , A ∈ m×n , b ∈ m
min max L(x,λ) = max min L(x,λ) Recall: D(λ) = min L(x,λ) • The Lagrangian is L(x,λ) = f(x) + ∑𝑚𝑚
𝑖𝑖=1 λi gi(x) = f(x) + λ g (x) = c x + λ (Ax – b)
T T T
𝒙𝒙 λ ≥ 0 λ ≥ 0 𝑥𝑥 𝒙𝒙
= (c + A T λ)Tx – λTb
• In contrast to the primal problem which is constrained, the problem min L(x,λ) is unconstrained for a
𝑥𝑥 where λ ∈ m is the vector of non-negative Lagrange multipliers.
given λ : the dual problem provides an alternative in solving the primal problem if we can solve min L(x,λ) • Taking the derivative of L(x,λ) with respect to x and setting it to 0 yields:
𝑥𝑥
easily. c + AT λ = 0
BT2103/BT3104: • Consider linear program: BT2103/BT3104: • Suppose we have the linear program:
max – λTb min 𝐜𝐜T𝐱𝐱 - max − 𝐜𝐜T𝐱𝐱
subject to c + A T λ = 0 c + A T λ ≥ 0, c + A T λ ≤ 0
subject to A𝐱𝐱 ≤ b
λ≥0 x ∈ n
• The same LP: • Consider now the equivalent LP:
max – λTb - max − 𝐜𝐜T (𝐱𝐱𝐱𝐱 − 𝐱𝐱𝐱𝐱)
−A T c
subject to λ ≤ 𝐱𝐱1
AT −c subject to A 𝐱𝐱𝟏𝟏 − 𝐱𝐱𝐱𝐱 ≤ b A −A ≤ b
𝐱𝐱2
λ≥ 0 𝐱𝐱𝟏𝟏 ≥ 0 , 𝐱𝐱𝟐𝟐 ≥ 0
• The dual of the above LP:
• The dual of the above LP:
min 𝐜𝐜T𝐲𝐲1 − 𝐜𝐜T𝐲𝐲2
- min 𝐛𝐛T𝐮𝐮
−A T T 𝐲𝐲1 subject to
𝐀𝐀𝐓𝐓
𝐮𝐮 ≥
−𝐜𝐜
subject to
AT 𝐲𝐲2 ≥ – b – A 𝐲𝐲1 + A 𝐲𝐲2 ≥ – b −𝐀𝐀𝐓𝐓 𝐜𝐜
𝐲𝐲1 ≥ 0 , 𝐲𝐲2 ≥ 0 𝐮𝐮 ≥ 0
• Let 𝐱𝐱 = 𝐲𝐲1 – 𝐲𝐲2 • We have the following dual LP:
• We have the following dual LP: - min 𝐛𝐛T𝐮𝐮 max −𝐛𝐛T𝐮𝐮
When the constraints are equality When the variables are unrestricted
min 𝐜𝐜T𝐱𝐱 subject to 𝐀𝐀𝐓𝐓 𝒖𝒖 = −𝐜𝐜
constraints, the corresponding duals in sign, the corresponding dual
are unrestricted in sign (can be 0, subject to A𝐱𝐱 ≤ b constraints are equality constraints 𝐮𝐮 ≥ 0
positive or negative)
31 32
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 2 (Quadratic programming) :
Example 1 (Linear programming) continued:
Consider the quadratic optimization problem
Primal LP: min cTx Dual LP: max – λTb
𝑥𝑥 λ min ½ xTQx + cTx
subject to Ax ≤ b 𝑥𝑥
subject to c + AT λ=0 subject to Ax ≤ b
where c ∈ n , A∈ m×n , b∈ m
λ ≥ 0 where λ ∈ m where c ∈ n , A ∈ m×n , b ∈ m and Q ∈ n×n is a symmetric positive definite matrix.
• Strong duality: Suppose x* is feasible for LP , λ* is feasible for the dual LP and cTx* = - λ*Tb, then x* is o The Lagrangian is given by
the solution of the primal LP and λ∗ is the solution of the dual LP: L(x,λ) = ½ xTQx + cTx + λT (Ax – b)
o Let x be any point feasible for primal LP, then by weak duality: cTx ≥ - λ*Tb = cTx* , then x* must be the = ½ xTQx + (c + AT λ) T x – λT b
solution of primal LP. o The derivative of the Lagrangian with respect to x is ∇x L(x,λ) = Qx + (c + AT λ)
o Let λ be any point feasible for dual LP, then by weak duality: cTx* ≥ - λTb , that is, - λ*Tb ≥ - λTb , then o Setting the derivative to 0 and solving for x, we obtain x = - Q-1(c + AT λ)
λ* must be the solution of dual LP. o Hence, L(x,λ) = ½ xTQx + cTx + λT (Ax – b)
33 = ½ xTQx + (c + AT λ) T x – λT b = - ½(c + AT λ)T Q-1(c + AT λ) – λT b 34
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 2 (Quadratic programming) continued: Example 3 General non-linear optimization:
• Consider the primal quadratic optimization problem: • Consider the primal quadratic optimization problem:
where c ∈ n , A ∈ m×n , b ∈ m and Q ∈ n×n is a symmetric positive definite matrix. • Its dual is
37 38
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Remark 3: Karush Kuhn Tucker (KKT) optimality conditions:
• Problem with non-negative variables: • Given the nonlinear optimization problem:
subject to gi(x) ≤ 0 for all i = 1,2 …. m subject to gi(x) ≤ 0 for all i = 1,2 …. m
x ≥ 0 x∈ n • Necessary conditions:
• Consider the equivalent problem: If x is an optimal solution, then there must exist multipliers λ1, λ2, … λm such that
• When λi > 0 , then to satisfy the complementary slackness condition: gi (x) = 0 • Sufficient conditions:
• When gi (x) < 0 , then to satisfy the complementary slackness condition: λi = 0 If x and λ satisfy the necessary conditions, and the function f(x), g1(x), g2(x), … gm(x) are convex, then x
Complementary slackness:
• The constraint gi(x) ≤ 0 is said to be binding if it satisfied as equality: gi (x) = 0 o si ≥ 0, λi ≥ 0 is the optimal solution of NLP.
• If the constraint is not binding at x, then there is a positive slack si > 0 such that: gi (x) + si = 0
• If multiplier λi > 0 ⇨ constraint must be binding, gi (x) = 0 , si = 0
• If the constraint is not binding, gi (x) < 0 , si > 0 ⇨ the multiplier λi = 0 41 42
Z = x1 + x2 = c
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
2. Constrained optimization and Lagrange multipliers 2. Constrained optimization and Lagrange multipliers
Example 5 (Linear programming) : Example 5 (Linear programming) :
Consider the linear problem with equality constraints and non-negative variables: Consider the linear problem with equality constraints and non-negative variables:
min cTx
min cTx 𝑥𝑥
𝑥𝑥
subject to Ax = b ⇒ Ax - b ≤ 0 , - Ax + b ≤ 0
subject to Ax = b ⇒ Ax - b ≤ 0 , - Ax + b ≤ 0
x ≥ 0 ⇒ -x ≤ 0 where c ∈ n , A ∈ m×n , b ∈ m
x ≥ 0 ⇒ -x ≤ 0 where c ∈ n , A ∈ m×n , b ∈ m • Its dual is
min cT x
o The Lagrangian is given by max – (λ - κ) T b x
λ, κ,ϕ subject to A x = b
L(x,λ) = cTx + λT (Ax – b) + κT (- Ax + b) - ϕ Tx subject to c + AT λ - AT κ - ϕ = 0 x ≥ 0
= (c + AT λ - AT κ - ϕ ) T x – (λ - κ) T b λ, κ , ϕ ≥ 0 where λ, κ ∈ m and ϕ ∈ n
⇔
o The derivative of the Lagrangian with respect to x is ∇x L(x,λ) = c + AT λ - AT κ - ϕ Let u = – (λ - κ) , then the dual problem is:
max bT u max bT u
Setting the derivative to 0, we obtain u
o u,ϕ
subject to AT u ≤ c
L(x,λ) = – (λ - κ) T b subject to c - AT u - ϕ = 0
where 𝐮𝐮 ∈ m
47 ϕ ≥ 0 where 𝐮𝐮 ∈ m and ϕ ∈ n 48
Outline:
2. Gradient descent
ERROR BASED LEARNING: REGRESSION
3. Matrix approach and statistical analysis
7. Logistic regression
1
2
7 8
1. Fundamentals of regression models 2. Gradient descent
• Example 1. Error surfaces and plot of f(w0,w1) = 5(w1 – w0)2 + (2w1 + w0 + 3)2 • There are different ways to find a point in the weight space where the derivatives equal to zero.
𝜕𝜕
f(w0,w1) 5(2)(−1)(w1 – w0) + 2(2w1 + w0 + 3) • One way is to apply a guided search algorithm known as gradient descent
𝜕𝜕𝜕𝜕 0
∇ f(w0,w1) = =
𝜕𝜕
f(w0,w1) 5(2)(w1 – w0) + 2(2)(2w1 + w0 + 3)
𝜕𝜕𝜕𝜕 1
18
= for w[0] = 2, w[1] = 2
36
9 10
• Lines indicate path that the gradient descent algorithm would take across the error surface from 4 different starting
points to the global minimum.
13 14
Sample ID #1:
d0 1
d1 500
=
d2 4
d[3] 8
• Rental price = w[0] + w[1] × Size + w[2] × Floor + w[3] × Broadband rate
• Rule of thumb: choose some learning rate α in the range [0.000001,10] • Energy rating is not included in the model (for now)
• Choose initial weights randomly in the range [-0.2, 0.2] • Initial weights: w[0] = -0.146, w[1] = 0.185, w[2] = -0.444, w[3] = 0.119
• Learning rate: α = 0.00000002, number of data samples n = 10
15 16
2. Gradient descent 2. Gradient descent
Example (continued). Example (continued).
• w[j] ← w[j] + α [∑𝑛𝑛𝑖𝑖=1 ti − 𝐰𝐰T𝐝𝐝 i × 𝐝𝐝𝑖𝑖[j]] • Batch update.
• When i = 1:
o Error = t1 − 𝐰𝐰T𝐝𝐝1 = 320 – 93.26 = 226.74
o di[0] = 1, contribution to errorDelta(D,w[i]) :
226.74 × 1 = 226.74
o di[1] = 500,
contribution to errorDelta(D,w[i]) :
226.74 × 500 = 113370.05
17 18
- XT(Y - X β ) = 0,
1 500 4 8 320
Rental price 1 550 7 50 380
320 1 620 9 7 400
1 500 4 8 320 1 630 5 24 390 10 7235 82 286
380
1 550 7 50 380 1 665 8 100 385 7235 5479725 62840 203810
X = Y= XTX =
1 620 9 7 400 400
1 700 4 8 410 82 62840 772 2395
1 630 5 24 390 390 1 770 10 7 480 286 203810 2395 16442
1 665 8 100 385 1 880 12 50 600
X = Y= 385
1 700 4 8 410 410 1 920 14 8 570
1 770 10 7 480 1 1000 9 24 620
480
1 880 12 50 600 Linear function: f(x, β) = βTx
600
1 920 14 8 570
1 1000 9 24 620 570 4555 19.5615 w[0] = 19.5616
620 3447725 0.5487 w[1] = 0.5487
XTY = β = (XTX )-1 XTY = Affine function: f(x, β) = β0 + βTx
39770 4.9635 w[2] = 4.9635
128300 −0.0621 w[3] = -0.0621 where β0 is a constant.
21 22
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
Interpreting multivariate linear regression model.
• The parameter estimate (weight) for Size is 0.54874,
Output summary from R: • The parameter estimate (weight) for Size is its corresponding std. error is 0.07945 and t-value =
Coefficients:
0.54874: for every square foot increase in 0.54874/0.07945 = 6.907
Estimate Std. Error t value Pr(>|t|)
office size, we can expect the rental price to • The observed p-value Pr ˃ |t| is the probability of
(Intercept) 19.56156 43.64459 0.448 0.669738
go up by 0.5487 Euros (provided all the observing a value of the t statistic that is at least as
size 0.54874 0.07945 6.907 0.000455 ***
floor 4.96355 3.93842 1.260 0.254361 other variables remain the same). contradictory to the null hypothesis as the actual
broadband -0.06210 0.30486 -0.204 0.845332 • Of the 3 descriptive features, only Size is one computed from the data.
--- found to have a significant impact on the • Degrees of freedom for the t-distribution = n – p – 1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 model based on t-statistic. = 10 – 3 – 1 = 6.
• For Size is p-value is 0.000455, we can reject the null
Residual standard error: 27.34 on 6 degrees of freedom
hypothesis H0: β1 = 0 at the 5% level of significance
Multiple R-squared: 0.9552, Adjusted R-squared: 0.9328
and conclude that the weight for Size β1 ≠ 0.
F-statistic: 42.65 on 3 and 6 DF, p-value: 0.0001932 6.907
23 24
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
SUMMARY OUTPUT
• Analysis of Variance (ANOVA) Table: • Analysis of Variance (ANOVA) Table: ANOVA
Regression Statistics
Multiple R 0.977348054 df SS MS F Significance F
R Square 0.955209219
Adjusted R Square 0.932813829
Standard Error 27.3391202 Regression 3 95637.935 31879.311 42.6520 0.000193
Observations 10
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
Regression Statistics
square of r = correlation coefficient between X and Y. standard deviation σ. • The variability of Y given X is the same
regardless of X (homoscedasticity)
• RegressionSS + ResidualSS = TotalSS is true only for “training data”.
27 28
3. Matrix approach and statistical analysis 3. Matrix approach and statistical analysis
29 30
α0 is the initial learning rate, c is a constant that controls how fast the rate decays and τ is the current iteration of the
• The office rental dataset, trained with α0 = 0.25 and c = 100: gradient descent algorithm.
31 32
4. Extension and variations 4. Extension and variations
33 34
lm(formula = rental ~ • For Energy rating A: • Consider the quadratic relationship of the form: Y = b + wX + dX2
size + floor + broadband + Energy) Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband Let Z = X2
Output: • For Energy rating B:
then we have Y = b + wX + dZ
Coefficients:
Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband –
(Intercept)
Estimate Std. Error t value Pr(>|t|)
25.08095 34.50268 0.727 0.5075
46.5555 = • Consider the Cobb-Douglas production function: Q = α Cβ1Lβ2
size 0.64315 0.09088 7.077 0.0021 **
floor 0.01672 4.59138 0.004 0.9973
-21.47 + 0.6431 size + 0.0167 floor – 0.1325 broadband where Q is the output, C is capital and L is labor.
broadband -0.13249 0.24992 -0.530 0.6241
EnergyB -46.55546 25.51954 -1.824 0.1422
• For Energy rating C:
EnergyC -42.09485 17.40362 -2.419 0.0729 . Taking the log of both sides of the equation: log Q = log α + β1 log C + β2 log L
---
Rental = 25.08 + 0.6431 size + 0.0167 floor – 0.1325 broadband –
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
42.09485 = and letting Y = log Q, β0 = log α, X1 = log C, X2 = log L, we have the multiple regression model:
Residual standard error: 20.76 on 4 degrees of freedom
Multiple R-squared: 0.9828, Adjusted R-squared: 0.9613
-17.01 + 0.6431 size + 0.0167 floor – 0.1325 broadband Y = β 0 + β 1 X1 + β 2 X2
F-statistic: 45.67 on 5 and 4 DF, p-value: 0.001274
• If two office units are identical in size, floor level, broadband rate,
• We can then find the model parameters by the standard linear regression approach.
Represent n categorical values by the one with A energy rating is expected to have rental that is 46.55
binary string of size n-1: Euros higher than the one with B energy rating per month.
[A,B,C]→ [(0,0), (1,0), (0,1)] 35 36
4. Extension and variations 4. Extension and variations
Ridge regression Ridge regression
• The least squares estimators for β: • The parameter estimates βλ for ridge regression:
β = (XTX )-1 XTY
βλ = (XTX + λI) -1 XTY where I is a p by p identity matrix.
may not exist or they can have very large values.
depends on the value of λ.
• Instead of minimizing the sum of squared errors, we impose the ridge constraint
min S’(β): min ½ (Y - X β)T (Y - X β ) • Select the best value of λ by cross validation.
subject to ∑𝑝𝑝𝑖𝑖 β𝑖𝑖2 ≤ 𝑡𝑡 • The descriptive features X should be standardized (mean = 0, std = 1) and target feature Y centered (mean = 0).
• The above constrained optimization problem is equivalent to: • Ridge regression is applied to overcome serious multi-colinearity problems.
min S”(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ∑𝑝𝑝𝑖𝑖 β𝑖𝑖2
⇔ min S”(β): min ½ (Y - X β)T (Y - X β ) + ½ λ ‖ β ‖2 where λ ˃ 0 is the penalty/regularization parameter.
• Take the derivative of S”(β) and set it to 0:
- XT(Y - X β ) + λ β = 0,
• We have XTX β + λ β = XTY and β = (XTX + λI) -1 XTY where I is a p by p identity matrix.
37 38
• When t is sufficiently small (λ is sufficiently large), we can expect many of the parameters β𝑖𝑖 to be zero ⇒ feature
selection.
β1=0 , β2=1
• Unlike original least squares regression and ridge regression, there is no close form solution for the parameters in
LASSO regression.
39 40
5. Maximum likelihood estimation 5. Maximum likelihood estimation
• Maximum likelihood estimation (MLE): define a function of the parameters θ that enables us to find a model that fits • Maximum likelihood estimation (MLE): define a function of the parameters θ that enables us to find a model that fits
the data well. the data well.
• Let p(x|θ) be the probability density function that provides the probability of observing data x given θ. • For example, in linear regression we specify a Gaussian likelihood: p(yn|xn ,θ) = N(yn| xnT θ, σ2) where observation
• The maximum likelihood estimator gives us the most likely parameter θ. uncertainty is modelled as Gaussian noise εn ∼ N(0, σ2) (see slide page 28).
• Supervised learning: • Given an input xn, predict the target value yn to be distributed according to the Gaussian (normal) distribution with
o Pairs of observations (x1,y1), (x2,y2) …., (xN,yN) with xn ∈ D and labels yn ∈ . α 1
mean xnT θ and variance σ2. In the figure below θ = β , xn =
o Given a vector xn, we want the probability distribution of the label yn: specify the conditional probability Xn N(yn| xnT θ, σ2) ⇨
Mean of distribution for yn= xnT θ
distribution of the labels given the observations for the particular parameter setting θ. Variance = σ2
o For each observation label pair (xn,yn), find N(yn| xnT θ, σ2) ⇨
Mean of distribution for yn= xnT θ
p(yn|xn ,θ) Variance = σ2
• For example, in linear regression we specify a Gaussian likelihood: p(yn|xn ,θ) = N(yn| xnT θ, σ2) where observation
uncertainty is modelled as Gaussian noise εn ∼ N(0, σ2) (see slide page 28).
41 42
⇨ recall P(a,b) = P(a ∩ b) = P(a) × P(b) if a and b are independent events. • “Identically distributed” means that each term in the product is of the same distribution with the same parameters,
• “Identically distributed” means that each term in the product is of the same distribution with the same parameters, e.g. N(yn| xnT θ, σ2) .
e.g. N(yn| xnT θ, σ2) . • Example: minimizing the negative log-likelihood where p( yn|xn ,θ) is N(yn|xnT θ, σ2) gives us linear regression least
• We consider the negative log-likelihood: squares parameter estimates:
• Take the derivative of
L(θ) = - log p(Y|X,θ) = - log ∏𝑁𝑁 𝑁𝑁
𝑛𝑛=1 p(yn| xn,θ) = - ∑𝑛𝑛=1 log p(yn| xn, θ) L(θ) with respect to θ
⇨ recall log(a × b) = log(a) + log(b) • Set the derivative to 0
and solve for θ
Find θ that maximizes the likelihood function p(Y|X, θ)
• Answer: see page 20
⇔ minimize the negative likelihood function p(Y|X, θ)
⇔ minimize the negative of the log-likelihood function L(θ) (because log(x) is a strictly increasing function).
43 44
6. Maximum A Posteriori Estimation 6. Maximum A Posteriori Estimation (MAP)
• If we have prior knowledge about the distribution of the parameters θ, we can update this distribution after observing • Given the training data (X,Y) to find the MAP estimate θMAP , we compute the log-posterior:
some data x.
𝒑𝒑 Y X, θ 𝒑𝒑(θ)
log 𝑝𝑝(θ|X,Y) = log 𝒑𝒑(Y|X)
= log p(Y|X, θ) + log p(θ) + const
• Bayes’ theorem allows us to compute a posterior distribution p(θ|x):
and minimize the negative log-posterior distribution with respect to θ:
𝑝𝑝 x θ 𝑝𝑝(θ)
𝑝𝑝(θ|x) =
𝑝𝑝(x)
θMAP ∈ argmin { - log p(Y|X, θ) - log p(θ)}
where p(x|θ) is the likelihood of the data x given θ and p(x) is the probability distribution of the data x. θ
• For example, with a Gaussian prior p(θ) = N(0, b2I) , we have the negative log posterior:
• Since the distribution p(x) does not depend on θ, finding the maximum of the posterior distribution 𝑝𝑝(θ|x) is
- log 𝑝𝑝(θ|X,Y) = - log p(Y|X, θ) - log p(θ) + const
equivalent to maximizing the denominator numerator 𝑝𝑝 x θ 𝑝𝑝(θ)
1 1
= (Y - X θ)T(y - X θ) + θTθ + const See page 44
Bayes’ formula: 2σ2 2𝑏𝑏2
max 𝑝𝑝(θ|x ) ∝ max 𝑝𝑝 x θ 𝑝𝑝(θ) P(A|B) = P(A ∩ B)/P(B) • Note: consider the prior for a single parameter θ that is Gaussian with mean 0 and variance b2, then
+ ½ (θ/b)2 = const +
1 1
- log p(θ) = - log ( 2π𝑏𝑏2
) θ2
45
2𝑏𝑏2 46
• Taking the derivative of the negative log posterior: • It is a technique for converting binary classification problems into linear regression.
- log 𝑝𝑝(θ|X,Y) = - log p(Y|X, θ) - log p(θ) • The response variables are assumed to have values either 0 or 1.
=
1
(Y - X θ)T(Y - X θ) +
1
θTθ + const • Using logistic regression, the posterior probability P(y|x) of the target variable y conditioned to the
2σ2 2𝑏𝑏2
input x = (X1, X2, …. , Xn) is modeled according to the logistic function:
and setting it to 0, we obtain the MAP estimate
2
𝜎𝜎
θMAP = (XTX + 𝑏𝑏2 I)-1 XTY
o Compare with the parameter estimates for ridge regression on page 38.
47 48
7. Logistic regression 7. Logistic regression
• Graph of the logistic function f(x) = 1/(1 + e-x): • The ratio of the two conditional probabilities is:
• f(x) is close to 1 for large positive x This is the logit or the log-odds (= z)
• f(x) = 0.5 when x = 0
• If Xi is increased by 1:
• Its derivative is f’(x) = f(x) (1 – f(x))
ln(x) = loge(x)
• Hence, 0 ≤ P(Y=1|X1,X2, …., Xn) ≤ 1 and 0 ≤ P(Y=0|X1,X2, …., Xn) ≤ 1 is the odds-ratio: the multiplicative increase in the odds when Xi increases by one (other variables
• The above logistic function is also known as the sigmoid function. remaining constant)
49 50
• Given a classification problem where the target variable Yi, i = 1,2 … ,n is binary-valued, 0 or 1 with probabilities 1 - πi
• A Bernoulli trial is an experiment with 2 possible outcomes: success and failure with probability of success = π and
and πi, the logistic regression model can be stated as finding parameters β0 and β1 such that:
probability of failure = 1 – π, where 0 ≤ π ≤ 1.
E[Yi|Xi] = πi = exp(β0 + β1Xi)/[1 + exp(β0 + β1Xi)] Compare to Gaussian likelihood
• A Bernoulli random variable Y is defined by the probability mass function: for linear regression on page 41
• Since each Yi is a Bernoulli random variable, where
1 −π 𝑦𝑦 = 0 P(Yi = 1|Xi) = πi and P(Yi = 0|Xi) = 1 - πi
PY(y) = �
π 𝑦𝑦 = 1
the probability distribution function can be represented as follows:
• Alternatively, it can be defined by the probability mass function: fi(Yi) = πiYi (1 - πi )1-Yi for Yi = 0, 1; i = 1,2, ….n
PY(y) = πy (1 - π)1-y for y = 0, 1 • Note:
• The expected value of Bernoulli random variable Y is: o It is assumed that there is only one independent variable/descriptive feature: simple logistic regression. The discussion to follow
E[Y] = 1(π) + 0(1 − π) = π applies to more than one independent variable. Here Yi is the response for the i-th data, Xi is the corresponding value of the
independent variable.
o See the equation on page 48, set n = 1.
51 52
7. Logistic regression 7. Logistic regression
Maximum likelihood estimation log(a) + log(b) = log(ab) Maximum likelihood estimation
• The probability distribution function of Yi is: log(a)N = N log(a) • Instead of maximizing the joint probability function
fi(Yi) = πiYi (1 − πi )1-Yi for Yi = 0, 1; i = 1,2, ….n Here n is the number of data samples! g(Y1,Y2,… Yn) =∏𝑛𝑛𝑖𝑖=1 fi(Yi)= ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi
• Since the Yi observations are independent, their joint probability function is:
we find the best likelihood estimates β0 ,β1 by maximizing the log likelihood function:
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1− πi )1−Yi πi = exp(β0 + β1Xi)/[1 + exp(β0 + β1Xi)]
loge L(β0 ,β1) = ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi))
• Taking the logarithm of the joint probability function: 1 - πi = 1/[1 + exp(β0 + β1Xi)]
loge g(Y1,Y2,… Yn) = loge ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi • Let b0 and b1 be the maximum likelihood estimates.
πi
= exp(β0 + β1Xi)
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )1−Yi 1 − πi • The fitted logistic response function for a data sample with independent variable value X is:
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) −Yi � = exp(b0 + b1 X )/[1 + exp(b0 + b1 X )]
π
= loge ∏𝑛𝑛𝑖𝑖=1 πiYi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) −Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )
and: 1 − π
� = 1/[1 + exp(b0 + b1 X )]
π
= loge ∏𝑛𝑛𝑖𝑖=1 (1 − iπ ) Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) �
π
i = exp(b0 + b1X ) This is the odds in favor of Y = 1
π π 1 − �
π
= ∑𝑛𝑛𝑖𝑖=1 loge(1 − iπ ) Yi + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi ) = ∑𝑛𝑛𝑖𝑖=1 Yi loge(1 − iπ ) + loge ∏𝑛𝑛𝑖𝑖=1 (1 − πi )
i i �
π
loge 1 − = b0 + b 1 X This is the logit or the log-odds (= z)
= ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi)) = loge L(β0 ,β1) ⇐ Maximize! 53 �
π 54
• The maximum value that can be achieved by a joint probability function is 1 (best fit). A system analyst studied the effect of computer programming experience on ability to complete within a
specified time a complex programming task. They had varying amount of experience (measured in months).
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi
All persons were given the same programming task and their success in the task was recorded: Y = 1 if task
• The log likelihood function
was completed successfully within the allotted time, Y = 0 otherwise.
loge L(β0 ,β1) = ∑𝑛𝑛𝑖𝑖=1 Yi (β0 + β1Xi) − ∑𝑛𝑛𝑖𝑖=1 loge (1 + exp(β0 + β1Xi)) Person Months-Experience Success
ranges from minus infinity to 0. 1 14 0
2 29 0
• -LL or -2LL is normally reported as a model’s fit. The closer it is to 0, the better the fit. 3 6 0
• When there is no independent variable in the model, the maximum likelihood is achieved with 4 25 1
.. …. ….
1
� = ∑𝑛𝑛𝑖𝑖=1 Yi
π the proportion of data samples with Yi = 1 23 28 1
𝑛𝑛
24 22 1
25 8 1
55 56
7. Logistic regression 7. Logistic regression
Example 1.
Example 1.
• A standard logistic regression package was run on the data and the parameter values found are • Suppose there is another programmer with 15 months experience, i.e. X1 = 15.
β0 = -3.0595 and β1 = 0.1615. • Recall the parameter values are β0 = -3.0595 and β1 = 0.1615 ⇐
• The estimated mean response for i = 1, where X1 = 14 months-experience is o b = β0 + β1 X1 = -3.0595 + 0.1615 (15) = - 0.637 (logit when X1 = 15)
o eb = e-0.637 = 0.5289
o a = β0 + β1 X1 = -3.0595 + 0.1615 (14) = -0.7985 (logit when X1 = 14)
o P(Y=1|X1=15) = 0.5289/(1.0 + 0.5289) = 0.3459.
o ea = e-0.7985 = 0.4500 ⇐
o The estimated probability that a person with 15 months experience will successfully complete the programming
o P(Y=1|X1=14) = 0.4500/(1.0 + 0.4500) = 0.3103 task is 0.3459.
o The estimated probability that a person with 14 months experience will successfully complete the o The odds in favor of completing the task =
programming task is 0.3103. 0.3459/(1 – 0.3459) = 0.5288.
o The odds in favor of completing the task = o Comparing the two odds: 0.5288/0.4500 = 1.1753 = e0.1615 ⇐
o The odds increase by 17.53% with each additional month of experience
0.3103/(1 – 0.3103) = 0.4500 ⇐
57 58
o high writing test score if the writing score is greater than or equal to 60 (honors = 1),
Call: glm(formula = honors ~ female + reading + science, family = binomial)
Coefficients:
o low writing test score, otherwise (honors = 0).
(Intercept) female reading science o Estimate for Intercept = β0 = -12.7772
• The predictor variables used: gender (female), reading test score (reading), and science test score (science). -12.77720 1.48250 0.10354 0.09479 o Estimate for β1 = 1.4825 (corresponds to variable female)
Degrees of Freedom: 199 Total (i.e. Null); 196 Residual
• The dataset can be downloaded from https://stats.idre.ucla.edu/ o Estimate for β2 = 0.10354 (corresponds to variable reading)
Null Deviance: 231.3
o Estimate for β3 = 0.09479 (corresponds to variable science)
59 Residual Deviance: 160.2 AIC: 168.2 60
7. Logistic regression 7. Logistic regression
observation with the higher ordered response value, then the pair is discordant.
g(Y1,Y2,… Yn) = ∏𝑛𝑛𝑖𝑖=1 fi(Yi) = ∏𝑛𝑛𝑖𝑖=1 πiYi (1 − πi )1−Yi = (53/200)53 × (147/200)147
• Percent Tied - If a pair of observations with different responses is neither concordant nor discordant, it is a tie.
loge g(Y1,Y2,… Yn) = (53) loge(53/200) + (147) loge(147/200) = -115.644 = loge L(β0)
• Pairs - This is the total number of distinct pairs with one case having a positive response (honcomp = 1) and the other having a
negative response (honcomp = 0). -2 loge L(β0 ,−−) = (-2)(-115.644) = 231.388 = Null Deviance
o The total number ways the 200 observations can be paired up is 200(199)/2 = 19,900. • Number of parameters = 3, number of levels in the dependent variable k = 2.
!
3x21 − 4x1
∇f (x1 , x2 ) =
3x22 + 6x2
!
−1
∇f (1, −1) =
−3
!
1
d = −∇f (1, −1) =
3
!
1+λ
x + λd =
−1 + 3λ
!
3(1 + λ)2 − 4(1 + λ)
∇f (x + λd) =
3(−1 + 3λ)2 + 6(−1 + 3λ)
′
f (λ) = ∇f (x + λd)T d = 3(1 + λ)2 − 4(1 + λ) + 9(−1 + 3λ)2 + 18(−1 + 3λ)
= −10 + 2λ + 84λ2 = 0
x1 = x0 + λ 1 d
! !
1 1 1
= +
−1 3 3
!
4
= 3
0
!
0
∇f (x1 ) =
0
2. Use the steepest descent method to approximate the optimal solution to the problem
1
!
0.5
Starting from the point x0 = , show the first few iterations.
0.5
!
2x1 − 2x2
g(x1 , x2 ) = ∇f (x1 , x2 ) =
4x2 − 2x1 − 2
The first few iterations:
k x1 x2 g1 g2 f(x1,x2)
1 0.50000 0.50000 0.00000 -1.00000 -0.75000
2 0.50000 0.60000 -0.20000 -0.60000 -0.83000
3 0.52000 0.66000 -0.28000 -0.40000 -0.86480
4 0.54800 0.70000 -0.30400 -0.29600 -0.88690
5 0.57840 0.72960 -0.30240 -0.23840 -0.90402
.....
96 0.99969 0.99981 -0.00024 -0.00015 -1.00000
97 0.99972 0.99982 -0.00022 -0.00013 -1.00000
98 0.99974 0.99984 -0.00020 -0.00012 -1.00000
99 0.99976 0.99985 -0.00019 -0.00011 -1.00000
100 0.99978 0.99986 -0.00017 -0.00011 -1.00000
2
Compute the Hessian and its inverse, then update x :
!
12x1 (x31 − x2 ) + 6x21 (3x21 ) + 24(x2 − x1 )2 −6x21 − 24(x2 − x1 )2
H(x1 , x2 ) =
−6x21 − 24(x2 − x1 )2 2 + 24(x2 − x1 )2
!
24 −24
H(0, 1) =
−24 26
!
13 1
H −1 (0, 1) = 24
1
2
1
2 2
! ! !
13 1
−1 0 24 2 −8
x1 = x0 − H (0, 1)∇f (0, 1) = − 1 1
1 2 2 10
!
− 32
=
0
3
6. Consider the quadratic optimization problem:
min −15x1 − 30x2 − 4x1 x2 + 2x21 + 4x22
subject to
x1 + 2x2 ≤ 30
x1 ≥ 0
x2 ≥ 0
(a) Find the matrix Q and vector c such that the objective function above can be
expressed as 12 xT Qx + cT x.
4
BT4240 Tutorial # 2, February 4, 2025.
1. A study was undertaken to examine the profit Y per sales dollar earned by a construction
company and its relationship to the size X1 of the construction contract (in hundreds
of thousands of dollars) and the number X2 of years of experience of the construction
supervisor. Data were obtained from a sample of n = 18 construction projects undertaken
by the construction company over the past two years. Below is the ANOVA table of a
regression analysis using the model:
Y = β0 + β1 X1 + β2 X2 + β3 X12 + β4 X1 X2 + e
Analysis of Variance
Source DF Sum of Mean F value Prob > F
Squares Square
Model 4 77.02982 19.25745 20.440 0.0001
Error 13 12.24796 0.94215
Total 17 89.27778
Parameter Estimates
Variable DF Parameter Standard T for H0 : Prob > |T |
Estimate Error Parameter = 0
INTERCEP 1 19.304957 2.05205668 9.408 0.0001
CSIZE X1 1 -1.486602 1.17773388 -1.262 0.2290
SUPEXP X2 1 -6.371452 1.04230807 -6.113 0.0001
CSIZESQ X1*X1 1 -0.752248 0.22523749 -3.340 0.0063
SIZEEXP X1*X2 1 1.717053 0.25377879 6.766 0.0001
(a) Test the overall adequacy of the model at the 1% level of significance.
Since the p-value is 0.0001 for the F test is less than 0.01 (= 1%), we reject the null
hypothesis that β1 = β2 = β3 = β4 = 0. The model is adequate.
(b) (2 points) Compute and interpret the value of R2 .
R2 = 77.02982/89.277778 = 0.8628, that is, around 86% of the variations in the
profit Y can be explained by the multiple linear regression model.
(c) Explain how the prob-value for testing H0 : β2 = 0 versus Ha : β2 ̸= 0 has been
calculated.
The total area under the t-distribution with DF = 13 with t less than -6.113 and
greater than 6.113 is 0.0001.
(d) By using the prob-value for β3 , determine whether we can reject H0 : β3 = 0 versus
Ha : β3 ̸= 0 at the 5% level of significance.
Since the p-value of 0.0063 is less than 0.05, we reject H0 in favor of Ha .
(e) True or False: the regression model above is not a linear model as it involves X12
and X1 X2 .
False. The model is linear with respect to the model parameters β1 , β2 , β3 , β4 .
2. We are investigating the starting salary of new graduates from the university. A multiple
linear regression with 4 independent variables is generated. The input variables are:
1
CAP (cumulative average point): continuous in the interval [2, 5].
Faculty: categorical variables, Faculty1, Faculty2, Faculty3, Faculty4.
Internship: binary value (0 = no, 1 = yes).
Call:
lm(formula = Salary ~ CAP + as.factor(Faculty) + as.factor(Internship), data=data)
Residuals:
Min 1Q Median 3Q Max
-1297.23 -514.77 -2.33 494.34 1056.74
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2010.9 489.3 4.110 0.000236 ***
CAP 451.3 138.9 3.250 0.002605 **
as.factor(Faculty)2 130.1 313.6 0.415 0.680974
as.factor(Faculty)3 -981.3 334.8 -2.931 0.005997 **
as.factor(Faculty)4 1238.5 305.6 4.052 0.000279 ***
as.factor(Internship)1 629.6 249.9 2.519 0.016622 *
---
Signif. codes: 0 ’****’ 0.001 ’***’ 0.01 ’**’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 695.9 on 34 degrees of freedom
Multiple R-squared: 0.7048, Adjusted R-squared: 0.6614
F-statistic: 16.23 on 5 and 34 DF, p-value: 3.531e-08
(a) What is the expected starting salary of a student from Faculty1 with CAP = 2.0
and no internship experience?
Ŷ = 2010.9 + 451.3(2) = 2913.5
(b) How do you interpret the coefficient estimate of Faculty4?
Compared to students from Faculty 1, students from Faculty 4 are expected to
have starting salary that is higher by $1238.5 when they have the same CAP and
internship experience.
(c) What can you conclude from the large p-value corresponding to the coefficient of
Faculty2?
The difference in starting salary of students from Faculty 1 and Faculty 2 is not
statistically significant (if they have the same CAP and internship experience).
(d) Explain how the value of Multiple R-squared is computed.
The multiple R-squared is the ratio of sum of squared regression/total sum of
squared of the data, where
sum of squared regression = model sum of squares = i (ŷi − y)2 ,
P
2
Explain briefly.
True, since Internship has only two possible values, 0 or 1.
3. You are tasked to build a regression model to predict yi using the input variables
xi,1 , xi,2 , ....xi,p , where i = 1, 2, ..., N . For each data sample i, its importance is re-
flected by the corresponding weight wi > 0. The prediction error is (yi − ŷi ), where
the prediction ŷi is computed as ŷi = β0 + β1 xi,1 + β2 xi,2 + . . . βp xi,p . Find the optimal
parameters β = (β0 , β1 , ..., βp ) that will minimize the sum of weighted squared errors:
N
X
wi (yi − ŷi )2
i=1
(Hint: use matrix notation, let the input data be X and the target variable be Y).
Let W be a diagonal matrix with Wi,i = wi .
Ŷ = Xββ
S(β β )T W(Y − Xβ
β ) = (Y − Xβ β)
T
∂S/∂ββ = −2X W(Y − Xβ β) = 0
β = (XT WX)−1 XT WY
4. The data below show for a consumer finance company operating in six cities, the number
of competing loan companies operating in the city (X) and the number per thousand of
the company’s loans made in that city that are currently delinquent (Y ):
i: 1 2 3 4 5 6
Xi : 4 1 2 3 3 4
Yi : 16 5 10 15 13 22
Yi = β0 + β1 Xi
The least squares estimators for β = (β0 , β1 ) are computed using ridge regression with
3
penalty parameter λ = 1. Find the optimal values of the parameters (β0 , β1 ).
1 λ
β) =
S(β (Y − Xββ )T (Y − Xβ β )T + β T β
2 2
∇S(ββ ) = −XT (Y − Xβ β ) + λβ
β
0 = −XT Y + XT Xβ β + λββ
T T
X X + λI β = X Y
1 4
1 1
1 2
X = 1 3
1 3
1 4
T
1 4 16
1 5
1
1 10
2 81
XT Y =
15 = 261
1 3
1 13
3
1 4 22
6 17 1 0
XT X + λI = +
17 55 0 1
1 56 −17
(XT X + λI)−1 =
103 −17 7
99
β0 1 56 −17 81 103
β= = = 450
β1 103 −17 7 261 103
5. Consider the problem of collusive bidding among road construction contractors. Con-
tractors sometimes scheme to set bid prices higher than the fair (or competitive) price.
Suppose an investigator has obtained information on the bid status (fixed or competi-
tive) for a sample of 31 contracts. Two variables thought to be related to bid status are
also recorded for each contract: number of bidders (x1 ), and the difference between the
lowest bid and the estimated competitive bid where the difference x2 is measured as a
percentage of the estimate. The response y is recorded as follows:
1 if fixed bid
y=
0 if competitive bid
A logistic regression model is obtained to predict the target variable y from the following
31 data samples:
The output from R is as follows:
Coefficients:
(Intercept) numbids pdiff
1.4212 -0.7553 0.1122
4
Status #bids % diff Status #bids % diff Status #bids % diff
1 4 19.2 1 2 24.1 0 4 -7.1
1 3 3.9 0 9 4.5 0 6 10.6
0 2 -3.0 0 11 16.2 1 6 72.8
0 7 28.7 1 3 11.5 1 2 56.3
0 5 -0.5 0 3 -1.3 0 3 12.9
0 8 34.1 0 10 6.6 1 5 -2.5
0 13 24.2 0 7 2.3 1 3 36.9
0 4 -11.7 1 2 22.1 1 3 10.4
0 2 9.1 0 5 2.0 0 6 12.6
1 5 18.0 0 3 1.5 1 4 27.3
0 10 -8.4
(a) Compute and interpret the odds ratio estimate for the variable number of bidders
(x1 ).
Odds ratio for β1 is e−0.7553 = 0.47. Interpretation: if the number of bids increases
by 1, the odds that this contract is fixed actually decreases to only about 47% of the
original contract (with the % difference between the lowest bid and the estimated
competitive bid unchanged).
(b) Compute the Null Deviance for this data.
Number of Class 0 samples = 19, number of Class 1 samples = 12. Probability of
12
Class 1 samples = π = 12/31. Null deviance = (−2)×(12×ln 31 +19×ln 19
31 ) = 41.38.
(c) What is your prediction for a contract where the number of bids is 3 and the %
difference between the lowest bid and the estimated competitive bid is 50?
Probability of Status = 1 is 1.0/(1.0 + exp−1.4212+0.7553×bids−0.1122×dif f ) =
0.9916.
Prediction: Status = fixed.
5
BT4240
subject to
−x1 + x2 ≤ 1
x1 + x2 ≤ 2
x1 , x2 ≥ 0
−x1 + x2 ≤ 5
x1 + x 3 + x4 ≤ 2
x1 + x2 − x3 + 3x4 ≤ 4
(a) (5 points) State the KKT optimality conditions for the above problem.
(b) (5 points) State the dual of the above problem.
Note: You do not need to find the optimal solution
4. (10 points) The data below show for a consumer finance company operating in six cities,
the number of competing loan companies operating in the city (X) and the number
per thousand of the company’s loans made in that city that are currently delinquent
(Y ):
2
BT4240
i: 1 2 3 4 5 6
Xi : 4 1 2 3 3 4
Yi : 16 5 10 15 13 22
Yi = β0 + β1 Xi
The least squares estimators for (β0 , β1 ) are computed using ridge regression with
penalty parameter λ = 1.
• (5 points) Find the optimal values of parameters (β0 , β1 ).
• (5 points) If we apply the gradient descent method to update the weights, starting
from initial weights (β0 , β1 ) = (1, 0), what will be the new values for (β0 , β1 ) after
one iteration? Set the learning rate α = 0.5.
— END OF PAPER —
3
1 a) Of()
(h)
=
② =
50- rofed
(i) - i (iv)
=
= =
(=)
(2) (i)
6)0H =
② ,
=
Go -
H"*f(o)
&
-(i) i (5)
E =
(9)
. a)
2 D Primal feasibility :
9 ,
(c) = -
x
,
+
xz -
1 = 0
9) (x) = x, +
xz 2 1 0
-
93(x) = x= 0
9n(x) =
-
xz =0
② Qual
feasibility : M ,
42 ,
13 ,
zy 0
=
,
-
# ((2 -
2) + =, (1) + x2() +
1g( 1) - =
0 = 2x2 + x
,
+ x2
-
x3 =
4
④ comp .
Slackness : X, (-x , + xz - 1) = 0
x2(x + x -
2) = 0
xz( xi) -
= 0
xy( xz) - = 0
>
= 1 ,
=
with ( ,
2) ,
1
,
=0 & +3 ,
xy
=
0
③ ((z)
]
-
y
, + x = 2 = -
x+ x = 1
2(2) + 21 + x2 =
4 = x
,
+
x = 1
6) Global solution since obj function is convex In all constrants are linear (convex) .
Thus
,
solution
is
global minimum .
a) ① Primal (1)
3
.
feasibility 9 x,+ X -5 -0
=
: .
-
92(x) =
x
,
+ xy + xy
-
2 10
93() =
x , + xz -
x3 + 3xy -
410
② Dual feasibility : i ,2
, is 20
Stationarity: (x,
③ + x, (-) + x2(1) +
x(1) = 0 = (x, -
x1 + x2 + 23 = 0
2x + x
,
(1) + 22(0) + xg(1) =
0 = 2x2 + x1 +
x3 = 0
& x3 + 1 + x
,
(0) + 22(1) +
x3( -
1) = 0 = xz + 1 +
x2 -
x3 = 0
Exy + 1 + 2, (0) +
22(1) + 23(3) =
0
= [xy + 1 + x2 +
3x3 = 0
Comp sladness ,
:
E x, ( -X
, + xz
-
5) = 0
⑤ xz(x + Xy + Xy -
1) = 0
⑧ 23 (x ,
+ x2 -
xy + 3xy
-
4) = 0
6) D 2(,X) =
x ,
+
+ xz + Ex +
yxy + x3 + xy + 10 + 1
,
(X , +
xz
-
3) + x2(x , +
xy +
xy -
2)
+ xz(x , + xz -
xz + 3xy -
4)
=
x ,+ x
,
( x
,
+ 22 + x3) + xi + xz(x ,
+
13) + Ex xg(1 + x2 x3) 7xy
+ -
+ + xy(1 + xz + 343) + 10 -
541 -
2x7
=
41
② For a
quadratic function of the form f(x) = ax* + bx
,
the minimum is attend at #() = Lax + b =
0 = X =
-a
with a value of a (2) -
minin
③ Qual function
:
Ex )
g(x 4) 10-541-2x2-4x3
, _) (1 3)
+
max ,
42, = -
=
+
2 +
4 ,
42 23
,
subject to X
, %2 ,
&
y
= 0
0x][
4
②B =
(x+X + x = )" XT Y
-[0i])
I "(ii)
=
[ ]]
p =
[]
6) B , [0] =
Po[o] + a ((ti -
(8[03 +
Bo[i]X) :
-
X : [0)]
=
1 + 0 .
5[(15)(1) = (4)(i) + (9)(x) + (14)(x) + (2)(1) +
(2)(k)]
= 1 + 0 .
5(75)
= 38 5 .
B (1) =
Bo(7] + < ((ti -
18010) : Bott]xi)) ·
X: [17]
=
0 + 0 .
5 [(15)(4) +
(4)() + (a)(2) + (14)(3) + (12)(z) +
(21)(4)]
=
122
5
,
122)
BT4240
2
1
.
a) ① convert to minimization
problem :
min f(x) =
In (X , + 1) + x2 subject to x1 + 2x113 & X
,, X, 20
② Primal feasibility :
9 . 67) = x
, + 2x2
-
3 = 0 ③ Dual feasibitty : X , 12
, xz
= 0
92(c) = =
X
,
10
93() = -
xz = 0
Stationanty :
⑤ + 2, (1) + y ( -
1) =
0 = + x -
x = 0
2x2 + m
, (2) +
2g()) =
0 = 2x2 +
2x =
23 = 0
3) = 0
22( -
x, ) = 0
0
xz( xz) -
=
Then 12 1370
,
& X, = 0
# + 10) -
x2 = 0 EX = 1
2(0) + 2(0) -
2 = 0 =
13 = 0
b) For the minimization problem Ind #1) is concave while Xg* is convex Since &1 X2 = 0 deviation
,
, .
, ,
any
10, 0) InGut) the objective function
from would care to incrise so even
though is not connex
,
we
② E, =
Yo -xXf/ * d)
=
(i) =( -)
=
-
=
(
Fi =
3
. Let X = Y
,
(x X) XTY
+
Than B =
+
[ii]"[]
=
s[i] [i]
[2]
B
=
[: ]
Matrix derivatives cheat sheet
Kirsty McNaught
October 2017
1 Matrix/vector manipulation
You should be comfortable with these rules. They will come in handy when you want to simplify an
expression before differentiating. All bold capitals are matrices, bold lowercase are vectors.
Rule Comments
(AB)T = BT AT order is reversed, everything is transposed
T T T T
(a Bc) = c B a as above
aT b = bT a (the result is a scalar, and the transpose of a scalar is itself)
(A + B)C = AC + BC multiplication is distributive
(a + b)T C = aT C + bT C as above, with vectors
AB 6= BA multiplication is not commutative
bx → b xT B → B
bx → b xT b → b
2 T
x → 2x x x → 2x
bx2 → 2bx xT Bx → 2Bx