-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
There is a mistake in the calculation of information criteria in LassoLarsIC. It seems after a long discussion here, it was modified to incorrect behavior.
The issue is due to poor notation in (Zou et al, 2007) that is taken as a reference. In Stein, Akaike, Mallow's or similar approaches the noise variance is unfortunately assumed to be known. However in practice it is estimated. One should also be careful that this variance is the conditional variance, not the unconditional one (this is only understandable if you read Zou wording very carefully).
The formula used in the current implementation from Zou et al, 2007 eq 2.15 is for "known" conditional variance!!! The below code is calculating sigma2 as var(y). This is an estimator for unconditional variance. Our case is "unknown conditional variance" case. Basically there are two mistakes in this implementation. It uses a formula for "unconditional variance" estimation. Then it plugs it into another formula which is applicable for known (not estimated) conditional variance.
There is no perfect way to estimate conditional variance as true model is unknown. There are two approaches in practice:
-
Mallow's in practice: To estimate it from the most overfit model, https://en.wikipedia.org/wiki/Mallows%27s_Cp
"and {\displaystyle {\hat {\sigma }}^{2}} \hat{\sigma}^2 refers to an estimate of the variance associated with each response in the linear model (estimated on a model containing all predictors)" One important issue here is to do the bias correction (degrees of freedom adjustment) to the variance estimation calculated from the unconstrained model. Where df is taken to be total number of predictors including the intercept. -
Akaike: where the conditional variance from the "maximum likelihood estimated" model is used but in that case you need to use
n_samples * log(mean_squared_error) + K * df -- for Gaussian models only!!!
or
-2 * maximum_log_likelihood + K * df -- for generic application
Please see the difference in the following book page:
For small samples of Gaussian models a better solution is to use AICc instead of AIC as it adjusts for the estimation bias in conditional variance. Actually for Gaussian case this is the best performing option whether the sample is small or large as the correction is universal. This option can be added.
In general user supplied options for criteria should be GaussianAIC, GaussianBIC, the current attempts of this incorrect implementation. On top of that it is possible to add GenericAIC, GenericBIC, and GaussianAICc. Generic ones will be difficult as the model type is not known.
R = y[:, np.newaxis] - np.dot(X, coef_path_) # residuals
mean_squared_error = np.mean(R ** 2, axis=0)
sigma2 = np.var(y)
df = np.zeros(coef_path_.shape[1], dtype=np.int) # Degrees of freedom
for k, coef in enumerate(coef_path_.T):
mask = np.abs(coef) > np.finfo(coef.dtype).eps
if not np.any(mask):
continue
# get the number of degrees of freedom equal to:
# Xc = X[:, mask]
# Trace(Xc * inv(Xc.T, Xc) * Xc.T) ie the number of non-zero coefs
df[k] = np.sum(mask)
self.alphas_ = alphas_
eps64 = np.finfo('float64').eps
self.criterion_ = (n_samples * mean_squared_error / (sigma2 + eps64) +
K * df) # Eqns. 2.15--16 in (Zou et al, 2007)
There is one more issue in this code as whether df includes the intercept and the conditional covariance estimate or not. This may not be an immediate problem because for all models compared one uses "fit_intercept" option as either True or False. But in the future if behavior is modifed to compare options with intercept and without intercept these formulas will need serious rework to include full likelihood. The correct df will also be required if you decide to add AICc.
Conclusion
Correct formula ignoring irrelevant constants is
n_samples * log(mean_squared_error) + K * df
where log must be natural logarithm. other log bases are not correct.
And options need to be changed from aic, bic to gaussian_aic, gaussian_bic to make users understand that they are not very accurate other than gaussian models.