Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX scoring != None for RidgeCV should used unscaled y for evaluation #29842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 18, 2024

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Sep 13, 2024

closes #13998
closes #15648

While discussing with @jeromedockes, we recall to have observed something weird in the RidgeCV code. I check a bit closer and I open this PR to highlight what is the potential problem.

In RidgeCV, when having sample_weight we scale the data using the sqrt(sample_weight):

if sample_weight is not None:
X, y, sqrt_sw = _rescale_data(X, y, sample_weight)
else:
sqrt_sw = np.ones(n_samples, dtype=X.dtype)

The idea is that the mean squared error can be expressed as:

For many linear models, this enables easy support for sample_weight because
(y - X w)' S (y - X w)
with S = diag(sample_weight) becomes
||y_rescaled - X_rescaled w||_2^2
when setting
y_rescaled = sqrt(S) y
X_rescaled = sqrt(S) X

Those "centered" data are used to optimize the ridge loss. Later in the code, we want to compute a score that can be an arbitrary metric via a scorer.

predictions = y - (c / G_inverse_diag)
if self.store_cv_results:
self.cv_results_[:, i] = predictions.ravel()
score_params = score_params or {}
alpha_score = self._score(
predictions=predictions,
y=y,
n_y=n_y,
scorer=scorer,
score_params=score_params,
)

The problem here is that predictions is computed efficiently as provided in the GCV paper. But these predictions are in the "scaled" space and it seems incorrect to compute any metric in this space with an arbitrary metric. Instead, we should unscale these predictions and the scaled true targets to compute the metric in the original space.

This is what this PR is intended to. I did not add any non-regression test (I assume that using the MedAE should lead to some failures) because I wanted to be sure that what I'm saying is correct.

@jeromedockes @ogrisel @lorentzenchr Does the above description make sense to you?

Edit: It seems that it relates to #13998 and #15648

Probably, I should check the tests that were written in #15648

Copy link

github-actions bot commented Sep 13, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e9bd778. Link to the linter CI: here

@glemaitre glemaitre marked this pull request as draft September 16, 2024 08:19
@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2024

Cross-linking #16298 as it might be related.

@glemaitre glemaitre marked this pull request as ready for review September 17, 2024 21:52
@glemaitre
Copy link
Member Author

So I added a test that was failing on main but is passing in this PR.
So we should be better. However, I discovered a new bug when dealing with sample_weight when scoring != None and with several target.

I'm going to open another PR to not overload this PR.

@glemaitre
Copy link
Member Author

I added a new parametrization to check that we support multioutput properly.

@thomasjpfan thomasjpfan enabled auto-merge (squash) September 18, 2024 17:04
@thomasjpfan thomasjpfan merged commit 69c1d79 into scikit-learn:main Sep 18, 2024
28 checks passed
@lorentzenchr
Copy link
Member

@glemaitre Could you fix the typos in the whatsnew entry?

lorentzenchr pushed a commit that referenced this pull request Sep 18, 2024
kbharat1210 pushed a commit to kbharat1210/scikit-learn that referenced this pull request Sep 25, 2024
kbharat1210 pushed a commit to kbharat1210/scikit-learn that referenced this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RidgeCV cv_values_ are for preprocessed data: centered and scaled by sample weights.
4 participants