-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX apply sample weight to RANSAC residual threshold #23371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a test to check that we get the proper results.
Since the semantic of sample_weight
could be tricky, I think that we need to properly add some narrative in the API documentation regarding the parameter.
sklearn/linear_model/_ransac.py
Outdated
estimator_name = type(estimator).__name__ | ||
if sample_weight is not None and not estimator_fit_has_sample_weight: | ||
raise ValueError( | ||
"%s does not support sample_weight. Samples" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use an f-string since we make a change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
sklearn/linear_model/_ransac.py
Outdated
if self.residual_threshold is None: | ||
# MAD (median absolute deviation) | ||
residual_threshold = np.median(np.abs(y - np.median(y))) | ||
residual_threshold = np.median( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it solves the original issue.
We wanted to impact the loss such that samples do not contribute the same way. Here, we only impact what we should consider as an inlier. Intuitively, if we put weights that discard outliers then the MAD will reduce in your case and we only make it more difficult to get qualified as an inlier. However, we should as well modify the loss.
Reading the original discussion it seems that here we want to have sample_weight
between 0 and 1 and that we should multiply with 1 - sample_weight
the loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @glemaitre , I've added a test case that checks residual threshold is calculated correctly with sample weight.
Meanwhile it seems that modifying loss with 1-sample_weight
will break the test case test_ransac_fit_sample_weight
, which checks that fitting the model with sample weight n1, n2, n3
is equal to copy the data n1, n2, n3
times.
I'm closing this PR because this is still not fixing the bug and we should instead make a new PR. |
Maybe this approach wasn't a bad idea in retrospect (but I suspect it might not be complete either). In any case we should check that the usual weight semantics of scikit-learn work at least in expectation. That is:
Since RANSAC uses random sampling at each iteration, we cannot expect this mathematical equivalence to hold exactly but instead to only hold in distribution or at least in expectation. In practice, that means that to test this, we need to run multiple simulations with different values for the random seed and assess the impact on the learned prediction function one way or another. Since running iterated simulations is too costly, we should conduct this study in a side notebook at the time of the review of the PR but cannot hope to turn that into a regularly run pytest test case for the sake of sparing CI resources and keep test time manageable. |
Reference Issues/PRs
Addresses #15836.
What does this implement/fix? Explain your changes.
There is another opening PR #15952 that applies sample weight when calculating
residuals_subset
, meanwhile in this PR, sample weight is used to adjustresidual_threshold
.The reason for this change follows @glemaitre 's comment in the original PR #15952 (comment).
For example, if we set
sample_weight
to large values for samples that we want to be included in the model, in the original PR, this will result in a large residual and those samples will be considered outliers. In this PR, theresidual_threshold
is calculated using weighted sum, so those samples with large weights will more likely to be included in the final model, which I think follows intuition.Other comments
I'm not 100% sure adjusting
residual_threshold
is the correct way for using sample_weight, so it's open for discussion