FIX apply sample weight to RANSAC residual threshold #23371

MaxwellLZH · 2022-05-14T09:43:21Z

Reference Issues/PRs

Addresses #15836.

What does this implement/fix? Explain your changes.

There is another opening PR #15952 that applies sample weight when calculating residuals_subset, meanwhile in this PR, sample weight is used to adjust residual_threshold.

The reason for this change follows @glemaitre 's comment in the original PR #15952 (comment).

For example, if we set sample_weight to large values for samples that we want to be included in the model, in the original PR, this will result in a large residual and those samples will be considered outliers. In this PR, the residual_threshold is calculated using weighted sum, so those samples with large weights will more likely to be included in the final model, which I think follows intuition.

Other comments

I'm not 100% sure adjusting residual_threshold is the correct way for using sample_weight, so it's open for discussion

glemaitre

We also need a test to check that we get the proper results.

Since the semantic of sample_weight could be tricky, I think that we need to properly add some narrative in the API documentation regarding the parameter.

glemaitre · 2022-05-17T08:23:57Z

sklearn/linear_model/_ransac.py

+        estimator_name = type(estimator).__name__
+        if sample_weight is not None and not estimator_fit_has_sample_weight:
+            raise ValueError(
+                "%s does not support sample_weight. Samples"


Can you use an f-string since we make a change here?

glemaitre · 2022-05-17T08:34:52Z

sklearn/linear_model/_ransac.py

        if self.residual_threshold is None:
            # MAD (median absolute deviation)
-            residual_threshold = np.median(np.abs(y - np.median(y)))
+            residual_threshold = np.median(


I don't think that it solves the original issue.

We wanted to impact the loss such that samples do not contribute the same way. Here, we only impact what we should consider as an inlier. Intuitively, if we put weights that discard outliers then the MAD will reduce in your case and we only make it more difficult to get qualified as an inlier. However, we should as well modify the loss.

Reading the original discussion it seems that here we want to have sample_weight between 0 and 1 and that we should multiply with 1 - sample_weight the loss.

Hi @glemaitre , I've added a test case that checks residual threshold is calculated correctly with sample weight.

Meanwhile it seems that modifying loss with 1-sample_weight will break the test case test_ransac_fit_sample_weight, which checks that fitting the model with sample weight n1, n2, n3 is equal to copy the data n1, n2, n3 times.

glemaitre · 2024-03-13T10:11:38Z

I'm closing this PR because this is still not fixing the bug and we should instead make a new PR.

ogrisel · 2024-03-27T14:41:49Z

Maybe this approach wasn't a bad idea in retrospect (but I suspect it might not be complete either). In any case we should check that the usual weight semantics of scikit-learn work at least in expectation. That is:

setting null weights to arbitrary samples should be equivalent to training with those samples discarded from the training set;
setting weights to 2 for arbitrary samples (while keeping unit weights for the remainder) should be equivalent to unweighted training on an alternative version of the training set where the originally 2-weighted samples would instead be duplicated in the training set.

Since RANSAC uses random sampling at each iteration, we cannot expect this mathematical equivalence to hold exactly but instead to only hold in distribution or at least in expectation.

In practice, that means that to test this, we need to run multiple simulations with different values for the random seed and assess the impact on the learned prediction function one way or another.

Since running iterated simulations is too costly, we should conduct this study in a side notebook at the time of the review of the PR but cannot hope to turn that into a regularly run pytest test case for the sake of sparing CI resources and keep test time manageable.

apply sample weight to residual threshold in RANSAC

fd69d21

github-actions bot added the module:linear_model label May 14, 2022

reshape sample_weight for multi-output case

fed2aff

glemaitre mentioned this pull request May 17, 2022

[MRG] BUG apply sample weights to RANSAC loss functions #15952

Closed

glemaitre reviewed May 17, 2022

View reviewed changes

MaxwellLZH added 4 commits May 30, 2022 21:34

use f-string

062f42b

add test case

cd50e2b

add test case

67ee836

fix logic in calculating residual threshold

77da3bf

glemaitre closed this Mar 13, 2024

glemaitre mentioned this pull request Mar 13, 2024

RANSACRegressor should use weighted loss functions #15836

Open

ogrisel mentioned this pull request Oct 17, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX apply sample weight to RANSAC residual threshold #23371

FIX apply sample weight to RANSAC residual threshold #23371

MaxwellLZH commented May 14, 2022 •

edited

Loading

glemaitre left a comment

glemaitre May 17, 2022

MaxwellLZH Jun 16, 2022

glemaitre May 17, 2022

MaxwellLZH Jun 16, 2022

glemaitre commented Mar 13, 2024

ogrisel commented Mar 27, 2024 •

edited

Loading

FIX apply sample weight to RANSAC residual threshold #23371

FIX apply sample weight to RANSAC residual threshold #23371

Conversation

MaxwellLZH commented May 14, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Other comments

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre May 17, 2022

Choose a reason for hiding this comment

MaxwellLZH Jun 16, 2022

Choose a reason for hiding this comment

glemaitre May 17, 2022

Choose a reason for hiding this comment

MaxwellLZH Jun 16, 2022

Choose a reason for hiding this comment

glemaitre commented Mar 13, 2024

ogrisel commented Mar 27, 2024 • edited Loading

MaxwellLZH commented May 14, 2022 •

edited

Loading

ogrisel commented Mar 27, 2024 •

edited

Loading