Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX apply sample weight to RANSAC residual threshold #23371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

MaxwellLZH
Copy link
Contributor

@MaxwellLZH MaxwellLZH commented May 14, 2022

Reference Issues/PRs

Addresses #15836.

What does this implement/fix? Explain your changes.

There is another opening PR #15952 that applies sample weight when calculating residuals_subset, meanwhile in this PR, sample weight is used to adjust residual_threshold.

The reason for this change follows @glemaitre 's comment in the original PR #15952 (comment).

For example, if we set sample_weight to large values for samples that we want to be included in the model, in the original PR, this will result in a large residual and those samples will be considered outliers. In this PR, the residual_threshold is calculated using weighted sum, so those samples with large weights will more likely to be included in the final model, which I think follows intuition.

Other comments

I'm not 100% sure adjusting residual_threshold is the correct way for using sample_weight, so it's open for discussion

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a test to check that we get the proper results.

Since the semantic of sample_weight could be tricky, I think that we need to properly add some narrative in the API documentation regarding the parameter.

estimator_name = type(estimator).__name__
if sample_weight is not None and not estimator_fit_has_sample_weight:
raise ValueError(
"%s does not support sample_weight. Samples"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use an f-string since we make a change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

if self.residual_threshold is None:
# MAD (median absolute deviation)
residual_threshold = np.median(np.abs(y - np.median(y)))
residual_threshold = np.median(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it solves the original issue.

We wanted to impact the loss such that samples do not contribute the same way. Here, we only impact what we should consider as an inlier. Intuitively, if we put weights that discard outliers then the MAD will reduce in your case and we only make it more difficult to get qualified as an inlier. However, we should as well modify the loss.

Reading the original discussion it seems that here we want to have sample_weight between 0 and 1 and that we should multiply with 1 - sample_weight the loss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glemaitre , I've added a test case that checks residual threshold is calculated correctly with sample weight.

Meanwhile it seems that modifying loss with 1-sample_weight will break the test case test_ransac_fit_sample_weight, which checks that fitting the model with sample weight n1, n2, n3 is equal to copy the data n1, n2, n3 times.

@glemaitre
Copy link
Member

I'm closing this PR because this is still not fixing the bug and we should instead make a new PR.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2024

Maybe this approach wasn't a bad idea in retrospect (but I suspect it might not be complete either). In any case we should check that the usual weight semantics of scikit-learn work at least in expectation. That is:

  • setting null weights to arbitrary samples should be equivalent to training with those samples discarded from the training set;
  • setting weights to 2 for arbitrary samples (while keeping unit weights for the remainder) should be equivalent to unweighted training on an alternative version of the training set where the originally 2-weighted samples would instead be duplicated in the training set.

Since RANSAC uses random sampling at each iteration, we cannot expect this mathematical equivalence to hold exactly but instead to only hold in distribution or at least in expectation.

In practice, that means that to test this, we need to run multiple simulations with different values for the random seed and assess the impact on the learned prediction function one way or another.

Since running iterated simulations is too costly, we should conduct this study in a side notebook at the time of the review of the PR but cannot hope to turn that into a regularly run pytest test case for the sake of sparing CI resources and keep test time manageable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants