Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Fix pass sample weights to final estimator #15773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Dec 9, 2019

Conversation

J-A16
Copy link
Contributor

@J-A16 J-A16 commented Dec 4, 2019

Reference Issues/PRs

Fixes #13425

What does this implement/fix? Explain your changes.

RANSACRegressor will now pass weights to the used estimator during the training of the final model.

Any other comments?

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

This requires tests, and the loss functions should probably also be weighted.

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

I just fixed the line length and pushed to my repository, how do I rerun the tests?

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

I will look into the issue you sent me.

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

@jnothman, before you said just passing a custom base_estimator that required a sample_weight was enough. When I create this new test, as long as I'm passing the dummy estimator that requires the sample_weight and it runs with no problem, no specific assert statement is really necessary?

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019 via email

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

How do you propose testing correct handling of weights?



def test_ransac_base_estimator_fit_sample_weight():
class DummyLinearRegression(LinearRegression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This estimator is exactly a LinearRegression then. So we could use the LinearRegression directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample_weight is optional in the original LinearRegression, the point of the dummy is to make it necessary in the call. The old _ransac.py code breaks, as it should, using this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. I think the test below is better since we check the fitted model.

base_estimator = DummyLinearRegression()
ransac_estimator = RANSACRegressor(base_estimator, random_state=0)
n_samples = y.shape[0]
weights = np.ones(n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the test is actually testing anything. We should make sure that passing non unit weights will lead to a final model train on non unit weights. One way to do that is to pass a sample_weight with non unit weights and be sure sure that some of these weights will be used. Then we can use ransac_estimator.inlier_mask_ to train a model which should give the same results than the fitted model in ransac.

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

As in where is it?

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

Found it. Do I just add it to the last version's file?

@glemaitre
Copy link
Member

@glemaitre When it comes to weighting the loss functions, what do you suggest?

What do you mean?

:mod:`sklearn.linear_model`
...........................

- |Fix| Fixed a bug that made :class:`linear_model.RANSACRegressor` fail when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Fixed a bug that made :class:`linear_model.RANSACRegressor` fail when
- |Fix| Fixed a bug that made :class:`linear_model.RANSACRegressor` failed when

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove all the extra info around RANSACRegressor, the original reads:
Fixed a bug that made RANSACRegressor fail when

The suggestion reads:
Fixed a bug that made RANSACRegressor failed when

The original grammar is correct, are you saying this is just the standard? Should I change something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my mistake

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is not around failing if the estimator requires weights. It's that weights should have been passed on any case.

"Fixed a bug where sample_weight were not used when fitting the final estimator ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I corrected it.

J-A16 and others added 2 commits December 4, 2019 12:27
Co-Authored-By: Guillaume Lemaitre <[email protected]>
Co-Authored-By: Guillaume Lemaitre <[email protected]>
@J-A16
Copy link
Contributor Author

J-A16 commented Dec 4, 2019

@glemaitre When it comes to weighting the loss functions, what do you suggest?

What do you mean?

jnothman mentions that the loss functions within RANSACRegressor.fit() should probably be weighted, do you have a suggestion for how I should go about it?

@glemaitre
Copy link
Member

jnothman mentions that the loss functions within RANSACRegressor.fit() should probably be weighted, do you have a suggestion for how I should go about it?

I see. I would suggest to first merge this PR and open another one to solve the issue with the loss.
Basically, we just need to define a loss that weight the error y - y_pred. The relevant line are from 285 to 297. Then, in the case we have a loss which is a callable and that sample_weight is not None, we need to check that the callable is taking a sample_weight argument otherwise we should raise an error.

@glemaitre
Copy link
Member

To be more specific, it would be a diff around this:

diff --git a/sklearn/linear_model/_ransac.py b/sklearn/linear_model/_ransac.py
index 40ebb3a084..50a9fecf29 100644
--- a/sklearn/linear_model/_ransac.py
+++ b/sklearn/linear_model/_ransac.py
@@ -283,20 +283,22 @@ class RANSACRegressor(MetaEstimatorMixin, RegressorMixin,
             residual_threshold = self.residual_threshold
 
         if self.loss == "absolute_loss":
-            if y.ndim == 1:
-                loss_function = lambda y_true, y_pred: np.abs(y_true - y_pred)
-            else:
-                loss_function = lambda \
-                    y_true, y_pred: np.sum(np.abs(y_true - y_pred), axis=1)
+
+            def loss_function(y_true, y_pred, sample_weight=None):
+                sample_weight = np.ones(y_true.shape)
+                error = np.abs(sample_weight * (y_true - y_pred))
+                return error if y.ndim == 1 else np.sum(error, axis=1)
 
         elif self.loss == "squared_loss":
-            if y.ndim == 1:
-                loss_function = lambda y_true, y_pred: (y_true - y_pred) ** 2
-            else:
-                loss_function = lambda \
-                    y_true, y_pred: np.sum((y_true - y_pred) ** 2, axis=1)
+
+            def loss_function(y_true, y_pred, sample_weight=None):
+                sample_weight = np.ones(y_true.shape)
+                error = sample_weight * ((y_true - y_pred) ** 2)
+                return error if y.ndim == 1 else np.sum(error, axis=1)
 
         elif callable(self.loss):
+            # FIXME: check that self.loss has `sample_weight` parameters if
+            # it is sample_weight is not None
             loss_function = self.loss
 
         else:
@@ -373,7 +375,12 @@ class RANSACRegressor(MetaEstimatorMixin, RegressorMixin,
 
             # residuals of all data for current random sample model
             y_pred = base_estimator.predict(X)
-            residuals_subset = loss_function(y, y_pred)
+            if sample_weight is None:
+                residuals_subset = loss_function(y, y_pred)
+            else:
+                residuals_subset = loss_function(
+                    y, y_pred, sample_weight=sample_weight
+                )
 
             # classify data into inliers and outliers
             inlier_mask_subset = residuals_subset < residual_threshold

I would need to think a bit more regarding the testing. But in some way, we want to check some model equivalence or differences.

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 5, 2019

loss_has_sample_weight = `sample_weight` in signature(self.loss).parameters
if sample_weight is not None and loss_has_sample_weight:
    loss_function = self.loss
else:
   raise ValueError()

Do we automatically raise a ValueError here if both conditions aren't met?
I would use a function to test for the sample_weight parameter, but I couldn't find a general parameter testing function like has_fit_parameter().

Also, this statement is not automatically wiping out whatever value sample_weight had?:

sample_weight = np.ones(y_true.shape)

Should it be this?:

if sample_weight is None:
    sample_weight = np.ones(y_true.shape)

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 5, 2019

Also, this pull request is ready?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre
Copy link
Member

@jnothman do you want to have a look at it. Basically, I would be interested to know if we should add sample_weight to the loss right now or if it can be in a subsequent PR.

@glemaitre
Copy link
Member

Also, this statement is not automatically wiping out whatever value sample_weight had?:

yes I made a mistake. Basically this is just to have an idea of what to do :)

Do we automatically raise a ValueError here if both conditions aren't met?

I think that this fine if the loss defined handle sample_weight but a user does not give one. But again this is along these line.

What will be important is to have some proper tests.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this lgtm but yes we need an issue re the loss functions

:mod:`sklearn.linear_model`
...........................

- |Fix| Fixed a bug that made :class:`linear_model.RANSACRegressor` fail when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is not around failing if the estimator requires weights. It's that weights should have been passed on any case.

"Fixed a bug where sample_weight were not used when fitting the final estimator ..."

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 8, 2019

@jnothman, I fixed the what's new entry.

@jnothman
Copy link
Member

jnothman commented Dec 8, 2019

Please resolve conflicts, ensuring the change log remains in sorted order

@jnothman jnothman merged commit 1c42e79 into scikit-learn:master Dec 9, 2019
@jnothman
Copy link
Member

jnothman commented Dec 9, 2019

Thanks @J-A16

@glemaitre
Copy link
Member

@J-A16 Thanks for your efforts.

Do you want to make th next PR to include sample_weight in the loss function? If you like of time, I can make the PR then.

@J-A16
Copy link
Contributor Author

J-A16 commented Dec 9, 2019

@glemaitre, you mentioned the tests should check for equivalences or differences, so perhaps I should implement the different loss functions in the tests, instantiate different RANSACRegressor objects, each with it's own loss function, and assert that the results are the same for corresponding loss functions?

@glemaitre
Copy link
Member

equivalences or differences

I was thinking about checking that sample_weight will have an impact on the loss and therefore on the final estimator found.

assert that the results are the same for corresponding loss functions?

It could be a start.

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RANSAC does not pass sample weights to final estimator
3 participants