-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MRG: RANSAC algorithm #2025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: RANSAC algorithm #2025
Conversation
Thanks, Johannes! And thanks for contributing clean and well-documented code. I'm not familiar with the algorithm, and generally don't know a lot about meta-estimators such as this. So a few broad comments:
|
Training data. | ||
y : numpy array of shape [n_samples, n_targets] | ||
Target values | ||
estimator_cls : object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be estimator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Thanks for the contribution. As @jnothman mentionned, this has to be wrapped as an estimator. Out of curiosity, do you use RANSAC on something different than a linear model. I know that it can in theory, but I don't think that I have ever seen such an application, and they are probably reasons for this, i.e. that the inner estimator must be fast and simple enough. |
With regards to @jnothman 's comment on 'abs(estimator.predict(X) - y) < t', I believe that you should be using the estimator's 'score' method, if you want to do something general enough. |
You would have to call score for every sample individually, and for non-multilabel classification accuracy this would still result in binary values, not a continuous value to be thresholded. Some metrics would work, but none are currently implemented to return per-sample scores/residuals. |
Thanks for your feedback. I'm writing this from mobile. I'll address the estimator implementation an I see how the thresholding fails for multi-dimensional output variables. Calling score for each sample is far from optimal in my opinion. I'd rather suggest to implement a default sample-wise score implementation and provide the ability to pass a score_func for full flexibility on the user side. Tell me, what you think of that plan. |
rsample_n_inliers = np.sum(rsample_inlier_mask) | ||
|
||
# less inliers -> skip current random sample | ||
if rsample_n_inliers < best_n_inliers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't handle the case where rsample_n_inliners == best_n_inliers == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, see line 117. I can leave the == 0
since this can only happen when there is no appropriate inlier sample found yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're likely to get a ValueError
before 117 from estimator.score
. By which I mean the score of 0 samples is undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I assumed the scoring would also work for empty data arrays. Will fix it.
Well, this might get a bit messy, and we'll have to see what others think. Support your regressor residual by default, and allow a |
rsample_inlier_y = y[rsample_inlier_mask] | ||
|
||
# score of inlier data set | ||
rsample_score = estimator.score(rsample_inlier_X, rsample_inlier_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scoring just the inliers on a bootstrapped model seems a bit unintuitive. May we assume that this can be calculated as a mean of the residuals? If not, do we need to support an arbitrary score_func
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the final model is estimated only from the inliers, I tend to prefer scoring only inliers here. Additionally, few outliers can strongly distort the score, but the purpose of RANSAC actually is to robustly estimate models from faulty data. What is your motivation to score all data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None. I should really have just looked at the paper or reference implementations first :)
The documentation and docstring would benefit from a reference section pointing to the main papers or authoritative online discussion. The wikipedia article looks like a good start: http://en.wikipedia.org/wiki/RANSAC . The original paper is available online here: http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf (otherwise pay walled at ACM). I would also be interesting to compare the performance and accuracy of this implementation: http://www.scipy.org/Cookbook/RANSAC and check that there is no performance trick to reuse from it. |
At the scipy cookbook, the estimator has That implementation uses per sample error mean to score the model, and doesn't validate the sample or model. Can you also give a motivating example (here or in the example) for |
Parameters | ||
---------- | ||
X : numpy array or sparse matrix of shape [n_samples, n_features] | ||
Training data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a cosmetic comment - add blank lines between each of these to be consistent with other documents (mostly 😉 )
X : numpy array or sparse matrix of shape [n_samples, n_features]
Training data.
y : numpy array of shape [n_samples, n_targets]
Target values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Well, once again, before we get all worked out on how to implement all |
@GaelVaroquaux I have personally used it only in (non-)linear least squares optimization problems. Nevertheless I do not see a reason why this should not be benefitting in outlier detection for other estimators. Maybe I can come up with a good example to proof whether there is any benefit for other estimators. Nevertheless, when implemented for arbitrary "base" estimators, it could also be used by people who implement their own estimators (e.g. some least squares estimators) with the appropriate methods. @ogrisel I have forgotten to add a decent reference, will fix that. The RANSAC implementation on the SciPy website is far from good to be honest - actually it is not the real RANSAC algorithm. Just have a look at the 2 if-statements in the while loop - this may make sense in some cases, but this is an essential modification to the original algorithm. I'm open to also implement other variants of RANSAC such as MSAC or LO-RANSAC. @jnothman I agree, I hope, I have addressed all your concerns, let me know if I missed one of them ;-) |
On the debate general versus specific, it is not good to strive to be too much generic. Yes the RANSAC framework on paper can be applied to pretty much any estimator, but it practice, it is used only in regression settings, and they are good reasons for this. How would you rank the errors for a 2 class classification problem? Importantly, we should avoid implementing 'base' estimator, with the idea that they may be useful to people wanting to do unusual things. There is the common 80/20 rule and we don't want the complexity of the code base to explode just for corner cases. Given that, I would like:
Any other simplification that can be done is probably welcome. Also, a priority on this PR is to create an Estimator object, with 'fit' and 'predict', following the scikit-learn convention (as described in the contributors guide. |
I am pretty busy right now, give me some days to implement this. I'll notify you here! |
For non-supervised settings using a covariance model, the MCD (Minimum |
Which is here: http://scikit-learn.org/dev/developers/#contributing-code |
To rank errors in a binary classification one could just use For multiclass classification, if the classifier supports calibrated probabilistic predictions it could be But I agree with @GaelVaroquaux about the code complexity & maintainance vs genericity trade-offs. Let's focus on (linear) regression for this PR. We can always add support for other models later if we ever change our mind on the usefulness of less obvious use cases. |
OK, I'll stick with the linear case for this PR. You hear from me in the coming days! |
/ping I implemented this as an estimator class, let me know if this fits your standards. |
/ping |
I just ran the example, it looks good. Next steps:
|
for n_trials in range(self.max_trials): | ||
|
||
# choose random sample set | ||
random_idxs = np.random.randint(0, n_samples, self.min_n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never use the np.random
singleton directly in the fit
method of an estimator. Please add a random_state
argument to the __init__
method and check how other sklearn estimator implementations use the check_random_state
utility function in their fit method. To find example of this pattern use:
git grep "check_random_state(self.random_state)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also once this is done, please seed the tests by passing random_state=0
or some other arbitrary fixed seed in any instance of the RANSAC class in the tests rather than seeding the global singleton rng.
@ogrisel Thanks for your feedback. I hope I made all necessary changes. Please, let me know if anything is missing in the implementation yet. If not, I'll add the still missing documentation. |
self.max_trials = max_trials | ||
self.stop_n_inliers = stop_n_inliers | ||
self.stop_score = stop_score | ||
self.random_state = check_random_state(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the init one should store the raw parameters. Please do in the fit
method instead:
random_state = check_random_state(self.random_state)
# use random_state afterwards in the body of the fit method
so that two consecutive calls to the fit
method with an fixed integer random_state
param passed at init time will yield the same results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what caused the test failure:
https://travis-ci.org/scikit-learn/scikit-learn/builds/8829071#L2771
ransac_estimator.fit(X, y) | ||
|
||
assert ransac_estimator.score(X[2:], y[2:]) == 1 | ||
assert ransac_estimator.score(X[:2], y[:2]) < 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nitpick) In sklearn.utils.testing, there is an assert_equal and an assert_less that you can use.
I am 👍 to merge. Thanks @ahojnnes for you patience. |
Thanks for going through the review. The implementation benefitted quite a lot from it! |
Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master. |
+1 for merging as well once the comment on |
Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master. |
Done. |
As it looks ready for merge, @ahojnnes can you please do one last thing? Add an entry for |
Hm sorry for bringing one more thing. There are some low hanging fruit in the coverage:
Can you have a look to it? |
Indeed: here is a summary of the coverage of the ransac module: |
I think it should be possible to get 100% coverage on this module. |
@@ -53,6 +53,9 @@ Changelog | |||
- Add multi-output support to :class:`gaussian_process.GaussianProcess` | |||
by John Novak. | |||
|
|||
- Added :class:`linear_model.RANSACRegressor` meta-estimator for the robust | |||
fitting of regression models. By `Johannes Schönberger`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add your home page URL at the bottom of the file to get this link to work.
Should be fully covered with tests apart from the following line: https://coveralls.io/files/69816944#L225 Not sure which of the regressors returns one-dimensional y, but I can remember that I intentionally added that test because it once broke the |
Looking forward to testing this nifty new little module in master! |
+1 for merging on my side. |
+1 for merging |
Merging by rebase to resolve the what's new conflict manually. |
This is my first contribution to scikit-learn. Please, let me know if I meet all you coding conventions. I hope you find the implementation useful.
I am not aware of all the different estimator implementations, so I am not sure if this function is universally applicable to all the different estimators.