MRG: RANSAC algorithm #2025

ahojnnes · 2013-06-03T19:12:52Z

This is my first contribution to scikit-learn. Please, let me know if I meet all you coding conventions. I hope you find the implementation useful.

I am not aware of all the different estimator implementations, so I am not sure if this function is universally applicable to all the different estimators.

jnothman · 2013-06-04T02:47:21Z

Thanks, Johannes! And thanks for contributing clean and well-documented code.

I'm not familiar with the algorithm, and generally don't know a lot about meta-estimators such as this. So a few broad comments:

Add a citation to the documentation (in part so others can decide if it has sufficient impact to be appropriate for scikit-learn).
It'll need a home other than sklearn.utils._ransac (not in utils, and not with an underscore preceding it). It's a lot like some of the ensembles, but its decisions are not made by ensemble.
It'll need to be encapsulated as an estimator class, which means:
- It will delegate its predict, predict_proba, decision_function, score, transform, inverse_transform methods to the best estimator selected by its fit method. See for instance sklearn.feature_selection.RFECV. Doing so allows your meta-estimator to be used in pipelines, model selection routines, etc.
- Your function will need to return the best estimator, fitted, as well as the mask. In order to do so, you should use clone from sklearn.base before fitting each estimator. clone will copy the estimator unfitted.
- You should come up with reasonable default values for min_n_samples and residual_threshold, as we generally like users to be able to pull our estimators out of the library and run them without detailed configuration.
Using abs(estimator.predict(X) - y) < t will work well for regression, and afaik, poorly for binary (but you could use estimator.decision_function), multiclass or multilabel classification. I wonder whether absolute difference is the only option, or whether accepting an arbitrary per-sample score function is appropriate. But again, I am not familiar with the algorithm or theory.

jnothman · 2013-06-04T02:47:38Z

sklearn/utils/_ransac.py

+        Training data.
+    y : numpy array of shape [n_samples, n_targets]
+        Target values
+    estimator_cls : object


This should be estimator

GaelVaroquaux · 2013-06-04T04:57:38Z

Thanks for the contribution. As @jnothman mentionned, this has to be wrapped as an estimator.

Out of curiosity, do you use RANSAC on something different than a linear model. I know that it can in theory, but I don't think that I have ever seen such an application, and they are probably reasons for this, i.e. that the inner estimator must be fast and simple enough.

GaelVaroquaux · 2013-06-04T05:22:35Z

With regards to @jnothman 's comment on 'abs(estimator.predict(X) - y) < t', I believe that you should be using the estimator's 'score' method, if you want to do something general enough.

jnothman · 2013-06-04T05:33:26Z

I believe that you should be using the estimator's 'score' method, if you want to do something general enough.

You would have to call score for every sample individually, and for non-multilabel classification accuracy this would still result in binary values, not a continuous value to be thresholded. Some metrics would work, but none are currently implemented to return per-sample scores/residuals.

ahojnnes · 2013-06-04T06:58:58Z

Thanks for your feedback. I'm writing this from mobile.

I'll address the estimator implementation an I see how the thresholding fails for multi-dimensional output variables. Calling score for each sample is far from optimal in my opinion. I'd rather suggest to implement a default sample-wise score implementation and provide the ability to pass a score_func for full flexibility on the user side. Tell me, what you think of that plan.

jnothman · 2013-06-04T07:30:51Z

sklearn/utils/_ransac.py

+        rsample_n_inliers = np.sum(rsample_inlier_mask)
+
+        # less inliers -> skip current random sample
+        if rsample_n_inliers < best_n_inliers:


You don't handle the case where rsample_n_inliners == best_n_inliers == 0.

I do, see line 117. I can leave the == 0 since this can only happen when there is no appropriate inlier sample found yet.

You're likely to get a ValueError before 117 from estimator.score. By which I mean the score of 0 samples is undefined.

Ah, I assumed the scoring would also work for empty data arrays. Will fix it.

jnothman · 2013-06-04T07:35:38Z

Well, this might get a bit messy, and we'll have to see what others think. Support your regressor residual by default, and allow a residual_func : (est, X, y) -> float array of shape y.shape. I guess you similarly should support scoring : (est, X, y) -> float (or scoring may be a string) as sklearn.grid_search.BaseSearchCV does to replace estimator.score.

jnothman · 2013-06-04T07:39:05Z

sklearn/utils/_ransac.py

+        rsample_inlier_y = y[rsample_inlier_mask]
+
+        # score of inlier data set
+        rsample_score = estimator.score(rsample_inlier_X, rsample_inlier_y)


Scoring just the inliers on a bootstrapped model seems a bit unintuitive. May we assume that this can be calculated as a mean of the residuals? If not, do we need to support an arbitrary score_func?

Since the final model is estimated only from the inliers, I tend to prefer scoring only inliers here. Additionally, few outliers can strongly distort the score, but the purpose of RANSAC actually is to robustly estimate models from faulty data. What is your motivation to score all data?

None. I should really have just looked at the paper or reference implementations first :)

ogrisel · 2013-06-04T08:03:13Z

The documentation and docstring would benefit from a reference section pointing to the main papers or authoritative online discussion. The wikipedia article looks like a good start: http://en.wikipedia.org/wiki/RANSAC .

The original paper is available online here: http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf (otherwise pay walled at ACM).

I would also be interesting to compare the performance and accuracy of this implementation: http://www.scipy.org/Cookbook/RANSAC and check that there is no performance trick to reuse from it.

jnothman · 2013-06-04T09:56:45Z

At the scipy cookbook, the estimator has get_error. One option is to add some per-sample-error method to each estimator, but that won't be done lightly.

That implementation uses per sample error mean to score the model, and doesn't validate the sample or model.

Can you also give a motivating example (here or in the example) for is_data_valid (I'd prefer is_sample_valid or validate_sample) and is_model_valid?

jaquesgrobler · 2013-06-04T12:39:01Z

sklearn/utils/_ransac.py

+    Parameters
+    ----------
+    X : numpy array or sparse matrix of shape [n_samples, n_features]
+        Training data.


Just a cosmetic comment - add blank lines between each of these to be consistent with other documents (mostly 😉 )

X : numpy array or sparse matrix of shape [n_samples, n_features] Training data. y : numpy array of shape [n_samples, n_targets] Target values

GaelVaroquaux · 2013-06-04T13:42:14Z

Well, this might get a bit messy, and we'll have to see what others think.
Support your regressor residual by default, and allow a residual_func : (est,
X, y) -> float array of shape y.shape. I guess you similarly should support
scoring : (est, X, y) -> float (or scoring may be a string) as
sklearn.grid_search.BaseSearchCV does to replace estimator.score.

Well, once again, before we get all worked out on how to implement all
this in a generic way... Do people actually use RANSAC beyond
least-square regression settings? I have never seen it used in other
settings, so if it's not (appart for some corner-case academic setting),
then let's just code it for linear models. It will make everything
simpler.

ahojnnes · 2013-06-04T21:53:32Z

@GaelVaroquaux I have personally used it only in (non-)linear least squares optimization problems. Nevertheless I do not see a reason why this should not be benefitting in outlier detection for other estimators. Maybe I can come up with a good example to proof whether there is any benefit for other estimators. Nevertheless, when implemented for arbitrary "base" estimators, it could also be used by people who implement their own estimators (e.g. some least squares estimators) with the appropriate methods.

@ogrisel I have forgotten to add a decent reference, will fix that. The RANSAC implementation on the SciPy website is far from good to be honest - actually it is not the real RANSAC algorithm. Just have a look at the 2 if-statements in the while loop - this may make sense in some cases, but this is an essential modification to the original algorithm.

I'm open to also implement other variants of RANSAC such as MSAC or LO-RANSAC.

@jnothman I agree, validate_model and validate_sample are better names. The motivation behind the latter was to check whether the randomly selected samples are valid, e.g.: data points in a too close neighborhood, samples result in degenerate models etc. Of course, this could also be done with validate_model afterwards, but at the cost of performance, since the model has to be estimated first. I personally had use cases and applications for this, but I am OK with removing it?

I hope, I have addressed all your concerns, let me know if I missed one of them ;-)

GaelVaroquaux · 2013-06-06T05:55:43Z

On the debate general versus specific, it is not good to strive to be too much generic. Yes the RANSAC framework on paper can be applied to pretty much any estimator, but it practice, it is used only in regression settings, and they are good reasons for this. How would you rank the errors for a 2 class classification problem?

Importantly, we should avoid implementing 'base' estimator, with the idea that they may be useful to people wanting to do unusual things. There is the common 80/20 rule and we don't want the complexity of the code base to explode just for corner cases.

Given that, I would like:

To worry only about regression settings.
To have a standard LinearRegression object as a default argument to estimator.

Any other simplification that can be done is probably welcome.

Also, a priority on this PR is to create an Estimator object, with 'fit' and 'predict', following the scikit-learn convention (as described in the contributors guide.

ahojnnes · 2013-06-06T13:34:03Z

I am pretty busy right now, give me some days to implement this. I'll notify you here!

GaelVaroquaux · 2013-06-06T14:05:55Z

@GaelVaroquaux I have personally used it only in (non-)linear least squares
optimization problems. Nevertheless I do not see a reason why this should not
be benefitting in outlier detection for other estimators.

For non-supervised settings using a covariance model, the MCD (Minimum
Covariance Determinant) really uses the same ideas than the RANSAC, just
specialized to outlier detection.

ogrisel · 2013-06-07T18:07:44Z

as described in the contributors guide.

Which is here: http://scikit-learn.org/dev/developers/#contributing-code

ogrisel · 2013-06-07T18:15:11Z

To rank errors in a binary classification one could just use y_true * clf.decision_function(X) assuming y_true in {-1, 1}.

For multiclass classification, if the classifier supports calibrated probabilistic predictions it could be 1 - clf.predict_proba(X)[clf.classes_[y_true]]. If the classifier does not implement predict_proba but has a decision_function method one could use an IsotonicRegression model to generate probabilistic predictions and reduce to the previous case.

But I agree with @GaelVaroquaux about the code complexity & maintainance vs genericity trade-offs. Let's focus on (linear) regression for this PR. We can always add support for other models later if we ever change our mind on the usefulness of less obvious use cases.

ahojnnes · 2013-06-07T19:23:47Z

OK, I'll stick with the linear case for this PR. You hear from me in the coming days!

ahojnnes · 2013-06-21T11:39:39Z

/ping I implemented this as an estimator class, let me know if this fits your standards.

ahojnnes · 2013-07-06T06:35:49Z

/ping

ogrisel · 2013-07-07T18:07:33Z

I just ran the example, it looks good. Next steps:

min_n_samples is not documented (check that all the parameters are documented in the docstring)
please make sure that the ransac algorithm can work with the default parameters, for instance min_n_samples and make sure that no attribute with _ are set in the __init__ but only the in the fit method.
in particular the base_estimator is left to None, the fit method should automatically instantiate a LinearRegression instance if the y.dtype.kind == 'f' and Perceptron if the y.dtype.kind == 'i'.
set min_n_samples=0.5 by default, and make the fit method treat values for min_n_samples < 1 as rations of n_samples = X.shape[0].
could you please add some tests for the predict and score methods?
write narrative documentation (in the doc/ folder), probably in the linear model section and include the example plot there. Also be sure to explain what the RANSAC acronym stands for there.
could you run some benchmarks on largish datasets and make sure that it's not significantly slower than running n_trials times the fit time of the base estimator? You can do that is a gist outside of the scikit-learn source code and just remport the results in a comment here. Also include link to good online reference and the most original paper for the paper in the narrative doc.

ogrisel · 2013-07-07T18:10:07Z

sklearn/linear_model/ransac.py

+        for n_trials in range(self.max_trials):
+
+            # choose random sample set
+            random_idxs = np.random.randint(0, n_samples, self.min_n_samples)


Never use the np.random singleton directly in the fit method of an estimator. Please add a random_state argument to the __init__ method and check how other sklearn estimator implementations use the check_random_state utility function in their fit method. To find example of this pattern use:

git grep "check_random_state(self.random_state)"

Also once this is done, please seed the tests by passing random_state=0 or some other arbitrary fixed seed in any instance of the RANSAC class in the tests rather than seeding the global singleton rng.

ahojnnes · 2013-07-07T21:54:04Z

@ogrisel Thanks for your feedback. I hope I made all necessary changes. Please, let me know if anything is missing in the implementation yet. If not, I'll add the still missing documentation.

ogrisel · 2013-07-07T22:09:07Z

sklearn/linear_model/ransac.py

+        self.max_trials = max_trials
+        self.stop_n_inliers = stop_n_inliers
+        self.stop_score = stop_score
+        self.random_state = check_random_state(random_state)


In the init one should store the raw parameters. Please do in the fit method instead:

random_state = check_random_state(self.random_state) # use random_state afterwards in the body of the fit method

so that two consecutive calls to the fit method with an fixed integer random_state param passed at init time will yield the same results.

This is what caused the test failure:

https://travis-ci.org/scikit-learn/scikit-learn/builds/8829071#L2771

arjoly · 2013-10-18T11:05:53Z

sklearn/linear_model/tests/test_ransac.py

+    ransac_estimator.fit(X, y)
+
+    assert ransac_estimator.score(X[2:], y[2:]) == 1
+    assert ransac_estimator.score(X[:2], y[:2]) < 1


(nitpick) In sklearn.utils.testing, there is an assert_equal and an assert_less that you can use.

arjoly · 2013-10-18T11:08:29Z

I am 👍 to merge. Thanks @ahojnnes for you patience.

ahojnnes · 2013-10-18T11:10:48Z

Thanks for going through the review. The implementation benefitted quite a lot from it!

coveralls · 2013-10-18T11:11:14Z

Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master.

ogrisel · 2013-10-18T11:15:16Z

+1 for merging as well once the comment on assert_equal and assert_less is addressed.

coveralls · 2013-10-18T11:15:59Z

Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master.

ahojnnes · 2013-10-18T11:25:40Z

Done.

ogrisel · 2013-10-18T11:32:09Z

As it looks ready for merge, @ahojnnes can you please do one last thing? Add an entry for RANSACRegressor to the doc/whats_new.rst file.

coveralls · 2013-10-18T11:34:11Z

Coverage remained the same when pulling e08d34d on ahojnnes:ransac into 02e0267 on scikit-learn:master.

arjoly · 2013-10-18T11:43:36Z

Hm sorry for bringing one more thing. There are some low hanging fruit in the coverage:

Name                                                  Stmts   Miss Branch BrMiss  Cover   Missing
-------------------------------------------------------------------------------------------------
sklearn.linear_model.ransac                              93      5     40      5    92%   152, 164, 169, 176, 225

Can you have a look to it?

ogrisel · 2013-10-18T11:47:12Z

Indeed: here is a summary of the coverage of the ransac module:

https://coveralls.io/files/69816944

ogrisel · 2013-10-18T11:48:52Z

I think it should be possible to get 100% coverage on this module.

ogrisel · 2013-10-18T11:50:46Z

doc/whats_new.rst

@@ -53,6 +53,9 @@ Changelog
   - Add multi-output support to :class:`gaussian_process.GaussianProcess`
     by John Novak.

+   - Added :class:`linear_model.RANSACRegressor` meta-estimator for the robust
+     fitting of regression models. By `Johannes Schönberger`_.


You need to add your home page URL at the bottom of the file to get this link to work.

coveralls · 2013-10-18T11:56:20Z

Coverage remained the same when pulling 776ee4a on ahojnnes:ransac into 02e0267 on scikit-learn:master.

ahojnnes · 2013-10-18T12:02:00Z

Should be fully covered with tests apart from the following line: https://coveralls.io/files/69816944#L225

Not sure which of the regressors returns one-dimensional y, but I can remember that I intentionally added that test because it once broke the residual_metric.

coveralls · 2013-10-18T12:10:18Z

Coverage remained the same when pulling f4b2bc4 on ahojnnes:ransac into 02e0267 on scikit-learn:master.

mblondel · 2013-10-18T12:16:09Z

Looking forward to testing this nifty new little module in master!

ogrisel · 2013-10-18T19:23:41Z

+1 for merging on my side.

arjoly · 2013-10-19T07:04:30Z

+1 for merging

ogrisel · 2013-10-20T14:01:06Z

Merging by rebase to resolve the what's new conflict manually.

arjoly · 2013-10-20T14:40:04Z

Merged by @ogrisel

Thanks you @ahojnnes !!!

jnothman reviewed Jun 4, 2013
View reviewed changes

jaquesgrobler reviewed Jun 4, 2013
View reviewed changes

ogrisel reviewed Jul 7, 2013
View reviewed changes

arjoly reviewed Oct 18, 2013
View reviewed changes

Use assert_equal, assert_less rather than plain assert statement

e08d34d

Add RANSACRegressor to whats-new doc section

776ee4a

ahojnnes added 2 commits October 18, 2013 13:49

Add test for default value of min_samples

427c8c1

Add test for invalid value of min_samples

e8d854a

ogrisel reviewed Oct 18, 2013
View reviewed changes

ahojnnes added 3 commits October 18, 2013 13:57

Add test for custom residual_metric

f2307f7

Add test for default residual threshold

3cbdbf3

Remove reference to web page

f4b2bc4

arjoly closed this Oct 20, 2013

MechCoder mentioned this pull request Oct 21, 2015

[MRG] Deprecate residual_metric and add support for loss in RANSAC #5497

Merged

Uh oh!

MRG: RANSAC algorithm #2025

MRG: RANSAC algorithm #2025

Uh oh!

Conversation

ahojnnes commented Jun 3, 2013

Uh oh!

jnothman commented Jun 4, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jun 4, 2013

Uh oh!

GaelVaroquaux commented Jun 4, 2013

Uh oh!

jnothman commented Jun 4, 2013

Uh oh!

ahojnnes commented Jun 4, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 4, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 4, 2013

Uh oh!

jnothman commented Jun 4, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jun 4, 2013

Uh oh!

ahojnnes commented Jun 4, 2013

Uh oh!

GaelVaroquaux commented Jun 6, 2013

Uh oh!

ahojnnes commented Jun 6, 2013

Uh oh!

GaelVaroquaux commented Jun 6, 2013

Uh oh!

ogrisel commented Jun 7, 2013

Uh oh!

ogrisel commented Jun 7, 2013

Uh oh!

ahojnnes commented Jun 7, 2013

Uh oh!

ahojnnes commented Jun 21, 2013

Uh oh!

ahojnnes commented Jul 6, 2013

Uh oh!

ogrisel commented Jul 7, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahojnnes commented Jul 7, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Oct 18, 2013