Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MRG: RANSAC algorithm #2025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 133 commits into from
Closed

MRG: RANSAC algorithm #2025

wants to merge 133 commits into from

Conversation

ahojnnes
Copy link
Contributor

@ahojnnes ahojnnes commented Jun 3, 2013

This is my first contribution to scikit-learn. Please, let me know if I meet all you coding conventions. I hope you find the implementation useful.

I am not aware of all the different estimator implementations, so I am not sure if this function is universally applicable to all the different estimators.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

Thanks, Johannes! And thanks for contributing clean and well-documented code.

I'm not familiar with the algorithm, and generally don't know a lot about meta-estimators such as this. So a few broad comments:

  • Add a citation to the documentation (in part so others can decide if it has sufficient impact to be appropriate for scikit-learn).
  • It'll need a home other than sklearn.utils._ransac (not in utils, and not with an underscore preceding it). It's a lot like some of the ensembles, but its decisions are not made by ensemble.
  • It'll need to be encapsulated as an estimator class, which means:
    • It will delegate its predict, predict_proba, decision_function, score, transform, inverse_transform methods to the best estimator selected by its fit method. See for instance sklearn.feature_selection.RFECV. Doing so allows your meta-estimator to be used in pipelines, model selection routines, etc.
    • Your function will need to return the best estimator, fitted, as well as the mask. In order to do so, you should use clone from sklearn.base before fitting each estimator. clone will copy the estimator unfitted.
    • You should come up with reasonable default values for min_n_samples and residual_threshold, as we generally like users to be able to pull our estimators out of the library and run them without detailed configuration.
  • Using abs(estimator.predict(X) - y) < t will work well for regression, and afaik, poorly for binary (but you could use estimator.decision_function), multiclass or multilabel classification. I wonder whether absolute difference is the only option, or whether accepting an arbitrary per-sample score function is appropriate. But again, I am not familiar with the algorithm or theory.

Training data.
y : numpy array of shape [n_samples, n_targets]
Target values
estimator_cls : object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be estimator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@GaelVaroquaux
Copy link
Member

Thanks for the contribution. As @jnothman mentionned, this has to be wrapped as an estimator.

Out of curiosity, do you use RANSAC on something different than a linear model. I know that it can in theory, but I don't think that I have ever seen such an application, and they are probably reasons for this, i.e. that the inner estimator must be fast and simple enough.

@GaelVaroquaux
Copy link
Member

With regards to @jnothman 's comment on 'abs(estimator.predict(X) - y) < t', I believe that you should be using the estimator's 'score' method, if you want to do something general enough.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

I believe that you should be using the estimator's 'score' method, if you want to do something general enough.

You would have to call score for every sample individually, and for non-multilabel classification accuracy this would still result in binary values, not a continuous value to be thresholded. Some metrics would work, but none are currently implemented to return per-sample scores/residuals.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jun 4, 2013

Thanks for your feedback. I'm writing this from mobile.

I'll address the estimator implementation an I see how the thresholding fails for multi-dimensional output variables. Calling score for each sample is far from optimal in my opinion. I'd rather suggest to implement a default sample-wise score implementation and provide the ability to pass a score_func for full flexibility on the user side. Tell me, what you think of that plan.

rsample_n_inliers = np.sum(rsample_inlier_mask)

# less inliers -> skip current random sample
if rsample_n_inliers < best_n_inliers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't handle the case where rsample_n_inliners == best_n_inliers == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, see line 117. I can leave the == 0 since this can only happen when there is no appropriate inlier sample found yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're likely to get a ValueError before 117 from estimator.score. By which I mean the score of 0 samples is undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I assumed the scoring would also work for empty data arrays. Will fix it.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

Well, this might get a bit messy, and we'll have to see what others think. Support your regressor residual by default, and allow a residual_func : (est, X, y) -> float array of shape y.shape. I guess you similarly should support scoring : (est, X, y) -> float (or scoring may be a string) as sklearn.grid_search.BaseSearchCV does to replace estimator.score.

rsample_inlier_y = y[rsample_inlier_mask]

# score of inlier data set
rsample_score = estimator.score(rsample_inlier_X, rsample_inlier_y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoring just the inliers on a bootstrapped model seems a bit unintuitive. May we assume that this can be calculated as a mean of the residuals? If not, do we need to support an arbitrary score_func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the final model is estimated only from the inliers, I tend to prefer scoring only inliers here. Additionally, few outliers can strongly distort the score, but the purpose of RANSAC actually is to robustly estimate models from faulty data. What is your motivation to score all data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None. I should really have just looked at the paper or reference implementations first :)

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2013

The documentation and docstring would benefit from a reference section pointing to the main papers or authoritative online discussion. The wikipedia article looks like a good start: http://en.wikipedia.org/wiki/RANSAC .

The original paper is available online here: http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf (otherwise pay walled at ACM).

I would also be interesting to compare the performance and accuracy of this implementation: http://www.scipy.org/Cookbook/RANSAC and check that there is no performance trick to reuse from it.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2013

At the scipy cookbook, the estimator has get_error. One option is to add some per-sample-error method to each estimator, but that won't be done lightly.

That implementation uses per sample error mean to score the model, and doesn't validate the sample or model.

Can you also give a motivating example (here or in the example) for is_data_valid (I'd prefer is_sample_valid or validate_sample) and is_model_valid?

Parameters
----------
X : numpy array or sparse matrix of shape [n_samples, n_features]
Training data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a cosmetic comment - add blank lines between each of these to be consistent with other documents (mostly 😉 )

    X : numpy array or sparse matrix of shape [n_samples, n_features]
        Training data.

    y : numpy array of shape [n_samples, n_targets]
        Target values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@GaelVaroquaux
Copy link
Member

Well, this might get a bit messy, and we'll have to see what others think.
Support your regressor residual by default, and allow a residual_func : (est,
X, y) -> float array of shape y.shape. I guess you similarly should support
scoring : (est, X, y) -> float (or scoring may be a string) as
sklearn.grid_search.BaseSearchCV does to replace estimator.score.

Well, once again, before we get all worked out on how to implement all
this in a generic way... Do people actually use RANSAC beyond
least-square regression settings? I have never seen it used in other
settings, so if it's not (appart for some corner-case academic setting),
then let's just code it for linear models. It will make everything
simpler.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jun 4, 2013

@GaelVaroquaux I have personally used it only in (non-)linear least squares optimization problems. Nevertheless I do not see a reason why this should not be benefitting in outlier detection for other estimators. Maybe I can come up with a good example to proof whether there is any benefit for other estimators. Nevertheless, when implemented for arbitrary "base" estimators, it could also be used by people who implement their own estimators (e.g. some least squares estimators) with the appropriate methods.

@ogrisel I have forgotten to add a decent reference, will fix that. The RANSAC implementation on the SciPy website is far from good to be honest - actually it is not the real RANSAC algorithm. Just have a look at the 2 if-statements in the while loop - this may make sense in some cases, but this is an essential modification to the original algorithm.

I'm open to also implement other variants of RANSAC such as MSAC or LO-RANSAC.

@jnothman I agree, validate_model and validate_sample are better names. The motivation behind the latter was to check whether the randomly selected samples are valid, e.g.: data points in a too close neighborhood, samples result in degenerate models etc. Of course, this could also be done with validate_model afterwards, but at the cost of performance, since the model has to be estimated first. I personally had use cases and applications for this, but I am OK with removing it?

I hope, I have addressed all your concerns, let me know if I missed one of them ;-)

@GaelVaroquaux
Copy link
Member

On the debate general versus specific, it is not good to strive to be too much generic. Yes the RANSAC framework on paper can be applied to pretty much any estimator, but it practice, it is used only in regression settings, and they are good reasons for this. How would you rank the errors for a 2 class classification problem?

Importantly, we should avoid implementing 'base' estimator, with the idea that they may be useful to people wanting to do unusual things. There is the common 80/20 rule and we don't want the complexity of the code base to explode just for corner cases.

Given that, I would like:

  • To worry only about regression settings.
  • To have a standard LinearRegression object as a default argument to estimator.

Any other simplification that can be done is probably welcome.

Also, a priority on this PR is to create an Estimator object, with 'fit' and 'predict', following the scikit-learn convention (as described in the contributors guide.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jun 6, 2013

I am pretty busy right now, give me some days to implement this. I'll notify you here!

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux I have personally used it only in (non-)linear least squares
optimization problems. Nevertheless I do not see a reason why this should not
be benefitting in outlier detection for other estimators.

For non-supervised settings using a covariance model, the MCD (Minimum
Covariance Determinant) really uses the same ideas than the RANSAC, just
specialized to outlier detection.

@ogrisel
Copy link
Member

ogrisel commented Jun 7, 2013

as described in the contributors guide.

Which is here: http://scikit-learn.org/dev/developers/#contributing-code

@ogrisel
Copy link
Member

ogrisel commented Jun 7, 2013

To rank errors in a binary classification one could just use y_true * clf.decision_function(X) assuming y_true in {-1, 1}.

For multiclass classification, if the classifier supports calibrated probabilistic predictions it could be 1 - clf.predict_proba(X)[clf.classes_[y_true]]. If the classifier does not implement predict_proba but has a decision_function method one could use an IsotonicRegression model to generate probabilistic predictions and reduce to the previous case.

But I agree with @GaelVaroquaux about the code complexity & maintainance vs genericity trade-offs. Let's focus on (linear) regression for this PR. We can always add support for other models later if we ever change our mind on the usefulness of less obvious use cases.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jun 7, 2013

OK, I'll stick with the linear case for this PR. You hear from me in the coming days!

@ahojnnes
Copy link
Contributor Author

/ping I implemented this as an estimator class, let me know if this fits your standards.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jul 6, 2013

/ping

@ogrisel
Copy link
Member

ogrisel commented Jul 7, 2013

I just ran the example, it looks good. Next steps:

  • min_n_samples is not documented (check that all the parameters are documented in the docstring)
  • please make sure that the ransac algorithm can work with the default parameters, for instance min_n_samples and make sure that no attribute with _ are set in the __init__ but only the in the fit method.
  • in particular the base_estimator is left to None, the fit method should automatically instantiate a LinearRegression instance if the y.dtype.kind == 'f' and Perceptron if the y.dtype.kind == 'i'.
  • set min_n_samples=0.5 by default, and make the fit method treat values for min_n_samples < 1 as rations of n_samples = X.shape[0].
  • could you please add some tests for the predict and score methods?
  • write narrative documentation (in the doc/ folder), probably in the linear model section and include the example plot there. Also be sure to explain what the RANSAC acronym stands for there.
  • could you run some benchmarks on largish datasets and make sure that it's not significantly slower than running n_trials times the fit time of the base estimator? You can do that is a gist outside of the scikit-learn source code and just remport the results in a comment here. Also include link to good online reference and the most original paper for the paper in the narrative doc.

for n_trials in range(self.max_trials):

# choose random sample set
random_idxs = np.random.randint(0, n_samples, self.min_n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never use the np.random singleton directly in the fit method of an estimator. Please add a random_state argument to the __init__ method and check how other sklearn estimator implementations use the check_random_state utility function in their fit method. To find example of this pattern use:

git grep "check_random_state(self.random_state)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once this is done, please seed the tests by passing random_state=0 or some other arbitrary fixed seed in any instance of the RANSAC class in the tests rather than seeding the global singleton rng.

@ahojnnes
Copy link
Contributor Author

ahojnnes commented Jul 7, 2013

@ogrisel Thanks for your feedback. I hope I made all necessary changes. Please, let me know if anything is missing in the implementation yet. If not, I'll add the still missing documentation.

self.max_trials = max_trials
self.stop_n_inliers = stop_n_inliers
self.stop_score = stop_score
self.random_state = check_random_state(random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the init one should store the raw parameters. Please do in the fit method instead:

    random_state = check_random_state(self.random_state)
    # use random_state afterwards in the body of the fit method

so that two consecutive calls to the fit method with an fixed integer random_state param passed at init time will yield the same results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ransac_estimator.fit(X, y)

assert ransac_estimator.score(X[2:], y[2:]) == 1
assert ransac_estimator.score(X[:2], y[:2]) < 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) In sklearn.utils.testing, there is an assert_equal and an assert_less that you can use.

@arjoly
Copy link
Member

arjoly commented Oct 18, 2013

I am 👍 to merge. Thanks @ahojnnes for you patience.

@ahojnnes
Copy link
Contributor Author

Thanks for going through the review. The implementation benefitted quite a lot from it!

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2013

+1 for merging as well once the comment on assert_equal and assert_less is addressed.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0cf77aacd070ef575097f71be0368fbecefe7ecd on ahojnnes:ransac into 02e0267 on scikit-learn:master.

@ahojnnes
Copy link
Contributor Author

Done.

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2013

As it looks ready for merge, @ahojnnes can you please do one last thing? Add an entry for RANSACRegressor to the doc/whats_new.rst file.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling e08d34d on ahojnnes:ransac into 02e0267 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Oct 18, 2013

Hm sorry for bringing one more thing. There are some low hanging fruit in the coverage:

Name                                                  Stmts   Miss Branch BrMiss  Cover   Missing
-------------------------------------------------------------------------------------------------
sklearn.linear_model.ransac                              93      5     40      5    92%   152, 164, 169, 176, 225

Can you have a look to it?

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2013

Indeed: here is a summary of the coverage of the ransac module:

https://coveralls.io/files/69816944

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2013

I think it should be possible to get 100% coverage on this module.

@@ -53,6 +53,9 @@ Changelog
- Add multi-output support to :class:`gaussian_process.GaussianProcess`
by John Novak.

- Added :class:`linear_model.RANSACRegressor` meta-estimator for the robust
fitting of regression models. By `Johannes Schönberger`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add your home page URL at the bottom of the file to get this link to work.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 776ee4a on ahojnnes:ransac into 02e0267 on scikit-learn:master.

@ahojnnes
Copy link
Contributor Author

Should be fully covered with tests apart from the following line: https://coveralls.io/files/69816944#L225

Not sure which of the regressors returns one-dimensional y, but I can remember that I intentionally added that test because it once broke the residual_metric.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f4b2bc4 on ahojnnes:ransac into 02e0267 on scikit-learn:master.

@mblondel
Copy link
Member

Looking forward to testing this nifty new little module in master!

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2013

+1 for merging on my side.

@arjoly
Copy link
Member

arjoly commented Oct 19, 2013

+1 for merging

@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2013

Merging by rebase to resolve the what's new conflict manually.

@arjoly
Copy link
Member

arjoly commented Oct 20, 2013

Merged by @ogrisel

Thanks you @ahojnnes !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.