Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Remove OPTICS from 0.20 #12053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 17, 2018
Merged

Remove OPTICS from 0.20 #12053

merged 1 commit into from
Sep 17, 2018

Conversation

jnothman
Copy link
Member

Due to too many proposed backwards-incompatible changes.

We had hoped to include OPTICS in 0.20 particularly to provide a memory-efficient alternative to DBSCAN... but with several issues open about OPTICS that would result in backwards-incompatible changes, it's hard to justify further delaying the release, nor releasing with these issues open.

What do others think, @espg, @adrinjalali, @GaelVaroquaux?

Due to too many proposed backwards-incompatible changes
@espg
Copy link
Contributor

espg commented Sep 12, 2018

@jnothman My understanding is that the biggest blocker was the numerical issue referenced in #11878 , and finding an appropriate fix for the test involved. I think that this issue is (finally) fixed with #12054. I realize that there are a few edge cases outstanding with how split points should be dealt with (i.e., #11857)... however, I would argue that the core algorithm is solid and well tested, and that the extraction issues are relatively minor.

I'd also point out that (IMO), the OPTICS implementation is currently better tested and more complete then DBSCAN was in 2013 when the OPTICS PR was first opened. Specifically, compared to DBSCAN circa 2013, it scales better, has higher test coverage, and is better optimized.

My point in making the comparison with DBSCAN isn't meant as some sort of appeal for fairness in evaluating algorithms for inclusion into sklearn. The sklearn project was in a different place 5 years ago, and coding and algorithm inclusion standards have increased for a reason. My larger point is that none of the issues with DBSCAN in 2013 were severe enough to stop it from being usefully deployed to the community, and that with broader use of the algorithm, further releases addressed those issues. Similarly, I think that the remaining issues in OPTICS are minor, and can be addressed as bug fixes and enhancements in 0.21.

@jnothman
Copy link
Member Author

jnothman commented Sep 12, 2018 via email

@GaelVaroquaux
Copy link
Member

I think that removing Optics is something that we should consider. The line of thoughts is the following: we are strongly risking backward-incompatible changes in it. These changes are a significant cost to developers and to users. Trying to avoid them is currently holding up the release. We cannot hold up the release too long.

I am open to giving us another few days to try to resolve the pending issues (on which I feel that I have no expertise), but I do not have resources to commit to them right now. Hence, I fear that they will slip.

Right now, not putting OPTICS in 0.20 feels like a major drawback because it feels like it will delay its availability a lot. The best solution to such a problem would be release more often. I think that this is a worthwhile goal, and I would like to start thinking how I can allocate more resources to this goal.

@espg : does this sound like a reasonable analysis of the situation?

@adrinjalali
Copy link
Member

I think we are at a point where most issues are resolved, or pending review in PRs (#12028, #11929, #12029, #12049, #12054).

Specifically for the split points issue, I managed to come up with an idea, borrowing the xi concept from the original paper, which I've proposed in PR #12049.

Regarding the backward compatibility and the API, I'd consider it pretty mature specially if we implement a variety of what's proposed in OPTICS extraction methods.

There are some smaller outstanding issues which don't take too much time to fix, some of them mentioned in #11677 (comment)

The options I see ahead are:

  1. retract OPTICS from 0.20, and maybe try to release 0.21 sooner than usual
  2. tag OPTICS as experimental, release, and fix the issues for 0.20.1 or so
  3. fix the remaining issues fast and release

I personally have some good time to spare this week, to try and make the 3d option happen. Otherwise option 2 is my personal favorite, but I'm also in the camp which is happy with releasing every new algorithm as experimental for the community to test, get feedbacks, fix the issues, and release as stable once it's matured (but I know this is not how it's usually done in scikit-learn).

@espg
Copy link
Contributor

espg commented Sep 12, 2018

I think we can combine options 2 and 3 that @adrinjalali is suggesting while still meeting @jnothman and @GaelVaroquaux requirements concerning output stability. The OPTICS algorithm has at present:

  • A single extraction method of the reachability graph
  • Three features that extract clusters from the graph, at various levels of maturity:
    • extract_dbscan, which is the only method mentioned in the original paper
    • auto_extract, from @amyxzhang with substantial modification from @adrinjalali and myself
    • the ELKI extraction method which has many, many different names is in use (or at least available) in both ELKI and R

My view is that the extraction of the reachability graph is stable. The outputs are tested, and have not changed at all since 2015. They will not change in the future. They will not break compatibility going forward.

Likewise, I view extract_dbscan as stable. The code for the method is extremely terse, easy to understand, and has been coded and reviewed in it's current state by @jnothman and myself. The outputs for this function should not change, or if they do change, will change as a result of the Sklearn clustering API incorporating new labeling vocabulary (i.e., label = -2 if we ever decide to include split points for users explicitly).

My suggestion is that we release OPTICS with extract_dbscan as the default extraction method, and that we optionally include auto_extract as an experimental feature that the user can select. This would require mostly updating the documentation, and if including auto_extract, referencing that the outputs may change in a following release for certain edged cases (i.e., split points, the last entry of the graph, etc.). This would solve the primary complaint about DBSCAN not being memory efficient, allow the current release to go forward, and not introduce any code that may change outputs for the next release. Updates to auto_extract can be released as bug fixes and changed to the default extraction function in 0.21 as an enhancement. The ELKI extraction can be released as an enhancement.

I am of course also open to us trying to fix everything in time for the 0.20 release...but if it's too rushed (and I get the sense that it may be), we can treat the modular parts of OPTICS as modular and release the part of the API that is stable and will not change.

Does this sound like a reasonable way forward?

@jnothman
Copy link
Member Author

jnothman commented Sep 13, 2018 via email

@adrinjalali
Copy link
Member

I'd be happy with what you suggest @jnothman! I'm also working on an implementation of extractXi, which I'd probably be more comfortable to have as the default option.

@jnothman
Copy link
Member Author

Sigh. I'm really inclined to delay releasing OPTICS until 0.21. There seem to be too many questions of correctness and comparison to prior art. And it also seems fair to MICE/IterativeImputer which was also a long-open PR and has been similarly delayed due to interface stability issues.

@qinhanmin2014
Copy link
Member

Sigh. I'm really inclined to delay releasing OPTICS until 0.21.

Hmm, It's a sad decision but I'll vote +1 at this point. Some so-called edge cases are not trival and I think we still need some time to compare with existing implementation (the referenced one, R dbscan, ELKI) and figure out what's the appropriate solution.
Maybe we can release 0.21 once we merge OPTICS&IterativeImputer (e.g., in 6 months/3 months).

@espg
Copy link
Contributor

espg commented Sep 17, 2018

As much as I would love to see this available in 0.20, it is important that we get it right. I think that the appropriate thing to do is to start with comparison to the R implementation... i.e., these comments: #11677 (comment) , #12090 (comment)

I'd like to figure out why the R implantation is different, specifically why they don't choose the next closest point when building the reachability graph. Their implementation looks like it would produce identical reachability values if it did... if we're sure this is the right way to do it, perhaps we should open a bug with them. If we aren't, we should figure out why it isn't and change our implementation.

@qinhanmin2014
Copy link
Member

i.e., these comments: #11677 (comment) , #12090 (comment)

Apparently I've missed some comments when I'm with my parents, sorry.

I think that the appropriate thing to do is to start with comparison to the R implementation

I agree. R has a test to ensure that their implementation is consistent with ELKI (need to confirm whether it's coincident though). If R and ELKI have the same implementation, we might follow them and note down our solution.

@adrinjalali
Copy link
Member

Yeah, if there's a possibility that there's a release in ~3 months, I'd definitely prefer to delay releasing OPTICS until then.

@jnothman
Copy link
Member Author

jnothman commented Sep 17, 2018 via email

@jnothman
Copy link
Member Author

But I think there is some consensus here that we should hit merge on this PR, and make sure an OPTICS we are confident in is available at the next release.

@adrinjalali
Copy link
Member

+1.

@espg
Copy link
Contributor

espg commented Sep 17, 2018

It's already been 5 and a half years, so another 6-to-15 months won't make that much of a difference I suppose. At least now it's in the development version, so people can clone it and view the online documentation if they like... I know I used the reworked gaussian processes code for at least 6 or 8 months before it made it to a stable official release.

@jnothman
Copy link
Member Author

Thanks for your patience. I've been trying really hard to get this in too, but alas...!

@jnothman jnothman merged commit c196ec4 into scikit-learn:0.20.X Sep 17, 2018
@jnothman
Copy link
Member Author

jnothman commented Sep 17, 2018

I've reflected this change in what's new in master: e616ee3

@adrinjalali
Copy link
Member

Sad. Time for a few sad drinks to forget the loss of our beloved OPTICS in this release.
[I wish we actually used IRC or something]

@jnothman
Copy link
Member Author

jnothman commented Sep 17, 2018 via email

@jnothman
Copy link
Member Author

jnothman commented Sep 17, 2018 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 17, 2018 via email

@amueller
Copy link
Member

Thanks @jnothman. I've been unfortunately busy. Are there remaining blockers?

@kno10
Copy link
Contributor

kno10 commented Sep 21, 2018

@espg wrote:

  • Three features that extract clusters from the graph, at various levels of maturity:

    • extract_dbscan, which is the only method mentioned in the original paper
    • auto_extract, from @amyxzhang with substantial modification from @adrinjalali and myself
    • the ELKI extraction method which has many, many different names is in use (or at least available) in both ELKI and R

The "ELKI extraction method which has many, many different names" is from the original paper (Section 4.3 Automatic Techniques, Figure 19: Algorithm ExtractClusters)! In that paper it was using the definition 11: ξ-clusters (a lowercase greek xi), which is why we used the name OPTICSXi for finding such Xi clusters in Java... what other "many, many different names" do you know?

Naming another method "auto_extract" seems like a bad idea to me, as the original OPTICS paper describes the Xi method in the section titled "automatic techniques"...

@adrinjalali
Copy link
Member

@kno10 , just to note that in PR #12087, a new name, i.e. extract_sqlnk is proposed for what now is called the auto_extract in master. And hopefully once it's better tested, PR #12077 would add extract_xi as well, which is almost what is called OPTICSXi in ELKI.

Do these names sound better to you?

@kno10
Copy link
Contributor

kno10 commented Sep 21, 2018

Obviously more specific names make sense. I cannot parse "sqlnk" though, where is the name from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants