-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Remove OPTICS from 0.20 #12053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove OPTICS from 0.20 #12053
Conversation
Due to too many proposed backwards-incompatible changes
@jnothman My understanding is that the biggest blocker was the numerical issue referenced in #11878 , and finding an appropriate fix for the test involved. I think that this issue is (finally) fixed with #12054. I realize that there are a few edge cases outstanding with how split points should be dealt with (i.e., #11857)... however, I would argue that the core algorithm is solid and well tested, and that the extraction issues are relatively minor. I'd also point out that (IMO), the OPTICS implementation is currently better tested and more complete then DBSCAN was in 2013 when the OPTICS PR was first opened. Specifically, compared to DBSCAN circa 2013, it scales better, has higher test coverage, and is better optimized. My point in making the comparison with DBSCAN isn't meant as some sort of appeal for fairness in evaluating algorithms for inclusion into sklearn. The sklearn project was in a different place 5 years ago, and coding and algorithm inclusion standards have increased for a reason. My larger point is that none of the issues with DBSCAN in 2013 were severe enough to stop it from being usefully deployed to the community, and that with broader use of the algorithm, further releases addressed those issues. Similarly, I think that the remaining issues in OPTICS are minor, and can be addressed as bug fixes and enhancements in 0.21. |
I wanted to open this to give us the easy option of considering a release
asap. I'm okay with continuing to iron out the issues there, but feel that
if it's blocking release, we need to resolve the issues this week and no
later.
I think the question of whether density includes or excludes the query
point is also one that definitely needs resolving. There just seem to be
several open questions, where by this stage in the release cycle we need
answers.
Dbscan has changed in API and implementation but not in results in many
years, as far as I can remember. It's a bad comparison because except for
speed regressions, we were able to ensure backwards compatibility.
|
I think that removing Optics is something that we should consider. The line of thoughts is the following: we are strongly risking backward-incompatible changes in it. These changes are a significant cost to developers and to users. Trying to avoid them is currently holding up the release. We cannot hold up the release too long. I am open to giving us another few days to try to resolve the pending issues (on which I feel that I have no expertise), but I do not have resources to commit to them right now. Hence, I fear that they will slip. Right now, not putting OPTICS in 0.20 feels like a major drawback because it feels like it will delay its availability a lot. The best solution to such a problem would be release more often. I think that this is a worthwhile goal, and I would like to start thinking how I can allocate more resources to this goal. @espg : does this sound like a reasonable analysis of the situation? |
I think we are at a point where most issues are resolved, or pending review in PRs (#12028, #11929, #12029, #12049, #12054). Specifically for the split points issue, I managed to come up with an idea, borrowing the xi concept from the original paper, which I've proposed in PR #12049. Regarding the backward compatibility and the API, I'd consider it pretty mature specially if we implement a variety of what's proposed in OPTICS extraction methods. There are some smaller outstanding issues which don't take too much time to fix, some of them mentioned in #11677 (comment) The options I see ahead are:
I personally have some good time to spare this week, to try and make the 3d option happen. Otherwise option 2 is my personal favorite, but I'm also in the camp which is happy with releasing every new algorithm as experimental for the community to test, get feedbacks, fix the issues, and release as stable once it's matured (but I know this is not how it's usually done in |
I think we can combine options 2 and 3 that @adrinjalali is suggesting while still meeting @jnothman and @GaelVaroquaux requirements concerning output stability. The OPTICS algorithm has at present:
My view is that the extraction of the reachability graph is stable. The outputs are tested, and have not changed at all since 2015. They will not change in the future. They will not break compatibility going forward. Likewise, I view My suggestion is that we release OPTICS with I am of course also open to us trying to fix everything in time for the 0.20 release...but if it's too rushed (and I get the sense that it may be), we can treat the modular parts of OPTICS as modular and release the part of the API that is stable and will not change. Does this sound like a reasonable way forward? |
I find the idea of releasing dbscan as the primary extraction unappealing
when automatic extraction can deal with clusters of different density and
hence should be a key feature. I'd rather just go with saying that the
handling of split points is experimental, if we think that is likely to be
the only remaining issue pertaining to backwards compatibility
|
I'd be happy with what you suggest @jnothman! I'm also working on an implementation of |
Sigh. I'm really inclined to delay releasing OPTICS until 0.21. There seem to be too many questions of correctness and comparison to prior art. And it also seems fair to MICE/IterativeImputer which was also a long-open PR and has been similarly delayed due to interface stability issues. |
Hmm, It's a sad decision but I'll vote +1 at this point. Some so-called edge cases are not trival and I think we still need some time to compare with existing implementation (the referenced one, R dbscan, ELKI) and figure out what's the appropriate solution. |
As much as I would love to see this available in 0.20, it is important that we get it right. I think that the appropriate thing to do is to start with comparison to the R implementation... i.e., these comments: #11677 (comment) , #12090 (comment) I'd like to figure out why the R implantation is different, specifically why they don't choose the next closest point when building the reachability graph. Their implementation looks like it would produce identical reachability values if it did... if we're sure this is the right way to do it, perhaps we should open a bug with them. If we aren't, we should figure out why it isn't and change our implementation. |
Apparently I've missed some comments when I'm with my parents, sorry.
I agree. R has a test to ensure that their implementation is consistent with ELKI (need to confirm whether it's coincident though). If R and ELKI have the same implementation, we might follow them and note down our solution. |
Yeah, if there's a possibility that there's a release in ~3 months, I'd definitely prefer to delay releasing OPTICS until then. |
A release in 3 months is VERY optimistic! I think when @GaelVaroquaux talks
of a shorter release cycle, we would be hoping to return towards 6-monthly
cycles, where this last cycle has been 15 months. But that would depend on
having reviewer availability, maybe more project management around bigger
issues so that we don't delay for them excessively.
|
But I think there is some consensus here that we should hit merge on this PR, and make sure an OPTICS we are confident in is available at the next release. |
+1. |
It's already been 5 and a half years, so another 6-to-15 months won't make that much of a difference I suppose. At least now it's in the development version, so people can clone it and view the online documentation if they like... I know I used the reworked gaussian processes code for at least 6 or 8 months before it made it to a stable official release. |
Thanks for your patience. I've been trying really hard to get this in too, but alas...! |
I've reflected this change in what's new in master: e616ee3 |
Sad. Time for a few sad drinks to forget the loss of our beloved OPTICS in this release. |
There's a little bit of action on Gitter... but not much.
|
But 0.20 is huge. We really just need to let that bomb drop, and then focus
on making the next one more manageable.
|
But 0.20 is huge. We really just need to let that bomb drop, and then focus
on making the next one more manageable.
Yes, that's the big deal. Such a huge release is very hard to manage.
I would like in the future to try to put more resources on my side on
helping for releases. No promises, but I have hopes.
|
Thanks @jnothman. I've been unfortunately busy. Are there remaining blockers? |
@espg wrote:
The "ELKI extraction method which has many, many different names" is from the original paper (Section 4.3 Automatic Techniques, Figure 19: Algorithm ExtractClusters)! In that paper it was using the definition 11: ξ-clusters (a lowercase greek xi), which is why we used the name OPTICSXi for finding such Xi clusters in Java... what other "many, many different names" do you know? Naming another method "auto_extract" seems like a bad idea to me, as the original OPTICS paper describes the Xi method in the section titled "automatic techniques"... |
Obviously more specific names make sense. I cannot parse "sqlnk" though, where is the name from? |
Due to too many proposed backwards-incompatible changes.
We had hoped to include OPTICS in 0.20 particularly to provide a memory-efficient alternative to DBSCAN... but with several issues open about OPTICS that would result in backwards-incompatible changes, it's hard to justify further delaying the release, nor releasing with these issues open.
What do others think, @espg, @adrinjalali, @GaelVaroquaux?