Remove OPTICS from 0.20 #12053

jnothman · 2018-09-12T05:10:43Z

Due to too many proposed backwards-incompatible changes.

We had hoped to include OPTICS in 0.20 particularly to provide a memory-efficient alternative to DBSCAN... but with several issues open about OPTICS that would result in backwards-incompatible changes, it's hard to justify further delaying the release, nor releasing with these issues open.

What do others think, @espg, @adrinjalali, @GaelVaroquaux?

Due to too many proposed backwards-incompatible changes

espg · 2018-09-12T06:32:01Z

@jnothman My understanding is that the biggest blocker was the numerical issue referenced in #11878 , and finding an appropriate fix for the test involved. I think that this issue is (finally) fixed with #12054. I realize that there are a few edge cases outstanding with how split points should be dealt with (i.e., #11857)... however, I would argue that the core algorithm is solid and well tested, and that the extraction issues are relatively minor.

I'd also point out that (IMO), the OPTICS implementation is currently better tested and more complete then DBSCAN was in 2013 when the OPTICS PR was first opened. Specifically, compared to DBSCAN circa 2013, it scales better, has higher test coverage, and is better optimized.

My point in making the comparison with DBSCAN isn't meant as some sort of appeal for fairness in evaluating algorithms for inclusion into sklearn. The sklearn project was in a different place 5 years ago, and coding and algorithm inclusion standards have increased for a reason. My larger point is that none of the issues with DBSCAN in 2013 were severe enough to stop it from being usefully deployed to the community, and that with broader use of the algorithm, further releases addressed those issues. Similarly, I think that the remaining issues in OPTICS are minor, and can be addressed as bug fixes and enhancements in 0.21.

jnothman · 2018-09-12T06:50:14Z

I wanted to open this to give us the easy option of considering a release asap. I'm okay with continuing to iron out the issues there, but feel that if it's blocking release, we need to resolve the issues this week and no later. I think the question of whether density includes or excludes the query point is also one that definitely needs resolving. There just seem to be several open questions, where by this stage in the release cycle we need answers. Dbscan has changed in API and implementation but not in results in many years, as far as I can remember. It's a bad comparison because except for speed regressions, we were able to ensure backwards compatibility.

GaelVaroquaux · 2018-09-12T07:14:10Z

I think that removing Optics is something that we should consider. The line of thoughts is the following: we are strongly risking backward-incompatible changes in it. These changes are a significant cost to developers and to users. Trying to avoid them is currently holding up the release. We cannot hold up the release too long.

I am open to giving us another few days to try to resolve the pending issues (on which I feel that I have no expertise), but I do not have resources to commit to them right now. Hence, I fear that they will slip.

Right now, not putting OPTICS in 0.20 feels like a major drawback because it feels like it will delay its availability a lot. The best solution to such a problem would be release more often. I think that this is a worthwhile goal, and I would like to start thinking how I can allocate more resources to this goal.

@espg : does this sound like a reasonable analysis of the situation?

adrinjalali · 2018-09-12T08:57:35Z

I think we are at a point where most issues are resolved, or pending review in PRs (#12028, #11929, #12029, #12049, #12054).

Specifically for the split points issue, I managed to come up with an idea, borrowing the xi concept from the original paper, which I've proposed in PR #12049.

Regarding the backward compatibility and the API, I'd consider it pretty mature specially if we implement a variety of what's proposed in OPTICS extraction methods.

There are some smaller outstanding issues which don't take too much time to fix, some of them mentioned in #11677 (comment)

The options I see ahead are:

retract OPTICS from 0.20, and maybe try to release 0.21 sooner than usual
tag OPTICS as experimental, release, and fix the issues for 0.20.1 or so
fix the remaining issues fast and release

I personally have some good time to spare this week, to try and make the 3d option happen. Otherwise option 2 is my personal favorite, but I'm also in the camp which is happy with releasing every new algorithm as experimental for the community to test, get feedbacks, fix the issues, and release as stable once it's matured (but I know this is not how it's usually done in scikit-learn).

espg · 2018-09-12T17:38:38Z

I think we can combine options 2 and 3 that @adrinjalali is suggesting while still meeting @jnothman and @GaelVaroquaux requirements concerning output stability. The OPTICS algorithm has at present:

A single extraction method of the reachability graph
Three features that extract clusters from the graph, at various levels of maturity:
- extract_dbscan, which is the only method mentioned in the original paper
- auto_extract, from @amyxzhang with substantial modification from @adrinjalali and myself
- the ELKI extraction method which has many, many different names is in use (or at least available) in both ELKI and R

My view is that the extraction of the reachability graph is stable. The outputs are tested, and have not changed at all since 2015. They will not change in the future. They will not break compatibility going forward.

Likewise, I view extract_dbscan as stable. The code for the method is extremely terse, easy to understand, and has been coded and reviewed in it's current state by @jnothman and myself. The outputs for this function should not change, or if they do change, will change as a result of the Sklearn clustering API incorporating new labeling vocabulary (i.e., label = -2 if we ever decide to include split points for users explicitly).

My suggestion is that we release OPTICS with extract_dbscan as the default extraction method, and that we optionally include auto_extract as an experimental feature that the user can select. This would require mostly updating the documentation, and if including auto_extract, referencing that the outputs may change in a following release for certain edged cases (i.e., split points, the last entry of the graph, etc.). This would solve the primary complaint about DBSCAN not being memory efficient, allow the current release to go forward, and not introduce any code that may change outputs for the next release. Updates to auto_extract can be released as bug fixes and changed to the default extraction function in 0.21 as an enhancement. The ELKI extraction can be released as an enhancement.

I am of course also open to us trying to fix everything in time for the 0.20 release...but if it's too rushed (and I get the sense that it may be), we can treat the modular parts of OPTICS as modular and release the part of the API that is stable and will not change.

Does this sound like a reasonable way forward?

jnothman · 2018-09-13T08:29:26Z

I find the idea of releasing dbscan as the primary extraction unappealing when automatic extraction can deal with clusters of different density and hence should be a key feature. I'd rather just go with saying that the handling of split points is experimental, if we think that is likely to be the only remaining issue pertaining to backwards compatibility

adrinjalali · 2018-09-13T08:36:51Z

I'd be happy with what you suggest @jnothman! I'm also working on an implementation of extractXi, which I'd probably be more comfortable to have as the default option.

jnothman · 2018-09-17T03:09:03Z

Sigh. I'm really inclined to delay releasing OPTICS until 0.21. There seem to be too many questions of correctness and comparison to prior art. And it also seems fair to MICE/IterativeImputer which was also a long-open PR and has been similarly delayed due to interface stability issues.

qinhanmin2014 · 2018-09-17T03:43:46Z

Sigh. I'm really inclined to delay releasing OPTICS until 0.21.

Hmm, It's a sad decision but I'll vote +1 at this point. Some so-called edge cases are not trival and I think we still need some time to compare with existing implementation (the referenced one, R dbscan, ELKI) and figure out what's the appropriate solution.
Maybe we can release 0.21 once we merge OPTICS&IterativeImputer (e.g., in 6 months/3 months).

espg · 2018-09-17T03:55:41Z

As much as I would love to see this available in 0.20, it is important that we get it right. I think that the appropriate thing to do is to start with comparison to the R implementation... i.e., these comments: #11677 (comment) , #12090 (comment)

I'd like to figure out why the R implantation is different, specifically why they don't choose the next closest point when building the reachability graph. Their implementation looks like it would produce identical reachability values if it did... if we're sure this is the right way to do it, perhaps we should open a bug with them. If we aren't, we should figure out why it isn't and change our implementation.

qinhanmin2014 · 2018-09-17T04:05:52Z

i.e., these comments: #11677 (comment) , #12090 (comment)

Apparently I've missed some comments when I'm with my parents, sorry.

I think that the appropriate thing to do is to start with comparison to the R implementation

I agree. R has a test to ensure that their implementation is consistent with ELKI (need to confirm whether it's coincident though). If R and ELKI have the same implementation, we might follow them and note down our solution.

adrinjalali · 2018-09-17T06:22:30Z

Yeah, if there's a possibility that there's a release in ~3 months, I'd definitely prefer to delay releasing OPTICS until then.

jnothman · 2018-09-17T06:35:38Z

A release in 3 months is VERY optimistic! I think when @GaelVaroquaux talks of a shorter release cycle, we would be hoping to return towards 6-monthly cycles, where this last cycle has been 15 months. But that would depend on having reviewer availability, maybe more project management around bigger issues so that we don't delay for them excessively.

jnothman · 2018-09-17T06:36:18Z

But I think there is some consensus here that we should hit merge on this PR, and make sure an OPTICS we are confident in is available at the next release.

adrinjalali · 2018-09-17T06:39:52Z

+1.

espg · 2018-09-17T06:58:27Z

It's already been 5 and a half years, so another 6-to-15 months won't make that much of a difference I suppose. At least now it's in the development version, so people can clone it and view the online documentation if they like... I know I used the reworked gaussian processes code for at least 6 or 8 months before it made it to a stable official release.

jnothman · 2018-09-17T07:45:07Z

Thanks for your patience. I've been trying really hard to get this in too, but alas...!

jnothman · 2018-09-17T07:48:14Z

I've reflected this change in what's new in master: e616ee3

adrinjalali · 2018-09-17T07:55:10Z

Sad. Time for a few sad drinks to forget the loss of our beloved OPTICS in this release.
[I wish we actually used IRC or something]

jnothman · 2018-09-17T08:01:32Z

There's a little bit of action on Gitter... but not much.

jnothman · 2018-09-17T08:02:06Z

But 0.20 is huge. We really just need to let that bomb drop, and then focus on making the next one more manageable.

GaelVaroquaux · 2018-09-17T08:08:31Z

But 0.20 is huge. We really just need to let that bomb drop, and then focus on making the next one more manageable.

Yes, that's the big deal. Such a huge release is very hard to manage. I would like in the future to try to put more resources on my side on helping for releases. No promises, but I have hopes.

amueller · 2018-09-18T21:38:47Z

Thanks @jnothman. I've been unfortunately busy. Are there remaining blockers?

kno10 · 2018-09-21T09:20:29Z

@espg wrote:

Three features that extract clusters from the graph, at various levels of maturity:

extract_dbscan, which is the only method mentioned in the original paper

auto_extract, from @amyxzhang with substantial modification from @adrinjalali and myself

the ELKI extraction method which has many, many different names is in use (or at least available) in both ELKI and R

The "ELKI extraction method which has many, many different names" is from the original paper (Section 4.3 Automatic Techniques, Figure 19: Algorithm ExtractClusters)! In that paper it was using the definition 11: ξ-clusters (a lowercase greek xi), which is why we used the name OPTICSXi for finding such Xi clusters in Java... what other "many, many different names" do you know?

Naming another method "auto_extract" seems like a bad idea to me, as the original OPTICS paper describes the Xi method in the section titled "automatic techniques"...

adrinjalali · 2018-09-21T09:34:49Z

@kno10 , just to note that in PR #12087, a new name, i.e. extract_sqlnk is proposed for what now is called the auto_extract in master. And hopefully once it's better tested, PR #12077 would add extract_xi as well, which is almost what is called OPTICSXi in ELKI.

Do these names sound better to you?

kno10 · 2018-09-21T10:08:48Z

Obviously more specific names make sense. I cannot parse "sqlnk" though, where is the name from?

Remove OPTICS from 0.20

42b14ff

Due to too many proposed backwards-incompatible changes

adrinjalali mentioned this pull request Sep 14, 2018

[MRG+2] OPTICS: add extract_xi method #12077

Merged

jnothman merged commit c196ec4 into scikit-learn:0.20.X Sep 17, 2018

Uh oh!

Remove OPTICS from 0.20 #12053

Remove OPTICS from 0.20 #12053

Uh oh!

Conversation

jnothman commented Sep 12, 2018

Uh oh!

espg commented Sep 12, 2018

Uh oh!

jnothman commented Sep 12, 2018 via email

Uh oh!

GaelVaroquaux commented Sep 12, 2018

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

espg commented Sep 12, 2018

Uh oh!

jnothman commented Sep 13, 2018 via email

Uh oh!

adrinjalali commented Sep 13, 2018

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

qinhanmin2014 commented Sep 17, 2018

Uh oh!

espg commented Sep 17, 2018

Uh oh!

qinhanmin2014 commented Sep 17, 2018

Uh oh!

adrinjalali commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018 via email

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

adrinjalali commented Sep 17, 2018

Uh oh!

espg commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018 via email

Uh oh!

jnothman commented Sep 17, 2018 via email

Uh oh!

GaelVaroquaux commented Sep 17, 2018 via email

Uh oh!

amueller commented Sep 18, 2018

Uh oh!

kno10 commented Sep 21, 2018

Uh oh!

adrinjalali commented Sep 21, 2018

Uh oh!

kno10 commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jnothman commented Sep 17, 2018 •

edited

Loading

kno10 commented Sep 21, 2018 •

edited

Loading