[MRG+2] faster way of computing means across each group #10020

sergulaydore · 2017-10-26T20:19:18Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

It computes the mean of grouped features in a faster way.

Any other comments?

The performance improvement depends on the number of clusters and the number of samples. If the clusters increase the performance gain would be in the order of 100. If the number of samples increase then the performance gain gets lower. Performance gain (speed_of_new_method/speed_of_old_method) for

n_features=10000, n_clusters=5000, n_sample=10 is 515
n_features=10000, n_clusters=5000, n_sample=100 is 52
n_features=10000, n_clusters=5000, n_sample=1000 is 6
n_features=1000, n_clusters=500, n_sample=100000 is 1.6
n_features=1000, n_clusters=5, n_sample=100000 is 2.35

amueller · 2017-10-26T21:37:43Z

Can you please provide end-to-end benchmarks (code + results)?

sergulaydore · 2017-10-27T00:30:52Z

@amueller Added a benchmark file. However, I can't really figure out why codecov/patch is failing. Is that because I need to add more test cases for the lines I added?

lesteve · 2017-10-27T04:37:20Z

Look at the codecov report. There is no test covering pooling_func != np.mean.

It's not really your fault (there was no test before your PR) but if you can add a test that would be great.

sergulaydore · 2017-10-27T06:15:27Z

@lesteve I don’t quite understand that. Do you mean adding a new test file like test_feature_agglomeration? Or should I raise an error in _feature_agglomeration when pooling_func is not np.mean? I am confused because I don’t have a test for pooling_func == np.mean either. Why doesn’t coverage/batch complain about that?

lesteve · 2017-10-27T06:31:56Z

IIRC last time I looked pooling_func was not really well tested, even adding tests with pooling_func=np.mean would be great.

lesteve · 2017-10-27T06:36:26Z

Do you mean adding a new test file like test_feature_agglomeration?

Adding a test function inside the relevant test file, pooling_func==np.mean is "tested" (only smoke test IIRC) via sklearn.cluster.hierarchical.FeatureAgglomeration I think. If you can test only the AgglomerationTransform class in isolation i.e. add sklearn/cluster/tests/test_feature_agglomeration.py even better.

sergulaydore · 2017-10-27T06:42:09Z

@lesteve Thanks for the clarification. Let me try adding sklearn/cluster/tests/test_feature_agglomeration.py.

amueller · 2017-10-27T15:18:18Z

Sorry I was being unclear. I was asking for you to show benchmark results in this PR and link to code (possibly in a gist). We don't usually include benchmarks for "small" improvements into the benchmark folder, but it's good to have the code posted somewhere for the record.

amueller · 2017-10-27T15:19:46Z

(also wow the benchmarks look great but I haven't had time to look into the details)

amueller · 2017-10-27T16:20:34Z

pooling_func is deprecated, right?

amueller · 2017-10-27T17:36:03Z

with 1000 samples:

with 1000000 samples:

scaling behavior in terms of number of features seems clearly better, and not that dependend on n_clusters. (though maybe log-log scale would have been better for that last plot, whoops)

See: https://gist.github.com/amueller/9853d77d9a08f4445f7ee1f7cffe4241

amueller · 2017-10-27T17:37:59Z

I would remove the special case, given that pooling_func is ignored.

sergulaydore · 2017-10-28T09:36:27Z

I removed the case where pooling_func != np.mean. Should I also delete the benchmark file? Anything else you suggest?

jnothman · 2017-10-28T10:36:44Z

pooling_func is deprecated, right?

Noooo.... pooling_func is only deprecated from AgglomerativeClustering where it is unused, not from FeatureAgglomeration to which this applies.

(In fact, I've not checked, but pooling_func was only present in AgglomerativeClustering to make inherited initialisation of FeatureAgglomeration too easy; I suspect it was intentionally left undocumented until we recently inserted documentation to make our docstring checker happy... and then proceeded to find the param needed deprecating.)

sergulaydore · 2017-10-28T10:47:17Z

@jnothman So you suggest to keep the special case and write the test instead?

jnothman · 2017-10-28T11:59:42Z

yes, please

sergulaydore · 2017-10-29T19:35:29Z

How does it look now? Note that for the case pooling_func!=np.mean, I only tested for np.median.

jnothman · 2017-10-29T22:41:01Z

sklearn/cluster/_feature_agglomeration.py

+            size = np.bincount(self.labels_)
+            n_samples = X.shape[0]
+            # a fast way to compute the mean of grouped features
+            nX = np.array([np.bincount(self.labels_, X[i, :])/size


how does this weighted bincount compare to

X @ scipy.sparse.csr_matrix((np.ones_like(labels), labels, np.arange(len(labels) + 1)))

this did not perform better. please see the plots below.

jnothman

I think the test should actually include a simple case (e.g. one cluster, one sample) where the median and mean differ: FeatureAgglomeration(n_clusters=1).fit_transform([[0, 0, 1]])

jnothman · 2017-10-29T22:41:59Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)
+
+    # Test transform
+    X_red_mean = agglo_mean.transform(X)


if by "red" you mean "reduced" we ususally use Xt for transformed X

jnothman · 2017-10-29T22:52:53Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)
+
+    # Test transform
+    X_red_mean = agglo_mean.transform(X)


if by "red" you mean "reduced" we usually use Xt for transformed X

jnothman · 2017-10-29T22:54:31Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    assert_true(X_red_median.shape[1] == n_clusters)
+
+    # Check that fitting with no samples raises a ValueError
+    assert_raises(ValueError, agglo_mean.fit, X[:0])


This sort of thing should be covered by common tests

jnothman · 2017-10-29T22:55:27Z

sklearn/cluster/tests/test_feature_agglomeration.py

+
+    # Test inverse transform
+    X_full_mean = agglo_mean.inverse_transform(X_red_mean)
+    X_full_median = agglo_mean.inverse_transform(X_red_median)


should this be agglo_median?

sergulaydore · 2017-10-31T01:57:59Z

@jnothman Thanks for the review. I modified the test code according to your suggestions.
I also created some plots for np_mean, np_bincount and scipy.sparse.csr_matrix methods using this script. As you can see csr_matrix does not perform better than the other two and np_bincount seems to be the best method for the cases I've tested.

sergulaydore · 2017-10-31T03:01:20Z

I don't see a reason why X should be sparse but FeatureAgglomeration accepts sparse matrices. What do you mean by "previously"? Should not such a test go to test_hierarchical?

jnothman · 2017-10-31T03:09:43Z

Yes, if it currently works, then we should ensure it continues to work. If it currently fails, then we may choose to ensure that it continues to fail :)

…

On 31 October 2017 at 14:01, Sergul Aydore ***@***.***> wrote: I don't see a reason why X should be sparse but FeatureAgglomeration accepts sparse matrices. What do you mean by "previously"? Should not such a test go to test_hierarchical? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10020 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67JluS4kJcUDO1Z9EH-fvE-Bkdx-ks5sxo2PgaJpZM4QIKEQ> .

sergulaydore · 2017-11-01T23:53:04Z

OK, it seems np.mean accepts sparse matrices but np.bincount does not. I need to call

np.bincount(agglo.labels_, X.data[X.indptr[i]:X.indptr[i+1]]) if X is sparse. Should I add such a line when X is sparse or return an error or call np.mean?

jnothman · 2017-11-01T23:57:15Z

That's not quite right I think. You'd need labels_[X.indices[row_start:row_stop]].

That would be okay, or just throw sparse matrices into the else case

sergulaydore · 2017-11-02T00:02:31Z

Do you mean np.bincount(agglo.labels_[Xsparse.indices[i]:Xsparse.indices[i+1]], Xsparse.data[i, :])? It also did not work. I think throwing sparse matrices to else case would be less confusing but not sure about the performance.

codecov · 2017-11-02T00:19:36Z

Codecov Report

Merging #10020 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #10020      +/-   ##
==========================================
+ Coverage   96.18%   96.19%   +0.01%     
==========================================
  Files         336      337       +1     
  Lines       62706    62753      +47     
==========================================
+ Hits        60311    60364      +53     
+ Misses       2395     2389       -6

Impacted Files	Coverage Δ
sklearn/cluster/_feature_agglomeration.py	`100% <100%> (ø)`	⬆️
...klearn/cluster/tests/test_feature_agglomeration.py	`100% <100%> (ø)`
sklearn/linear_model/coordinate_descent.py	`96.95% <0%> (ø)`	⬆️
sklearn/neural_network/tests/test_mlp.py	`100% <0%> (ø)`	⬆️
sklearn/linear_model/ridge.py	`95.66% <0%> (ø)`	⬆️
sklearn/feature_selection/rfe.py	`97.6% <0%> (ø)`	⬆️
sklearn/calibration.py	`98.87% <0%> (ø)`	⬆️
sklearn/linear_model/logistic.py	`97.07% <0%> (ø)`	⬆️
sklearn/linear_model/least_angle.py	`96.26% <0%> (ø)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 202b532...fba7691. Read the comment docs.

jnothman · 2017-11-02T03:00:40Z

sklearn/cluster/_feature_agglomeration.py

-        for l in np.unique(self.labels_):
-            nX.append(pooling_func(X[:, self.labels_ == l], axis=1))
-        return np.array(nX).T
+        if (pooling_func == np.mean) & (not issparse(X)):


you need and and no parentheses, not &.

sergulaydore · 2017-11-02T19:34:46Z

@jnothman Done. Anything else?

jnothman

Thanks

jnothman · 2017-11-02T21:43:35Z

sklearn/cluster/tests/test_feature_agglomeration.py

+
+def test_feature_agglomeration():
+    n_clusters = 1
+    X = np.array([0, 0, 1], ndmin=2)  # (n_samples, n_features)


Could you write this out in full as nested lists, or reshape so that the orientation is clearer to the reader, pease?

jnothman · 2017-11-02T21:43:58Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    agglo_mean.fit(X)
+    agglo_median.fit(X)
+    assert_true(np.size(np.unique(agglo_mean.labels_)) == n_clusters)
+    assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)


You should test the size of labels_ too

jnothman · 2017-11-02T21:44:46Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)
+
+    # Test transform
+    Xt_mean = agglo_mean.transform(X)


Please assert that these match the exact values you expect.

sergulaydore

Done. How does it look now?

jnothman · 2017-11-03T02:44:18Z

LGTM!

agramfort · 2017-11-03T10:04:06Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    Xt_median = agglo_median.transform(X)
+    assert_true(Xt_mean.shape[1] == n_clusters)
+    assert_true(Xt_median.shape[1] == n_clusters)
+    assert_true(Xt_mean == np.array([1/3.]))


why is 1/3. not seen as a pep8 violation by travis?

Because the mean of [0,0,1] is 1/3 which is 0.333333.... It seems to have passed all the checks. Do you have another suggestion for this test or what is your concern exactly?

@agramfort, spaces around operators is no longer a strict requirement of PEP8. I would suggest, @sergulaydore, that we'd consider it better style to have spaces around arithmetic binary operators except where leaving spaces out helps clarify order of operations (e.g. a*b + c).

spaces around operators is no longer a strict requirement of PEP8

can't we still require it? We have all this code that use spaces and I don't
like changing the rules of the game in the middle... it's also likely to create
unnecessary diffs.

I think here we should require it. I suppose we could in general, but we will need to help contributors have the same flake8 settings. See #9121.

if we put the ignore in setup.cfg and not in travis build then I think it's easy for contributors to run make flake and see the same errors as on travis. I would favor this.

So, do you both mean to change it to assert_true(Xt_mean == np.array([1 / 3.]))?

agramfort · 2017-11-05T09:15:03Z

sklearn/cluster/_feature_agglomeration.py

+            size = np.bincount(self.labels_)
+            n_samples = X.shape[0]
+            # a fast way to compute the mean of grouped features
+            nX = np.array([np.bincount(self.labels_, X[i, :])/size


X[i, :]) / size

amueller · 2017-11-06T20:48:43Z

Hm I somehow thought the deprecation of pooling_func would propagate to FeatureAgglomeration. Sorry I didn't check the details. This should probably be documented more explicitly?

jnothman · 2017-11-06T23:17:26Z

Merge, @amueller, @agramfort?

…

On 7 November 2017 at 07:48, Andreas Mueller ***@***.***> wrote: Hm I somehow thought the deprecation of pooling_func would propagate to FeatureAgglomeration. Sorry I didn't check the details. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10020 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63H87vH0L12qMle9lfnDuJ-EGHEkks5sz3CtgaJpZM4QIKEQ> .

agramfort · 2017-11-07T13:59:57Z

sklearn/cluster/tests/test_feature_agglomeration.py

+    assert_true(Xt_mean.shape[1] == n_clusters)
+    assert_true(Xt_median.shape[1] == n_clusters)
+    assert_true(Xt_mean == np.array([1 / 3.]))
+    assert_true(Xt_median == np.array([0.]))


did we stop using assert_equal with the switch towards pytest?

I used test_ward_agglomeration function in test_hierarchical.py as a guideline for this test. However, for the rest of test_hierarchical.py, assert_equal and assert_true have been used interchangeably.

agramfort · 2017-11-07T16:27:36Z

thx @sergulaydore

sergulaydore · 2017-11-07T16:59:18Z

Thanks @agramfort , @jnothman, @lesteve and @amueller for your time.

jnothman · 2017-11-07T22:21:16Z

You're right, @agramfort, we shouldn't really be using assert_true there. Just plain assert... Yes, I think we should be moving to bare assert-based testing where possible.

…#10020)

sergulaydore added 3 commits October 26, 2017 14:08

faster way of computing means across each group

9807ef0

fixed bug

dfcd422

fixed the bug with paranthesis

b36422c

sergulaydore added 3 commits October 26, 2017 18:03

fixed PEP8 issues

534f757

Benchmarking np.bincount vs np.mean for feature agglomeration

f37287a

fixed PEP8 issues for benchmark file

593c8ed

removed pooling_func as it was depreciated

94ad19a

sergulaydore added 2 commits October 29, 2017 14:40

test file added,bencmark removed

3defcd6

special case was added back

7b5de28

jnothman reviewed Oct 29, 2017

View reviewed changes

after jnothman's review

5818abf

sparse condition added

fba7691

jnothman reviewed Nov 2, 2017

View reviewed changes

fixed 'and' and paranthesis

2f9846c

jnothman reviewed Nov 2, 2017

View reviewed changes

more tests added, reshaped input

72caa77

sergulaydore commented Nov 3, 2017

View reviewed changes

jnothman changed the title ~~faster way of computing means across each group~~ [MRG+1] faster way of computing means across each group Nov 3, 2017

agramfort reviewed Nov 3, 2017

View reviewed changes

added spaces before and after div operator

a35000b

agramfort reviewed Nov 5, 2017

View reviewed changes

added spaces before after div op - 2

9684ebe

agramfort reviewed Nov 7, 2017

View reviewed changes

agramfort changed the title ~~[MRG+1] faster way of computing means across each group~~ [MRG+2] faster way of computing means across each group Nov 7, 2017

agramfort approved these changes Nov 7, 2017

View reviewed changes

agramfort merged commit 555bf6b into scikit-learn:master Nov 7, 2017

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] faster way of computing means across each group (scikit-learn…

f2e5262

…#10020)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+2] faster way of computing means across each group (scikit-learn…

c4c66af

…#10020)

Uh oh!

[MRG+2] faster way of computing means across each group #10020

[MRG+2] faster way of computing means across each group #10020

Uh oh!

Conversation

sergulaydore commented Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Oct 26, 2017

Uh oh!

sergulaydore commented Oct 27, 2017

Uh oh!

lesteve commented Oct 27, 2017

Uh oh!

sergulaydore commented Oct 27, 2017

Uh oh!

lesteve commented Oct 27, 2017

Uh oh!

lesteve commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergulaydore commented Oct 27, 2017

Uh oh!

amueller commented Oct 27, 2017

Uh oh!

amueller commented Oct 27, 2017

Uh oh!

amueller commented Oct 27, 2017

Uh oh!

amueller commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 27, 2017

Uh oh!

sergulaydore commented Oct 28, 2017

Uh oh!

jnothman commented Oct 28, 2017

Uh oh!

sergulaydore commented Oct 28, 2017

Uh oh!

jnothman commented Oct 28, 2017 via email

Uh oh!

sergulaydore commented Oct 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergulaydore commented Oct 31, 2017

Uh oh!

sergulaydore commented Oct 31, 2017

Uh oh!

jnothman commented Oct 31, 2017 via email

Uh oh!

sergulaydore commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 1, 2017

Uh oh!

sergulaydore commented Nov 2, 2017

Uh oh!

codecov bot commented Nov 2, 2017

Codecov Report

Uh oh!

Choose a reason for hiding this comment

sergulaydore commented Oct 26, 2017 •

edited

Loading

lesteve commented Oct 27, 2017 •

edited

Loading

amueller commented Oct 27, 2017 •

edited

Loading

sergulaydore commented Oct 29, 2017 •

edited

Loading

sergulaydore commented Nov 1, 2017 •

edited

Loading

jnothman Nov 4, 2017 •

edited

Loading

amueller commented Nov 6, 2017 •

edited

Loading