Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+2] faster way of computing means across each group #10020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sergulaydore
Copy link
Contributor

@sergulaydore sergulaydore commented Oct 26, 2017

Reference Issues/PRs

What does this implement/fix? Explain your changes.

It computes the mean of grouped features in a faster way.

Any other comments?

The performance improvement depends on the number of clusters and the number of samples. If the clusters increase the performance gain would be in the order of 100. If the number of samples increase then the performance gain gets lower. Performance gain (speed_of_new_method/speed_of_old_method) for

  • n_features=10000, n_clusters=5000, n_sample=10 is 515
  • n_features=10000, n_clusters=5000, n_sample=100 is 52
  • n_features=10000, n_clusters=5000, n_sample=1000 is 6
  • n_features=1000, n_clusters=500, n_sample=100000 is 1.6
  • n_features=1000, n_clusters=5, n_sample=100000 is 2.35

@amueller
Copy link
Member

Can you please provide end-to-end benchmarks (code + results)?

@sergulaydore
Copy link
Contributor Author

@amueller Added a benchmark file. However, I can't really figure out why codecov/patch is failing. Is that because I need to add more test cases for the lines I added?

@lesteve
Copy link
Member

lesteve commented Oct 27, 2017

Look at the codecov report. There is no test covering pooling_func != np.mean.

It's not really your fault (there was no test before your PR) but if you can add a test that would be great.

@sergulaydore
Copy link
Contributor Author

@lesteve I don’t quite understand that. Do you mean adding a new test file like test_feature_agglomeration? Or should I raise an error in _feature_agglomeration when pooling_func is not np.mean? I am confused because I don’t have a test for pooling_func == np.mean either. Why doesn’t coverage/batch complain about that?

@lesteve
Copy link
Member

lesteve commented Oct 27, 2017

IIRC last time I looked pooling_func was not really well tested, even adding tests with pooling_func=np.mean would be great.

@lesteve
Copy link
Member

lesteve commented Oct 27, 2017

Do you mean adding a new test file like test_feature_agglomeration?

Adding a test function inside the relevant test file, pooling_func==np.mean is "tested" (only smoke test IIRC) via sklearn.cluster.hierarchical.FeatureAgglomeration I think. If you can test only the AgglomerationTransform class in isolation i.e. add sklearn/cluster/tests/test_feature_agglomeration.py even better.

@sergulaydore
Copy link
Contributor Author

@lesteve Thanks for the clarification. Let me try adding sklearn/cluster/tests/test_feature_agglomeration.py.

@amueller
Copy link
Member

Sorry I was being unclear. I was asking for you to show benchmark results in this PR and link to code (possibly in a gist). We don't usually include benchmarks for "small" improvements into the benchmark folder, but it's good to have the code posted somewhere for the record.

@amueller
Copy link
Member

(also wow the benchmarks look great but I haven't had time to look into the details)

@amueller
Copy link
Member

pooling_func is deprecated, right?

@amueller
Copy link
Member

amueller commented Oct 27, 2017

with 1000 samples:
feature_agg_transform_n_samples1000

with 1000000 samples:
feature_agg_transform_n_samples100000
scaling behavior in terms of number of features seems clearly better, and not that dependend on n_clusters. (though maybe log-log scale would have been better for that last plot, whoops)

See: https://gist.github.com/amueller/9853d77d9a08f4445f7ee1f7cffe4241

@amueller
Copy link
Member

I would remove the special case, given that pooling_func is ignored.

@sergulaydore
Copy link
Contributor Author

I removed the case where pooling_func != np.mean. Should I also delete the benchmark file? Anything else you suggest?

@jnothman
Copy link
Member

pooling_func is deprecated, right?

Noooo.... pooling_func is only deprecated from AgglomerativeClustering where it is unused, not from FeatureAgglomeration to which this applies.

(In fact, I've not checked, but pooling_func was only present in AgglomerativeClustering to make inherited initialisation of FeatureAgglomeration too easy; I suspect it was intentionally left undocumented until we recently inserted documentation to make our docstring checker happy... and then proceeded to find the param needed deprecating.)

@sergulaydore
Copy link
Contributor Author

@jnothman So you suggest to keep the special case and write the test instead?

@jnothman
Copy link
Member

jnothman commented Oct 28, 2017 via email

@sergulaydore
Copy link
Contributor Author

sergulaydore commented Oct 29, 2017

How does it look now? Note that for the case pooling_func!=np.mean, I only tested for np.median.

size = np.bincount(self.labels_)
n_samples = X.shape[0]
# a fast way to compute the mean of grouped features
nX = np.array([np.bincount(self.labels_, X[i, :])/size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this weighted bincount compare to

X @ scipy.sparse.csr_matrix((np.ones_like(labels), labels, np.arange(len(labels) + 1)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this did not perform better. please see the plots below.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test should actually include a simple case (e.g. one cluster, one sample) where the median and mean differ: FeatureAgglomeration(n_clusters=1).fit_transform([[0, 0, 1]])

assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)

# Test transform
X_red_mean = agglo_mean.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if by "red" you mean "reduced" we ususally use Xt for transformed X

assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)

# Test transform
X_red_mean = agglo_mean.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if by "red" you mean "reduced" we usually use Xt for transformed X

assert_true(X_red_median.shape[1] == n_clusters)

# Check that fitting with no samples raises a ValueError
assert_raises(ValueError, agglo_mean.fit, X[:0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of thing should be covered by common tests


# Test inverse transform
X_full_mean = agglo_mean.inverse_transform(X_red_mean)
X_full_median = agglo_mean.inverse_transform(X_red_median)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be agglo_median?

@sergulaydore
Copy link
Contributor Author

@jnothman Thanks for the review. I modified the test code according to your suggestions.
I also created some plots for np_mean, np_bincount and scipy.sparse.csr_matrix methods using this script. As you can see csr_matrix does not perform better than the other two and np_bincount seems to be the best method for the cases I've tested.

performance_vs_nclusters_linear
performance_vs_nfeatures_lin
performance_vs_nsamples_lin

@sergulaydore
Copy link
Contributor Author

I don't see a reason why X should be sparse but FeatureAgglomeration accepts sparse matrices. What do you mean by "previously"? Should not such a test go to test_hierarchical?

@jnothman
Copy link
Member

jnothman commented Oct 31, 2017 via email

@sergulaydore
Copy link
Contributor Author

sergulaydore commented Nov 1, 2017

OK, it seems np.mean accepts sparse matrices but np.bincount does not. I need to call

np.bincount(agglo.labels_, X.data[X.indptr[i]:X.indptr[i+1]]) if X is sparse. Should I add such a line when X is sparse or return an error or call np.mean?

@jnothman
Copy link
Member

jnothman commented Nov 1, 2017

That's not quite right I think. You'd need labels_[X.indices[row_start:row_stop]].

That would be okay, or just throw sparse matrices into the else case

@sergulaydore
Copy link
Contributor Author

Do you mean np.bincount(agglo.labels_[Xsparse.indices[i]:Xsparse.indices[i+1]], Xsparse.data[i, :])? It also did not work. I think throwing sparse matrices to else case would be less confusing but not sure about the performance.

@codecov
Copy link

codecov bot commented Nov 2, 2017

Codecov Report

Merging #10020 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10020      +/-   ##
==========================================
+ Coverage   96.18%   96.19%   +0.01%     
==========================================
  Files         336      337       +1     
  Lines       62706    62753      +47     
==========================================
+ Hits        60311    60364      +53     
+ Misses       2395     2389       -6
Impacted Files Coverage Δ
sklearn/cluster/_feature_agglomeration.py 100% <100%> (ø) ⬆️
...klearn/cluster/tests/test_feature_agglomeration.py 100% <100%> (ø)
sklearn/linear_model/coordinate_descent.py 96.95% <0%> (ø) ⬆️
sklearn/neural_network/tests/test_mlp.py 100% <0%> (ø) ⬆️
sklearn/linear_model/ridge.py 95.66% <0%> (ø) ⬆️
sklearn/feature_selection/rfe.py 97.6% <0%> (ø) ⬆️
sklearn/calibration.py 98.87% <0%> (ø) ⬆️
sklearn/linear_model/logistic.py 97.07% <0%> (ø) ⬆️
sklearn/linear_model/least_angle.py 96.26% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 202b532...fba7691. Read the comment docs.

for l in np.unique(self.labels_):
nX.append(pooling_func(X[:, self.labels_ == l], axis=1))
return np.array(nX).T
if (pooling_func == np.mean) & (not issparse(X)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need and and no parentheses, not &.

@sergulaydore
Copy link
Contributor Author

@jnothman Done. Anything else?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks


def test_feature_agglomeration():
n_clusters = 1
X = np.array([0, 0, 1], ndmin=2) # (n_samples, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write this out in full as nested lists, or reshape so that the orientation is clearer to the reader, pease?

agglo_mean.fit(X)
agglo_median.fit(X)
assert_true(np.size(np.unique(agglo_mean.labels_)) == n_clusters)
assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should test the size of labels_ too

assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters)

# Test transform
Xt_mean = agglo_mean.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please assert that these match the exact values you expect.

Copy link
Contributor Author

@sergulaydore sergulaydore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. How does it look now?

@jnothman
Copy link
Member

jnothman commented Nov 3, 2017

LGTM!

@jnothman jnothman changed the title faster way of computing means across each group [MRG+1] faster way of computing means across each group Nov 3, 2017
Xt_median = agglo_median.transform(X)
assert_true(Xt_mean.shape[1] == n_clusters)
assert_true(Xt_median.shape[1] == n_clusters)
assert_true(Xt_mean == np.array([1/3.]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 1/3. not seen as a pep8 violation by travis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the mean of [0,0,1] is 1/3 which is 0.333333.... It seems to have passed all the checks. Do you have another suggestion for this test or what is your concern exactly?

Copy link
Member

@jnothman jnothman Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort, spaces around operators is no longer a strict requirement of PEP8. I would suggest, @sergulaydore, that we'd consider it better style to have spaces around arithmetic binary operators except where leaving spaces out helps clarify order of operations (e.g. a*b + c).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces around operators is no longer a strict requirement of PEP8

can't we still require it? We have all this code that use spaces and I don't
like changing the rules of the game in the middle... it's also likely to create
unnecessary diffs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should require it. I suppose we could in general, but we will need to help contributors have the same flake8 settings. See #9121.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put the ignore in setup.cfg and not in travis build then I think it's easy for contributors to run make flake and see the same errors as on travis. I would favor this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you both mean to change it to assert_true(Xt_mean == np.array([1 / 3.]))?

size = np.bincount(self.labels_)
n_samples = X.shape[0]
# a fast way to compute the mean of grouped features
nX = np.array([np.bincount(self.labels_, X[i, :])/size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X[i, :]) / size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@amueller
Copy link
Member

amueller commented Nov 6, 2017

Hm I somehow thought the deprecation of pooling_func would propagate to FeatureAgglomeration. Sorry I didn't check the details. This should probably be documented more explicitly?

@jnothman
Copy link
Member

jnothman commented Nov 6, 2017 via email

assert_true(Xt_mean.shape[1] == n_clusters)
assert_true(Xt_median.shape[1] == n_clusters)
assert_true(Xt_mean == np.array([1 / 3.]))
assert_true(Xt_median == np.array([0.]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we stop using assert_equal with the switch towards pytest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used test_ward_agglomeration function in test_hierarchical.py as a guideline for this test. However, for the rest of test_hierarchical.py, assert_equal and assert_true have been used interchangeably.

@agramfort agramfort changed the title [MRG+1] faster way of computing means across each group [MRG+2] faster way of computing means across each group Nov 7, 2017
@agramfort agramfort merged commit 555bf6b into scikit-learn:master Nov 7, 2017
@agramfort
Copy link
Member

thx @sergulaydore

@sergulaydore
Copy link
Contributor Author

Thanks @agramfort , @jnothman, @lesteve and @amueller for your time.

@jnothman
Copy link
Member

jnothman commented Nov 7, 2017 via email

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants