-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+2] faster way of computing means across each group #10020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] faster way of computing means across each group #10020
Conversation
Can you please provide end-to-end benchmarks (code + results)? |
@amueller Added a benchmark file. However, I can't really figure out why codecov/patch is failing. Is that because I need to add more test cases for the lines I added? |
Look at the codecov report. There is no test covering It's not really your fault (there was no test before your PR) but if you can add a test that would be great. |
@lesteve I don’t quite understand that. Do you mean adding a new test file like test_feature_agglomeration? Or should I raise an error in _feature_agglomeration when pooling_func is not np.mean? I am confused because I don’t have a test for pooling_func == np.mean either. Why doesn’t coverage/batch complain about that? |
IIRC last time I looked pooling_func was not really well tested, even adding tests with pooling_func=np.mean would be great. |
Adding a test function inside the relevant test file, pooling_func==np.mean is "tested" (only smoke test IIRC) via sklearn.cluster.hierarchical.FeatureAgglomeration I think. If you can test only the AgglomerationTransform class in isolation i.e. add sklearn/cluster/tests/test_feature_agglomeration.py even better. |
@lesteve Thanks for the clarification. Let me try adding sklearn/cluster/tests/test_feature_agglomeration.py. |
Sorry I was being unclear. I was asking for you to show benchmark results in this PR and link to code (possibly in a gist). We don't usually include benchmarks for "small" improvements into the benchmark folder, but it's good to have the code posted somewhere for the record. |
(also wow the benchmarks look great but I haven't had time to look into the details) |
|
with 1000000 samples: See: https://gist.github.com/amueller/9853d77d9a08f4445f7ee1f7cffe4241 |
I would remove the special case, given that |
I removed the case where |
Noooo.... (In fact, I've not checked, but |
@jnothman So you suggest to keep the special case and write the test instead? |
yes, please
|
How does it look now? Note that for the case |
size = np.bincount(self.labels_) | ||
n_samples = X.shape[0] | ||
# a fast way to compute the mean of grouped features | ||
nX = np.array([np.bincount(self.labels_, X[i, :])/size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this weighted bincount compare to
X @ scipy.sparse.csr_matrix((np.ones_like(labels), labels, np.arange(len(labels) + 1)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this did not perform better. please see the plots below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test should actually include a simple case (e.g. one cluster, one sample) where the median and mean differ: FeatureAgglomeration(n_clusters=1).fit_transform([[0, 0, 1]])
assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters) | ||
|
||
# Test transform | ||
X_red_mean = agglo_mean.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if by "red" you mean "reduced" we ususally use Xt for transformed X
assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters) | ||
|
||
# Test transform | ||
X_red_mean = agglo_mean.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if by "red" you mean "reduced" we usually use Xt for transformed X
assert_true(X_red_median.shape[1] == n_clusters) | ||
|
||
# Check that fitting with no samples raises a ValueError | ||
assert_raises(ValueError, agglo_mean.fit, X[:0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of thing should be covered by common tests
|
||
# Test inverse transform | ||
X_full_mean = agglo_mean.inverse_transform(X_red_mean) | ||
X_full_median = agglo_mean.inverse_transform(X_red_median) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be agglo_median?
@jnothman Thanks for the review. I modified the test code according to your suggestions. |
I don't see a reason why X should be sparse but FeatureAgglomeration accepts sparse matrices. What do you mean by "previously"? Should not such a test go to test_hierarchical? |
Yes, if it currently works, then we should ensure it continues to work. If
it currently fails, then we may choose to ensure that it continues to fail
:)
…On 31 October 2017 at 14:01, Sergul Aydore ***@***.***> wrote:
I don't see a reason why X should be sparse but FeatureAgglomeration
accepts sparse matrices. What do you mean by "previously"? Should not such
a test go to test_hierarchical?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10020 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67JluS4kJcUDO1Z9EH-fvE-Bkdx-ks5sxo2PgaJpZM4QIKEQ>
.
|
OK, it seems
|
That's not quite right I think. You'd need That would be okay, or just throw sparse matrices into the else case |
Do you mean |
Codecov Report
@@ Coverage Diff @@
## master #10020 +/- ##
==========================================
+ Coverage 96.18% 96.19% +0.01%
==========================================
Files 336 337 +1
Lines 62706 62753 +47
==========================================
+ Hits 60311 60364 +53
+ Misses 2395 2389 -6
Continue to review full report at Codecov.
|
for l in np.unique(self.labels_): | ||
nX.append(pooling_func(X[:, self.labels_ == l], axis=1)) | ||
return np.array(nX).T | ||
if (pooling_func == np.mean) & (not issparse(X)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need and
and no parentheses, not &
.
@jnothman Done. Anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
|
||
def test_feature_agglomeration(): | ||
n_clusters = 1 | ||
X = np.array([0, 0, 1], ndmin=2) # (n_samples, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write this out in full as nested lists, or reshape so that the orientation is clearer to the reader, pease?
agglo_mean.fit(X) | ||
agglo_median.fit(X) | ||
assert_true(np.size(np.unique(agglo_mean.labels_)) == n_clusters) | ||
assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should test the size of labels_ too
assert_true(np.size(np.unique(agglo_median.labels_)) == n_clusters) | ||
|
||
# Test transform | ||
Xt_mean = agglo_mean.transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please assert that these match the exact values you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. How does it look now?
LGTM! |
Xt_median = agglo_median.transform(X) | ||
assert_true(Xt_mean.shape[1] == n_clusters) | ||
assert_true(Xt_median.shape[1] == n_clusters) | ||
assert_true(Xt_mean == np.array([1/3.])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 1/3. not seen as a pep8 violation by travis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the mean of [0,0,1] is 1/3 which is 0.333333.... It seems to have passed all the checks. Do you have another suggestion for this test or what is your concern exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agramfort, spaces around operators is no longer a strict requirement of PEP8. I would suggest, @sergulaydore, that we'd consider it better style to have spaces around arithmetic binary operators except where leaving spaces out helps clarify order of operations (e.g. a*b + c
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces around operators is no longer a strict requirement of PEP8
can't we still require it? We have all this code that use spaces and I don't
like changing the rules of the game in the middle... it's also likely to create
unnecessary diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should require it. I suppose we could in general, but we will need to help contributors have the same flake8 settings. See #9121.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we put the ignore in setup.cfg and not in travis build then I think it's easy for contributors to run make flake and see the same errors as on travis. I would favor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do you both mean to change it to assert_true(Xt_mean == np.array([1 / 3.]))
?
size = np.bincount(self.labels_) | ||
n_samples = X.shape[0] | ||
# a fast way to compute the mean of grouped features | ||
nX = np.array([np.bincount(self.labels_, X[i, :])/size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X[i, :]) / size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Hm I somehow thought the deprecation of pooling_func would propagate to FeatureAgglomeration. Sorry I didn't check the details. This should probably be documented more explicitly? |
Merge, @amueller, @agramfort?
…On 7 November 2017 at 07:48, Andreas Mueller ***@***.***> wrote:
Hm I somehow thought the deprecation of pooling_func would propagate to
FeatureAgglomeration. Sorry I didn't check the details.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10020 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63H87vH0L12qMle9lfnDuJ-EGHEkks5sz3CtgaJpZM4QIKEQ>
.
|
assert_true(Xt_mean.shape[1] == n_clusters) | ||
assert_true(Xt_median.shape[1] == n_clusters) | ||
assert_true(Xt_mean == np.array([1 / 3.])) | ||
assert_true(Xt_median == np.array([0.])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we stop using assert_equal with the switch towards pytest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used test_ward_agglomeration
function in test_hierarchical.py as a guideline for this test. However, for the rest of test_hierarchical.py, assert_equal
and assert_true
have been used interchangeably.
thx @sergulaydore |
Thanks @agramfort , @jnothman, @lesteve and @amueller for your time. |
You're right, @agramfort, we shouldn't really be using assert_true there.
Just plain assert... Yes, I think we should be moving to bare assert-based
testing where possible.
|
Reference Issues/PRs
What does this implement/fix? Explain your changes.
It computes the mean of grouped features in a faster way.
Any other comments?
The performance improvement depends on the number of clusters and the number of samples. If the clusters increase the performance gain would be in the order of 100. If the number of samples increase then the performance gain gets lower. Performance gain (speed_of_new_method/speed_of_old_method) for