-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[MRG] Added flag to disable l2-dist finite check #7383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sklearn/cluster/k_means_.py
Outdated
# Initialize list of closest distances and calculate current potential | ||
closest_dist_sq = euclidean_distances( | ||
centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms, | ||
squared=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silly question: didn't we already check finiteness of everything in the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check_arrays in the fit method of MiniBatchKmeans, but there doesn't seem to be one that happens before this one in regular KMeans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best choice would be to add a check_input
flag to _k_init
. Then if this flag is true the input is checked at the start of the _k_init
function. For minibatch this flag can be set to false and for regular kmeans this flag can be set to true (or we can just put the check in kmeans as well). In all cases check_input
would be set to false in the euclidean_distances
calls inside _k_init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going with that approach?
please address the flake8 errors. |
Great work! I am in favor of @amueller 's suggestion: skipping completely the check_array in the inner loop, rather than adding a flag to check_array. |
Sounds good. I'll make those changes and push. |
de67c9d
to
9f05181
Compare
I finished these changes. Let me know if anything looks like it needs more work. |
sklearn/cluster/k_means_.py
Outdated
on the number of seeds (2+log(k)); this is the default. | ||
check_inputs : boolean (default=True) | ||
Whether to check if inputs are are finite and floats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are are
LGTM. Can you please re-run your benchmarks with the current version and check that the improvement over master is still there? (it's easy to mess up things when refactoring ;) |
I ran the benchmarks on a smaller grid because the older grid took a long time. This one still shows the same speed improvements. I also added a parameter to test giving integer inputs.
The modified benchmark script is in this gist https://gist.github.com/Erotemic/5230d93ccc9fa5329b0a02a351b02939 I also just squashed everything into a single commit to make the history nicer. |
5773240
to
761611e
Compare
great, thanks :) |
761611e
to
3d67e5a
Compare
3a52006
to
1ac48fe
Compare
I made a few more changes to this branch. Here is a summary of the differences. FIX: logic error in euclidean_distances I noticed a small issue in euclidean_distances when FIX: documentation error in euclidean_distances When fixing this I also noticed that the documentation states that the input shape of ADD: check_flags to other pairwise metrics functions I also went ahead and added the FIX: Converted one test in test_pairwise.py to non-yield version While I was in that file I changed the behavior of a yield test which was producing a lot of warning on my system with py.test version 3.0.1. It seems like this behavior will be deprecated in a new version, so I'm still waiting on the results of AppVeyor, but it seems as if all tests are now passing in this branch. |
3aa5985
to
52f958a
Compare
52f958a
to
bddce27
Compare
I dropped the commit that changes the yield test behavior and made a new PR #7654 which looks at the issue independently. |
sklearn/cluster/k_means_.py
Outdated
|
||
def _init_centroids(X, k, init, random_state=None, x_squared_norms=None, | ||
init_size=None): | ||
init_size=None, check_inputs=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this check_inputs
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its passed to _k_init (the kmeans++) implementation. On reviewing this again it may not be worth the extra API complexity to have the check_inputs parameter in _init_centroids because it is not called very many times. Do you think I should remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably if it doesn't provide speed gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change.
sklearn/cluster/k_means_.py
Outdated
|
||
def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None): | ||
def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None, | ||
check_inputs=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more: is it called with both "True" and "False" in the current code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is only called in _init_centroids, and in the KMeans code it is called with check_inputs=True by default and in MinibatchKMeans it is called with check_inputs=False because the inputs have already been checked at that point.
bddce27
to
ef1d29f
Compare
ef1d29f
to
cf53672
Compare
@jnothman I think that a context manager could work, and I would be ok with redoing this patch to use such a context manager if the sklearn team decided they liked that direction. However, in my opinion I don't particularly like it as a solution because of the way it would interact with global variables. This could cause issues with solutions that involve any sort of threading. Perhaps this isn't a huge issue because the GIL is so omnipresent and I don't think a race condition would exist in a multiprocessing solution, but from a stylistic design perspective I think it is a bit too obfuscated. I would prefer to have a flag explicitly passed around so it is very clear when inputs are not being validated. That being said, as long as it is constant, I do think having To summarize, here's a list of pros and cons of a context manager from my perspective. I'll try to weight them by importance. Pros:
Cons:
That's my opinion based mostly on the grounds of explicit > implicit and an avoidance of manipulating global variables. There are more cons than pros, but the pros have slightly higher (subjectively chosen) weights, but I do think the cons outweigh the pros. However, If other developers are in support of a context manager, I'm willing to jump on board. |
I feel that we can be explicit on repeated internal calls, such as here, and leave global behavior to the user. |
TBH, I wasn't well acquainted with this issue when I made that comment. I'll need to give it a more thorough look. |
I'd forgotten, @Erotemic, that this was about reducing finiteness checks in a nested context. Here, I agree that a global context switch is not great. I was asking if as a user, disabling finiteness checks for your call would be an okay solution. But perhaps it'd only be a temporary fix. For this sort of thing, it'd be nice to be able to attach to the data a tag to say that it's to be presumed finite. Is there a way to do that which does not destroy everything we believe in? I agree that Threading could be a concern; really the context manager should use a lock. |
@jnothman I think some of the confusion also stems from me being undisciplined about keeping a single feature to a single branch. This caused this PR was a bit haphazardly put together. It started off as me noticing that I could get a performance gain with a change to a single function, so I made that change and it got +1ed. Then I ended up changing all of the pairwise distance functions in order to achieve a consistent API. While making other changes that I believe I've dropped from the PR. Perhaps it might be better to close this PR and then I can reorganize them a bit. First I'll create one that simply adds the |
I think the PR is reasonably scoped. If you want you can break it up but it's not necessary. |
Its less work to not break it up, so I'll default to that as long as there are no objections. I just thought I'd bring it up. |
cf53672
to
a79de72
Compare
I just noticed that elsewhere in the code a similar flag is referred to as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we already have tests in there to ensure finiteness is checked when check_input=True
...
sklearn/metrics/pairwise.py
Outdated
Y : {array-like, sparse matrix}, shape (n_samples_2, n_features) | ||
Y_norm_squared : array-like, shape (n_samples_2, ), optional | ||
Y_norm_squared : array-like, shape (1, n_samples_2), optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is tested to work with a 1-d array. Can we leave this shape alone?
sklearn/metrics/pairwise.py
Outdated
Return squared Euclidean distances. | ||
X_norm_squared : array-like, shape = [n_samples_1], optional | ||
X_norm_squared : array-like, shape = (n_samples_1, 1), optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
sklearn/metrics/pairwise.py
Outdated
raise ValueError( | ||
"Incompatible dimensions for X and X_norm_squared") | ||
else: | ||
XX = X_norm_squared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still be putting this into the right shape if 1d.
sklearn/metrics/pairwise.py
Outdated
raise ValueError("additive_chi2 does not support sparse matrices.") | ||
X, Y = check_pairwise_arrays(X, Y) | ||
if check_input: | ||
if issparse(X) or issparse(Y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's negligible harm to having this outside the if statement.
Added check_inputs flag to _k_init and _init_centroids. Fixed flake8 newline and binary operator errors. spelling error
…ts for check_inputs=False
one more pep8 fix
7d5bf36
to
3d42e26
Compare
@jeremiedbb , @ogrisel , is this PR still relevant? kmeans has been largely modified and kmeans++ is under revision... Thanks! |
Thanks @jeremiedbb for clarifying. This is above my understanding :) . |
This one is a bit more complex than the previous one. I'll probably have to look at it this weekend to refresh myself on what I was doing here. |
The following change significantly speeds up the kmeans++ initialization used
in MiniBatchKmeans.
The Euclidean distance computation is the bottleneck in kmeans++. However, on
every call to Euclidean distance there is also a call to check_pairwise_arrays.
In in kmeans++, the same Y arrays are being checked every time. One of the checks
in here turns out to cause a speed issue. Specifically the finite check. This patch adds a flag to disable this check.
I'm not sure if this is the desired way to go about this change, but I do think something needs to be
done about this functions efficiency.
Here is some data that shows the speed increase:
For these tests I'm clustering with n_clusters=1000. The feature dimension is 128 and
the number of data points is 10*n_clusters. I then profiled different versions of the code.
First here is the slow version where I force it to perform the finite check every time.
Digging a little deeper shows the timings int his function in this function.
As you can see most of this functions time is spent in
check_array
.Looking at
check_array
we see the offending function call to_assert_all_finite
Disabling this check after it runs the first time gives a better profile.
We could probably even scrape a bit more performance by checking everything at
the start of kmeans++ and then disabling all subsequent checks. This should be
ok because the new centroids are always just previously existing ones.
Disabling the profiler and using a coarser function timer we get the following timings:
Without Checks: 3.9655s
With Checks: 4.9669s
This is a 20% decrease in the amount of time taken (1 second total).
To ensure that this speedup was not just for parameters resembling my problem I
did a gridsearch on various parameter values and looked at the percent change.
For larget datasets the change is consistently positive. There are a few
negative changes for small datasets, but this is likely because of random
fluctuations. For datasets with at least a .1 second speed increase, there is a
15% average improvement with the improvement increasing for larger datasets.
The script I used to generate these numbers is:
This change is