-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH reduce memory consumption in nan_euclidean_distances #15615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test failure is because of this expected distance matrix: X = np.array([[missing_value, missing_value], [0, 1]])
exp_dist = np.array([[np.nan, np.nan], [np.nan, 0]]) This sets the standard that if a sample is all-nan, its euclidean distance to itself is nan rather than 0. My code currently sets the diagonal to 0 in all cases. What do we consider to be the right behaviour? |
Originally I had the diagonal at zero (when X is Y), but I was concerned with how the following were not equal:
|
At least it is documented Since the distance is a bit specific to handle @jnothman Do you have a use case or any thought on what it should be |
Apart from this LGTM |
I'm happy with @thomasjpfan's reasoning for now. In any case, better that
an efficiency fix like this does not change behaviour
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So LGTM. @thomasjpfan Do you want to give a look.
Thank you @jnothman ! |
This goes towards fixing #15604
Basic benchmark:
At master:
This branch:
Faster times could possibly be achieved with chunking.