ENH reduce memory consumption in nan_euclidean_distances #15615

jnothman · 2019-11-13T10:56:09Z

This goes towards fixing #15604

Basic benchmark:

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.metrics.pairwise import nan_euclidean_distances

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan

%time nan_euclidean_distances(X_na)

At master:

CPU times: user 33 s, sys: 9.42 s, total: 42.4 s
Wall time: 24 s

This branch:

CPU times: user 27.3 s, sys: 3.71 s, total: 31 s
Wall time: 14.8 s

Faster times could possibly be achieved with chunking.

jnothman · 2019-11-14T03:54:35Z

The test failure is because of this expected distance matrix:

        X = np.array([[missing_value, missing_value], [0, 1]])
        exp_dist = np.array([[np.nan, np.nan], [np.nan, 0]])

This sets the standard that if a sample is all-nan, its euclidean distance to itself is nan rather than 0. My code currently sets the diagonal to 0 in all cases. What do we consider to be the right behaviour?

thomasjpfan · 2019-11-14T15:52:18Z

Originally I had the diagonal at zero (when X is Y), but I was concerned with how the following were not equal:

X = np.array([[missing_value, missing_value], [0, 1]])

nan_euclidean_distances(X, X.copy())
nan_euclidean_distances(X, X)

glemaitre · 2019-11-15T10:37:30Z

What do we consider to be the right behaviour?

At least it is documented nan.

Since the distance is a bit specific to handle nan, I would not be surprised that it returns nan while I would be surprised to have some nan in other metrics due to division by zero or stuff like that.

@jnothman Do you have a use case or any thought on what it should be 0 instead of nan.

glemaitre · 2019-11-15T10:37:43Z

Apart from this LGTM

jnothman · 2019-11-17T04:58:06Z

I'm happy with @thomasjpfan's reasoning for now. In any case, better that an efficiency fix like this does not change behaviour

glemaitre

So LGTM. @thomasjpfan Do you want to give a look.

thomasjpfan · 2019-11-18T20:13:00Z

Thank you @jnothman !

…n#15615)

ENH reduce memory consumption in nan_euclidean_distances

d9b9b95

According to tests, diagonal should not always be 0

14a6504

thomasjpfan self-requested a review November 14, 2019 15:53

glemaitre self-requested a review November 14, 2019 17:50

glemaitre mentioned this pull request Nov 15, 2019

MemoryError in KNNImputer with california housing #15604

Closed

glemaitre approved these changes Nov 17, 2019

View reviewed changes

thomasjpfan approved these changes Nov 18, 2019

View reviewed changes

thomasjpfan merged commit f7ed72a into scikit-learn:master Nov 18, 2019

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 25, 2019

ENH reduce memory consumption in nan_euclidean_distances (scikit-lear…

c8b0dc3

…n#15615)

jnothman added a commit that referenced this pull request Nov 28, 2019

ENH reduce memory consumption in nan_euclidean_distances (#15615)

ddfc592

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH reduce memory consumption in nan_euclidean_distances (scikit-lear…

aa1558e

…n#15615)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH reduce memory consumption in nan_euclidean_distances #15615

ENH reduce memory consumption in nan_euclidean_distances #15615

Uh oh!

jnothman commented Nov 13, 2019 •

edited

Loading

Uh oh!

jnothman commented Nov 14, 2019

Uh oh!

thomasjpfan commented Nov 14, 2019

Uh oh!

glemaitre commented Nov 15, 2019

Uh oh!

glemaitre commented Nov 15, 2019

Uh oh!

jnothman commented Nov 17, 2019 via email

Uh oh!

glemaitre left a comment

Uh oh!

thomasjpfan commented Nov 18, 2019

Uh oh!

Uh oh!

Uh oh!

ENH reduce memory consumption in nan_euclidean_distances #15615

ENH reduce memory consumption in nan_euclidean_distances #15615

Uh oh!

Conversation

jnothman commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 14, 2019

Uh oh!

thomasjpfan commented Nov 14, 2019

Uh oh!

glemaitre commented Nov 15, 2019

Uh oh!

glemaitre commented Nov 15, 2019

Uh oh!

jnothman commented Nov 17, 2019 via email

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Nov 18, 2019

Uh oh!

Uh oh!

jnothman commented Nov 13, 2019 •

edited

Loading