[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

ashimb9 · 2017-07-13T09:27:53Z

Reference Issue

Issue: 4844, 2989
Prelude to: 9212

What does this implement/fix? Explain your changes.

This is actually first part of another PR. The referenced PR implements a KNN based imputation strategy using its own k-Nearest Neighbor calculator. But it was suggested here that it might be a better option to make changes within sklearn.metrics instead so that the existing sklearn.neighbors machinery could be used. Given this context, this PR modifies relevant modules in sklearn.metrics (and sklearn.neighbors to a very minor extent) so the euclidean distance between samples with arbitrary missing coordinates can now be calculated.

…g (NaN) values

glemaitre · 2017-07-17T21:09:59Z

You might want to check why the CI are not happy.

jnothman · 2017-07-18T03:30:15Z

sklearn/metrics/pairwise.py

@@ -179,6 +193,19 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
    the distance matrix returned by this function may not be exactly
    symmetric as required by, e.g., ``scipy.spatial.distance`` functions.

+    Additionally, euclidean_distances() can also compute pairwise euclidean


I suspect we should pull it out as a separate function, e.g. masked_euclidean_distances

jnothman · 2017-07-18T03:30:35Z

sklearn/metrics/pairwise.py

+    # NOTE: force_all_finite=False allows not only NaN but also inf/-inf
+    X, Y = check_pairwise_arrays(X, Y,
+                                 force_all_finite=kill_missing, copy=copy)
+    if kill_missing is False and \


do you mean not kill_missing rather than kill_missing is False

Ha sure did :)

jnothman · 2017-07-18T03:32:32Z

sklearn/metrics/pairwise.py

+        else:
+            YY = row_norms(Y, squared=True)[np.newaxis, :]
+
+        distances = safe_sparse_dot(X, Y.T, dense_output=True)


I don't get how this works if there are NaNs in X and Y still.

^(Please see my previous comment.)

jnothman · 2017-07-18T03:32:55Z

sklearn/metrics/pairwise.py

+                raise ValueError(
+                    "Incompatible dimensions for X and X_norm_squared")
+        else:
+            XX = row_norms(X, squared=True)[:, np.newaxis]


I don't get how this works if there are NaNs in X still

Did you mean in case kill_missing=True but there are NaN values? I'll add a check there to avoid that scenario.

Maybe I've just misunderstood what's going on here. Can you please make a separate function rather than having a kill_missing setting?

jnothman · 2017-07-18T03:34:58Z

sklearn/metrics/pairwise.py

+    kill_missing : boolean, optional
+        Allow missing values (e.g., NaN)
+
+    missing_values : String, optional


Why a string? Surely we only want this configurable in the case of integer data...?

I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.

If I do integers only, I am just thinking how the user can conveniently pass NaN as the missing_values to the function. Or should the options rather be: either "NaN" or a number? And asking to pass np.nan might not be the best option? Or is there a better way I am not aware of here?

See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

Yeah imputer has integer or “NaN” -- that seems like a good choice. Thanks!

jnothman · 2017-07-18T03:37:00Z

sklearn/neighbors/base.py

@@ -270,7 +272,8 @@ def _pairwise(self):
 class KNeighborsMixin(object):
    """Mixin for k-neighbors searches"""

-    def kneighbors(self, X=None, n_neighbors=None, return_distance=True):
+    def kneighbors(self, X=None, n_neighbors=None, return_distance=True,
+                   kill_missing=True, missing_values="NaN", copy=None):


I think we should just use a variant metric name, rather than adding all these parameters to kneighbors. Certainly, it does not belong in the method, but in the class.

jnothman · 2017-07-18T03:41:31Z

sklearn/metrics/pairwise.py

+    kill_missing : boolean, optional
+        Allow missing values (e.g., NaN)
+
+    missing_values : String, optional


I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.

jaquesgrobler · 2017-07-18T08:27:45Z

sklearn/metrics/pairwise.py

@@ -54,7 +63,8 @@ def _return_float_dtype(X, Y):
    return X, Y, dtype


-def check_pairwise_arrays(X, Y, precomputed=False, dtype=None):
+def check_pairwise_arrays(X, Y, precomputed=False, dtype=None,
+                          copy=False, force_all_finite=True):


copy and force_all_finite should be added to the docstring parameters, no?

Of course. Thanks!

ashimb9 · 2017-07-19T06:40:15Z

Hmm, I seem to have broken the code here. I had to do a force push due to some issues in my local machine, and that seems to have made CI unhappy again. Looking into it at the moment.

ashimb9 · 2017-07-19T21:16:15Z

@jnothman @jaquesgrobler Hi guys -- So things seem to be running ok now. I have tried to address all of your comments in this update. Please take a look at your convenience. Thank you.

jnothman

Just a glance at docs so far. Thanks for the quick work

jnothman · 2017-07-19T23:36:50Z

sklearn/metrics/pairwise.py

+# Pairwise distances in the presence of missing values
+def masked_euclidean_distances(X, Y=None, squared=False,
+                               missing_values="NaN", copy=True, **kwargs):
+    """


Pep257: there should be a one-like summary here

jnothman · 2017-07-19T23:38:19Z

sklearn/metrics/pairwise.py

+    in dense matrices X and Y with missing values in arbitrary
+    coordinates. The following formula is used for this:
+
+        dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) *


Could we give the formula in terms of vectors rather than matrices? It's much more straightforward to understand relative to the well known Euclidean distance.

It is there, one or two paragraphs below the matrix notation: "Breakdown of euclidean distance calculation between a vector pair x,y...". Or did you mean something else?

Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.

Implementation notes can be commented inside the function if they will help maintainers and curious users.

Hmm, I am not totally sure what you mean. I have edited the docstring to change it to what I think you meant -- please let me know what you think about the following:

This formulation zero-weights feature coordinates with missing value in either vector in the pair and up-weights the remaining coordinates. For instance, say we have two sample points (x1, y1) and (x2, NaN). To calculate the euclidean distance, first the square "distance" is calculated based only on the first feature coordinate, as the second coordinate is missing in one of the samples, i.e., we get (x2-x1)**2. This squared distance is scaled-up by the ratio of total number of coordinates to the number of available coordinates, which in this case is 2/1 = 2. Now, we are left with 2*((x2-x1)**2). Finally, if squared=False then the square root of this is evaluated otherwise the value is returned as is.

All I meant was that it's not especially helpful to the reader to describe the operation in terms of matrices and masks, and that should be deleted. But it can be described in terms of vectors.

I'm struggling with your description, and find "zero-weights" and "up-weights" particularly confusing.

You could state something like "In accordance with [x] we calculate Euclidean distance between vectors with some elements missing as: the sum of squared differences between elements that are not missing in either vector, scaled in inverse proportion to the number of elements not missing in either vector, and the square root taken." Alternatively "... as: the Euclidean distance between the elements that are not missing in either vector, multiplied by sqrt(vector length / number of elements not missing in either vector)." I'm not sure there about "Euclidean distance between elements". Is "the Euclidean distance between vectors consisting of the elements that are not missing in either input vector" clearer?

jnothman · 2017-07-19T23:40:16Z

sklearn/metrics/pairwise.py

+    array([[ 1.        ],
+           [ 1.41421356]])
+
+    See also


I think this deserves a reference to research / textbooks / encyclopaedia where Euclidean distance with missing values is used / defined.

Sure, will add that.

jnothman · 2017-07-19T23:40:26Z

sklearn/metrics/pairwise.py

+                axis=1) == Y.shape[1])):
+        raise ValueError("One or more rows only contain missing values.")
+    #
+    # if kill_missing:


So my thinking for this was that a row with all NaN will only introduce "unnecessary" NaN values in the distance matrix. It does not technically matter for this specific situation (or for kneighbors) since all it does is return a row (or column) with all NaN values, but I was like why keep it when it can potentially introduce issues down the analysis-chain and does not contribute anything useful. So the current implementation requires users to get rid of samples that have nothing but NaN values before they pass the dataset to masked_kneighbors or masked_euclidean_distance(). However, if you prefer to remove this check for any reason please let me know and I will get rid of it.

I'm fine with all-NaN rows resulting in an error.

That still doesn't explain why you have a large block of commented-out code.

Oops, that was accidentally left there.

jnothman

This kill_missing parameter is still a bit of a mystery to me.

jnothman · 2017-07-20T03:33:32Z

sklearn/metrics/pairwise.py

+    in dense matrices X and Y with missing values in arbitrary
+    coordinates. The following formula is used for this:
+
+        dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) *


Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.

Implementation notes can be commented inside the function if they will help maintainers and curious users.

jnothman · 2017-07-20T03:37:37Z

sklearn/metrics/pairwise.py

+                axis=1) == Y.shape[1])):
+        raise ValueError("One or more rows only contain missing values.")
+    #
+    # if kill_missing:


I'm fine with all-NaN rows resulting in an error.

That still doesn't explain why you have a large block of commented-out code.

jnothman · 2017-07-20T03:39:09Z

sklearn/metrics/pairwise.py

+    """
+    # Check and except sparse matrices
+    if issparse(X) or (Y is not None and issparse(Y)):
+        raise ValueError(


Perhaps check_pairwise_arrays should have an accept_sparse parameter.

jnothman · 2017-07-20T03:44:17Z

sklearn/metrics/pairwise.py

+    #             (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +
+    #              np.dot(NX, (YT * YT)))
+
+    # Above is faster but following for Python 2.x support


Do you mean Python 2 division? we only need from __future__ import division at the top of the file (and I'm surprised it's not already there)

It's the multiply that you've changed. I don't get why that's necessary...? Nor do I get how it could be substantially slower.

jnothman · 2017-07-20T03:45:54Z

sklearn/metrics/pairwise.py

+    #              np.dot(NX, (YT * YT)))
+
+    # Above is faster but following for Python 2.x support
+    distances = np.multiply(np.multiply(X.shape[1],


X.shape[1] / np.dot(NX, NYT) should suffice here

jnothman · 2017-07-20T04:15:15Z

sklearn/metrics/pairwise.py

+    # Get Y.T mask and anti-mask and set Y.T's missing to zero
+    YT = Y.T
+    mask_YT = _get_mask(YT, missing_values)
+    NYT = (~mask_YT).astype(np.int8)


does the astype help with performance? I suspect it does not.

No I did that because leaving it as bool was returning incorrect values. I think the issue is that dot product of bool matrices returns a bool which of course means that we do not get the sum of True as a sum of ones, as we would want for individual dot products.

Although this probably means that I should not use int8 either since that could be a problem when dataset has a lot of columns/features and 8 bits might not be enough ... will change that :)

jnothman · 2017-07-20T04:20:44Z

sklearn/metrics/pairwise.py

+    # Calculate distances
+
+    # distances = (X.shape[1] * 1 / ((np.dot(NX, NYT)))) * \
+    #             (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +


A comment on np.dot((X * X), NYT) might be helpful. But I can't work out how to word it, so perhaps not :)

jnothman · 2017-07-20T04:22:26Z

sklearn/metrics/pairwise.py

@@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
                         "Valid metrics are %s, or 'precomputed', or a "
                         "callable" % (metric, _VALID_METRICS))

+    # To handle kill_missing = False
+    kill_missing = kwds.get("kill_missing")


I don't get what this is all about. Why aren't we just adding 'masked_euclidean' to PAIRWISE_DISTANCE_FUNCTIONS?

Yeah this one was a tough call. So ideally we probably don't want the user to pass "masked_euclidean" right? Which leaves us with the option to check if kill_missing==False, and if it is then to use metric as "masked_euclidean" when the user passes "euclidean". I was afraid that allowing user to pass "euclidean" when it actually meant calling masked_euclidean might potentially trigger all sorts of unintended things down the chain if both versions shared the same function dictionary, aka I was just playing it safe. But if you think it is okay, then I will do the conversion to masked_ version by checking for kill_missing==False.

jnothman · 2017-07-23T02:17:10Z

I would rather the user specify 'masked_euclidean'. Why not? Certainly it's no problem in the case of Imputer.

…

On 23 Jul 2017 10:54 am, "ashimb9" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/metrics/pairwise.py <#9348 (comment)> : > @@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds): "Valid metrics are %s, or 'precomputed', or a " "callable" % (metric, _VALID_METRICS)) + # To handle kill_missing = False + kill_missing = kwds.get("kill_missing") Yeah this one was a tough call. So ideally we probably don't want the user to pass "masked_euclidean" right? Which leaves us with the option to check if kill_missing==False, and if it is then to use metric as "masked_euclidean" when the user passes "euclidean". I was afraid that allowing user to pass "euclidean" when it actually meant calling masked_euclidean might potentially trigger all sorts of unintended things down the chain if both versions shared the same function dictionary, aka I was just playing it safe. But if you think it is okay, then I will do the conversion to masked_ version by checking for kill_missing==False. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9348 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69G-8Lsm3aps0nSM-V1vM7_msampks5sQpm7gaJpZM4OWu3U> .

ashimb9 · 2017-07-23T02:26:08Z

Ohh, cool...That makes my life easier :)

ashimb9 · 2017-07-23T05:11:21Z

@jnothman Just pushed a commit addressing the issues that you raised. And thanks a lot for all your feedback so far. PS: Please don't mind the random comment blocks for now, I will get rid of them soon :)

jnothman · 2017-07-23T12:05:20Z

That makes my life easier :)

I'm all for that. :P

jnothman

Something closer to a full review.

jnothman · 2017-07-23T12:12:40Z

sklearn/neighbors/base.py

@@ -421,6 +454,174 @@ class from an array representing our data set and ask who's
                return dist, neigh_ind
            return neigh_ind

+    def masked_kneighbors(self, X=None, n_neighbors=None, return_distance=True,


Is the main point in duplicating this to pass force_all_finite=False? I don't think it's necessary.

Rather, I think we should be leaving the data validation to the pairwise_distances and ball/binary tree (which seem to validate when queried, but not when constructed), and remove it from here / make it minimal. I admit that getting this right might be tricky, but I am not happy with a solution that duplicates kneighbors (and why not radius_neighbors also?) for the sake of validating more leniently.

Do you mean the method masked_kneighbors itself? I thought you wanted me to create a separate method to handle missing datasets? Initially, I had the missing handling within kneighbors() but I created this later. Maybe I misunderstood you then?

Yes, must have been a misunderstanding. I only asked for separation in pairwise_distances and then only as much separation as there is between euclidean_distances and cosine_distances and manhattan_distances.

jnothman · 2017-07-23T12:15:14Z

sklearn/metrics/tests/test_pairwise.py

+    S = pairwise_distances(X, metric="masked_euclidean")
+    S2 = masked_euclidean_distances(X)
+    assert_array_almost_equal(S, S2)
+    # Euclidean distance, with Y != X.


I would only keep this small test, and its point would be to check that pairwise_distances did not perform any unnecessary finiteness validation.

So remove pairwise_distances(X, Y, metric="masked_euclidean") yeah?

I mean that you should test pairwise_distances(X, Y, metric="masked_euclidean") once, to make sure that it does not raise an error due to NaNs

jnothman · 2017-07-23T12:17:12Z

sklearn/metrics/tests/test_pairwise.py

+                  [8., 2., 4., np.nan, 8.],
+                  [5., np.nan, 5., np.nan, 1.],
+                  [8., np.nan, np.nan, np.nan, np.nan]])
+    D1 = masked_euclidean_distances(X, missing_values="NaN")


Don't bother with this. Rather, after checking that it works in the X, Y case, check that just m_e_d(X) gives the same result as m_e_d(X, X)

jnothman · 2017-07-23T12:22:11Z

sklearn/metrics/tests/test_pairwise.py

+                  [np.nan, np.nan, 5., 4., 7.],
+                  [np.nan, np.nan, np.nan, 4., 5.]])
+
+    D3 = np.array([[6.32455532, 6.95221787, 4.74341649],


I'd rather see the tests show working, at least in some cases. Certainly you should test the distance calculation in the squared=True case for readability, then test the invariance that squared is meant to obtain.

assert_almost_equal(masked_euclidean_distances(X[:1], Y[:1], squared=True), [[5/2 * ((7-3)**2 + (2-2)**2)]])

jnothman · 2017-07-23T12:22:38Z

sklearn/metrics/pairwise.py

-    ============     ====================================
+    ============            ====================================
+    metric                  Function
+    ============            ====================================


The first underline should be longer.

jnothman · 2017-07-23T12:22:44Z

sklearn/metrics/pairwise.py

-    'l2'             metrics.pairwise.euclidean_distances
-    'manhattan'      metrics.pairwise.manhattan_distances
-    ============     ====================================
+    ============            ====================================


The first overline should be longer.

jnothman · 2017-07-23T12:22:48Z

sklearn/metrics/pairwise.py

+    'l2'                    metrics.pairwise.euclidean_distances
+    'manhattan'             metrics.pairwise.manhattan_distances
+    'masked_euclidean'      metrics.pairwise.masked_euclidean_distances
+    ============            ====================================


The first underline should be longer.

jnothman · 2017-07-23T12:25:39Z

sklearn/metrics/pairwise.py

+    vector in the pair or if there are no common non-missing coordinates then
+    NaN is returned for that pair.
+
+    References


We'd usually put this after Returns, before See Also (see for example additive_chi2_kernel in this file).

jnothman · 2017-07-23T12:27:53Z

sklearn/metrics/pairwise.py

+    # Calculate distances
+
+    distances = (X.shape[1] / ((np.dot(NX, NYT)))) * \
+                (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +


please drop unnecessary parentheses around X * X and YT * YT

jnothman

This is starting to get there. I'm not so happy about the special-casing in neighbours, but it's alright for now.

jnothman · 2017-07-24T07:21:08Z

sklearn/metrics/pairwise.py

    For efficiency reasons, the euclidean distance between a pair of row
    vector x and y is computed as::
-


You've, I suppose accidentally, remind a whole lot of blank lines

This is weird -- the spacing does not exist on my machine. Has to be some artifact of Github.

Well if you removed it it doesn't exist. It existed before, but not after your PR. This is not an artifact of github.

Haha good one. But I think I misunderstood the comment -- @jnothman what do you mean by "remind a whole lot of blank lines"? Did you mean "removed"?

I did indeed. Reviewing in the phone is a terrible habit.

Gotcha. I initially thought you meant I added a whole lot of blank lines, which I obviously could not find. But yeah I do see the removed blank lines, and I have no idea how it happened! Sorry about that.

jnothman · 2017-07-24T07:21:31Z

sklearn/metrics/pairwise.py

@@ -256,6 +277,150 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
    return distances if squared else np.sqrt(distances, out=distances)


+# Pairwise distances in the presence of missing values


Unnecessary comment

jnothman · 2017-07-24T07:28:06Z

sklearn/metrics/tests/test_pairwise.py

+
+    assert_array_almost_equal(D1, D2)
+
+    # check when squared = True


Please just do the first test with squared=True then assert_almost_equal (med(X,Y)**2, med(X,Y, squared=True)).

Good tests, in my opinion, should look like a proof by induction. First you prove a base case, then you show that invariants hold in extending from the base case. The base case should ideally be something the reader can easily reason is doing the right thing, hence rational numbers or worked examples.

jnothman · 2017-07-24T07:36:47Z

sklearn/neighbors/base.py

@@ -158,6 +159,11 @@ def _init_params(self, n_neighbors=None, radius=None,
        self._fit_method = None

    def _fit(self, X):
+        if self.metric in _MASKED_SUPPORTED_METRICS:
+            kill_missing = False


Rename to allow_nans.

jnothman · 2017-07-24T07:39:35Z

sklearn/metrics/pairwise.py

+    in dense matrices X and Y with missing values in arbitrary
+    coordinates.
+
+    The following formula is used for this:


Please cut the rest of the description down to a few sentences describing the calculation between vector pairs

I don't know why you're repeatedly ignoring this comment.

Hmm, I did cut the description down by between 8-10 lines compared to my previous commit. Sorry if it looked like I was ignoring it, that was definitely not my intention. But, anyway, I will cut it down further.

Sorry, is not noticed. I have tried to suggest that the matrix formulation here is unhelpful. You just need enough for the intuition behind calculating the metric to be clear. A couple of sentences

jnothman · 2017-07-24T08:39:24Z

sklearn/neighbors/base.py

@@ -355,6 +372,10 @@ class from an array representing our data set and ask who's
            if self.effective_metric_ == 'euclidean':


Use or or in to put these cases in one.

jnothman · 2017-07-24T08:40:24Z

sklearn/neighbors/base.py

+                    "Nearest neighbor algorithm does not currently support"
+                    "the use of sparse matrices."
+                )
+            else:


Use elif rather than more nesting.

Please correct me if I am mistaken, but it seems the two "if" statements following the "else" are not mutually exclusive. This would preclude the use of two "elif"s instead right? However, I think I can simply remove the "else" as it seems to be redundant there.

Sorry no need to use elif. Can just drop the else clause, CV as the preceding if clause raises an error.

ashimb9 · 2017-07-26T09:18:50Z

@jnothman I have pushed the changes you asked for. Thanks again!

ashimb9 · 2017-07-28T20:32:52Z

@jnothman @amueller @jaquesgrobler Hey guys -- just a friendly ping to request feedback so I can wrap this up :)

jnothman · 2017-07-29T12:05:07Z

It's in my pile. I'm taking a while to work through it.

…

On 29 July 2017 at 06:32, ashimb9 ***@***.***> wrote: @jnothman <https://github.com/jnothman> @amueller <https://github.com/amueller> @jaquesgrobler <https://github.com/jaquesgrobler> Hey guys -- just a friendly ping to request feedback so I can wrap this up :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9348 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67D_MFbAhXo-LTi2pRV9h2aed5MZks5sSkV2gaJpZM4OWu3U> .

jnothman

I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.

jnothman · 2017-07-29T13:24:55Z

sklearn/metrics/pairwise.py

+    in dense matrices X and Y with missing values in arbitrary
+    coordinates.
+
+    The following formula is used for this:


I don't know why you're repeatedly ignoring this comment.

jnothman · 2017-07-29T13:26:41Z

sklearn/metrics/pairwise.py

+    where NX and NYT represent the logical-not of the missing masks of
+    X and Y.T, respectively.
+    Formula in matrix form derived by:
+    Shreya Bhattarai <[email protected]>


I don't think the docstring is an appropriate place to place credits about implementation details. If you wish, note it in a comment in the code.

jnothman · 2017-07-29T13:32:49Z

sklearn/metrics/pairwise.py

+        to be any format. False means that a sparse matrix input will
+        raise an error.
+
+        .. deprecated:: 0.19


If this is being added, the deprecation note is irrelevant.

Not sure I understand this. The deprecation note is for passing accept_sparse=None, which is not directly relevant to us?

I mean that this behaviour is changed in check_array, not here. Deprecation is there only too help users taking advantage of a previously supported interface.

jnothman · 2017-07-29T13:33:43Z

sklearn/metrics/pairwise.py

+    # NOTE: force_all_finite=False allows not only NaN but also +/- inf
+    X, Y = check_pairwise_arrays(X, Y, accept_sparse=False,
+                                 force_all_finite=False, copy=copy)
+    if (np.any(np.isinf(X.data)) or


Do we ever overwrite X if copy=False??

At the moment masked_euclidean_distances() sets copy=True by default. I did that because X is altered during distance calculation whereby all NaNs are replaced with zeros. What do you think?

Ahh sorry, I forgot that. I suspect that the user benefits little from being able to not copy (roughly the same memory is occupied by the mask), but I suppose it doesn't hurt to keep it in as long as it is tested.

jnothman · 2017-07-29T21:59:58Z

sklearn/metrics/pairwise.py

+            "+/- Infinite values are not allowed.")
+
+    # Check if any rows have only missing value
+    if np.any(_get_mask(X, missing_values).sum(axis=1) == X.shape[1])\


these repeated calls to _get_mask and any are relatively expensive. These should not be repeated here and below. And perhaps a helper should be factored out of imputer.

By a helper I mean a separate function, perhaps in sklearn.utils, that can be reused.

Is it okay if I do that as a separate PR later? I am thinking this PR might become a little unwieldy if I modify utils on top of already having modified both pairwise and neighbors. What do you think?

I think you should at least avoid repeating element-wise operations here. If a helper refactors between here and Imputer, yes, make the change in utils, in this PR. It is only relevant because of this PR.

Which module within sklearn.utils do you think is most appropriate? Or a new one instead? I could not locate a clear candidate at a quick glance.

ashimb9 · 2017-07-29T22:47:19Z

I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.

Sure, but what do I do with the old kNN imputation PR? Would you prefer I start a new one or that I just edit that PR by referencing to this instead?

jnothman · 2017-07-30T02:18:57Z

forgot about that pr. great! let's work on it...

…

On 30 Jul 2017 8:47 am, "ashimb9" ***@***.***> wrote: I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part. Sure, but what do I do with the old kNN imputation PR <#9212>? Would you prefer I start a new one or that I just edit that PR by referencing to this instead? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9348 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61vRqL1CpSuDtYodW2A-m_h1SRdQks5sS7Z4gaJpZM4OWu3U> .

jnothman · 2017-07-30T04:13:15Z

I think a few of your error messages do not have test coverage.

jnothman · 2017-08-14T00:42:08Z

The new function should be listed in doc/modules/classes.rst

jnothman · 2017-09-04T03:43:53Z

sklearn/metrics/pairwise.py

    if metric == "precomputed":
        X, _ = check_pairwise_arrays(X, Y, precomputed=True)
        return X
    elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
-        func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
+            func = PAIRWISE_DISTANCE_FUNCTIONS[metric]


Why this changed indent?

jnothman · 2017-09-04T03:43:58Z

sklearn/metrics/pairwise.py

@@ -1148,7 +1299,9 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
    Valid values for metric are:

    - From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
-      'manhattan']. These metrics support sparse matrix inputs.
+      'manhattan']. These metrics support sparse matrix
+      inputs.


Why this newline?

jnothman · 2017-09-04T03:44:19Z

sklearn/neighbors/base.py


        n_samples = X.shape[0]
        if n_samples == 0:
            raise ValueError("n_samples must be greater than 0")

        if issparse(X):
+            if allow_nans:
+                raise ValueError(


This doesn't appear to be tested.

ashimb9 · 2017-09-04T05:03:17Z

@jnothman Hey, thanks a lot for the comments! A quick question: given that this has been merged with the PR for KNNImputer, should I address your comments here or in the other one?

jnothman · 2017-09-04T05:04:15Z

If you want to focus on KNNImputer, close this and work there. I only suggested creating this as a more tractable stepping stone for reviewers to iterate over. I still think it has value in that goal.

…

On 4 September 2017 at 15:03, ashimb9 ***@***.***> wrote: @jnothman <https://github.com/jnothman> Hey, thanks a lot for the comments! A quick question: given that this has been merged with the PR for KNNImputer, should I address your comments here or in the other one? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9348 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6z1DNJvvuOD0SsVyQwrlrt6KaO8Rks5se4SYgaJpZM4OWu3U> .

ashimb9 · 2017-09-04T05:08:33Z

Ok cool, I will keep this since you think it might be useful.

Modified metrics to enable euclidean distance calculation with missin…

d707dcd

…g (NaN) values

ashimb9 added 2 commits July 17, 2017 21:21

Changes to ensure Python 2.x compatibility

b4b5ae9

Fixed pep8 issues

04ed4a0

jnothman reviewed Jul 18, 2017

View reviewed changes

jaquesgrobler reviewed Jul 18, 2017

View reviewed changes

ashimb9 force-pushed the naneuclid branch from 04ed4a0 to c8ccc98 Compare July 19, 2017 06:22

ashimb9 force-pushed the naneuclid branch from c8ccc98 to 04ed4a0 Compare July 19, 2017 08:57

ashimb9 added 4 commits July 19, 2017 06:22

Addressed comments from review

a6d8ef6

Docstring example issues

e4f8612

Formatting fixes on docstring

daf247f

And yet more fixes

10f5adb

jnothman reviewed Jul 19, 2017

View reviewed changes

jnothman reviewed Jul 20, 2017

View reviewed changes

Addressed review comments (Part 2)

22cf9ef

Changed nan-mask from int8 to int32

2482c8a

jnothman reviewed Jul 23, 2017

View reviewed changes

ashimb9 added 3 commits July 24, 2017 00:52

Addressed review comments (scikit-learn#3)

66527cd

Pep8 fix

a968b1e

Comment edit on test_pairwise

356c8e8

jnothman reviewed Jul 24, 2017

View reviewed changes

Addressed review comments scikit-learn#4

d6aeaf3

ashimb9 added 3 commits July 25, 2017 02:37

replaced or with in

e8ccdee

Changed allow_nans assignment

4a8309b

One more or to in

5cbc156

jnothman reviewed Jul 29, 2017

View reviewed changes

jnothman mentioned this pull request Jul 30, 2017

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

Closed

7 tasks

ashimb9 added 2 commits July 31, 2017 00:54

Addressed review comments scikit-learn#5

a31c43a

Edited comments

eacb19d

jnothman reviewed Sep 4, 2017

View reviewed changes

Addressed review comments - 6

351e3b9

thomasjpfan mentioned this pull request Dec 21, 2018

[MRG] Adds KNNImputer #12852

Merged

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

jnothman closed this in #12852 Sep 3, 2019

		For efficiency reasons, the euclidean distance between a pair of row
		vector x and y is computed as::

		@@ -256,6 +277,150 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
		return distances if squared else np.sqrt(distances, out=distances)


		# Pairwise distances in the presence of missing values


		assert_array_almost_equal(D1, D2)

		# check when squared = True

		@@ -355,6 +372,10 @@ class from an array representing our data set and ask who's
		if self.effective_metric_ == 'euclidean':

Uh oh!

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

Uh oh!

Conversation

ashimb9 commented Jul 13, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

glemaitre commented Jul 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashimb9 commented Jul 19, 2017

Uh oh!

ashimb9 commented Jul 19, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!