Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

ashimb9
Copy link
Contributor

@ashimb9 ashimb9 commented Jul 13, 2017

Reference Issue

Issue: 4844, 2989
Prelude to: 9212

What does this implement/fix? Explain your changes.

This is actually first part of another PR. The referenced PR implements a KNN based imputation strategy using its own k-Nearest Neighbor calculator. But it was suggested here that it might be a better option to make changes within sklearn.metrics instead so that the existing sklearn.neighbors machinery could be used. Given this context, this PR modifies relevant modules in sklearn.metrics (and sklearn.neighbors to a very minor extent) so the euclidean distance between samples with arbitrary missing coordinates can now be calculated.

@glemaitre
Copy link
Member

You might want to check why the CI are not happy.

@@ -179,6 +193,19 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
the distance matrix returned by this function may not be exactly
symmetric as required by, e.g., ``scipy.spatial.distance`` functions.

Additionally, euclidean_distances() can also compute pairwise euclidean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we should pull it out as a separate function, e.g. masked_euclidean_distances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# NOTE: force_all_finite=False allows not only NaN but also inf/-inf
X, Y = check_pairwise_arrays(X, Y,
force_all_finite=kill_missing, copy=copy)
if kill_missing is False and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean not kill_missing rather than kill_missing is False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha sure did :)

else:
YY = row_norms(Y, squared=True)[np.newaxis, :]

distances = safe_sparse_dot(X, Y.T, dense_output=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get how this works if there are NaNs in X and Y still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^(Please see my previous comment.)

raise ValueError(
"Incompatible dimensions for X and X_norm_squared")
else:
XX = row_norms(X, squared=True)[:, np.newaxis]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get how this works if there are NaNs in X still

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean in case kill_missing=True but there are NaN values? I'll add a check there to avoid that scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I've just misunderstood what's going on here. Can you please make a separate function rather than having a kill_missing setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

kill_missing : boolean, optional
Allow missing values (e.g., NaN)

missing_values : String, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a string? Surely we only want this configurable in the case of integer data...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do integers only, I am just thinking how the user can conveniently pass NaN as the missing_values to the function. Or should the options rather be: either "NaN" or a number? And asking to pass np.nan might not be the best option? Or is there a better way I am not aware of here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah imputer has integer or “NaN” -- that seems like a good choice. Thanks!

@@ -270,7 +272,8 @@ def _pairwise(self):
class KNeighborsMixin(object):
"""Mixin for k-neighbors searches"""

def kneighbors(self, X=None, n_neighbors=None, return_distance=True):
def kneighbors(self, X=None, n_neighbors=None, return_distance=True,
kill_missing=True, missing_values="NaN", copy=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just use a variant metric name, rather than adding all these parameters to kneighbors. Certainly, it does not belong in the method, but in the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

kill_missing : boolean, optional
Allow missing values (e.g., NaN)

missing_values : String, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to just assume missing_values is NaN. If we really want it configurable, it will be a number.

@@ -54,7 +63,8 @@ def _return_float_dtype(X, Y):
return X, Y, dtype


def check_pairwise_arrays(X, Y, precomputed=False, dtype=None):
def check_pairwise_arrays(X, Y, precomputed=False, dtype=None,
copy=False, force_all_finite=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy and force_all_finite should be added to the docstring parameters, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Thanks!

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 19, 2017

Hmm, I seem to have broken the code here. I had to do a force push due to some issues in my local machine, and that seems to have made CI unhappy again. Looking into it at the moment.

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 19, 2017

@jnothman @jaquesgrobler Hi guys -- So things seem to be running ok now. I have tried to address all of your comments in this update. Please take a look at your convenience. Thank you.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a glance at docs so far. Thanks for the quick work

# Pairwise distances in the presence of missing values
def masked_euclidean_distances(X, Y=None, squared=False,
missing_values="NaN", copy=True, **kwargs):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pep257: there should be a one-like summary here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

in dense matrices X and Y with missing values in arbitrary
coordinates. The following formula is used for this:

dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give the formula in terms of vectors rather than matrices? It's much more straightforward to understand relative to the well known Euclidean distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is there, one or two paragraphs below the matrix notation: "Breakdown of euclidean distance calculation between a vector pair x,y...". Or did you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.

Implementation notes can be commented inside the function if they will help maintainers and curious users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not totally sure what you mean. I have edited the docstring to change it to what I think you meant -- please let me know what you think about the following:

This formulation zero-weights feature coordinates
with missing value in either vector in the pair and up-weights the
remaining coordinates. For instance, say we have two sample points (x1,
y1) and (x2, NaN). To calculate the euclidean distance, first the square
"distance" is calculated based only on the first feature coordinate, 
as the second coordinate is missing in one of the samples,
i.e., we get (x2-x1)**2. This squared distance is scaled-up by the ratio 
of total number of coordinates to the number of available coordinates, 
which in this case is 2/1 = 2. Now, we are left with 2*((x2-x1)**2). 
Finally, if squared=False then the square root of this is evaluated 
otherwise the value is returned as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I meant was that it's not especially helpful to the reader to describe the operation in terms of matrices and masks, and that should be deleted. But it can be described in terms of vectors.

I'm struggling with your description, and find "zero-weights" and "up-weights" particularly confusing.

You could state something like "In accordance with [x] we calculate Euclidean distance between vectors with some elements missing as: the sum of squared differences between elements that are not missing in either vector, scaled in inverse proportion to the number of elements not missing in either vector, and the square root taken." Alternatively "... as: the Euclidean distance between the elements that are not missing in either vector, multiplied by sqrt(vector length / number of elements not missing in either vector)." I'm not sure there about "Euclidean distance between elements". Is "the Euclidean distance between vectors consisting of the elements that are not missing in either input vector" clearer?

array([[ 1. ],
[ 1.41421356]])

See also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a reference to research / textbooks / encyclopaedia where Euclidean distance with missing values is used / defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add that.

axis=1) == Y.shape[1])):
raise ValueError("One or more rows only contain missing values.")
#
# if kill_missing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thinking for this was that a row with all NaN will only introduce "unnecessary" NaN values in the distance matrix. It does not technically matter for this specific situation (or for kneighbors) since all it does is return a row (or column) with all NaN values, but I was like why keep it when it can potentially introduce issues down the analysis-chain and does not contribute anything useful. So the current implementation requires users to get rid of samples that have nothing but NaN values before they pass the dataset to masked_kneighbors or masked_euclidean_distance(). However, if you prefer to remove this check for any reason please let me know and I will get rid of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with all-NaN rows resulting in an error.

That still doesn't explain why you have a large block of commented-out code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was accidentally left there.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kill_missing parameter is still a bit of a mystery to me.

in dense matrices X and Y with missing values in arbitrary
coordinates. The following formula is used for this:

dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.

Implementation notes can be commented inside the function if they will help maintainers and curious users.

axis=1) == Y.shape[1])):
raise ValueError("One or more rows only contain missing values.")
#
# if kill_missing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with all-NaN rows resulting in an error.

That still doesn't explain why you have a large block of commented-out code.

"""
# Check and except sparse matrices
if issparse(X) or (Y is not None and issparse(Y)):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps check_pairwise_arrays should have an accept_sparse parameter.

# (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +
# np.dot(NX, (YT * YT)))

# Above is faster but following for Python 2.x support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean Python 2 division? we only need from __future__ import division at the top of the file (and I'm surprised it's not already there)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the multiply that you've changed. I don't get why that's necessary...? Nor do I get how it could be substantially slower.

# np.dot(NX, (YT * YT)))

# Above is faster but following for Python 2.x support
distances = np.multiply(np.multiply(X.shape[1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X.shape[1] / np.dot(NX, NYT) should suffice here

# Get Y.T mask and anti-mask and set Y.T's missing to zero
YT = Y.T
mask_YT = _get_mask(YT, missing_values)
NYT = (~mask_YT).astype(np.int8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the astype help with performance? I suspect it does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I did that because leaving it as bool was returning incorrect values. I think the issue is that dot product of bool matrices returns a bool which of course means that we do not get the sum of True as a sum of ones, as we would want for individual dot products.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this probably means that I should not use int8 either since that could be a problem when dataset has a lot of columns/features and 8 bits might not be enough ... will change that :)

# Calculate distances

# distances = (X.shape[1] * 1 / ((np.dot(NX, NYT)))) * \
# (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on np.dot((X * X), NYT) might be helpful. But I can't work out how to word it, so perhaps not :)

@@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
"Valid metrics are %s, or 'precomputed', or a "
"callable" % (metric, _VALID_METRICS))

# To handle kill_missing = False
kill_missing = kwds.get("kill_missing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what this is all about. Why aren't we just adding 'masked_euclidean' to PAIRWISE_DISTANCE_FUNCTIONS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this one was a tough call. So ideally we probably don't want the user to pass "masked_euclidean" right? Which leaves us with the option to check if kill_missing==False, and if it is then to use metric as "masked_euclidean" when the user passes "euclidean". I was afraid that allowing user to pass "euclidean" when it actually meant calling masked_euclidean might potentially trigger all sorts of unintended things down the chain if both versions shared the same function dictionary, aka I was just playing it safe. But if you think it is okay, then I will do the conversion to masked_ version by checking for kill_missing==False.

@jnothman
Copy link
Member

jnothman commented Jul 23, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 23, 2017

Ohh, cool...That makes my life easier :)

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 23, 2017

@jnothman Just pushed a commit addressing the issues that you raised. And thanks a lot for all your feedback so far. PS: Please don't mind the random comment blocks for now, I will get rid of them soon :)

@jnothman
Copy link
Member

That makes my life easier :)

I'm all for that. :P

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something closer to a full review.

@@ -421,6 +454,174 @@ class from an array representing our data set and ask who's
return dist, neigh_ind
return neigh_ind

def masked_kneighbors(self, X=None, n_neighbors=None, return_distance=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the main point in duplicating this to pass force_all_finite=False? I don't think it's necessary.

Rather, I think we should be leaving the data validation to the pairwise_distances and ball/binary tree (which seem to validate when queried, but not when constructed), and remove it from here / make it minimal. I admit that getting this right might be tricky, but I am not happy with a solution that duplicates kneighbors (and why not radius_neighbors also?) for the sake of validating more leniently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the method masked_kneighbors itself? I thought you wanted me to create a separate method to handle missing datasets? Initially, I had the missing handling within kneighbors() but I created this later. Maybe I misunderstood you then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, must have been a misunderstanding. I only asked for separation in pairwise_distances and then only as much separation as there is between euclidean_distances and cosine_distances and manhattan_distances.

S = pairwise_distances(X, metric="masked_euclidean")
S2 = masked_euclidean_distances(X)
assert_array_almost_equal(S, S2)
# Euclidean distance, with Y != X.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only keep this small test, and its point would be to check that pairwise_distances did not perform any unnecessary finiteness validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So remove pairwise_distances(X, Y, metric="masked_euclidean") yeah?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you should test pairwise_distances(X, Y, metric="masked_euclidean") once, to make sure that it does not raise an error due to NaNs

[8., 2., 4., np.nan, 8.],
[5., np.nan, 5., np.nan, 1.],
[8., np.nan, np.nan, np.nan, np.nan]])
D1 = masked_euclidean_distances(X, missing_values="NaN")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bother with this. Rather, after checking that it works in the X, Y case, check that just m_e_d(X) gives the same result as m_e_d(X, X)

[np.nan, np.nan, 5., 4., 7.],
[np.nan, np.nan, np.nan, 4., 5.]])

D3 = np.array([[6.32455532, 6.95221787, 4.74341649],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see the tests show working, at least in some cases. Certainly you should test the distance calculation in the squared=True case for readability, then test the invariance that squared is meant to obtain.

    assert_almost_equal(masked_euclidean_distances(X[:1], Y[:1], squared=True),
                        [[5/2 * ((7-3)**2 + (2-2)**2)]])

============ ====================================
============ ====================================
metric Function
============ ====================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first underline should be longer.

'l2' metrics.pairwise.euclidean_distances
'manhattan' metrics.pairwise.manhattan_distances
============ ====================================
============ ====================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first overline should be longer.

'l2' metrics.pairwise.euclidean_distances
'manhattan' metrics.pairwise.manhattan_distances
'masked_euclidean' metrics.pairwise.masked_euclidean_distances
============ ====================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first underline should be longer.

vector in the pair or if there are no common non-missing coordinates then
NaN is returned for that pair.

References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd usually put this after Returns, before See Also (see for example additive_chi2_kernel in this file).

# Calculate distances

distances = (X.shape[1] / ((np.dot(NX, NYT)))) * \
(np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop unnecessary parentheses around X * X and YT * YT

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to get there. I'm not so happy about the special-casing in neighbours, but it's alright for now.

For efficiency reasons, the euclidean distance between a pair of row
vector x and y is computed as::

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've, I suppose accidentally, remind a whole lot of blank lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird -- the spacing does not exist on my machine. Has to be some artifact of Github.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you removed it it doesn't exist. It existed before, but not after your PR. This is not an artifact of github.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha good one. But I think I misunderstood the comment -- @jnothman what do you mean by "remind a whole lot of blank lines"? Did you mean "removed"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did indeed. Reviewing in the phone is a terrible habit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I initially thought you meant I added a whole lot of blank lines, which I obviously could not find. But yeah I do see the removed blank lines, and I have no idea how it happened! Sorry about that.

@@ -256,6 +277,150 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
return distances if squared else np.sqrt(distances, out=distances)


# Pairwise distances in the presence of missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment


assert_array_almost_equal(D1, D2)

# check when squared = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just do the first test with squared=True then assert_almost_equal (med(X,Y)**2, med(X,Y, squared=True)).

Good tests, in my opinion, should look like a proof by induction. First you prove a base case, then you show that invariants hold in extending from the base case. The base case should ideally be something the reader can easily reason is doing the right thing, hence rational numbers or worked examples.

@@ -158,6 +159,11 @@ def _init_params(self, n_neighbors=None, radius=None,
self._fit_method = None

def _fit(self, X):
if self.metric in _MASKED_SUPPORTED_METRICS:
kill_missing = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to allow_nans.

in dense matrices X and Y with missing values in arbitrary
coordinates.

The following formula is used for this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cut the rest of the description down to a few sentences describing the calculation between vector pairs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you're repeatedly ignoring this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I did cut the description down by between 8-10 lines compared to my previous commit. Sorry if it looked like I was ignoring it, that was definitely not my intention. But, anyway, I will cut it down further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, is not noticed. I have tried to suggest that the matrix formulation here is unhelpful. You just need enough for the intuition behind calculating the metric to be clear. A couple of sentences

@@ -355,6 +372,10 @@ class from an array representing our data set and ask who's
if self.effective_metric_ == 'euclidean':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use or or in to put these cases in one.

"Nearest neighbor algorithm does not currently support"
"the use of sparse matrices."
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use elif rather than more nesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am mistaken, but it seems the two "if" statements following the "else" are not mutually exclusive. This would preclude the use of two "elif"s instead right? However, I think I can simply remove the "else" as it seems to be redundant there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry no need to use elif. Can just drop the else clause, CV as the preceding if clause raises an error.

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 26, 2017

@jnothman I have pushed the changes you asked for. Thanks again!

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 28, 2017

@jnothman @amueller @jaquesgrobler Hey guys -- just a friendly ping to request feedback so I can wrap this up :)

@jnothman
Copy link
Member

jnothman commented Jul 29, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.

in dense matrices X and Y with missing values in arbitrary
coordinates.

The following formula is used for this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you're repeatedly ignoring this comment.

where NX and NYT represent the logical-not of the missing masks of
X and Y.T, respectively.
Formula in matrix form derived by:
Shreya Bhattarai <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docstring is an appropriate place to place credits about implementation details. If you wish, note it in a comment in the code.

to be any format. False means that a sparse matrix input will
raise an error.

.. deprecated:: 0.19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is being added, the deprecation note is irrelevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this. The deprecation note is for passing accept_sparse=None, which is not directly relevant to us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that this behaviour is changed in check_array, not here. Deprecation is there only too help users taking advantage of a previously supported interface.

# NOTE: force_all_finite=False allows not only NaN but also +/- inf
X, Y = check_pairwise_arrays(X, Y, accept_sparse=False,
force_all_finite=False, copy=copy)
if (np.any(np.isinf(X.data)) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever overwrite X if copy=False??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment masked_euclidean_distances() sets copy=True by default. I did that because X is altered during distance calculation whereby all NaNs are replaced with zeros. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry, I forgot that. I suspect that the user benefits little from being able to not copy (roughly the same memory is occupied by the mask), but I suppose it doesn't hurt to keep it in as long as it is tested.

"+/- Infinite values are not allowed.")

# Check if any rows have only missing value
if np.any(_get_mask(X, missing_values).sum(axis=1) == X.shape[1])\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these repeated calls to _get_mask and any are relatively expensive. These should not be repeated here and below. And perhaps a helper should be factored out of imputer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By a helper I mean a separate function, perhaps in sklearn.utils, that can be reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if I do that as a separate PR later? I am thinking this PR might become a little unwieldy if I modify utils on top of already having modified both pairwise and neighbors. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should at least avoid repeating element-wise operations here. If a helper refactors between here and Imputer, yes, make the change in utils, in this PR. It is only relevant because of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which module within sklearn.utils do you think is most appropriate? Or a new one instead? I could not locate a clear candidate at a quick glance.

@ashimb9
Copy link
Contributor Author

ashimb9 commented Jul 29, 2017

I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.

Sure, but what do I do with the old kNN imputation PR? Would you prefer I start a new one or that I just edit that PR by referencing to this instead?

@jnothman
Copy link
Member

jnothman commented Jul 30, 2017 via email

@jnothman
Copy link
Member

I think a few of your error messages do not have test coverage.

@jnothman
Copy link
Member

The new function should be listed in doc/modules/classes.rst

if metric == "precomputed":
X, _ = check_pairwise_arrays(X, Y, precomputed=True)
return X
elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this changed indent?

@@ -1148,7 +1299,9 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
Valid values for metric are:

- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']. These metrics support sparse matrix inputs.
'manhattan']. These metrics support sparse matrix
inputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this newline?


n_samples = X.shape[0]
if n_samples == 0:
raise ValueError("n_samples must be greater than 0")

if issparse(X):
if allow_nans:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be tested.

@ashimb9
Copy link
Contributor Author

ashimb9 commented Sep 4, 2017

@jnothman Hey, thanks a lot for the comments! A quick question: given that this has been merged with the PR for KNNImputer, should I address your comments here or in the other one?

@jnothman
Copy link
Member

jnothman commented Sep 4, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Sep 4, 2017

Ok cool, I will keep this since you think it might be useful.

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants