-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You might want to check why the CI are not happy. |
sklearn/metrics/pairwise.py
Outdated
@@ -179,6 +193,19 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False, | |||
the distance matrix returned by this function may not be exactly | |||
symmetric as required by, e.g., ``scipy.spatial.distance`` functions. | |||
|
|||
Additionally, euclidean_distances() can also compute pairwise euclidean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we should pull it out as a separate function, e.g. masked_euclidean_distances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sklearn/metrics/pairwise.py
Outdated
# NOTE: force_all_finite=False allows not only NaN but also inf/-inf | ||
X, Y = check_pairwise_arrays(X, Y, | ||
force_all_finite=kill_missing, copy=copy) | ||
if kill_missing is False and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean not kill_missing
rather than kill_missing is False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha sure did :)
sklearn/metrics/pairwise.py
Outdated
else: | ||
YY = row_norms(Y, squared=True)[np.newaxis, :] | ||
|
||
distances = safe_sparse_dot(X, Y.T, dense_output=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get how this works if there are NaNs in X and Y still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^(Please see my previous comment.)
sklearn/metrics/pairwise.py
Outdated
raise ValueError( | ||
"Incompatible dimensions for X and X_norm_squared") | ||
else: | ||
XX = row_norms(X, squared=True)[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get how this works if there are NaNs in X still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean in case kill_missing=True but there are NaN values? I'll add a check there to avoid that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I've just misunderstood what's going on here. Can you please make a separate function rather than having a kill_missing
setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing
sklearn/metrics/pairwise.py
Outdated
kill_missing : boolean, optional | ||
Allow missing values (e.g., NaN) | ||
|
||
missing_values : String, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a string? Surely we only want this configurable in the case of integer data...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to just assume missing_values
is NaN. If we really want it configurable, it will be a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do integers only, I am just thinking how the user can conveniently pass NaN as the missing_values to the function. Or should the options rather be: either "NaN" or a number? And asking to pass np.nan might not be the best option? Or is there a better way I am not aware of here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah imputer has integer or “NaN” -- that seems like a good choice. Thanks!
sklearn/neighbors/base.py
Outdated
@@ -270,7 +272,8 @@ def _pairwise(self): | |||
class KNeighborsMixin(object): | |||
"""Mixin for k-neighbors searches""" | |||
|
|||
def kneighbors(self, X=None, n_neighbors=None, return_distance=True): | |||
def kneighbors(self, X=None, n_neighbors=None, return_distance=True, | |||
kill_missing=True, missing_values="NaN", copy=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just use a variant metric name, rather than adding all these parameters to kneighbors. Certainly, it does not belong in the method, but in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sklearn/metrics/pairwise.py
Outdated
kill_missing : boolean, optional | ||
Allow missing values (e.g., NaN) | ||
|
||
missing_values : String, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to just assume missing_values
is NaN. If we really want it configurable, it will be a number.
sklearn/metrics/pairwise.py
Outdated
@@ -54,7 +63,8 @@ def _return_float_dtype(X, Y): | |||
return X, Y, dtype | |||
|
|||
|
|||
def check_pairwise_arrays(X, Y, precomputed=False, dtype=None): | |||
def check_pairwise_arrays(X, Y, precomputed=False, dtype=None, | |||
copy=False, force_all_finite=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
and force_all_finite
should be added to the docstring parameters, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. Thanks!
Hmm, I seem to have broken the code here. I had to do a force push due to some issues in my local machine, and that seems to have made CI unhappy again. Looking into it at the moment. |
@jnothman @jaquesgrobler Hi guys -- So things seem to be running ok now. I have tried to address all of your comments in this update. Please take a look at your convenience. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a glance at docs so far. Thanks for the quick work
sklearn/metrics/pairwise.py
Outdated
# Pairwise distances in the presence of missing values | ||
def masked_euclidean_distances(X, Y=None, squared=False, | ||
missing_values="NaN", copy=True, **kwargs): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep257: there should be a one-like summary here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
sklearn/metrics/pairwise.py
Outdated
in dense matrices X and Y with missing values in arbitrary | ||
coordinates. The following formula is used for this: | ||
|
||
dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we give the formula in terms of vectors rather than matrices? It's much more straightforward to understand relative to the well known Euclidean distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is there, one or two paragraphs below the matrix notation: "Breakdown of euclidean distance calculation between a vector pair x,y...". Or did you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.
Implementation notes can be commented inside the function if they will help maintainers and curious users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am not totally sure what you mean. I have edited the docstring to change it to what I think you meant -- please let me know what you think about the following:
This formulation zero-weights feature coordinates
with missing value in either vector in the pair and up-weights the
remaining coordinates. For instance, say we have two sample points (x1,
y1) and (x2, NaN). To calculate the euclidean distance, first the square
"distance" is calculated based only on the first feature coordinate,
as the second coordinate is missing in one of the samples,
i.e., we get (x2-x1)**2. This squared distance is scaled-up by the ratio
of total number of coordinates to the number of available coordinates,
which in this case is 2/1 = 2. Now, we are left with 2*((x2-x1)**2).
Finally, if squared=False then the square root of this is evaluated
otherwise the value is returned as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I meant was that it's not especially helpful to the reader to describe the operation in terms of matrices and masks, and that should be deleted. But it can be described in terms of vectors.
I'm struggling with your description, and find "zero-weights" and "up-weights" particularly confusing.
You could state something like "In accordance with [x] we calculate Euclidean distance between vectors with some elements missing as: the sum of squared differences between elements that are not missing in either vector, scaled in inverse proportion to the number of elements not missing in either vector, and the square root taken." Alternatively "... as: the Euclidean distance between the elements that are not missing in either vector, multiplied by sqrt(vector length / number of elements not missing in either vector)." I'm not sure there about "Euclidean distance between elements". Is "the Euclidean distance between vectors consisting of the elements that are not missing in either input vector" clearer?
sklearn/metrics/pairwise.py
Outdated
array([[ 1. ], | ||
[ 1.41421356]]) | ||
|
||
See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves a reference to research / textbooks / encyclopaedia where Euclidean distance with missing values is used / defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add that.
sklearn/metrics/pairwise.py
Outdated
axis=1) == Y.shape[1])): | ||
raise ValueError("One or more rows only contain missing values.") | ||
# | ||
# if kill_missing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my thinking for this was that a row with all NaN will only introduce "unnecessary" NaN values in the distance matrix. It does not technically matter for this specific situation (or for kneighbors) since all it does is return a row (or column) with all NaN values, but I was like why keep it when it can potentially introduce issues down the analysis-chain and does not contribute anything useful. So the current implementation requires users to get rid of samples that have nothing but NaN values before they pass the dataset to masked_kneighbors or masked_euclidean_distance(). However, if you prefer to remove this check for any reason please let me know and I will get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with all-NaN rows resulting in an error.
That still doesn't explain why you have a large block of commented-out code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was accidentally left there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kill_missing
parameter is still a bit of a mystery to me.
sklearn/metrics/pairwise.py
Outdated
in dense matrices X and Y with missing values in arbitrary | ||
coordinates. The following formula is used for this: | ||
|
||
dist(X, Y) = (X.shape[1] * 1 / ((dot(NX, NYT)))) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the user doesn't care how it's computed. They care what it means and why it behaves a certain way with their data. Better with documentation that gives a functional description and does not need to be updated whenever the implementation is.
Implementation notes can be commented inside the function if they will help maintainers and curious users.
sklearn/metrics/pairwise.py
Outdated
axis=1) == Y.shape[1])): | ||
raise ValueError("One or more rows only contain missing values.") | ||
# | ||
# if kill_missing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with all-NaN rows resulting in an error.
That still doesn't explain why you have a large block of commented-out code.
sklearn/metrics/pairwise.py
Outdated
""" | ||
# Check and except sparse matrices | ||
if issparse(X) or (Y is not None and issparse(Y)): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps check_pairwise_arrays
should have an accept_sparse parameter.
sklearn/metrics/pairwise.py
Outdated
# (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + | ||
# np.dot(NX, (YT * YT))) | ||
|
||
# Above is faster but following for Python 2.x support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Python 2 division? we only need from __future__ import division
at the top of the file (and I'm surprised it's not already there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the multiply that you've changed. I don't get why that's necessary...? Nor do I get how it could be substantially slower.
sklearn/metrics/pairwise.py
Outdated
# np.dot(NX, (YT * YT))) | ||
|
||
# Above is faster but following for Python 2.x support | ||
distances = np.multiply(np.multiply(X.shape[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.shape[1] / np.dot(NX, NYT)
should suffice here
sklearn/metrics/pairwise.py
Outdated
# Get Y.T mask and anti-mask and set Y.T's missing to zero | ||
YT = Y.T | ||
mask_YT = _get_mask(YT, missing_values) | ||
NYT = (~mask_YT).astype(np.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the astype help with performance? I suspect it does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I did that because leaving it as bool was returning incorrect values. I think the issue is that dot product of bool matrices returns a bool which of course means that we do not get the sum of True as a sum of ones, as we would want for individual dot products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this probably means that I should not use int8 either since that could be a problem when dataset has a lot of columns/features and 8 bits might not be enough ... will change that :)
sklearn/metrics/pairwise.py
Outdated
# Calculate distances | ||
|
||
# distances = (X.shape[1] * 1 / ((np.dot(NX, NYT)))) * \ | ||
# (np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment on np.dot((X * X), NYT)
might be helpful. But I can't work out how to word it, so perhaps not :)
sklearn/metrics/pairwise.py
Outdated
@@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds): | |||
"Valid metrics are %s, or 'precomputed', or a " | |||
"callable" % (metric, _VALID_METRICS)) | |||
|
|||
# To handle kill_missing = False | |||
kill_missing = kwds.get("kill_missing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what this is all about. Why aren't we just adding 'masked_euclidean' to PAIRWISE_DISTANCE_FUNCTIONS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this one was a tough call. So ideally we probably don't want the user to pass "masked_euclidean" right? Which leaves us with the option to check if kill_missing==False, and if it is then to use metric as "masked_euclidean" when the user passes "euclidean". I was afraid that allowing user to pass "euclidean" when it actually meant calling masked_euclidean might potentially trigger all sorts of unintended things down the chain if both versions shared the same function dictionary, aka I was just playing it safe. But if you think it is okay, then I will do the conversion to masked_ version by checking for kill_missing==False.
I would rather the user specify 'masked_euclidean'. Why not? Certainly it's
no problem in the case of Imputer.
…On 23 Jul 2017 10:54 am, "ashimb9" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/metrics/pairwise.py
<#9348 (comment)>
:
> @@ -1216,11 +1391,36 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
"Valid metrics are %s, or 'precomputed', or a "
"callable" % (metric, _VALID_METRICS))
+ # To handle kill_missing = False
+ kill_missing = kwds.get("kill_missing")
Yeah this one was a tough call. So ideally we probably don't want the user
to pass "masked_euclidean" right? Which leaves us with the option to check
if kill_missing==False, and if it is then to use metric as
"masked_euclidean" when the user passes "euclidean". I was afraid that
allowing user to pass "euclidean" when it actually meant calling
masked_euclidean might potentially trigger all sorts of unintended things
down the chain if both versions shared the same function dictionary, aka I
was just playing it safe. But if you think it is okay, then I will do the
conversion to masked_ version by checking for kill_missing==False.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69G-8Lsm3aps0nSM-V1vM7_msampks5sQpm7gaJpZM4OWu3U>
.
|
Ohh, cool...That makes my life easier :) |
@jnothman Just pushed a commit addressing the issues that you raised. And thanks a lot for all your feedback so far. PS: Please don't mind the random comment blocks for now, I will get rid of them soon :) |
I'm all for that. :P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something closer to a full review.
sklearn/neighbors/base.py
Outdated
@@ -421,6 +454,174 @@ class from an array representing our data set and ask who's | |||
return dist, neigh_ind | |||
return neigh_ind | |||
|
|||
def masked_kneighbors(self, X=None, n_neighbors=None, return_distance=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the main point in duplicating this to pass force_all_finite=False
? I don't think it's necessary.
Rather, I think we should be leaving the data validation to the pairwise_distances
and ball/binary tree (which seem to validate when queried, but not when constructed), and remove it from here / make it minimal. I admit that getting this right might be tricky, but I am not happy with a solution that duplicates kneighbors
(and why not radius_neighbors
also?) for the sake of validating more leniently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the method masked_kneighbors itself? I thought you wanted me to create a separate method to handle missing datasets? Initially, I had the missing handling within kneighbors() but I created this later. Maybe I misunderstood you then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, must have been a misunderstanding. I only asked for separation in pairwise_distances
and then only as much separation as there is between euclidean_distances
and cosine_distances
and manhattan_distances
.
S = pairwise_distances(X, metric="masked_euclidean") | ||
S2 = masked_euclidean_distances(X) | ||
assert_array_almost_equal(S, S2) | ||
# Euclidean distance, with Y != X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only keep this small test, and its point would be to check that pairwise_distances
did not perform any unnecessary finiteness validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So remove pairwise_distances(X, Y, metric="masked_euclidean") yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that you should test pairwise_distances(X, Y, metric="masked_euclidean")
once, to make sure that it does not raise an error due to NaNs
[8., 2., 4., np.nan, 8.], | ||
[5., np.nan, 5., np.nan, 1.], | ||
[8., np.nan, np.nan, np.nan, np.nan]]) | ||
D1 = masked_euclidean_distances(X, missing_values="NaN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bother with this. Rather, after checking that it works in the X, Y case, check that just m_e_d(X) gives the same result as m_e_d(X, X)
[np.nan, np.nan, 5., 4., 7.], | ||
[np.nan, np.nan, np.nan, 4., 5.]]) | ||
|
||
D3 = np.array([[6.32455532, 6.95221787, 4.74341649], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see the tests show working, at least in some cases. Certainly you should test the distance calculation in the squared=True
case for readability, then test the invariance that squared
is meant to obtain.
assert_almost_equal(masked_euclidean_distances(X[:1], Y[:1], squared=True),
[[5/2 * ((7-3)**2 + (2-2)**2)]])
sklearn/metrics/pairwise.py
Outdated
============ ==================================== | ||
============ ==================================== | ||
metric Function | ||
============ ==================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first underline should be longer.
sklearn/metrics/pairwise.py
Outdated
'l2' metrics.pairwise.euclidean_distances | ||
'manhattan' metrics.pairwise.manhattan_distances | ||
============ ==================================== | ||
============ ==================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first overline should be longer.
sklearn/metrics/pairwise.py
Outdated
'l2' metrics.pairwise.euclidean_distances | ||
'manhattan' metrics.pairwise.manhattan_distances | ||
'masked_euclidean' metrics.pairwise.masked_euclidean_distances | ||
============ ==================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first underline should be longer.
sklearn/metrics/pairwise.py
Outdated
vector in the pair or if there are no common non-missing coordinates then | ||
NaN is returned for that pair. | ||
|
||
References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd usually put this after Returns, before See Also (see for example additive_chi2_kernel in this file).
sklearn/metrics/pairwise.py
Outdated
# Calculate distances | ||
|
||
distances = (X.shape[1] / ((np.dot(NX, NYT)))) * \ | ||
(np.dot((X * X), NYT) - 2 * (np.dot(X, YT)) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please drop unnecessary parentheses around X * X and YT * YT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to get there. I'm not so happy about the special-casing in neighbours, but it's alright for now.
sklearn/metrics/pairwise.py
Outdated
For efficiency reasons, the euclidean distance between a pair of row | ||
vector x and y is computed as:: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've, I suppose accidentally, remind a whole lot of blank lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird -- the spacing does not exist on my machine. Has to be some artifact of Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if you removed it it doesn't exist. It existed before, but not after your PR. This is not an artifact of github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha good one. But I think I misunderstood the comment -- @jnothman what do you mean by "remind a whole lot of blank lines"? Did you mean "removed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did indeed. Reviewing in the phone is a terrible habit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I initially thought you meant I added a whole lot of blank lines, which I obviously could not find. But yeah I do see the removed blank lines, and I have no idea how it happened! Sorry about that.
sklearn/metrics/pairwise.py
Outdated
@@ -256,6 +277,150 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False, | |||
return distances if squared else np.sqrt(distances, out=distances) | |||
|
|||
|
|||
# Pairwise distances in the presence of missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary comment
|
||
assert_array_almost_equal(D1, D2) | ||
|
||
# check when squared = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just do the first test with squared=True then assert_almost_equal (med(X,Y)**2, med(X,Y, squared=True))
.
Good tests, in my opinion, should look like a proof by induction. First you prove a base case, then you show that invariants hold in extending from the base case. The base case should ideally be something the reader can easily reason is doing the right thing, hence rational numbers or worked examples.
sklearn/neighbors/base.py
Outdated
@@ -158,6 +159,11 @@ def _init_params(self, n_neighbors=None, radius=None, | |||
self._fit_method = None | |||
|
|||
def _fit(self, X): | |||
if self.metric in _MASKED_SUPPORTED_METRICS: | |||
kill_missing = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to allow_nans
.
sklearn/metrics/pairwise.py
Outdated
in dense matrices X and Y with missing values in arbitrary | ||
coordinates. | ||
|
||
The following formula is used for this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cut the rest of the description down to a few sentences describing the calculation between vector pairs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why you're repeatedly ignoring this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I did cut the description down by between 8-10 lines compared to my previous commit. Sorry if it looked like I was ignoring it, that was definitely not my intention. But, anyway, I will cut it down further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, is not noticed. I have tried to suggest that the matrix formulation here is unhelpful. You just need enough for the intuition behind calculating the metric to be clear. A couple of sentences
@@ -355,6 +372,10 @@ class from an array representing our data set and ask who's | |||
if self.effective_metric_ == 'euclidean': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use or or in to put these cases in one.
sklearn/neighbors/base.py
Outdated
"Nearest neighbor algorithm does not currently support" | ||
"the use of sparse matrices." | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use elif rather than more nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am mistaken, but it seems the two "if" statements following the "else" are not mutually exclusive. This would preclude the use of two "elif"s instead right? However, I think I can simply remove the "else" as it seems to be redundant there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry no need to use elif. Can just drop the else clause, CV as the preceding if clause raises an error.
@jnothman I have pushed the changes you asked for. Thanks again! |
@jnothman @amueller @jaquesgrobler Hey guys -- just a friendly ping to request feedback so I can wrap this up :) |
It's in my pile. I'm taking a while to work through it.
…On 29 July 2017 at 06:32, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> @amueller
<https://github.com/amueller> @jaquesgrobler
<https://github.com/jaquesgrobler> Hey guys -- just a friendly ping to
request feedback so I can wrap this up :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67D_MFbAhXo-LTi2pRV9h2aed5MZks5sSkV2gaJpZM4OWu3U>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not given this a full review now. Could I please suggest that you open a new PR which starts where this branch leaves off and implements a n_neighbors feature in Imputer...? unless you didn't want to do that part.
sklearn/metrics/pairwise.py
Outdated
in dense matrices X and Y with missing values in arbitrary | ||
coordinates. | ||
|
||
The following formula is used for this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why you're repeatedly ignoring this comment.
sklearn/metrics/pairwise.py
Outdated
where NX and NYT represent the logical-not of the missing masks of | ||
X and Y.T, respectively. | ||
Formula in matrix form derived by: | ||
Shreya Bhattarai <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the docstring is an appropriate place to place credits about implementation details. If you wish, note it in a comment in the code.
sklearn/metrics/pairwise.py
Outdated
to be any format. False means that a sparse matrix input will | ||
raise an error. | ||
|
||
.. deprecated:: 0.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is being added, the deprecation note is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this. The deprecation note is for passing accept_sparse=None, which is not directly relevant to us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that this behaviour is changed in check_array, not here. Deprecation is there only too help users taking advantage of a previously supported interface.
sklearn/metrics/pairwise.py
Outdated
# NOTE: force_all_finite=False allows not only NaN but also +/- inf | ||
X, Y = check_pairwise_arrays(X, Y, accept_sparse=False, | ||
force_all_finite=False, copy=copy) | ||
if (np.any(np.isinf(X.data)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever overwrite X if copy=False??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment masked_euclidean_distances() sets copy=True by default. I did that because X is altered during distance calculation whereby all NaNs are replaced with zeros. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh sorry, I forgot that. I suspect that the user benefits little from being able to not copy (roughly the same memory is occupied by the mask), but I suppose it doesn't hurt to keep it in as long as it is tested.
sklearn/metrics/pairwise.py
Outdated
"+/- Infinite values are not allowed.") | ||
|
||
# Check if any rows have only missing value | ||
if np.any(_get_mask(X, missing_values).sum(axis=1) == X.shape[1])\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these repeated calls to _get_mask
and any
are relatively expensive. These should not be repeated here and below. And perhaps a helper should be factored out of imputer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By a helper I mean a separate function, perhaps in sklearn.utils, that can be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay if I do that as a separate PR later? I am thinking this PR might become a little unwieldy if I modify utils on top of already having modified both pairwise and neighbors. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should at least avoid repeating element-wise operations here. If a helper refactors between here and Imputer, yes, make the change in utils, in this PR. It is only relevant because of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which module within sklearn.utils do you think is most appropriate? Or a new one instead? I could not locate a clear candidate at a quick glance.
Sure, but what do I do with the old kNN imputation PR? Would you prefer I start a new one or that I just edit that PR by referencing to this instead? |
forgot about that pr. great! let's work on it...
…On 30 Jul 2017 8:47 am, "ashimb9" ***@***.***> wrote:
I've not given this a full review now. Could I please suggest that you
open a new PR which starts where this branch leaves off and implements a
n_neighbors feature in Imputer...? unless you didn't want to do that part.
Sure, but what do I do with the old kNN imputation PR
<#9212>? Would you
prefer I start a new one or that I just edit that PR by referencing to this
instead?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61vRqL1CpSuDtYodW2A-m_h1SRdQks5sS7Z4gaJpZM4OWu3U>
.
|
I think a few of your error messages do not have test coverage. |
The new function should be listed in |
sklearn/metrics/pairwise.py
Outdated
if metric == "precomputed": | ||
X, _ = check_pairwise_arrays(X, Y, precomputed=True) | ||
return X | ||
elif metric in PAIRWISE_DISTANCE_FUNCTIONS: | ||
func = PAIRWISE_DISTANCE_FUNCTIONS[metric] | ||
func = PAIRWISE_DISTANCE_FUNCTIONS[metric] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this changed indent?
sklearn/metrics/pairwise.py
Outdated
@@ -1148,7 +1299,9 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds): | |||
Valid values for metric are: | |||
|
|||
- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', | |||
'manhattan']. These metrics support sparse matrix inputs. | |||
'manhattan']. These metrics support sparse matrix | |||
inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this newline?
|
||
n_samples = X.shape[0] | ||
if n_samples == 0: | ||
raise ValueError("n_samples must be greater than 0") | ||
|
||
if issparse(X): | ||
if allow_nans: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be tested.
@jnothman Hey, thanks a lot for the comments! A quick question: given that this has been merged with the PR for KNNImputer, should I address your comments here or in the other one? |
If you want to focus on KNNImputer, close this and work there. I only
suggested creating this as a more tractable stepping stone for reviewers to
iterate over. I still think it has value in that goal.
…On 4 September 2017 at 15:03, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Hey, thanks a lot for the
comments! A quick question: given that this has been merged with the PR for
KNNImputer, should I address your comments here or in the other one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9348 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z1DNJvvuOD0SsVyQwrlrt6KaO8Rks5se4SYgaJpZM4OWu3U>
.
|
Ok cool, I will keep this since you think it might be useful. |
Reference Issue
Issue: 4844, 2989
Prelude to: 9212
What does this implement/fix? Explain your changes.
This is actually first part of another PR. The referenced PR implements a KNN based imputation strategy using its own k-Nearest Neighbor calculator. But it was suggested here that it might be a better option to make changes within sklearn.metrics instead so that the existing sklearn.neighbors machinery could be used. Given this context, this PR modifies relevant modules in sklearn.metrics (and sklearn.neighbors to a very minor extent) so the euclidean distance between samples with arbitrary missing coordinates can now be calculated.