Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

scottgigante
Copy link
Contributor

As discussed in #25

Copy link
Owner

@wmayner wmayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Seems like there's pretty bad precision in certain cases, like array_emd([1], [2]). This should be 1, but it's 0.9666... on my machine. This has to do with NumPy's heuristics for bin sizes, which result in many bins even in small cases like that. I'm not sure how important this is, but it may be worth thinking about.

pyemd/emd.pyx Outdated
histogram. Defaults to the range of the union of `first_array`
and `second_array`. Note: if the given range is not a superset
of the default range, no warning will be given.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, trailing whitespace should be stripped. There's probably a plugin for your text editor that can do this automatically for you.

if (first_histogram.shape[0] != second_histogram.shape[0]):
raise ValueError('Histogram lengths must be equal')

def euclidean_pairwise_distance(x):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be two blank lines before top-level function definitions. There are automated tools for checking these sorts of things, called “linters”, such as pylint.

pyemd/emd.pyx Outdated
return emd(first_histogram,
second_histogram,
distance_matrix,
extra_mass_penalty)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call should explicitly pass extra_mass_penalty as a keyword argument:

emd(..., extra_mass_penalty=extra_mass_penalty)

>>> first_array = [1,2,3,4]
>>> second_array = [2,3,4,5]
>>> array_emd(first_array, second_array, bins=2)
0.5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a few unit tests added to test/test_pyemd.py, in addition to this.

pyemd/emd.pyx Outdated
distance_matrix,
extra_mass_penalty)

def array_emd(first_array,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called emd_samples (if you can think of something better, I'm all ears), since the other emd functions also take arrays, and they all begin with emd.

Copy link
Owner

@wmayner wmayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good—just a couple of style nitpicks.

Last thing is to write some tests, and then I think this is good to go.

pyemd/emd.pyx Outdated
range = (min(np.min(first_array), np.min(second_array)),
max(np.max(first_array), np.max(second_array)))

if type(bins) == str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(bins, str) should be preferred over explicitly checking type equality.

pyemd/emd.pyx Outdated
if type(bins) == str:
hist, _ = np.histogram(np.concatenate([first_array,
second_array]),
range=range, bins=bins)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: arguments should be aligned.

>>> emd_with_flow(first_signature, second_signature, distance_matrix)
(3.5, [[0.0, 0.0], [0.0, 1.0]])
You can also calculate the EMD directly from two arrays:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could elaborate and say “arrays of observations” or “samples” instead of “arrays” and mention that the histogram is computed automatically.

if distance == 'euclidean':
distance = euclidean_pairwise_distance

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: comments should be capitalized, to be consistent with the others.

@scottgigante
Copy link
Contributor Author

scottgigante commented Jan 19, 2018 via email

@wmayner
Copy link
Owner

wmayner commented Jan 19, 2018

Gotcha, no worries. I'll wait to review until you comment.

@scottgigante
Copy link
Contributor Author

Okay, I'm pretty happy with that now. Thanks for your patience!

@wmayner
Copy link
Owner

wmayner commented Jan 22, 2018

Great, looks good. Thank you!

@wmayner wmayner merged commit 56ea285 into wmayner:develop Jan 22, 2018
wmayner added a commit that referenced this pull request Jan 22, 2018
Release v0.5.0

- Add the `emd_samples()` function (PR #26).
- Clarify docstrings.
- Update documentation in README.
- Refactor tests.
- Explicitly support Python 3.4 and 3.5.
@wmayner
Copy link
Owner

wmayner commented Jan 22, 2018

Released in v0.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants