Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 822 commits into from

Conversation

nayyarv
Copy link
Contributor

@nayyarv nayyarv commented Sep 7, 2015

Following on from PR #6029, the methods provided ignore range or weights so this PR aims to address this.

The estimators are still defined as separate functions if there is ever a future need to expose these estimators to users. Each estimator now takes x and weights as arguments. These subfunctions expectation of weights is that it is count-like. If it's a probability, the subfunctions won't choose the datasize properly.

_hist_optim_numbins_estimator handles the range keyword by masking the appropriate parts of a and weights. It also checks whether weights is of an appropriate dtype (int or float). If weights is probability-like (i.e. weights.sum()<a.size), I scale weights such that weights.sum()=a.size. This means that I effectively choose n=max(a.size, weights.sum()) for my datasize, allowing for decent behaviour whether weights is count-like or probability like, while keeping things simple.

Freedman Diaconis requires a weighted IQR to be calculated, and since np.percentile or np.partition does not provide weighted options, this method is only used for unweighted data. Trying to use the 'FD' estimator with weights results in a TypeError. 'auto' now chooses between 'sturges' and 'scott' for weighted data, while retaining original behaviour for unweighted data.

I've put some basic tests together and minor changes to the documentation and would love to hear others thoughts and feedback on the design choices/implementation/style etc.

mn, mx = data_range
keep = (a >= mn)
keep &= (a <= mx)
if not np.logical_and.reduce(keep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.all instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I coped the filter step from https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L400 which came up in #6100.

It's actually faster than np.all, though for style reasons, I'm happy to use np.all too

@shoyer
Copy link
Member

shoyer commented Sep 7, 2015

The design choices here look good to me.

@seberg
Copy link
Member

seberg commented Sep 7, 2015

Frankly, I am a bit sceptical about automatically guessing the weight type. Maybe it is clear enough, just wanted to say that it has a bit of a bad taste to me for starters.

That said, I think the functionality is not bad at all.

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 8, 2015

I agree that choosing data_size from weights is far from perfect, but without further input from the user (**kwargs or another argument in histogram)/exposing the estimators, it's the best I can do.
After our discussion at the end of the last PR, we decided the two main use cases for weights + automatic bin estimation were either count-like or probability-like weights (matched up to friends/colleagues thoughts)
If the weights were count-like, it's sum would be greater than x.size and if it was probability-like, it's sum would be less, which is why I use the max(a.size, weights.sum()) logic.
If the user was using the weights option as a bin summation method, it's unlikely they would use automatic estimators, and rather fixed bin edges - so situations in which the weights are not count/probability like can be ignored.

Any better solutions, I'm all ears. If nothing else, we could simply patch in the range support and deal with weights at a later date.

@seberg
Copy link
Member

seberg commented Sep 8, 2015

I am not currently convinced that you can always correctly infer the type. So I am still wondering if we cannot find a way to force the user to make a conscies choice.

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 9, 2015

Fair enough - the way I see it, we need to know 2 things

  1. Whether or not to consider the weights when estimating the number of bins
  2. If we do consider the weights, what should the data size be - weights.sum() or a.size.

Basically, we need to know how to deal with datasize - if it's None, we can ignore the weights when estimating, if it's 'sum' or 'size', we know that we have to deal with weights. I.e. 1 argument.

Possible solutions,

  1. Expose the estimators for manual control and let np.histogram provide simple default behaviour where it tries to guess (Maybe move the estimators to scipy.stats?)
  2. Add another argument to np.histogram for instructions on how to deal with weights, weights_sum=True or something.
  3. Add options to the bins string, like 'auto sum' or 'scott size', which can then be split to retrieve the estimator and how to deal with weights. If the second term isn't included, ignore weights in the estimation. This also means the bins checking logic remains simple, as opposed to making it a tuple or list which is also iterable.
  4. Something similar to 3, except maybe combined into the weights argument as a tuple or iterable?

I'm in favour of 3 (or maybe 1), as it requires the least amount of API changes, and it's an incremental change.

@njsmith
Copy link
Member

njsmith commented Sep 9, 2015

Would it make sense to add a new kwarg data_size whose value defaults to
weights.sum()?
On Sep 9, 2015 1:01 AM, "Varun Nayyar" [email protected] wrote:

Fair enough - the way I see it, we need to know 2 things

  1. Whether or not to consider the weights when estimating the number of
    bins
  2. If we do consider the weights, what should the data size be -
    weights.sum() or a.size.

Basically, we need to know how to deal with datasize - if it's None, we
can ignore the weights when estimating, if it's 'sum' or 'size', we know
that we have to deal with weights. I.e. 1 argument.

Possible solutions,

  1. Expose the estimators for manual control and let np.histogram provide
    simple default behaviour where it tries to guess (Maybe move the estimators
    to scipy.stats?)
  2. Add another argument to np.histogram for instructions on how to deal
    with weights, weights_sum=True or something.
  3. Add options to the bins string, like 'auto sum' or 'scott size', which
    can then be split to retrieve the estimator and how to deal with weights.
    If the second term isn't included, ignore weights in the estimation. This
    also means the bins checking logic remains simple, as opposed to making
    it a tuple or list which is also iterable.
  4. Something similar to 3, except maybe combined into the weights
    argument as a tuple or iterable?

I'm in favour of 3 (or maybe 1), as it requires the least amount of API
changes, and it's an incremental change.


Reply to this email directly or view it on GitHub
#6288 (comment).

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 9, 2015

I'm not overly keen on adding a new kwarg that only applies when a bins estimator is being used and weights is not None. Seems like overkill, and I would like to avoid R style functions with their mostly unused arguments.

@njsmith
Copy link
Member

njsmith commented Sep 9, 2015

I'm not keen on having lots of non-orthogonal kwargs either, but I like it
a lot better than the idea of defining a little ad hoc string language
inside a single string kwarg. The other general option that comes to mind
would be using something that has more structure than a str, like a dict or
the object, but I'm not seeing a terribly natural way to apply that trick
here myself...
On Sep 9, 2015 6:31 AM, "Varun Nayyar" [email protected] wrote:

I'm not overly keen on adding a new kwarg that only applies when a bins
estimator is being used and weights is not None. Seems like overkill, and
I would like to avoid R style functions with their mostly unused arguments.


Reply to this email directly or view it on GitHub
#6288 (comment).

@tacaswell
Copy link
Contributor

If there are only two options for how to deal with the weights, maybe just do a look up on the whole string. You effectively double/triple the size of the number of estimators, but that does not seem so bad.

From the mpl side, having histogram gain any kwargs would greatly complicate the argument handling in hist.

@shoyer
Copy link
Member

shoyer commented Sep 10, 2015

I wonder if it would make sense to pick data_size based on the dtype of the array:

  • int: use data.sum()
  • float: use data.size

I guess this would interact poorly with downstream libraries like pandas that use floats to represent nullable-int.

@njsmith
Copy link
Member

njsmith commented Sep 10, 2015

Is using data.size even theoretically justified? I can't quite think of any
reason why it's a more natural choice than any other arbitrary number
(though I admit my intuition about these algorithms is very limited).

On Wed, Sep 9, 2015 at 11:41 PM, Stephan Hoyer [email protected]
wrote:

I wonder if it would make sense to pick data_size based on the dtype of
the array:

  • int: use data.sum()
  • float: use data.size

I guess this would interact poorly with downstream libraries like pandas
that use floats to represent nullable-int.


Reply to this email directly or view it on GitHub
#6288 (comment).

Nathaniel J. Smith -- http://vorpus.org

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 10, 2015

Haha, inferring data_size from the dtype is what we're unhappy with, so we're looking at a way to get the user to give us input without changing function signatures too much.

The estimator is primarily concerned with the size of the data to minimise Mean Integrated Squared Error. In general optimal binwidth ~ n^(-1/3) where n is the datasize.

Possible situations

  1. Now if weights represents counts, this means that np.repeat(a, weights) would give us the full data set and hence n should be weights.sum(). This makes sense for integer weights and some float weights (if there's reasoning behind it).
    2.In a situation where the weights represents a probability/proportion, the calculation of standard deviation or the IQR for scott's and FD's rule respectively are now altered, while the datasize should be the number of points, not weight.sum() which would evaluate to 1.
  2. Finally, if there is no meaning to the weights, it's simply for the summation, the weights should be ignored for the calculation of the standard deviation and the datasize. This is for something exotic like complex weights or you have matched pairs eg. (Age, Wealth) that you want to sum up per group.

We could potentially turn bins into a dict (or a list or tuple) for automatic methods, i.e,
{method='auto', weights=['sum', 'size', 'ignore']} for example, and for bins='auto' we either ignore weights by default or try and guess (max(weights.sum(), a.size) for data_size) for best average case behaviour.

@seberg
Copy link
Member

seberg commented Sep 10, 2015 via email

@seberg
Copy link
Member

seberg commented Sep 10, 2015

I would like to throw in one further option, which may or may not be temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then just throw an error and tell user to use those functions manually. If weights are not very common (I have no clue) for plotting, we could do that reasonably.

@tacaswell
Copy link
Contributor

I like @seberg's last suggestion (ignore/raise) the best.

On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do that
reasonably.


Reply to this email directly or view it on GitHub
#6288 (comment).

@shoyer
Copy link
Member

shoyer commented Sep 10, 2015

So just raising an error is weights are set? That seems reasonable to me, better than designing an awkward API for niche uses.

On Thu, Sep 10, 2015 at 10:18 AM, Thomas A Caswell
[email protected] wrote:

I like @seberg's last suggestion (ignore/raise) the best.
On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do that
reasonably.


Reply to this email directly or view it on GitHub
#6288 (comment).


Reply to this email directly or view it on GitHub:
#6288 (comment)

@njsmith
Copy link
Member

njsmith commented Sep 10, 2015

It would be nice to support count weights + auto binning at some point, but
we can always do that as an enhancement later once we've figured out how to
actually make it work right :-). And until then, raising an error is safe
way to keep our options open, so +1

On Thu, Sep 10, 2015 at 11:08 AM, Stephan Hoyer [email protected]
wrote:

So just raising an error is weights are set? That seems reasonable to me,
better than designing an awkward API for niche uses.

On Thu, Sep 10, 2015 at 10:18 AM, Thomas A Caswell
[email protected] wrote:

I like @seberg's last suggestion (ignore/raise) the best.
On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do
that
reasonably.


Reply to this email directly or view it on GitHub
#6288 (comment).


Reply to this email directly or view it on GitHub:
#6288 (comment)


Reply to this email directly or view it on GitHub
#6288 (comment).

Nathaniel J. Smith -- http://vorpus.org

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 11, 2015

Ok, in that case, I'll edit the PR to support the range keyword only and throw an error for weights not None. R doesn't provide an option for weights with it's hist method (which is primarily visual), and I'm not sure how many situations with 'auto' bin estimation will have weights too. I suppose we'll find out after 1.11 is released.

With regards to exposing the estimators, would it find a better home in scipy.stats instead?

@seberg
Copy link
Member

seberg commented Sep 11, 2015

It is true that exposing would fit better into scipy. And I am not a fan of exposing them in top level numpy. But scipy is not a dependency so that would be code duplication.
So, I know I suggested that we may expose it, but I frankly have no clue how it would look like :(.

@nayyarv
Copy link
Contributor Author

nayyarv commented Sep 22, 2015

Removed support for weights, and instead raise a TypeError when weights!=None is passed into the function. Support for range still remains, and the test is included too.

@@ -91,6 +101,17 @@ def _hist_optim_numbins_estimator(a, estimator):
if a.size == 0:
return 1

if data_weights is not None:
raise TypeError("Automated estimation of the number of "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a ValueError? What's the best error to raise here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a RuntimeWarning? Something along the lines of FD estimator currently ignores weighted data. Might be better than crashing.

rgommers and others added 2 commits February 13, 2016 15:22
DOC: typo: change 'pubic' to 'public'.
ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base
@homu
Copy link
Contributor

homu commented Feb 13, 2016

☔ The latest upstream changes (presumably #7090) made this pull request unmergeable. Please resolve the merge conflicts.

@seberg
Copy link
Member

seberg commented Feb 13, 2016

Wait, did we never put some kind of warning/error about using this with weights in, sounds like this should be done still!? @madphysicist maybe you want to have a bit of a look at this, or @nayyarv has time to finish this off?

@madphysicist
Copy link
Contributor

I skimmed this and the related (closed) PR. I will look into it in more detail as soon as I get a chance. The first thing that comes to mind is to add a weightedfunctions module similar to nanfunctions. However, even at my most optimistic, I have to admit this is probably not a reasonable solution.

MAINT: update doc/sphinxext to numpydoc 0.6.0, and fix up some docstrings
@seberg
Copy link
Member

seberg commented Feb 13, 2016

Actually, I meant mostly if we are doing something reasonable with the
automatic bins right now. Error/warning ignore, I don't know, but just
to make sure it makes sense and is documented. We could try to make it
correct, but I expect it is not easy.

@njsmith
Copy link
Member

njsmith commented Feb 13, 2016

@seberg: Yeah, I think you're right -- it looks like in master and 1.11b3, the new bins="auto" feature is kinda broken: np.histogram will happily accept bins="auto", weights=..., and then return nonsense results. Similarly for bins="auto", range=.... This PR looks like a good solution to me, but has conflicts due to other cleanups to this code...

@nayyarv @madphysicist: Any chance one of you could clean this up quickly so we can get it into 1.11?

@njsmith njsmith added this to the 1.11.0 release milestone Feb 13, 2016
seberg and others added 2 commits February 13, 2016 22:26
@madphysicist
Copy link
Contributor

Cancel my previous comment. I took a look at the code more carefully. I can clean this up right now actually. Can I just rebase it onto #7199?

# Conflicts:
#	numpy/lib/function_base.py
#	numpy/lib/tests/test_function_base.py
@madphysicist
Copy link
Contributor

I have rebased the autobins branch onto the latest master with all that that entails. I did add a couple of gramatical and line wrapping fixes. The result is PR #7243. I was unable to make a PR back to @nayyarv's fork of numpy.

@nayyarv, FWIW, I like the version in the autobinsWeighted branch better, even if this version is much simpler. Short-term, this is the best solution (although I would prefer to warn rather than raise an error). Medium-term, the autobinsWeighted solution is probably better. Long-term, I think adding a weightedfunctions module to do percentile, mean, median, etc. on weighted data would be optimal.

@madphysicist
Copy link
Contributor

Looks like the other PR kicked off a rebuild on this one?

@nayyarv
Copy link
Contributor Author

nayyarv commented Feb 14, 2016

Nup, this was me trying to fix it, but looks like the rebase went a bit crazy.

@madphysicist
Copy link
Contributor

Looks like you may have rebased master onto your branch instead of the other way around?

@madphysicist
Copy link
Contributor

If you are OK with my PR, we can just run with that. I did not change any of the functionality, just some wording and whitespace stuff.

@nayyarv
Copy link
Contributor Author

nayyarv commented Feb 14, 2016

yep, that's exactly what happened. Let's jump over to your PR instead, I'm shutting this down.

@nayyarv
Copy link
Contributor Author

nayyarv commented Feb 14, 2016

Jump over to #7243, this pull request has gotten out of hand with my failure to git properly.

@nayyarv nayyarv closed this Feb 14, 2016
@charris
Copy link
Member

charris commented Feb 14, 2016

You can recover from a failed rebase, git reflog is your friend, See http://stackoverflow.com/questions/134882/undoing-a-git-rebase.

@nayyarv
Copy link
Contributor Author

nayyarv commented Feb 14, 2016

Thanks charris, I did manage to fix the rebase but since I closed this it, github won't let me reopen this Pull Request (since it's been recreated). As such, since madphysicist managed the exact same effect thing in his PR, I'm happy to let his go through instead

@charris charris removed this from the 1.11.0 release milestone Feb 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.