Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Uniform interface for accessing minimum or maximum value of a dtype #27785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
carlosgmartin opened this issue Nov 17, 2024 · 7 comments
Open
Labels
62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.

Comments

@carlosgmartin
Copy link
Contributor

carlosgmartin commented Nov 17, 2024

Proposed new feature or change:

Mailing list post

Feature request: Add a uniform interface for accessing the minimum or maximum value of a given dtype.

Previously discussed here, here, and here. Currently, doing this requires branching on the type of dtype (boolean, integer, or floating point) and then (for the latter two) calling either iinfo or finfo, respectively. It would be more ergonomic to have a single, uniform interface for accessing this information that is dtype-independent.

Here are the possible interfaces suggested so far:

import numpy as np
dt = np.int32 # example

# As an attribute of the dtype itself:
dt.min
dt.min_value
dt.info().min
dt.info.min

# As a function in numpy.dtypes:
np.dtypes.info(dt).min
np.dtypes.min(dt)
np.dtypes.min_value(dt)

# As a top-level function:
np.min_dtype_value(dt)
np.dtype_info(dt).min
np.dtype_info(dt).min_value

Personally, my current favorite is simply dt.min.

Relevant comment by @mhvk on the mailing list thread:

It would also seem to make sense to consider how the dtype itself can hold/calculate this type of information, since that will be the only way a generic info() function can get information for a user-defined dtype. Indeed, taking that further, might a method or property on the dtype itself be the cleaner interface?

@mhvk
Copy link
Contributor

mhvk commented Nov 18, 2024

Thanks for posting! My personal favourite would be dt.info.min (or dt.info().min) to avoid adding many different methods to the dtype that are not always relevant (e.g., .min is not relevant for a string dtype, and ambiguous for a structured one). An .info attribute or method would also make it more obvious that it would be optional.

It also might be simpler for an initial implementation, since it will be easy to wrap the existing code.

p.s. While I think something on the dtype might be cleanest, it is of course possible to think of something like a registration system that would allow something like np.dtypes.info(...) work for user-defined dtype as well.

@seberg
Copy link
Member

seberg commented Nov 20, 2024

Yes, it would be nice to have this. Have to carefully think about a 1:1 mapping with iinfo/finfo, though (i.e. return an iinfo/finfo instance).

The reason is that min and max for example are not clearly defined for floats, they could refer to inf or the maximum finite value. For finfo I think that is OK, but for a uniform info, I feel it may be confusing because dtype.max would not be a safe value for np.max(arr, initial=arr.dtype.max).

@carlosgmartin
Copy link
Contributor Author

@seberg Did you mean initial=arr.dtype.min in your example, which would be the identity for max?

You raise a good point that finfo.max is the largest representable number, rather than the largest number (inf). Thus we should try to avoid confusion.

For precisely the use case you described, I'm interested in extracting the actual min/max value, that is, the identity element for max/min, respectively.

@rgommers rgommers added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Dec 6, 2024
@rgommers
Copy link
Member

rgommers commented Dec 6, 2024

Improving user-defined dtypes seems like an important need, as pointed out on the mailing list. That should apply to the current finfo/iinfo API as well, as well as to isdtype & co. This seems like the most important gap we currently have. The existing feature request for that is gh-27231. There is also relevant discussion on registering user-defined dtypes in a way that makes it possible to query their kind (e.g., the bfloat16 from ml_dtypes in gh-24699).

The commonality of the user-defined dtypes topic with this feature request is that it requires stashing metadata on the dtype somewhere (either publicly or privately).

Previously discussed here, here, and here.

I'll note that these are all links to you asking for this same feature. What I am missing is a more compelling use case, it would be great to add one (a initial=arr.dtype.min one-liner isn't a use case). I quickly searched the SciPy code base, which makes heavy use of finfo in particular, to try to find places where adding the new info API would make the code nicer - but failed to find anything (disclaimer: I only spent a few minutes, so I could easily have missed something). For motivating this change, it'd be great to point at some instances of existing code that is made nicer, or issues about the lack of this uniform API being problematic somehow. I feel that there is potentially something valuable in this feature request, but it's not clear cut - and we should have a stronger motivation to add new API, especially if it's duplicate with what we already have.

The other point is that semantically the min/max of loating-point and integer dtypes aren't actually the same - there's wraparound vs. clipping, floating-point warnings, different casting rules, etc.). A few examples:

>>> np.finfo(np.float64).max * 2  # returns a NumPy scalar
<ipython-input-12-b98731d4261a>:1: RuntimeWarning: overflow encountered in scalar multiply
  np.finfo(np.float64).max * 2
np.float64(inf)
>>> np.iinfo(np.int64).max * 2  # returns a Python int
18446744073709551614

>>> fmax = np.finfo(np.float64).max
>>> imax = np.iinfo(np.int64).max
>>> fmax + 1 == fmax
np.True_
>>> imax + 1 == imax
False

Which is probably why it's hard to find code that actually needs this.

Re standalone function vs dtype method/attribute: if you go for .info/.info(), please immediately try to add static typing. My feeling is that it will show that an attribute won't work well, because these things aren't actually uniform across dtype kinds.

The discussion on the mailing list asked for a NEP - that seems justified indeed.

@carlosgmartin
Copy link
Contributor Author

@rgommers

a initial=arr.dtype.min one-liner isn't a use case

Why not? It seems like a perfectly good use case: One wants the sum/product/max/min of an array that happens to be empty (either by size or masking) to be the identity element for sum/product/max/min, respectively.

@seberg
Copy link
Member

seberg commented Dec 6, 2024

FWIW, finfo is not very generic and not very convenient. It is overloaded with IEEE information that is usually unneeded and doesn't generalize cleanly.

The important part is that any such information must be well defined, and I think if you do that the typing will be a nice check, but will just fall into place.
I.e. if dtype.info.smallest_subnormal is defined that is equivalent to finfo(dtype).smallest_subnormal. In one case the IntegerDType doesn't define finfo(dtype) -> FloatInfo in the other case you have dtype.info -> IntegerInfo.
In both cases you need a FloatInfo and an IntegerInfo protocol/ABC hierarchy if you want to type it.

An interesting thing to keep in mind is also complex numbers, finfo actually has an advantage there, since I am not 100% sure that there are no values where one would want to return something different for complex_dtype.info vs. complex_dtype.as_real().info.

Having a single name entry-point seems more convenient, but of course it means that the all attributes must be very well defined, e.g.:

  • maximum_value as the np.maximum.get_identity() (doesn't exist but could)
  • maximum_finite == np.nextafter(np.inf, 0, dtype=dtype) (already exists)
  • ...

Besides being very inconvenient and slow, for floats you can infer almost all interesting values (an interesting problem is complex numbers, but I suspect one can deal with it).

The minimum value problem may be one of the few more used things you can't infer right now for non-floats (for floats int and nextafter are all you need).

Why not? It seems like a perfectly good use case

It's a good use, but it is helpful to mention the actual code/use-case where you need to do this for a mix of float/non-float for example.

@rgommers
Copy link
Member

rgommers commented Dec 6, 2024

it is helpful to mention the actual code/use-case where you need to do this for a mix of float/non-float for example.

Yes this☝🏼. What's the context, what are you trying to achieve and why? And why you need it often enough that the one-liner you've already received is too annoying? And do you find this code pattern, or similar such patterns, in other code:

>>> def min_val(dtype):
...   return (False if dtype == bool else np.iinfo(dtype).min if np.issubdtype(dtype, np.integer)
... else np.finfo(dtype).min)
... 
>>> arr = np.array([])
>>> np.max(arr, initial=min_val(arr.dtype))
np.float64(-1.7976931348623157e+308)

Basically, this doesn't allow anything new yet, it just saves a bit of typing for what looks like a fairly niche use case (niche because normally you'd want either an exception or an inf/nan type value for data analysis purposes, and mask it out for plotting or summary statistics). The bit of typing saved also typically doesn't clear the bar for adding new API into NumPy. So the request here is to help us out by making a better case with more detail/context. That is anyway needed for the motivation section of a NEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.
Projects
None yet
Development

No branches or pull requests

4 participants