Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/release/1.16.0-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ Highlights
New functions
=============

* New functions in the `numpy.lib.recfunctions` module to ease the structured
assignment changes: `assign_fields_by_name`, `structured_to_unstructured`,
`unstructured_to_structured`, `apply_along_fields`, and `require_fields`.
See the user guide at <https://docs.scipy.org/doc/numpy/user/basics.rec.html>
for more info.

Deprecations
============
Expand Down
9 changes: 9 additions & 0 deletions numpy/doc/structured_arrays.py
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,15 @@
>>> repack_fields(a[['a','c']]).view('i8') # supported 1.15 and 1.16
array([0, 0, 0])

The :module:`numpy.lib.recfunctions` module has other new methods
introduced in numpy 1.16 to help users account for this change. These are
:func:`numpy.lib.recfunctions.structured_to_unstructured`,
:func:`numpy.lib.recfunctions.unstructured_to_structured`,
:func:`numpy.lib.recfunctions.apply_along_fields`,
:func:`numpy.lib.recfunctions.assign_fields_by_name`, and
:func:`numpy.lib.recfunctions.require_fields`.


Assigning to an array with a multi-field index will behave the same in Numpy
1.15 and Numpy 1.16. In both versions the assignment will modify the original
array::
Expand Down
326 changes: 325 additions & 1 deletion numpy/lib/recfunctions.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def get_fieldspec(dtype):
fields = ((name, dtype.fields[name]) for name in dtype.names)
# keep any titles, if present
return [
(name if len(f) == 2 else (f[2], name), f[0])
(name if len(f) == 2 else (f[2], name), f[0])
for name, f in fields
]

Expand Down Expand Up @@ -870,6 +870,330 @@ def repack_fields(a, align=False, recurse=False):
dt = np.dtype(fieldinfo, align=align)
return np.dtype((a.type, dt))

def _get_fields_and_offsets(dt, offset=0):
"""
Returns a flat list of (name, dtype, count, offset) tuples of all the
scalar fields in the dtype "dt", including nested fields, in left
to right order.
"""
fields = []
for name in dt.names:
field = dt.fields[name]
if field[0].names is None:
count = 1
for size in field[0].shape:
count *= size
fields.append((name, field[0], count, field[1] + offset))
else:
fields.extend(_get_fields_and_offsets(field[0], field[1] + offset))
return fields


def _structured_to_unstructured_dispatcher(arr, dtype=None, copy=None,
casting=None):
return (arr,)

@array_function_dispatch(_structured_to_unstructured_dispatcher)
def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are places we learnt that unsafe was a bad default, but ended up stuck with it, leaving users surprised by the conversion.

Should we apply that learning here, and pick a more conservative default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed that. Maybe in #8733? But that was about assignment using unsafe casting, with no option to specify otherwise, unlike here where there is a keyword the user can specify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that issue was one of the ones I was thinking of - thanks for linking it, I was looking for that for unrelated reasons too!

Unsafe just feels like an... unsafe default to me. In my opinion, unsafe behavior should be something you ask for, not something you get by default. You're picking between:

  • as is: f(...) vs f(..., casting='safe')
  • proposed: f(..., casting='unsafe') vs f(...)

I'd much rather see the word 'unsafe' to tell me I need to think more carefully about that line of code, rather than having to look for the absence of it.

I don't have a good memory of how the other casting modes behave. I'd be inclined to pick same_kind to match the default value of the casting argument for for ufuncs

Copy link
Member Author

@ahaldane ahaldane Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an argument for unsafe:

First, it matches the default for the same keyword in astype, and so its easier for the user to remember if they are used to using astype.

Second, it seems like most of the time the user wants unsafe, because there are many common casts that are ruled out otherwise. For instance casts from f8 to i8 are disallowed with same_kind, but I expect this is a very common cast.

Actually, for reasons I don't understand, ufuncs seem to allow casts from f8 to i8 even though they supposedly use same_kind:

>>> np.arange(3, dtype='f8').astype('i8', casting='same_kind')                 
TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'same_kind'
>>> np.add(np.arange(3, dtype='f8'), np.arange(3, dtype='i8'))                 
array([0., 2., 4.])
>>> np.can_cast('f8', 'i8', casting='same_kind')
False

So ufuncs appear to use unsafe casting despite the keyword default??

"""
Converts and n-D structured array into an (n+1)-D unstructured array.

The new array will have a new last dimension equal in size to the
number of field-elements of the input array. If not supplied, the output
datatype is determined from the numpy type promotion rules applied to all
the field datatypes.

Nested fields, as well as each element of any subarray fields, all count
as a single field-elements.
Copy link
Member

@eric-wieser eric-wieser Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dtype is itself a structured array? Eg, consider:

point = np.dtype([('x'. int), ('y', int)])
triangle = np.dtype([('p_a', point), ('p_b', point), ('p_c', point)]

I'd expect to be able to do

arr = np.zeros(10, triangle)

structured_to_unstructured(arr, dtype=point)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accounted for, however your particular example shows there is a bug in this code because it can't account for repeated field names in the nested structures. Will fix.

Copy link
Member Author

@ahaldane ahaldane Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second examination, I also missed that your output was structured. structured_to_unstructured doesn't account for that the way you expected, and I'm not sure there is a good "rule" for how it should work. Your example makes sense because each field can be unambiguously cast to the new structured type, so you expect the output shape to be (10, 3) with point dtype. My implementation currently casts all nested fields to the new dtype, resulting in a (10, 6) array of points: It casts the x and y of each point to a point individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating "field_{}".format(i) as the name for each field is probably the safest bet - if you control all the names, you don't need to worry about escape sequences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's already fixed/implemented in #12446.

I'd rather not attempt to account for structured dtypes in the output though: That's not a pre-existing use-case we're trying to fix, and the best behavior to implement is unclear to me at the moment. Any users who previously did something like that can still do it using repack_fields instead of structured_to_unstructured, though without the added safety the latter has added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a pre-existing use-case we're trying to fix,

As I understand it, the purpose of structured_to_unstructured is to replace the arr[fields].view(dt) idiom. But it sounds like it doesn't work as a universal replacement:

  • arr[['f_a', 'f_b']].view(float)structured_to_unstructured(arr['f_a', 'f_c'], float)
  • arr[['p_a', 'p_b']].view(point) → ???

Is there a way to spell the second case with these functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though without the added safety the latter has added.

Can you give an example of that safety, maybe even in the docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding:

All code of the form arr[['field1', 'field2, ...]].view(dt) in 1.15 can be replaced by repack_fields(arr[['field1', 'field2, ...]]).view(dt) in 1.16, without exceptions. It should be identical performance, since in both cases a copy is made.

Additionally, we have implemented a new function structured_to_unstructured. Although this can't be used as a replacement in all cases as you point out, in the many cases where it can be used it is better because it avoids a copy for multifield-indexes, it is safer since it is memory-layout-agnostic, and it better documents the user's intent. It is safer because it saves the user from bugs when doing the view: If the user tries to do the view themselves it is quite easy to forget to account for padding bytes, endianness, dtype, or misunderstand the memory layout, but structured_to_unstructured is written in a way to guarantee the view is "safe" for any memory layout and dtype (the fields are always viewed with the right offsets). structured_to_unstructured also casts the fields if needed, which the view above does not do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a description based on my last comment in #12447


Parameters
----------
arr : ndarray
Structured array or dtype to convert. Cannot contain object datatype.
dtype : dtype, optional
The dtype of the output unstructured array
copy : bool, optional
See copy argument to `ndarray.astype`. If true, always return a copy.
If false, and `dtype` requirements are satisfied, a view is returned.
casting : {'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional
See casting argument of `ndarray.astype`. Controls what kind of data
casting may occur.

Returns
-------
unstructured : ndarray
Unstructured array with one more dimension.

Examples
--------

>>> a = np.zeros(4, dtype=[('a', 'i4'), ('b', 'f4,u2'), ('c', 'f4', 2)])
>>> a
array([(0, (0., 0), [0., 0.]), (0, (0., 0), [0., 0.]),
(0, (0., 0), [0., 0.]), (0, (0., 0), [0., 0.])],
dtype=[('a', '<i4'), ('b', [('f0', '<f4'), ('f1', '<u2')]), ('c', '<f4', (2,))])
>>> structured_to_unstructured(arr)
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])

>>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],
... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>>> np.mean(structured_to_unstructured(b[['x', 'z']]), axis=-1)
array([ 3. , 5.5, 9. , 11. ])

"""
if arr.dtype.names is None:
raise ValueError('arr must be a structured array')

fields = _get_fields_and_offsets(arr.dtype)
names, dts, counts, offsets = zip(*fields)
n_fields = len(names)

if dtype is None:
out_dtype = np.result_type(*[dt.base for dt in dts])
else:
out_dtype = dtype

# Use a series of views and casts to convert to an unstructured array:

# first view using flattened fields (doesn't work for object arrays)
# Note: dts may include a shape for subarrays
flattened_fields = np.dtype({'names': names,
'formats': dts,
'offsets': offsets,
'itemsize': arr.dtype.itemsize})
arr = arr.view(flattened_fields)

# next cast to a packed format with all fields converted to new dtype
packed_fields = np.dtype({'names': names,
'formats': [(out_dtype, c) for c in counts]})
arr = arr.astype(packed_fields, copy=copy, casting=casting)

# finally is it safe to view the packed fields as the unstructured type
return arr.view((out_dtype, sum(counts)))

def _unstructured_to_structured_dispatcher(arr, dtype=None, names=None,
align=None, copy=None, casting=None):
return (arr,)

@array_function_dispatch(_unstructured_to_structured_dispatcher)
def unstructured_to_structured(arr, dtype=None, names=None, align=False,
copy=False, casting='unsafe'):
"""
Converts and n-D unstructured array into an (n-1)-D structured array.

The last dimension of the input array is converted into a structure, with
number of field-elements equal to the size of the last dimension of the
input array. By default all output fields have the input array's dtype, but
an output structured dtype with an equal number of fields-elements can be
supplied instead.

Nested fields, as well as each element of any subarray fields, all count
towards the number of field-elements.

Parameters
----------
arr : ndarray
Unstructured array or dtype to convert.
dtype : dtype, optional
The structured dtype of the output array
names : list of strings, optional
If dtype is not supplied, this specifies the field names for the output
dtype, in order. The field dtypes will be the same as the input array.
align : boolean, optional
Whether to create an aligned memory layout.
copy : bool, optional
See copy argument to `ndarray.astype`. If true, always return a copy.
If false, and `dtype` requirements are satisfied, a view is returned.
casting : {'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional
See casting argument of `ndarray.astype`. Controls what kind of data
casting may occur.

Returns
-------
structured : ndarray
Structured array with fewer dimensions.

Examples
--------

>>> dt = np.dtype([('a', 'i4'), ('b', 'f4,u2'), ('c', 'f4', 2)])
>>> a = np.arange(20).reshape((4,5))
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
>>> unstructured_to_structured(a, dt)
array([( 0, ( 1., 2), [ 3., 4.]), ( 5, ( 6., 7), [ 8., 9.]),
(10, (11., 12), [13., 14.]), (15, (16., 17), [18., 19.])],
dtype=[('a', '<i4'), ('b', [('f0', '<f4'), ('f1', '<u2')]), ('c', '<f4', (2,))])

"""
if arr.shape == ():
raise ValueError('arr must have at least one dimension')
n_elem = arr.shape[-1]

if dtype is None:
if names is None:
names = ['f{}'.format(n) for n in range(n_elem)]
out_dtype = np.dtype([(n, arr.dtype) for n in names], align=align)
fields = _get_fields_and_offsets(out_dtype)
names, dts, counts, offsets = zip(*fields)
else:
if names is not None:
raise ValueError("don't supply both dtype and names")
# sanity check of the input dtype
fields = _get_fields_and_offsets(dtype)
names, dts, counts, offsets = zip(*fields)
if n_elem != sum(counts):
raise ValueError('The length of the last dimension of arr must '
'be equal to the number of fields in dtype')
out_dtype = dtype
if align and not out_dtype.isalignedstruct:
raise ValueError("align was True but dtype is not aligned")

# Use a series of views and casts to convert to a structured array:

# first view as a packed structured array of one dtype
packed_fields = np.dtype({'names': names,
'formats': [(arr.dtype, c) for c in counts]})
arr = np.ascontiguousarray(arr).view(packed_fields)

# next cast to an unpacked but flattened format with varied dtypes
flattened_fields = np.dtype({'names': names,
'formats': dts,
'offsets': offsets,
'itemsize': out_dtype.itemsize})
arr = arr.astype(flattened_fields, copy=copy, casting=casting)

# finally view as the final nested dtype and remove the last axis
return arr.view(out_dtype)[..., 0]

def _apply_along_fields_dispatcher(func, arr):
return (arr,)

@array_function_dispatch(_apply_along_fields_dispatcher)
def apply_along_fields(func, arr):
"""
Apply function 'func' as a reduction across fields of a structured array.

This is similar to `apply_along_axis`, but treats the fields of a
structured array as an extra axis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this needs a warning that fields are all cast to the same type


Parameters
----------
func : function
Function to apply on the "field" dimension. This function must
support an `axis` argument, like np.mean, np.sum, etc.
arr : ndarray
Structured array for which to apply func.

Returns
-------
out : ndarray
Result of the recution operation

Examples
--------

>>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)],
... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>>> apply_along_fields(np.mean, b)
array([ 2.66666667, 5.33333333, 8.66666667, 11. ])
>>> apply_along_fields(np.mean, b[['x', 'z']])
array([ 3. , 5.5, 9. , 11. ])

"""
if arr.dtype.names is None:
raise ValueError('arr must be a structured array')

uarr = structured_to_unstructured(arr)
return func(uarr, axis=-1)
# works and avoids axis requirement, but very, very slow:
#return np.apply_along_axis(func, -1, uarr)

def _assign_fields_by_name_dispatcher(dst, src, zero_unassigned=None):
return dst, src

@array_function_dispatch(_assign_fields_by_name_dispatcher)
def assign_fields_by_name(dst, src, zero_unassigned=True):
"""
Assigns values from one structured array to another by field name.

Normally in numpy >= 1.14, assignment of one structured array to another
copies fields "by position", meaning that the first field from the src is
copied to the first field of the dst, and so on, regardless of field name.

This function instead copies "by field name", such that fields in the dst
are assigned from the identically named field in the src. This applies
recursively for nested structures. This is how structure assignment worked
in numpy >= 1.6 to <= 1.13.

Parameters
----------
dst : ndarray
src : ndarray
The source and destination arrays during assignment.
zero_unassigned : bool, optional
If True, fields in the dst for which there was no matching
field in the src are filled with the value 0 (zero). This
was the behavior of numpy <= 1.13. If False, those fields
are not modified.
"""

if dst.dtype.names is None:
dst[:] = src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work on 0d arrays, this needs to be ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

return

for name in dst.dtype.names:
if name not in src.dtype.names:
if zero_unassigned:
dst[name] = 0
else:
assign_fields_by_name(dst[name], src[name],
zero_unassigned)

def _require_fields_dispatcher(array, required_dtype):
return (array,)

@array_function_dispatch(_require_fields_dispatcher)
def require_fields(array, required_dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name strikes me as a little odd, but I also can't think of a better one.

It might be handy to use the word "require" in the description somewhere, to make the name easier to remember.

"""
Casts a structured array to a new dtype using assignment by field-name.

This function assigns to from the old to the new array by name, so the
value of a field in the output array is the value of the field with the
same name in the source array.

If a field name in the required_dtype does not exist in the
input array, that field is set to 0 in the output array.

Parameters
----------
a : ndarray
array to cast
required_dtype : dtype
datatype for output array

Returns
-------
out : ndarray
array with the new dtype, with field values copied from the fields in
the input array with the same name

Examples
--------

>>> a = np.ones(4, dtype=[('a', 'i4'), ('b', 'f8'), ('c', 'u1')])
>>> require_fields(a, [('b', 'f4'), ('c', 'u1')])
"""
out = np.empty(array.shape, dtype=required_dtype)
assign_fields_by_name(out, array)
return out


def _stack_arrays_dispatcher(arrays, defaults=None, usemask=None,
asrecarray=None, autoconvert=None):
Expand Down
Loading