Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Interpolation always returns floats #4770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Illviljan opened this issue Jan 6, 2021 · 3 comments
Open

Interpolation always returns floats #4770

Illviljan opened this issue Jan 6, 2021 · 3 comments

Comments

@Illviljan
Copy link
Contributor

What happened:
When interpolating datasets integer arrays are forced to floats.

What you expected to happen:
To retain the same dtype after interpolation.

Minimal Complete Verifiable Example:

import numpy as np
import dask.array as da
a = np.arange(0, 2)
b = np.core.defchararray.add("long_variable_name", a.astype(str))
coords = dict(time=da.array([0, 1]))
data_vars = dict()
for v in b:
    data_vars[v] = xr.DataArray(
        name=v,
        data=da.array([0, 1], dtype=int),
        dims=["time"],
        coords=coords,
    )
ds1 = xr.Dataset(data_vars)

print(ds1)
Out[35]: 
<xarray.Dataset>
Dimensions:              (time: 4)
Coordinates:
  * time                 (time) float64 0.0 0.5 1.0 2.0
Data variables:
    long_variable_name0  (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>
    long_variable_name1  (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>

# Interpolate:
ds1 = ds1.interp(
    time=da.array([0, 0.5, 1, 2]),
    assume_sorted=True,
    method="linear",
    kwargs=dict(fill_value="extrapolate"),
)

# dask array thinks it's an integer array:
print(ds1.long_variable_name0)
Out[55]: 
<xarray.DataArray 'long_variable_name0' (time: 4)>
dask.array<dask_aware_interpnd, shape=(4,), dtype=int32, chunksize=(4,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) float64 0.0 0.5 1.0 2.0

#  But once computed it turns out is a float:
print(ds1.long_variable_name0.compute())
Out[38]: 
<xarray.DataArray 'long_variable_name0' (time: 4)>
array([0. , 0.5, 1. , 2. ])
Coordinates:
  * time     (time) float64 0.0 0.5 1.0 2.0
 

Anything else we need to know?:
An easy first step is to also force np.float_ in da.blockwise in missing.interp_func.

The more difficult way is to somehow be able to change back the dataarrays into the old dtype without affecting performance. I did a test simply adding .astype() to the returned value in missing.interp and it doubled the calculation time.

I was thinking the conversion to floats in scipy could be avoided altogether by adding a (non-)public option to ignore any dtype checks and just let the user handle the "unsafe" interpolations.

Related:
scipy/scipy#11093

Environment:

Output of xr.show_versions()

xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
libhdf5: 1.10.4
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.5
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 51.0.0.post20201207
pip: 20.3.3
conda: 4.9.2
pytest: 6.2.1
IPython: 7.19.0
sphinx: 3.4.0

@mathause
Copy link
Collaborator

#4771 forces the dtype to np.float_ for consistency. Leaving this open for the bigger issue: keeping the dtype (if possible).

@dcherian
Copy link
Contributor

IMO if someone wants this they should just use "nearest". It's nice to have consistency and predictability in return types (including dtypes). Shall we close?

@Illviljan
Copy link
Contributor Author

Using nearest works if all variables are integers.
But once there's a mix of integers/indexes (nearest), floats (linear) and strings (nearest) it becomes tricky how to interpolate the dataset.

# 100x more variables in actual dataset:
ds = xr.Dataset(
    {
        "numeric": ("time", 1.0 + np.arange(0, 4, 1)),
        "non_numeric_integer": ("time", np.array([1, 2, 3, 4], dtype=np.int8)),
        "non_numeric": ("time", np.array(["a", "b", "c", "d"])),
    },
    coords={"time": (np.arange(0, 4, 1))},
)
actual = ds.interp(time=np.linspace(0, 3, 7))

expected = xr.Dataset(
    {
        "numeric": ("time", 1 + np.linspace(0, 3, 7)),
        "non_numeric_integer": ("time", np.array([1, 2, 2, 3, 3, 4, 4], dtype=np.int8)),
        "non_numeric": ("time", np.array(["a", "b", "b", "c", "c", "d", "d"])),
    },
    coords={"time": np.linspace(0, 3, 7)},
)
xr.testing.assert_identical(actual, expected)

We could make it optional what data types are considered non_numeric, we try to do that internally already so it wouldn't be a large change:

xarray/xarray/core/dataset.py

Lines 4198 to 4205 in 755581c

dtype_kind = var.dtype.kind
if dtype_kind in "uifc":
# For normal number types do the interpolation:
var_indexers = {k: v for k, v in use_indexers.items() if k in var.dims}
variables[name] = missing.interp(var, var_indexers, method, **kwargs)
elif dtype_kind in "ObU" and (use_indexers.keys() & var.dims):
# For types that we do not understand do stepwise
# interpolation to avoid modifying the elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants