Interpolation always returns floats #4770

Illviljan · 2021-01-06T03:16:43Z

What happened:
When interpolating datasets integer arrays are forced to floats.

What you expected to happen:
To retain the same dtype after interpolation.

Minimal Complete Verifiable Example:

import numpy as np
import dask.array as da
a = np.arange(0, 2)
b = np.core.defchararray.add("long_variable_name", a.astype(str))
coords = dict(time=da.array([0, 1]))
data_vars = dict()
for v in b:
    data_vars[v] = xr.DataArray(
        name=v,
        data=da.array([0, 1], dtype=int),
        dims=["time"],
        coords=coords,
    )
ds1 = xr.Dataset(data_vars)

print(ds1)
Out[35]: 
<xarray.Dataset>
Dimensions:              (time: 4)
Coordinates:
  * time                 (time) float64 0.0 0.5 1.0 2.0
Data variables:
    long_variable_name0  (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>
    long_variable_name1  (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>

# Interpolate:
ds1 = ds1.interp(
    time=da.array([0, 0.5, 1, 2]),
    assume_sorted=True,
    method="linear",
    kwargs=dict(fill_value="extrapolate"),
)

# dask array thinks it's an integer array:
print(ds1.long_variable_name0)
Out[55]: 
<xarray.DataArray 'long_variable_name0' (time: 4)>
dask.array<dask_aware_interpnd, shape=(4,), dtype=int32, chunksize=(4,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) float64 0.0 0.5 1.0 2.0

#  But once computed it turns out is a float:
print(ds1.long_variable_name0.compute())
Out[38]: 
<xarray.DataArray 'long_variable_name0' (time: 4)>
array([0. , 0.5, 1. , 2. ])
Coordinates:
  * time     (time) float64 0.0 0.5 1.0 2.0

Anything else we need to know?:
An easy first step is to also force np.float_ in da.blockwise in missing.interp_func.

The more difficult way is to somehow be able to change back the dataarrays into the old dtype without affecting performance. I did a test simply adding .astype() to the returned value in missing.interp and it doubled the calculation time.

I was thinking the conversion to floats in scipy could be avoided altogether by adding a (non-)public option to ignore any dtype checks and just let the user handle the "unsafe" interpolations.

Related:
scipy/scipy#11093

Environment:

Output of xr.show_versions()

xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
libhdf5: 1.10.4
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.5
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: 2020.12.0
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 51.0.0.post20201207
pip: 20.3.3
conda: 4.9.2
pytest: 6.2.1
IPython: 7.19.0
sphinx: 3.4.0

The text was updated successfully, but these errors were encountered:

mathause · 2021-01-12T16:30:54Z

#4771 forces the dtype to np.float_ for consistency. Leaving this open for the bigger issue: keeping the dtype (if possible).

dcherian · 2024-12-13T18:37:05Z

IMO if someone wants this they should just use "nearest". It's nice to have consistency and predictability in return types (including dtypes). Shall we close?

Illviljan · 2024-12-14T09:29:22Z

Using nearest works if all variables are integers.
But once there's a mix of integers/indexes (nearest), floats (linear) and strings (nearest) it becomes tricky how to interpolate the dataset.

# 100x more variables in actual dataset:
ds = xr.Dataset(
    {
        "numeric": ("time", 1.0 + np.arange(0, 4, 1)),
        "non_numeric_integer": ("time", np.array([1, 2, 3, 4], dtype=np.int8)),
        "non_numeric": ("time", np.array(["a", "b", "c", "d"])),
    },
    coords={"time": (np.arange(0, 4, 1))},
)
actual = ds.interp(time=np.linspace(0, 3, 7))

expected = xr.Dataset(
    {
        "numeric": ("time", 1 + np.linspace(0, 3, 7)),
        "non_numeric_integer": ("time", np.array([1, 2, 2, 3, 3, 4, 4], dtype=np.int8)),
        "non_numeric": ("time", np.array(["a", "b", "b", "c", "c", "d", "d"])),
    },
    coords={"time": np.linspace(0, 3, 7)},
)
xr.testing.assert_identical(actual, expected)

We could make it optional what data types are considered non_numeric, we try to do that internally already so it wouldn't be a large change:

xarray/xarray/core/dataset.py

Lines 4198 to 4205 in 755581c

    
           dtype_kind = var.dtype.kind 
        
           if dtype_kind in "uifc": 
        
               # For normal number types do the interpolation: 
        
               var_indexers = {k: v for k, v in use_indexers.items() if k in var.dims} 
        
               variables[name] = missing.interp(var, var_indexers, method, **kwargs) 
        
           elif dtype_kind in "ObU" and (use_indexers.keys() & var.dims): 
        
               # For types that we do not understand do stepwise 
        
               # interpolation to avoid modifying the elements.

Illviljan mentioned this issue Jan 6, 2021

Always force dask arrays to float in missing.interp_func #4771

Merged

3 tasks

kafitzgerald mentioned this issue Oct 6, 2023

interp_multidim can change data precision NCAR/geocat-comp#488

Closed

dcherian added the topic-interpolation label Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Interpolation always returns floats #4770

Interpolation always returns floats #4770

Illviljan commented Jan 6, 2021

INSTALLED VERSIONS

mathause commented Jan 12, 2021

Uh oh!

dcherian commented Dec 13, 2024

Uh oh!

Illviljan commented Dec 14, 2024

Uh oh!

Uh oh!

Interpolation always returns floats #4770

Interpolation always returns floats #4770

Comments

Illviljan commented Jan 6, 2021

INSTALLED VERSIONS

mathause commented Jan 12, 2021

Uh oh!

dcherian commented Dec 13, 2024

Uh oh!

Illviljan commented Dec 14, 2024

Uh oh!