Thanks to visit codestin.com
Credit goes to github.com

Skip to content

String handling #111

@znichollscr

Description

@znichollscr

At the outset, I'm not sure if this is a user error or a bug in ncdata or a bug upstream in xarray/iris or a bug downstream in netCDF4. I'm asking because it occurs in ncdata, but feel free to send me looking elsewhere if that makes more sense.

I was playing around with writing strings into a netCDF file. There seems to be multiple ways to do this, some of which seem to work fine, others of which raise errors.

For running all these demos, I used a Python 3.11 virtual environment with the following requirements.txt file. I'm working on a mac.

Requirements

ncdata==0.1.1
netCDF4==1.7.2
scitools-iris==3.11.1
xarray==2025.1.2

Passing example

If you create the array using a character array, this seems to all be happy

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype=f"S{regions_max_length}")
    ds.createDimension("lbl", len(regions))
    ds.createDimension("strlen", regions_max_length)
    ds.createVariable("region", "S1", ("lbl", "strlen"))
    ds["region"][:] = netCDF4.stringtochar(regions)


# None of these raise any errors
from_nc4("demo.nc")
cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)

The output netCDF file also looks sensible

ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
data:

 region =
  "Australia",
  "New Zealand",
  "England" ;
}

Failing example 1 - something to do with encoding

If you create the array using a character array but let netCDF4 do the encoding, the string encoding seems to not work if you load from iris then try and convert with ncdata (suggests the bug is in iris?).

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype=f"S{regions_max_length}")
    ds.createDimension("lbl", len(regions))
    ds.createDimension("strlen", regions_max_length)
    ds.createVariable("region", "S1", ("lbl", "strlen"))
    ds["region"]._Encoding = "ascii"
    ds["region"][:] = regions


from_nc4("demo.nc")

cube = iris.load("demo.nc")
from_iris(cube)
"""
The line above gives the following error

...

  File ".../venv/lib/python3.11/site-packages/ncdata/dataset_like.py", line 284, in _get_fillvalue
    fv = netCDF4.default_fillvals[dtype_code]
         ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'U11'
"""
cubes_to_xarray(cube)

The underlying netCDF file looks sensible though.

netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
		region:_Encoding = "ascii" ;
data:

 region =
  "Australia",
  "New Zealand",
  "England" ;
}

Failing example 2 - variable length strings

If you write using a variable length string, then the error appears to come from ncdata. However, iris also can't load the file, so maybe this just isn't a supported use case.

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype="O")
    ds.createDimension("lbl", len(regions))
    ds.createVariable("region", str, ("lbl",))
    ds["region"][:] = regions


from_nc4("demo.nc")
"""
The line above gives the following error

Traceback (most recent call last):
  File ".../demo-variable-str-failing.py", line 20, in <module>
    from_nc4("demo.nc")
  File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 308, in from_nc4
    ncdata = _from_nc4_group(nc4ds)
             ^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 264, in _from_nc4_group
    var.data = da.from_array(
               ^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3523, in from_array
    chunks = normalize_chunks(
             ^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3130, in normalize_chunks
    chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3304, in auto_chunks
    raise ValueError(
ValueError: auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks` explicitly
"""

cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)

The underlying netCDF seems to be valid, but maybe I'm missing something.

ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
variables:
	string region(lbl) ;
data:

 region = "Australia", "New Zealand", "England" ;
}

@pp-mo not sure if you have any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions