String handling

At the outset, I'm not sure if this is a user error or a bug in ncdata or a bug upstream in xarray/iris or a bug downstream in netCDF4. I'm asking because it occurs in ncdata, but feel free to send me looking elsewhere if that makes more sense.

I was playing around with writing strings into a netCDF file. There seems to be multiple ways to do this, some of which seem to work fine, others of which raise errors.

For running all these demos, I used a Python 3.11 virtual environment with the following `requirements.txt` file. I'm working on a mac.

<details><summary>Requirements</summary>


```
ncdata==0.1.1
netCDF4==1.7.2
scitools-iris==3.11.1
xarray==2025.1.2
```


</details> 

<details><summary>Passing example</summary>


If you create the array using a character array, this seems to all be happy

```python
import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
 regions_l = ["Australia", "New Zealand", "England"]
 regions_max_length = max(len(v) for v in regions_l)

 regions = np.array(regions_l, dtype=f"S{regions_max_length}")
 ds.createDimension("lbl", len(regions))
 ds.createDimension("strlen", regions_max_length)
 ds.createVariable("region", "S1", ("lbl", "strlen"))
 ds["region"][:] = netCDF4.stringtochar(regions)


# None of these raise any errors
from_nc4("demo.nc")
cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)
```

The output netCDF file also looks sensible

```
ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
data:

 region =
 "Australia",
 "New Zealand",
 "England" ;
}
```


</details> 

<details><summary>Failing example 1 - something to do with encoding</summary>


If you create the array using a character array but let netCDF4 do the encoding, the string encoding seems to not work if you load from iris then try and convert with ncdata (suggests the bug is in iris?).

```python
import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
 regions_l = ["Australia", "New Zealand", "England"]
 regions_max_length = max(len(v) for v in regions_l)

 regions = np.array(regions_l, dtype=f"S{regions_max_length}")
 ds.createDimension("lbl", len(regions))
 ds.createDimension("strlen", regions_max_length)
 ds.createVariable("region", "S1", ("lbl", "strlen"))
 ds["region"]._Encoding = "ascii"
 ds["region"][:] = regions


from_nc4("demo.nc")

cube = iris.load("demo.nc")
from_iris(cube)
"""
The line above gives the following error

...

 File ".../venv/lib/python3.11/site-packages/ncdata/dataset_like.py", line 284, in _get_fillvalue
 fv = netCDF4.default_fillvals[dtype_code]
 ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'U11'
"""
cubes_to_xarray(cube)
```

The underlying netCDF file looks sensible though.

```
netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
		region:_Encoding = "ascii" ;
data:

 region =
 "Australia",
 "New Zealand",
 "England" ;
}
```


</details> 

<details><summary>Failing example 2 - variable length strings</summary>


If you write using a variable length string, then the error appears to come from ncdata. However, iris also can't load the file, so maybe this just isn't a supported use case.

```python
import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
 regions_l = ["Australia", "New Zealand", "England"]
 regions_max_length = max(len(v) for v in regions_l)

 regions = np.array(regions_l, dtype="O")
 ds.createDimension("lbl", len(regions))
 ds.createVariable("region", str, ("lbl",))
 ds["region"][:] = regions


from_nc4("demo.nc")
"""
The line above gives the following error

Traceback (most recent call last):
 File ".../demo-variable-str-failing.py", line 20, in <module>
 from_nc4("demo.nc")
 File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 308, in from_nc4
 ncdata = _from_nc4_group(nc4ds)
 ^^^^^^^^^^^^^^^^^^^^^^
 File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 264, in _from_nc4_group
 var.data = da.from_array(
 ^^^^^^^^^^^^^^
 File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3523, in from_array
 chunks = normalize_chunks(
 ^^^^^^^^^^^^^^^^^
 File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3130, in normalize_chunks
 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3304, in auto_chunks
 raise ValueError(
ValueError: auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks` explicitly
"""

cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)
```

The underlying netCDF seems to be valid, but maybe I'm missing something.

```
ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
variables:
	string region(lbl) ;
data:

 region = "Australia", "New Zealand", "England" ;
}
```


</details> 

@pp-mo not sure if you have any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String handling #111

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

String handling #111

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions