Support character arrays #6764

pp-mo · 2025-10-25T01:28:59Z

Closes #6309

So far, just some ideas brewing

…=problems.

pp-mo · 2025-10-25T01:33:21Z

Older notes

Issues for iris char data

read + write, with + without encodings
? choose to view cube/coord data as strings or (underlying) byte array
?? char coord writing works, but char cube data does not

=========================
testing dimensions (FOR READS)

encoding can be None, "ascii" or "utf-8"
- we should also test alternative spellings of utf-8 / ascii
- but not fuss too much ?

EXISTING behaviour

is ok for ascii
but results depend on the presence of the "_Encoding"
- since that is the default working of netCDF4-python

ASIDE: Python "standard encodings" : https://docs.python.org/3/library/codecs.html#standard-encodings
A table
normalise names like this...

    >>> codecs.lookup("u8").name
    'utf-8'

this produces "name" from "alternatives", as in the table
also fails when given junk
- does not accept "" or None

Old discussion in netcdf4-python, refd by xarray docs
: Unidata/netcdf4-python#654 (comment)
From that specific comment by jswhit , (quoting old version of NCUG ?)

Applications writing string data using the char data type are encouraged to add
the special variable attribute "_Encoding" with a value that the netCDF libraries
recognize.
Currently those valid values are "UTF-8" or "ASCII", case insensitive.

In Unidata docs, reference is hard to find
STILL NOTHING in the Attributes Appendix (A).
In : https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded
as UTF-8, character data may use other encodings.
The variable attribute “_Encoding” is reserved for this purpose in future implementations.

Outstanding issues

assumption that string dim of coords cannot be a data dim
how to manage backwards-compatible approach to coords + cubes
- == expecting data cubes to contain strings ??
- == OR converting (automatically, with turn-off FUTURE control??) ??

pp-mo · 2025-10-27T11:27:52Z

There seems to be a problem with netcdf4-Python byte encodings Unidata/netcdf4-python#1440

For now, here, have just turned off decoding, so everything now reads as character arrays??
Future intention: decode here, to reproduce original intended behavior.

I now don't think that people need or want to see cubes or coords with string dimensions: we will convert all to Uxx arrays internally.
This means we will lose names and identity of string dimensions. But that is probably ok.

Note : existing code names dims according to their (byte) lengths. This seems a neat idea, since it means they automatically share where convenient.
But there could be inefficiencies with using worst-case byte lengths for a given Unicode length?

…Mostly working? Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?

pp-mo · 2025-10-28T21:14:07Z

lib/iris/fileformats/_nc_load_rules/helpers.py

-    common_dims = [
-        dim for dim in cf_coord_var.dimensions if dim in engine.cf_var.dimensions
-    ]
+    coord_dims = cf_coord_var.dimensions


NOTE: this possibly needs to be implemented for ancillary-variables too

which might also be strings

which is awkward because of DRY failure in rules code

pp-mo · 2025-10-28T21:28:48Z

lib/iris/fileformats/cf.py

+                # if encoding == "ascii":
+                #     print("\n\n*** FIX !!")
+                #     string = bytes.decode("utf-8")
+                # else:


TODO: remove

… Cubes.

pp-mo · 2025-11-11T12:02:22Z

Status update 2025-11-11

intended behaviour I think is now complete + working
much more proper testing needed
- the added PoC tests exercise it, but lack desired-result asserts -- to be rewritten entirely, probably
a number of existing mock-ist tests are broken by the changes (--> failures in this PR), so need fixing
after consideration, I now really want to refactor the encode/decode support
- to replace the various places I've added/changed this, with a separate dataset wrapper
- .. like (and subclassing) the _threadsafe_nc ones
- .. which should reduce a lot of the "mess" and DRY failure in this PoC
- possibly this can even be removed again, if a future fix to the netcdf bug delivers all that we would want
  - they have already put in a fix, but unreleased, so far easier to wait for release to test against this.
  - it's not yet clear (to me) whether this intends to support the _Encoding attribute entirely as we'd like it to ?

pp-mo added 3 commits October 24, 2025 14:53

Disable byte decoding on load.

4653710

Fix label data decoding, for when coded bytes contain zero-bytes: WIP…

3c50da4

…=problems.

Initial tests.

a0a969b

scitools-ci bot added this to 🚴 Peloton Oct 25, 2025

pp-mo force-pushed the chardata branch from 1954006 to 93bf8a1 Compare October 27, 2025 19:00

pp-mo added 3 commits October 28, 2025 18:19

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': …

9d894f6

…Mostly working? Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?

Fix to temporary tests.

b0b673b

Support encoding on array writes.

1bcc78d

pp-mo force-pushed the chardata branch from 17058fc to 1bcc78d Compare October 28, 2025 18:24

pp-mo commented Oct 28, 2025

View reviewed changes

Reinstate decode on load, now in-Iris coded.

292ca9f

pp-mo force-pushed the chardata branch from 8d907a7 to 292ca9f Compare October 28, 2025 21:27

pp-mo commented Oct 28, 2025

View reviewed changes

pp-mo added 6 commits October 29, 2025 12:23

Hack to preserve the existing order of attributes on saved Coords and…

c3f9192

… Cubes.

Fix for dataless; avoid FUTURE global state change from temporary tests.

5a5a07f

Further fix to attribute ordering.

adf4229

Fix use of ncdump for tests in Iris.

536c932

Fixes for data packing.

86ccbc6

Fix for coding-standards netcdf import check.

729d0b5

pp-mo force-pushed the chardata branch from 2a4b614 to 729d0b5 Compare October 31, 2025 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support character arrays #6764

Support character arrays #6764

Uh oh!

pp-mo commented Oct 25, 2025

Uh oh!

pp-mo commented Oct 25, 2025

Uh oh!

pp-mo commented Oct 27, 2025 •

edited

Loading

Uh oh!

pp-mo Oct 28, 2025

Uh oh!

pp-mo Oct 28, 2025

Uh oh!

pp-mo commented Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support character arrays #6764

Are you sure you want to change the base?

Support character arrays #6764

Uh oh!

Conversation

pp-mo commented Oct 25, 2025

Uh oh!

pp-mo commented Oct 25, 2025

Older notes

Outstanding issues

Uh oh!

pp-mo commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pp-mo Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

pp-mo Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status update 2025-11-11

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pp-mo commented Oct 27, 2025 •

edited

Loading

pp-mo commented Nov 11, 2025 •

edited

Loading