Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Oct 25, 2025

Closes #6309

So far, just some ideas brewing

@pp-mo
Copy link
Member Author

pp-mo commented Oct 25, 2025

Older notes

Issues for iris char data

  • read + write, with + without encodings
  • ? choose to view cube/coord data as strings or (underlying) byte array
  • ?? char coord writing works, but char cube data does not

=========================
testing dimensions (FOR READS)

  • encoding can be None, "ascii" or "utf-8"
    • we should also test alternative spellings of utf-8 / ascii
    • but not fuss too much ?

EXISTING behaviour

  • is ok for ascii
  • but results depend on the presence of the "_Encoding"
    • since that is the default working of netCDF4-python

ASIDE: Python "standard encodings" : https://docs.python.org/3/library/codecs.html#standard-encodings
A table
normalise names like this...

    >>> codecs.lookup("u8").name
    'utf-8'
  • this produces "name" from "alternatives", as in the table
  • also fails when given junk
    • does not accept "" or None

Old discussion in netcdf4-python, refd by xarray docs
: Unidata/netcdf4-python#654 (comment)
From that specific comment by jswhit , (quoting old version of NCUG ?)

Applications writing string data using the char data type are encouraged to add
the special variable attribute "_Encoding" with a value that the netCDF libraries
recognize.
Currently those valid values are "UTF-8" or "ASCII", case insensitive.

In Unidata docs, reference is hard to find
STILL NOTHING in the Attributes Appendix (A).
In : https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded
as UTF-8, character data may use other encodings.
The variable attribute “_Encoding” is reserved for this purpose in future implementations.

Outstanding issues

  • assumption that string dim of coords cannot be a data dim
  • how to manage backwards-compatible approach to coords + cubes
    • == expecting data cubes to contain strings ??
    • == OR converting (automatically, with turn-off FUTURE control??) ??

@pp-mo
Copy link
Member Author

pp-mo commented Oct 27, 2025

There seems to be a problem with netcdf4-Python byte encodings Unidata/netcdf4-python#1440

For now, here, have just turned off decoding, so everything now reads as character arrays??
Future intention: decode here, to reproduce original intended behavior.

I now don't think that people need or want to see cubes or coords with string dimensions: we will convert all to Uxx arrays internally.
This means we will lose names and identity of string dimensions. But that is probably ok.

Note : existing code names dims according to their (byte) lengths. This seems a neat idea, since it means they automatically share where convenient.
But there could be inefficiencies with using worst-case byte lengths for a given Unicode length?

pp-mo added 3 commits October 28, 2025 18:19
…Mostly working?

Get 'create_cf_data_variable' to call 'create_generic_cf_array_var': Mostly working?
common_dims = [
dim for dim in cf_coord_var.dimensions if dim in engine.cf_var.dimensions
]
coord_dims = cf_coord_var.dimensions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this possibly needs to be implemented for ancillary-variables too

  • which might also be strings
  • which is awkward because of DRY failure in rules code

Comment on lines +854 to +857
# if encoding == "ascii":
# print("\n\n*** FIX !!")
# string = bytes.decode("utf-8")
# else:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove

@pp-mo
Copy link
Member Author

pp-mo commented Nov 11, 2025

Status update 2025-11-11

  • intended behaviour I think is now complete + working
  • much more proper testing needed
    • the added PoC tests exercise it, but lack desired-result asserts -- to be rewritten entirely, probably
  • a number of existing mock-ist tests are broken by the changes (--> failures in this PR), so need fixing
  • after consideration, I now really want to refactor the encode/decode support
    • to replace the various places I've added/changed this, with a separate dataset wrapper
    • .. like (and subclassing) the _threadsafe_nc ones
    • .. which should reduce a lot of the "mess" and DRY failure in this PoC
    • possibly this can even be removed again, if a future fix to the netcdf bug delivers all that we would want
      • they have already put in a fix, but unreleased, so far easier to wait for release to test against this.
      • it's not yet clear (to me) whether this intends to support the _Encoding attribute entirely as we'd like it to ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Fix iris handling of netcdf character array variables

1 participant