Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Adds read/write access to raster attribute tables #3252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ebkurtz
Copy link

@ebkurtz ebkurtz commented Nov 20, 2024

Adds the ability to read and write raster attribute tables (RAT) per #3185. The RAT is represented in python as a list of dictionaries where each dictionary describes a column and contains the column values. The dictionary keys are:

    column_name Name of the column
    column_type Data type of the column values.
                ``RATFieldType.<enum>``
                 see https://gdal.org/en/latest/doxygen/gdal_8h.html#a810154ac91149d1a63c42717258fe16e
    column_usage Indicates the specific usage of the column if defined. 
                ``RATFieldUsage.<enum>``
                 see https://gdal.org/en/latest/doxygen/gdal_8h.html#a27bf786b965d5227da1acc2a4cab69a1
    Values list of column values

RAT is read with:

with rasterio.open(...) as src:
    rat, rat_type = src.rat(1)

and written with:

with rasterio.open(..., 'w') as dst:
    dst.write_rat(1, rat, rat_type)

Potential changes to discuss:

  • Create a python class to hold the raster attribute table rather than a list of dictionaries
  • Have an option to read the RAT as a pandas DataFrame or numpy array

Todo:

  • Add unit tests
  • Add validation before a RAT is written
  • Usage documentation

As far as I can tell, esri has it's own standard for writing raster attribute tables as .dbf sidecar files that GDAL does not read so this won't work for rasters written by ArcGIS products.

@groutr
Copy link
Contributor

groutr commented Nov 30, 2024

I'm not super familiar with wrapping GDAL calls, but is there any sort of cleanup needed for the GDAL objects allocated to avoid memory leaks?

@sgillies
Copy link
Member

sgillies commented Dec 1, 2024

@ebkurtz I'll dig into reviewing this after I've released 1.4.3 (Monday, with luck).

@groutr
Copy link
Contributor

groutr commented Dec 3, 2024

Is there a situation where you wouldn't want the RAT represented as a numpy array? If the RAT is returned as two dictionaries:

table = {'column1': values1, 'column2':values2, ...}
usage = {'column1': usage1, 'column2': usage2, ...}

it becomes really easy to convert to a dataframe. pd.DataFrame(table) if values1, value2, ... are already numpy arrays of the proper type.

@ebkurtz
Copy link
Author

ebkurtz commented Dec 4, 2024

Is there a situation where you wouldn't want the RAT represented as a numpy array? If the RAT is returned as two dictionaries:

table = {'column1': values1, 'column2':values2, ...}
usage = {'column1': usage1, 'column2': usage2, ...}

it becomes really easy to convert to a dataframe. pd.DataFrame(table) if values1, value2, ... are already numpy arrays of the proper type.

I like this approach but a previous discussion mentioned avoiding numpy for performance. I think I would need to move the reading methods from _base.pyx to _io.pyx as well.

@groutr
Copy link
Contributor

groutr commented Dec 4, 2024

@ebkurtz A lot has changed since that discussion. Numpy is now integrated much more tightly into rasterio and is a required dependency. For example, _io.pyx already imports numpy.

@ebkurtz
Copy link
Author

ebkurtz commented Dec 4, 2024

I have updated the code to use dict of numpy arrays and a separate dict of field usage.

@sgillies
Copy link
Member

@ebkurtz you arrived at a column-oriented model for this? I don't use raster attribute tables at all, so I don't know if that's best. It looks like GDAL's RAT API supports column-oriented access with functions like GDALRATValuesIOAsDouble(), yes?

If we do columns, I would suggest that we create a column class (with attrs, maybe) instead of using dicts.

@sgillies
Copy link
Member

sgillies commented Dec 16, 2024

I did some background reading and am more qualified to discuss this now. https://cran.r-project.org/web/packages/gdalraster/vignettes/raster-attribute-tables.html is a nice intro. From watching some QGIS videos it seems that map-makers appreciate raster attribute tables that help them make nice map legends.

The GDAL raster attribute table implementation include a minimal data frame implementation, does it not? At least that's how it looks to me. What would you think about using it, with a more Pythonic API?

rat = rat.Table()  # A Cython extension class, calls GDALCreateRasterAttributeTable() and stores the pointer.
rat["attr1"] = rat.Column("attribute 1", dtype="str", usage=rat.Usage.generic)
rat["attr1"][:] = ["a", "b", "c", "d"]  # accept a sequence or numpy array
print(rat["attr1"][:])  # prints: array(['a', 'b', 'c', 'd'], dtype='<U1')

@groutr
Copy link
Contributor

groutr commented Dec 16, 2024

I think using a numpy structured array might be a better alternative to new custom classes in rasterio.
https://numpy.org/doc/2.2/user/basics.rec.html

@sgillies
Copy link
Member

@groutr aren't structured arrays primarily for record or row-oriented data? A raster attribute table looks fundamentally column-oriented, and that's reflected in the GDAL model.

@groutr
Copy link
Contributor

groutr commented Dec 16, 2024

My experiments with structured arrays seem to indicate easy field access. The array is 1D and stored row oriented, but it's easy to slice fields/columns.

For example:

rat_dtype = np.dtype([('name', '<U10'), ('value1', np.float64), ('value2', np.int8)])
RAT = np.empty((7,), dtype=rat_dtype)
RAT['name']   # A read/write view of the name field
RAT['name'] = [f"name{i}" for i in range(7)]
RAT[5]['name'] = 3.141  # Numpy automatically casts this to a string
RAT['value1'][:3] = [3.56, 7.5, 8]   # Note that numpy automatically casts the integer to float64
RAT[:3]['name'] == RAT['name'][:3]   # You can select rows or columns first, though I believe the first form is more efficient.
RAT[['value1', 'value2']]  # this also works to select multiple columns

# To iterate over the fields
for field in RAT.dtype.names:
    RAT[field]

# And finally, super easy to convert to dataframe
import pandas as pd
df = pd.DataFrame.from_records(RAT)
df.to_records(index=False)

Record arrays are a subclass of structured arrays that allow field access by attribute, ie RAT.name instead of RAT['name']

@ebkurtz
Copy link
Author

ebkurtz commented Dec 26, 2024

One advantage I see with the custom class is that the usage and table type accompanies the data in a single object. With the dict and structured array approaches, that information is stored in a separate dict which might be cumbersome. However, the structured array would be easier to implement and users are likely already comfortable with numpy arrays (or pandas). Perhaps another option is a custom class with convenience methods to convert to/from np arrays?

@groutr
Copy link
Contributor

groutr commented Dec 27, 2024

One advantage of the custom class is that the RAT can be lazily read/written. Perhaps that is a desirable characteristic. Slightly increases the complexity but with the benefit of not having to read and process the entire RAT or store it in memory. The custom class could support table slicing/indexing similar to numpy/pandas with convenience methods to easily output to those data structures.

@sgillies
Copy link
Member

sgillies commented Jan 2, 2025

Thanks for the comments @ebkurtz, @groutr. We'll have to implement much of the GDAL RAT API just to read and write them, yes? I don't think it would be much more work to add a custom class based on the GDAL API, with some additional features allowing for partial updates and smooth interop with Numpy and Pandas (like slicing). Do either of you have reservations or objections about going in this direction?

@ebkurtz
Copy link
Author

ebkurtz commented Jan 3, 2025

Thanks for the comments @ebkurtz, @groutr. We'll have to implement much of the GDAL RAT API just to read and write them, yes? I don't think it would be much more work to add a custom class based on the GDAL API, with some additional features allowing for partial updates and smooth interop with Numpy and Pandas (like slicing). Do either of you have reservations or objections about going in this direction?

I think that's a good approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants