-
Notifications
You must be signed in to change notification settings - Fork 539
Adds read/write access to raster attribute tables #3252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm not super familiar with wrapping GDAL calls, but is there any sort of cleanup needed for the GDAL objects allocated to avoid memory leaks? |
@ebkurtz I'll dig into reviewing this after I've released 1.4.3 (Monday, with luck). |
Is there a situation where you wouldn't want the RAT represented as a numpy array? If the RAT is returned as two dictionaries:
it becomes really easy to convert to a dataframe. |
I like this approach but a previous discussion mentioned avoiding numpy for performance. I think I would need to move the reading methods from |
@ebkurtz A lot has changed since that discussion. Numpy is now integrated much more tightly into rasterio and is a required dependency. For example, |
I have updated the code to use dict of numpy arrays and a separate dict of field usage. |
@ebkurtz you arrived at a column-oriented model for this? I don't use raster attribute tables at all, so I don't know if that's best. It looks like GDAL's RAT API supports column-oriented access with functions like If we do columns, I would suggest that we create a column class (with attrs, maybe) instead of using dicts. |
I did some background reading and am more qualified to discuss this now. https://cran.r-project.org/web/packages/gdalraster/vignettes/raster-attribute-tables.html is a nice intro. From watching some QGIS videos it seems that map-makers appreciate raster attribute tables that help them make nice map legends. The GDAL raster attribute table implementation include a minimal data frame implementation, does it not? At least that's how it looks to me. What would you think about using it, with a more Pythonic API? rat = rat.Table() # A Cython extension class, calls GDALCreateRasterAttributeTable() and stores the pointer.
rat["attr1"] = rat.Column("attribute 1", dtype="str", usage=rat.Usage.generic)
rat["attr1"][:] = ["a", "b", "c", "d"] # accept a sequence or numpy array
print(rat["attr1"][:]) # prints: array(['a', 'b', 'c', 'd'], dtype='<U1') |
I think using a numpy structured array might be a better alternative to new custom classes in rasterio. |
@groutr aren't structured arrays primarily for record or row-oriented data? A raster attribute table looks fundamentally column-oriented, and that's reflected in the GDAL model. |
My experiments with structured arrays seem to indicate easy field access. The array is 1D and stored row oriented, but it's easy to slice fields/columns. For example: rat_dtype = np.dtype([('name', '<U10'), ('value1', np.float64), ('value2', np.int8)])
RAT = np.empty((7,), dtype=rat_dtype)
RAT['name'] # A read/write view of the name field
RAT['name'] = [f"name{i}" for i in range(7)]
RAT[5]['name'] = 3.141 # Numpy automatically casts this to a string
RAT['value1'][:3] = [3.56, 7.5, 8] # Note that numpy automatically casts the integer to float64
RAT[:3]['name'] == RAT['name'][:3] # You can select rows or columns first, though I believe the first form is more efficient.
RAT[['value1', 'value2']] # this also works to select multiple columns
# To iterate over the fields
for field in RAT.dtype.names:
RAT[field]
# And finally, super easy to convert to dataframe
import pandas as pd
df = pd.DataFrame.from_records(RAT)
df.to_records(index=False) Record arrays are a subclass of structured arrays that allow field access by attribute, ie |
One advantage I see with the custom class is that the usage and table type accompanies the data in a single object. With the dict and structured array approaches, that information is stored in a separate dict which might be cumbersome. However, the structured array would be easier to implement and users are likely already comfortable with numpy arrays (or pandas). Perhaps another option is a custom class with convenience methods to convert to/from np arrays? |
One advantage of the custom class is that the RAT can be lazily read/written. Perhaps that is a desirable characteristic. Slightly increases the complexity but with the benefit of not having to read and process the entire RAT or store it in memory. The custom class could support table slicing/indexing similar to numpy/pandas with convenience methods to easily output to those data structures. |
Thanks for the comments @ebkurtz, @groutr. We'll have to implement much of the GDAL RAT API just to read and write them, yes? I don't think it would be much more work to add a custom class based on the GDAL API, with some additional features allowing for partial updates and smooth interop with Numpy and Pandas (like slicing). Do either of you have reservations or objections about going in this direction? |
I think that's a good approach |
Adds the ability to read and write raster attribute tables (RAT) per #3185. The RAT is represented in python as a list of dictionaries where each dictionary describes a column and contains the column values. The dictionary keys are:
RAT is read with:
and written with:
Potential changes to discuss:
Todo:
As far as I can tell, esri has it's own standard for writing raster attribute tables as .dbf sidecar files that GDAL does not read so this won't work for rasters written by ArcGIS products.