Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@vschaffn
Copy link
Contributor

@vschaffn vschaffn commented Mar 25, 2025

Resolves #670.

Context

To enable efficient raster processing on large datasets without exceeding memory limits, this PR introduces generic multiprocessing functions within geoutils. These functions allow users to apply any processing function to raster data using a tiling approach with overlap handling. The approach minimizes memory usage by processing tiles separately and writing results directly to disk.

Features

This PR introduces the following key functions:

  • map_overlap_multiproc_save: This function divides the input raster into overlapping tiles, processes them in parallel using
    multiprocessing, and writes the processed results to an output raster file.
  • map_multiproc_collect: This function splits an input raster into overlapping tiles, processes them in parallel, and returns the results as a list. It is intended for cases where func does not return a Raster, but instead returns arbitrary values (e.g., numerical statistics, feature extractions, etc.).
  • apply_func_block: A helper function that loads a specific raster tile, applies the provided function, and removes padding to avoid edge effects.
  • load_raster_tile: Loads a specific tile from a raster based on given bounding box coordinates.
  • remove_tile_padding: Removes extra padding from tiles after processing to mitigate edge artifacts.
  • MultiprocConfig: Configuration class for handling multiprocessing parameters.

These functions are designed to be generic and reusable for various raster-processing tasks.

Tests

To ensure the correctness of these functions, tests have been implemented:

  • Running map_multiproc with a simple function (Raster.copy()) to verify that tiles are processed correctly and raster is not loaded during processing.
  • Verifying that load_raster_tile load the right tile.
  • Verifying that padding removal correctly restores tile boundaries without artifacts.
  • Ensuring that multiprocessing runs correctly across different tile sizes and clusters.

Documentation

A documentation page has been written to explain how to use these generic multiprocessing functions, and they have been added to the API.

Example Usage

A basic example demonstrating the usage of map_overlap_multiproc_save with a simple function (Raster.copy()):

# Input raster path
input_raster_path = "path/to/large_raster.tif"

# Define function to apply to the raster
cast_nodata=True
def copy_func(r: RasterType, cast_nodata: bool) -> RasterType:
    return r.copy(cast_nodata=cast_nodata)

# Define config for multiprocessing:
config = MultiprocConfig(
    chunk_size=200, 
    outfile = "path/to/output_raster.tif", 
    cluster=ClusterGenerator("multi", nb_workers=4)
)

# Run multiprocessing with the copy function
map_overlap_multiproc_save(
    copy_func, 
    raster_path, 
    config,
    cast_nodata,
    depth=0
)

@rhugonnet
Copy link
Member

Great! All good for me on the generic implementation, following our discussion in GlacioHack/xdem#704 😉.

I'll just add some line-to-line comments on little things we can adjust to make the namings more inter-changeable with Dask, which will be especially useful once we start documenting things!

@vschaffn
Copy link
Contributor Author

@rhugonnet thanks for your feedback, I have modified the little things you have mentioned in your review 😃

@vschaffn vschaffn force-pushed the multiproc_generic branch from e3aa433 to ea16637 Compare March 26, 2025 13:56
@vschaffn
Copy link
Contributor Author

@rhugonnet I have made a few changes to adapt map_overlap_multiproc to functions that don't necessarily return a raster.
I just have a problem with mypy, when overloading map_overlap_multiproc, as the Callable[... Any] signature also includes Callable[..., RasterType], I would need to be able to define an Any but not RasterType type, which doesn't seem possible in python. For the moment I've ignored the error. Do you have any ideas?

I have also created a MultiprocConfig class, which we can pass as a parameter to functions where we want a multiprocessing equivalent, and if this parameter is not None, activate the multiprocessing version (just like an equivalent of chunk_size in Dask, but with additionnal parameters).

@rhugonnet
Copy link
Member

rhugonnet commented Mar 27, 2025

I see... I think you won't be able to overload here, you might have to # type: ignore.

In Xarray, they use map_overlap/block only for functions working to smaller Xarray objects (see the "Notes" below the function description: https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html).
For functions working on the array inside of the Xarray objects (not requiring any Xarray metadata), they use apply_ufunc instead: https://docs.xarray.dev/en/stable/generated/xarray.apply_ufunc.html#xarray.apply_ufunc.

Potentially, we could also mirror that structure.

@rhugonnet
Copy link
Member

And great idea for the MultiprocConfig, it didn't cross my mind! It will simplify the API a lot for multi-processing calls, and make the behaviour easy to explain 🙂

I'm not sure if we need to have depth in it?
For instance, if we have DEM.slope(mp_config=), it will be passed to a slope_multiproc() function wrapping map_overlap_multiproc(). This slope_multiproc() should convert the window argument of slope() into the depth needed for the operation.
So I think the user would never have to specify depth through a defined function (only if they call directly map_overlap, but in that case it is a keyword argument)?

@vschaffn vschaffn force-pushed the multiproc_generic branch 3 times, most recently from 7b4c3c5 to 2a4fad9 Compare March 28, 2025 10:58
@vschaffn vschaffn force-pushed the multiproc_generic branch from 2a4fad9 to 2e2a0a5 Compare March 31, 2025 10:03
@vschaffn vschaffn force-pushed the multiproc_generic branch from ca1af6a to 3083112 Compare April 1, 2025 10:09
@vschaffn
Copy link
Contributor Author

vschaffn commented Apr 1, 2025

@rhugonnet @adehecq @adebardo I have changed the implementation of the multiprocessing as discussed on 28/03/2025 :

  • map_overlap_multiproc has been seperate in two functions:
    • map_overlap_multiproc_save write a new raster on disk.
    • map_multiproc_collect returns a list of required information by tile.
  • depth has been removed of MultiprocConfig.
  • A documentation page has been written, the function has been added to the API.

@rhugonnet
Copy link
Member

rhugonnet commented Apr 1, 2025

Perfect, thanks!

For the documentation page:
As this is an early version of the API for these functions that is likely to evolve quickly, I'm not sure we're fully ready to document those yet. We've seen it with the evolving needs in xDEM, and there are still many cases we'll need and haven't covered yet that will trigger changes (loading multiple rasters at once on the same tile, or returning multiple rasters for one operation).
So I propose to keep multiprocessing.md as a draft without rendering it yet. And we wait until showing changes in api.md and index.md for a bit longer, until we've finalized the various multiprocessing features across GeoUtils-xDEM, and we know the API won't change.

Other small remarks:

  • All internal functions not planned to be listed in the API should be non-public and preceded by an underscore, like _apply_func_block,
  • (I mentioned this in Slack) Our documentation is written in British lower-case capitalization style everywhere. We should keep it consistent, so "Example: Applying a Raster Filter" into "Example: applying a raster filter".

On this last point, we could think of defining a STYLE.md or equivalent, to clearly list the rules there and agree on it all together 😉. And we could apply the same style for the few things that show on the GitHub page of the package (Releases, PR titles).

@vschaffn vschaffn force-pushed the multiproc_generic branch 3 times, most recently from fcb9648 to 37dc50e Compare April 1, 2025 13:04
@vschaffn vschaffn force-pushed the multiproc_generic branch from 37dc50e to f4c2f27 Compare April 1, 2025 13:13
@adebardo adebardo merged commit 1bbfddd into GlacioHack:main Apr 2, 2025
16 checks passed
@adebardo
Copy link
Contributor

adebardo commented Apr 2, 2025

Thanks for the work to both of you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for generic multiprocessing functions for raster operations in geoutils

3 participants