Issue #1445 fix performance big model #1446

JoerivanEngelen · 2025-02-21T10:54:54Z

It turned out this performance issue pertained from a performance issue where calling xr.DataArray().isin(pd.Series()) was very slow for large datasets, whereas pd.Series().isin(pd.Series()) was a lot faster.

Description

Fix performance issue when writing large MODFLOW6 WEL packages
Add test which tests this performance, I had to add a small pytest plugin for this.

Checklist

Links to correct issue
Update changelog, if changes affect users
PR title starts with Issue #nr, e.g. Issue #737
Unit tests were added
If feature added: Added/extended example

Huite · 2025-02-21T11:15:28Z

        filtered_well_ids = [
            id
            for id in well_data["id"].unique()
            if id not in well_data_filtered["id"].values
        ]

https://www.tumblr.com/accidentallyquadratic

?

Manangka · 2025-02-21T11:14:25Z

imod/mf6/wel.py

-        ]
-        return filtered_well_ids
+        # Work around performance issue with xarray isin for large datasets.
+        if isinstance(well_data_filtered, xr.Dataset):


Do you need to convert it to a dataframe? Can't you do directly access the values by well_data_filtered["id"].values ?

Or if it needs to be a dataframe wouldn't it better to construct it like
pd.DataFrame(well_data_filtered["id"].values)?

The function now accepts both xr.Dataset as well as pd.Series. It is called twice in the _to_mf6_pkg method, once with a pd.Series, once with a xr.Dataset.

The problem with using .values is that there is a difference between xarray DataArrays and pandas Series: xarray has a .values method, whereas pandas has a values property. I therefore thought calling to_dataframe made it more explicit that were are converting an xarray DataArray to a pandas Dataframe.

Manangka · 2025-02-21T11:16:46Z

pixi.toml

@@ -128,6 +128,7 @@ vtk = { version = ">=9.0", build = "*qt*" }
 xarray = ">=2023.08.0"
 xugrid = ">=0.11.0"
 zarr = "*"
+pytest-timeout = ">=2.3.1,<3"


What does this do?

Sorry, I should have elaborated more in the description. This pytest plugin allows you to set a timeout on tests. If changes will result on this test becoming slow again, it will show. This avoids having unittests that might run for an hour and then let TeamCity crash, after which the problems are less clear.

JoerivanEngelen · 2025-02-21T12:01:53Z

        filtered_well_ids = [
            id
            for id in well_data["id"].unique()
            if id not in well_data_filtered["id"].values
        ]

https://www.tumblr.com/accidentallyquadratic

?

Yeah I first thought this was the issue, but it turned out the conversion to a pandas series was essential to get proper performance. Calling xarray.DataArray.isin(pd.Series) was just as slow as the original solution.

sonarqubecloud · 2025-02-21T12:51:38Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

JoerivanEngelen added 6 commits February 21, 2025 10:02

Add test to reproduce performance problem

f29e574

Add pytest mark timeout

eee598d

Fix performance issue with xarray isin

647c3b6

Bump array size for extra stress test

5a7043f

format

71b954b

Update changelog

9a04647

JoerivanEngelen requested a review from Manangka February 21, 2025 10:54

Manangka approved these changes Feb 21, 2025

View reviewed changes

Fix broken unittests

b1b2f21

JoerivanEngelen enabled auto-merge February 21, 2025 12:50

JoerivanEngelen added this pull request to the merge queue Feb 21, 2025

Merged via the queue into master with commit dd96909 Feb 21, 2025
7 checks passed

JoerivanEngelen deleted the issue_#1445_fix_performance_big_model branch February 21, 2025 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue #1445 fix performance big model #1446

Issue #1445 fix performance big model #1446

Uh oh!

JoerivanEngelen commented Feb 21, 2025 •

edited

Loading

Uh oh!

Huite commented Feb 21, 2025

Uh oh!

Manangka Feb 21, 2025

Uh oh!

JoerivanEngelen Feb 21, 2025

Uh oh!

Manangka Feb 21, 2025

Uh oh!

JoerivanEngelen Feb 21, 2025

Uh oh!

JoerivanEngelen commented Feb 21, 2025

Uh oh!

sonarqubecloud bot commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Issue #1445 fix performance big model #1446

Issue #1445 fix performance big model #1446

Uh oh!

Conversation

JoerivanEngelen commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Huite commented Feb 21, 2025

Uh oh!

Manangka Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

JoerivanEngelen Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Manangka Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

JoerivanEngelen Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

JoerivanEngelen commented Feb 21, 2025

Uh oh!

sonarqubecloud bot commented Feb 21, 2025

Quality Gate passed

Uh oh!

Uh oh!

Uh oh!

JoerivanEngelen commented Feb 21, 2025 •

edited

Loading