Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Issue #1445 fix performance big model #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 21, 2025

Conversation

JoerivanEngelen
Copy link
Contributor

@JoerivanEngelen JoerivanEngelen commented Feb 21, 2025

Fixes #1445

It turned out this performance issue pertained from a performance issue where calling xr.DataArray().isin(pd.Series()) was very slow for large datasets, whereas pd.Series().isin(pd.Series()) was a lot faster.

Description

  • Fix performance issue when writing large MODFLOW6 WEL packages
  • Add test which tests this performance, I had to add a small pytest plugin for this.

Checklist

  • Links to correct issue
  • Update changelog, if changes affect users
  • PR title starts with Issue #nr, e.g. Issue #737
  • Unit tests were added
  • If feature added: Added/extended example

@Huite
Copy link
Contributor

Huite commented Feb 21, 2025

        filtered_well_ids = [
            id
            for id in well_data["id"].unique()
            if id not in well_data_filtered["id"].values
        ]

https://www.tumblr.com/accidentallyquadratic

?

]
return filtered_well_ids
# Work around performance issue with xarray isin for large datasets.
if isinstance(well_data_filtered, xr.Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to convert it to a dataframe? Can't you do directly access the values by well_data_filtered["id"].values ?

Or if it needs to be a dataframe wouldn't it better to construct it like
pd.DataFrame(well_data_filtered["id"].values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function now accepts both xr.Dataset as well as pd.Series. It is called twice in the _to_mf6_pkg method, once with a pd.Series, once with a xr.Dataset.

The problem with using .values is that there is a difference between xarray DataArrays and pandas Series: xarray has a .values method, whereas pandas has a values property. I therefore thought calling to_dataframe made it more explicit that were are converting an xarray DataArray to a pandas Dataframe.

@@ -128,6 +128,7 @@ vtk = { version = ">=9.0", build = "*qt*" }
xarray = ">=2023.08.0"
xugrid = ">=0.11.0"
zarr = "*"
pytest-timeout = ">=2.3.1,<3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have elaborated more in the description. This pytest plugin allows you to set a timeout on tests. If changes will result on this test becoming slow again, it will show. This avoids having unittests that might run for an hour and then let TeamCity crash, after which the problems are less clear.

@JoerivanEngelen
Copy link
Contributor Author

        filtered_well_ids = [
            id
            for id in well_data["id"].unique()
            if id not in well_data_filtered["id"].values
        ]

https://www.tumblr.com/accidentallyquadratic

?

Yeah I first thought this was the issue, but it turned out the conversion to a pandas series was essential to get proper performance. Calling xarray.DataArray.isin(pd.Series) was just as slow as the original solution.

Copy link

@JoerivanEngelen JoerivanEngelen added this pull request to the merge queue Feb 21, 2025
Merged via the queue into master with commit dd96909 Feb 21, 2025
7 checks passed
@JoerivanEngelen JoerivanEngelen deleted the issue_#1445_fix_performance_big_model branch February 21, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] - Poor performance deriving dummy MODFLOW6 package for MetaSWAP coupling for big model
3 participants