-
Notifications
You must be signed in to change notification settings - Fork 3
Issue #1445 fix performance big model #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #1445 fix performance big model #1446
Conversation
filtered_well_ids = [
id
for id in well_data["id"].unique()
if id not in well_data_filtered["id"].values
] https://www.tumblr.com/accidentallyquadratic ? |
] | ||
return filtered_well_ids | ||
# Work around performance issue with xarray isin for large datasets. | ||
if isinstance(well_data_filtered, xr.Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to convert it to a dataframe? Can't you do directly access the values by well_data_filtered["id"].values
?
Or if it needs to be a dataframe wouldn't it better to construct it like
pd.DataFrame(well_data_filtered["id"].values)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function now accepts both xr.Dataset as well as pd.Series. It is called twice in the _to_mf6_pkg
method, once with a pd.Series, once with a xr.Dataset.
The problem with using .values
is that there is a difference between xarray DataArrays and pandas Series: xarray has a .values
method, whereas pandas has a values
property. I therefore thought calling to_dataframe
made it more explicit that were are converting an xarray DataArray to a pandas Dataframe.
@@ -128,6 +128,7 @@ vtk = { version = ">=9.0", build = "*qt*" } | |||
xarray = ">=2023.08.0" | |||
xugrid = ">=0.11.0" | |||
zarr = "*" | |||
pytest-timeout = ">=2.3.1,<3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have elaborated more in the description. This pytest plugin allows you to set a timeout on tests. If changes will result on this test becoming slow again, it will show. This avoids having unittests that might run for an hour and then let TeamCity crash, after which the problems are less clear.
Yeah I first thought this was the issue, but it turned out the conversion to a pandas series was essential to get proper performance. Calling xarray.DataArray.isin(pd.Series) was just as slow as the original solution. |
|
Fixes #1445
It turned out this performance issue pertained from a performance issue where calling
xr.DataArray().isin(pd.Series())
was very slow for large datasets, whereaspd.Series().isin(pd.Series())
was a lot faster.Description
Checklist
Issue #nr
, e.g.Issue #737