-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] BUG avoid memmaping large dataframe in permutation_importance #15898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] BUG avoid memmaping large dataframe in permutation_importance #15898
Conversation
β¦e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame. The bug was fixed by setting the bug by setting max_nbytes to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an Fix
to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data. Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution. Any other thoughts @jeremiedbb @thomasjpfan ? |
Thanks @thomasjpfan |
Since it only seems to fail on DataFrames, would it be an idea to conditionally turn off memmapping based on whether or not the input is a dataframe? |
@andersbogsnes I have added the check for Dataframe instance now. Please take a look don't think check for label column is necessary |
@andersbogsnes aren't we checking for same by checking type of input X. in line 120 |
That doesnβt allow for duck typing, so if I were to subclass a DataFrame (or use another DataFrame like library) then |
Updated the check based on attribute of data frame class. |
Thanks, @andersbogsnes for the help! Should we have a way to include such costly test cases, but ordinarily skip them? |
I would always recommend having a test written for a known bug - thatβs how I caught this error when we swapped our implementation for scikit-learnβs - but there is a cost of course. The failing example above is fairly cheap though, as the cost is linear per example. Does the test suite have a Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper? |
Yes, I think it should be fine to add that test. It's not so slow, and we can make the RF fitting faster (by using iris without repetition, for example) |
Well, you need the repetition to hit the 1M limit where joblib turns on the memmap - I didnβt do any calculations on where that was, just that 100 wasnβt enough but 1000 worked :-) |
For this specific case, the following (smallish) example also fails on master: import pandas as pd
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.inspection import permutation_importance
X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)
clf = DummyClassifier(strategy='prior')
clf.fit(df, y)
r = permutation_importance(clf, df, y, n_jobs=2) We could make this into a test. |
We should also check for the output: def test_permuation_importance_memmaping_dataframe():
pd = pytest.importorskip("pandas")
X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)
clf = HistGradientBoostingClassifier(max_iter=5, random_state=42)
clf.fit(df, y)
importances_parallel = permutation_importance(
clf, df, y, n_jobs=2, random_state=0, n_repeats=2
)
importances_sequential = permutation_importance(
clf, df, y, n_jobs=1, random_state=0, n_repeats=2
)
assert_allclose(
importances_parallel['importances'],
importances_sequential['importances']
) (we have something weird happening with the |
To make the test suggested in #15898 (comment) faster yet not too trivial, you can use Or alternatively use a regression problem with and a
|
The issue with different results with |
I agree |
I'm + 1 for closing. @shivamgargsya This is a bit unfortunate but we discovered the underlying bug because of your work in this PR. Don't feel like it was a vain effort. |
Thanks for your contribution @shivamgargsya! Closing. |
Reference Issues/PRs
Fixes #15810
What does this implement/fix? Explain your changes.
When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
The bug was fixed by setting the bug by setting max_nbytes to None.
Any other comments?
Verified using below snippet
`import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
data = load_iris()
df = pd.DataFrame(data=data.data.repeat(1000, axis=0))
y = data.target.repeat(1000)
clf = RandomForestClassifier()
clf.fit(df, y)
r = permutation_importance(clf, df, y, n_jobs=-1)`