-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX fix pickling for empty object with Python 3.11+ #25188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX fix pickling for empty object with Python 3.11+ #25188
Conversation
Since Python 3.11, objects have a __getstate__ method by default: python/cpython#70766 Therefore, the exception in BaseEstimator.__getstate__ will no longer be raised, thus not falling back on using the object's __dict__: https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/base.py#L274-L280 If the instance dict of the object is empty, the return value will, however, be None. Therefore, the line below calling state.items() results in an error. In this bugfix, it is checked if the state is None and if it is, the object's __dict__ is used (which should always be empty). Not addressed in this PR is how to deal with slots (see also discussion in scikit-learn#10079). When there are __slots__, __getstate__ will actually return a tuple, as documented here: https://docs.python.org/3/library/pickle.html#object.__getstate__ The user would thus still get an indiscriptive error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that now if a class has __slots__
, it also has to implement __getstate__
.
I think it makes sense to handle __slots__
, now that it's influencing what's returned by __getstate__
. But I'm not sure if it needs to be in this PR or a different one.
I recon this is somewhat a major change to warrant a @scikit-learn/core-devs ping.
Also needs a whatsnew entry.
sklearn/base.py
Outdated
# TODO: Remove once Python < 3.11 is dropped, as there will never be | ||
# an AttributeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would still be raised for C extension types which don't inherit from PyObject though, right? One can have a C extension type w/o __reduce__
for instance. It doesn't make much sense, but one can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right, I'll remove the comment.
For an end to end test, could you please add a test where |
Co-authored-by: Adrin Jalali <[email protected]>
Co-authored-by: Adrin Jalali <[email protected]>
Yeah, I had that and removed it again :) Added it back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this PR, especially since we're not changing behavior from what we already have in main
; but I think there are cases where we might be having a different behavior than python's default __getstate__
, and it'd be nice to have them less diverged from one another.
Note to reviewers, the circleci job doesn't work for @BenjaminBossan 's PRs on this repo somehow.
@@ -273,6 +273,8 @@ def __repr__(self, N_CHAR_MAX=700): | |||
def __getstate__(self): | |||
try: | |||
state = super().__getstate__() | |||
if state is None: | |||
state = self.__dict__.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python seems to be making a distinction between a __dict__
attached to the instance, and a __dict__
attached to the class. I'm not sure if we should do the same:
>>> class A:
... __dict__ = {"a": 10}
...
>>> a = A()
>>> a.__dict__
{'a': 10}
>>> a.__getstate__()
>>> a.__dict__.update({"b": 3})
>>> a.__dict__
{'a': 10, 'b': 3}
>>> a.__getstate__()
>>> class B:
... ...
...
>>> b = B()
>>> b.b = 10
>>> b.__dict__
{'b': 10}
>>> b.__getstate__()
{'b': 10}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a __dict__
as class attribute results in some (for me) quite surprising behavior. Not sure if that should be supported and if yes, how to differentiate that here.
class F:
__dict__ = {"x": 0}
def __init__(self):
self.y = 1
f = F()
print(f"{f.__getstate__()=}") # {'y': 1}
print(f"{f.__dict__=}") # {'x': 0}
print(f"{'x' in dir(f)=}") # True
print(f"{'y' in dir(f)=}") # False
f.y # works
f.x # AttributeError: 'F' object has no attribute 'x'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python seems to be making a distinction between a
__dict__
attached to the instance, and a__dict__
attached to the class.
Class attributes and instance attributes are different, which is why you see what you are seeing. While this is maybe not a super useful answer, as it is just stating a fact about how Python works, I don't understand what you are trying to do that leads you down the path of poking around at this level. Can you give an example or use-case? Would help with giving a more helpful answer.
Reading more about it here (https://wiki.python.org/moin/UsingSlots) it seems that class C:
__slots__ = ('a',)
def __init__(self):
setattr(self, 'b', 10)
C()
AttributeError: 'C' object has no attribute 'b'
EDIT: |
I was trying to find a case where you run into this problem. But I couldn't quickly find an estimator that is broken without this change. Do you know one or maybe elaborate when you hit this bug? |
I stumbled upon this first with >>> from sklearn.preprocessing import KernelCenterer
>>> import pickle
>>> pickle.dumps(KernelCenterer())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../sklearn/base.py", line 280, in __getstate__
return dict(state.items(), _sklearn_version=__version__)
^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'items' |
@@ -273,6 +273,8 @@ def __repr__(self, N_CHAR_MAX=700): | |||
def __getstate__(self): | |||
try: | |||
state = super().__getstate__() | |||
if state is None: | |||
state = self.__dict__.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that puzzles me here is that the docs say that __getstate__()
will return None
if there is no instance __dict__
. So how can we end up going down this if
branch and then try to access self.__dict__
? I've thought about this for a few minutes now but I can't work out when this would work, I'd expect that if we go down this path, then there is no self.__dict__
, and so we can't access it here. So puzzled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have been thoroughly confused by the documentation as well. What I think it doesn't mention, which is relevant here, is that if the object does not define __getstate__
and if its instance dict is empty (but exists), it will also return None. This is the case we cover here.
>>> sys.version
3.11.0
>>> class A:
... pass
>>> print(A().__getstate__())
None
>>> A().__dict__
{}
Perhaps that's what they mean by:
For a class that has no instance
__dict__
and no__slots__
, the default state is None.
but I'm not sure.
The other cases are indeed not covered, so this code can still break in very specific circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad I'm not the only one puzzled by this. For me "has no instance __dict__
sounds like "it doesn't exist" not "it exists but is empty". So yeah, confused.com
Could state = self.__dict__.copy()
ever be anything than an empty dict? And if not, we could change it to state = {}
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about C extension types which don't necessarily have to comply with python object conventions.
This part of our code seems to be quite buggy if we consider all sorts of things people could do with __dict__
and __slots__
both on the instance and on the class (I know they shouldn't lol). But I'm okay with not supporting all those odd usecases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could state = self.dict.copy() ever be anything than an empty dict? And if not, we could change it to state = {} no?
I wondered about this too. Probably, it would always be an empty dict. If there is ever a situation (maybe Python 3.12?) where it's not an empty dict, would it be better to return that non-empty dict or {}
? I thought the former, so I chose this solution. Not sure what other considerations to make here, runtime performance shouldn't matter much.
But I'm okay with not supporting all those odd usecases.
At least as a first step, we could focus on making the existing sklearn estimators run again. I used the skops persistence tests, which should cover __getstate__
of all estimators, and in addition to KernelCenterer
, I also found LabelEncoder
, which is a big one. Both of these don't have any C extensions, right?
There could be more estimators that I'm missing because they're masked by other errors in the test suite that occur due to Python 3.11.
@betatim this is in the context of skops's persistence model, which makes use of |
@BenjaminBossan might make sense to run our |
Yes, in theory that would be a good test. However, we have a few other errors in the same tests when using Python 3.11, which could mask errors stemming from this issue. So it would take a little bit of work on the skops side to clean up the tests (but we will have to do it eventually anyway). Maybe it would also be good to check that pickle works for all estimators in sklearn |
There is already such a test, I would have been surprised of the contrary, see scikit-learn/sklearn/utils/estimator_checks.py Line 1853 in 205f3b7
Maybe some estimators are not tested for some reason? |
Ah, I was not looking there
I think the reason why this passes is because the estimators are always fitted there. If fitted, there is always an instance dict for all tested estimators, therefore the edge case is not triggered and the tests pass. The question is: Does sklearn want unfitted estimators to also be pickleable? And should we rely on the fact that all estimators, after fitting, have an instance dict? Theoretically, there could be estimators that don't need instance variables. |
Actually, I checked and The following snippet works fine: from sklearn.preprocessing import KernelCenterer
import pickle
est = KernelCenterer().fit([[1.]])
pickle.dumps(est) I edited the test to pickle before fit, and |
We certainly want unfitted estimators to be pickle-able, which I think might be used in some distributed settings. I don't see any reason why we'd intentionally not support pickling unfitted estimators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we also need an entry in 1.2.1 since this is a bug fix that solves an issue if people try to pickle in 1.2.0.
Otherwise, I only have some formatting changes. LGTM.
sklearn/tests/test_base.py
Outdated
def test_parent_object_empty_instance_dict(): | ||
# Since Python 3.11, Python objects have a __getstate__ method by default | ||
# that returns None if the instance dict is empty. See #25188. | ||
class Empty: | ||
pass | ||
|
||
class Estimator(Empty, BaseEstimator): | ||
pass | ||
|
||
state = Estimator().__getstate__() | ||
expected = {"_sklearn_version": sklearn.__version__} | ||
assert state == expected | ||
|
||
|
||
def test_base_estimator_empty_instance_dict(): | ||
# Since Python 3.11, Python objects have a __getstate__ method by default | ||
# that returns None if the instance dict is empty. See #25188. | ||
|
||
# this should not raise | ||
state = BaseEstimator().__getstate__() | ||
expected = {"_sklearn_version": sklearn.__version__} | ||
assert state == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are testing the same behaviour here, I would prefer to parameterise the test and provide the object on which to __getstate__
.
Also could you put the comment inside a docstring of the test
def test_base_estimator_empty_instance():
"""Check that `__getstate__` returns an empty `dict` with an empty
instance.
Python 3.11+ changed behaviour by returning `None` instead of raising
an `AttributeError`.
Non-regression test for gh-25188.
"""
...
I am wondering if we could make the round-trip pickling-unpickling directly in this test instead of having a new test.
This is a summary of what I've understood: For Python before v3.11, For Python v3.11, Next class A:
__slots__ = ["a"]
def __init__(self, a):
self.a = a does not have a In Python 3.11, the instances will have a For completeness: Instances of this class class A:
__slots__ = ["a"]
def __init__(self, a):
pass will not have a |
Just a guess... what about stateless models that does not store anything during |
We agree that regardless of the Python version, the |
Just to be super explicit, in Python 3.11 the default This means, we should handle the case where |
I don't mind us handling I'm not sure if we should put code for something which we're really not supporting and there's no way we'll go down the road of supporting it. |
Probably something like: AttributeError: 'tuple' object has no attribute 'items' If we want to handle it, my first guess would be something like: err_msg_slots = (
"Pickling an estimator is not supported for classes using __slots__ "
"instead of __dict__."
)
try:
try:
state = super().__getstate__()
if state is None:
state = self.__dict__.copy()
except AttributeError:
state = self.__dict__.copy()
except AttributeError as exc:
# Fails if `__slots__` are used in an empty instance
raise ValueError(err_msg_slots) from exc
if not isinstance(state, Mapping):
# Fails if `__slots__` are used with a non-empty instance
raise ValueError(err_msg_slots) and I hate nested try:
state = getattr(super, "__getstate__", self.__dict__.copy())
except AttributeError as exc:
# Fails if `__slots__` are used in an empty instance
raise ValueError(err_msg_slots) from exc
if state is None and hasattr(self, "__dict__"):
state = self.__dict__.copy()
else:
raise ValueError(err_msg_slots) |
Python's pickle code checks if the output of |
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
I can add a check for slots if we agree on that. What would be an appropriate error to raise here? |
We could simply do: if getattr(self, "__slots__", None):
TypeError("You cannot use `__slots__` in objects inheriting from `sklearn.base.BaseEstimator`") |
- Add entry to changelog - Check if slots are used, raise TypeError - Refactor tests to use parametrize, test __getstate__ and pickle in same test
I think I made all changes as requested by reviewers:
|
sklearn/base.py
Outdated
@@ -271,9 +271,20 @@ def __repr__(self, N_CHAR_MAX=700): | |||
return repr_ | |||
|
|||
def __getstate__(self): | |||
if hasattr(self, "__slots__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a difference between getattr
and hasattr
here. We should do getattr
since if the attribute exists but it's none, we shouldn't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the attribute exists but it's none, we shouldn't fail.
Isn't that what's happening with the existing code?
In [1]: class A:
...: def __init__(self):
...: self.x = None
...:
In [2]: hasattr(A(), "x")
Out[2]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems that you can't do __slots__ = None
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali I think that __slots__
cannot be None
:
Cell In[33], line 1
----> 1 class A:
2 __slots__ = None
TypeError: 'NoneType' object is not iterable
Edit: arfff, GitHub latency :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but if the code in cpython does getattr
, I rather stay on that safer side. I don't know why they did that, and I'm not sure if I wanna know why lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there can be a C extension type which exposes a __slots__=None
, and that's why getattr(obj, '__slots__', None)
is more correct than getattr(obj, '__slots__')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what about __slots__ = []
, that should also trigger the error. We would have to additionally check for empty list and empty tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't try to be more correct than cpython, and I'm not sure what __slot__=[]
would mean. I can understand __slots__=None
would be equivalent to it not existing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On having to understand the subtle differences between getattr
and hasattr
: I'm with Adrin on the "there must be a reason the CPython devs used that, even if we don't fully understand why".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I changed to if getattr(self, "__slots__", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Python 3.11 introduces `__getstate__` on the `object` level, which breaks our existing `__getstate__` code for objects w/o any attributes. This fixes the issue.
Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Python 3.11 introduces `__getstate__` on the `object` level, which breaks our existing `__getstate__` code for objects w/o any attributes. This fixes the issue.
Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Python 3.11 introduces `__getstate__` on the `object` level, which breaks our existing `__getstate__` code for objects w/o any attributes. This fixes the issue.
Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Python 3.11 introduces `__getstate__` on the `object` level, which breaks our existing `__getstate__` code for objects w/o any attributes. This fixes the issue.
Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Python 3.11 introduces `__getstate__` on the `object` level, which breaks our existing `__getstate__` code for objects w/o any attributes. This fixes the issue.
As discussed with @adrinjalali
Since Python 3.11, objects have a
__getstate__
method by default:python/cpython#70766
Therefore, the exception in
BaseEstimator.__getstate__
will no longer be raised, thus not falling back on using the object's__dict__
:scikit-learn/sklearn/base.py
Lines 274 to 280 in dc580a8
If the instance dict of the object is empty, the return value will, however, be
None
. Therefore, the line below callingstate.items()
results in an error.In this bugfix, it is checked if the state is
None
and if it is, the object's__dict__
is used (which should always be empty).Not addressed in this PR is how to deal with slots (see also discussion in #10079). When there are
__slots__
,__getstate__
will actually return a tuple, as documented here.The user would thus still get an indiscriptive error message.