Bug report
Bug description:
There is a logical error in pickle.Pickler.save_str for protocol 0, such that it repeats pickling of a string object each time it is presented. The design clearly intends to re-use the first pickled representation, and the C-implementation _pickle does that.
In an implementation that does not provide a compiled _pickle (PyPy may be one) this is inefficient, but not actually wrong. The intended behaviour occurs with a simple string:
>>> s = "hello"
>>> pickle._dumps((s,s), 0)
b'(Vhello\np0\ng0\ntp1\n.'
When read by loads() this string says:
- stack "hello",
- save a copy in memory 0,
- stack the contents of memory 0,
- make a tuple from the stack,
- save a copy in memory 1.
The bug emerges when the pickled string needs pre-encoding:
>>> s = "hello\n"
>>> pickle._dumps((s,s), 0)
b'(Vhello\\u000a\np0\nVhello\\u000a\np1\ntp2\n.'
Here we see identical data stacked and saved (but not used). The problem is here:
|
obj = obj.replace("\\", "\\u005c") |
|
obj = obj.replace("\0", "\\u0000") |
|
obj = obj.replace("\n", "\\u000a") |
|
obj = obj.replace("\r", "\\u000d") |
|
obj = obj.replace("\x1a", "\\u001a") # EOF on DOS |
|
self.write(UNICODE + obj.encode('raw-unicode-escape') + |
|
b'\n') |
where the return from
obj.replace may be a different object from
obj. In CPython, that is only if a replacement takes place, which is why the problem only appears in the second case above.
save_str is only called when the object has not already been memoized, but in the cases at issue, the string memoized is not the original object, and so when the original string object is presented again, save_str is called again.
Depending upon the detailed behaviour of str.replace (in particular, if you decide to return an interned value when the result is, say, a Latin-1 character) an assertion may fail in memoize():
|
assert id(obj) not in self.memo |
|
idx = len(self.memo) |
|
self.write(self.put(idx)) |
|
self.memo[id(obj)] = idx, obj |
I have not managed to trigger an
AssertionError in CPython.
This has probably gone unnoticed so long only because pickle.py is not tested. (At least, I think it isn't. #105250 and #53350 note this coverage problem.)
CPython versions tested on:
3.11
Operating systems tested on:
Windows
Linked PRs
Bug report
Bug description:
There is a logical error in
pickle.Pickler.save_strfor protocol 0, such that it repeats pickling of a string object each time it is presented. The design clearly intends to re-use the first pickled representation, and the C-implementation_pickledoes that.In an implementation that does not provide a compiled
_pickle(PyPy may be one) this is inefficient, but not actually wrong. The intended behaviour occurs with a simple string:When read by
loads()this string says:The bug emerges when the pickled string needs pre-encoding:
Here we see identical data stacked and saved (but not used). The problem is here:
cpython/Lib/pickle.py
Lines 860 to 866 in 42a86df
where the return from
obj.replacemay be a different object fromobj. In CPython, that is only if a replacement takes place, which is why the problem only appears in the second case above.save_stris only called when the object has not already been memoized, but in the cases at issue, the string memoized is not the original object, and so when the original string object is presented again,save_stris called again.Depending upon the detailed behaviour of
str.replace(in particular, if you decide to return an interned value when the result is, say, a Latin-1 character) an assertion may fail inmemoize():cpython/Lib/pickle.py
Lines 504 to 507 in 42a86df
AssertionErrorin CPython.This has probably gone unnoticed so long only because
pickle.pyis not tested. (At least, I think it isn't. #105250 and #53350 note this coverage problem.)CPython versions tested on:
3.11
Operating systems tested on:
Windows
Linked PRs