-
-
Notifications
You must be signed in to change notification settings - Fork 387
Add .xz and increase performance of compression module #875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d5bb01e
to
5db3a1c
Compare
5db3a1c
to
1832ef6
Compare
def _maybe_wrap_buffered(file_obj, mode): | ||
# https://github.com/piskvorky/smart_open/issues/760#issuecomment-1553971657 | ||
result = file_obj | ||
if "b" in mode and "w" in mode: | ||
result = io.BufferedWriter(result) | ||
elif "b" in mode and "r" in mode: | ||
result = io.BufferedReader(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rhpvorderman 👋
taking the liberty to tag you here for review since you originally suggested this fix in 2023 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to benchmark the performance too as in CPython this buffering issue is fixed in the gzip module.
I'm seeing an improvement across the board for many short writes/reads 🎉 Without In [1]: import smart_open
...: from tempfile import NamedTemporaryFile as named_temporary_file
...:
...: def test_compression_extension(extension):
...: with named_temporary_file(suffix=extension) as tmp:
...: with smart_open.open(tmp.name, "w") as fout:
...: for i in range(100000):
...: fout.write("hello world\n")
...:
...: with smart_open.open(tmp.name, "r") as fin:
...: list(fin)
...:
...: with named_temporary_file(suffix=extension) as tmp:
...: with smart_open.open(tmp.name, "wb") as fout:
...: for i in range(100000):
...: fout.write(b"hello world\n")
...:
...: with smart_open.open(tmp.name, "rb") as fin:
...: list(fin)
...:
...: for extension in smart_open.compression.get_supported_extensions():
...: print(extension)
...: %timeit -n 20 test_compression_extension(extension)
...:
.bz2
329 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.gz
76.7 ms ± 619 μs per loop (mean ± std. dev. of 7 runs, 20 loops each)
.xz
140 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.zst
# UnsupportedOperation ref https://github.com/piskvorky/smart_open/pull/815 With In [1]: import smart_open
...: from tempfile import NamedTemporaryFile as named_temporary_file
...:
...: def test_compression_extension(extension):
...: with named_temporary_file(suffix=extension) as tmp:
...: with smart_open.open(tmp.name, "w") as fout:
...: for i in range(100000):
...: fout.write("hello world\n")
...:
...: with smart_open.open(tmp.name, "r") as fin:
...: list(fin)
...:
...: with named_temporary_file(suffix=extension) as tmp:
...: with smart_open.open(tmp.name, "wb") as fout:
...: for i in range(100000):
...: fout.write(b"hello world\n")
...:
...: with smart_open.open(tmp.name, "rb") as fin:
...: list(fin)
...:
...: for extension in smart_open.compression.get_supported_extensions():
...: print(extension)
...: %timeit -n 20 test_compression_extension(extension)
...:
.bz2
254 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.gz
49.8 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.xz
61.9 ms ± 225 μs per loop (mean ± std. dev. of 7 runs, 20 loops each)
.zst
28.5 ms ± 239 μs per loop (mean ± std. dev. of 7 runs, 20 loops each) |
Released v7.3.1 |
Motivation
Fixes #760 and registers the
.xz
compression extension by defaultTests
Work in progress
Checklist
python update_helptext.py
in case there are API changesWorkflow