Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

ddelange
Copy link
Collaborator

Please pick a concise, informative and complete title for your PR.

The title is important because it will appear in our change log.

Motivation

Please explain the motivation behind this PR.

If you're fixing a bug, link to the issue using a supported keyword like "Fixes #{issue_number}".

If you're adding a new feature, then consider opening a ticket and discussing it with the maintainers before you actually do the hard work.

Fixes #760 and registers the .xz compression extension by default

Tests

If you're fixing a bug, consider test-driven development:

  1. Create a unit test that demonstrates the bug. The test should fail.
  2. Implement your bug fix.
  3. The test you created should now pass.

If you're implementing a new feature, include unit tests for it.

Make sure all existing unit tests pass.
You can run them locally using:

pytest tests

If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

Work in progress

If you're still working on your PR, mark the PR as draft PR.

We'll skip reviewing it for the time being.

Once it's ready, mark the PR as "ready for review", and ping one of the maintainers (e.g. mpenkov).

Checklist

Before you mark the PR as "ready for review", please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Run python update_helptext.py in case there are API changes
  • Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.

Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

@ddelange ddelange force-pushed the buffered-compression branch 3 times, most recently from d5bb01e to 5db3a1c Compare August 18, 2025 17:18
@ddelange ddelange force-pushed the buffered-compression branch from 5db3a1c to 1832ef6 Compare August 18, 2025 17:20
Comment on lines +99 to +105
def _maybe_wrap_buffered(file_obj, mode):
# https://github.com/piskvorky/smart_open/issues/760#issuecomment-1553971657
result = file_obj
if "b" in mode and "w" in mode:
result = io.BufferedWriter(result)
elif "b" in mode and "r" in mode:
result = io.BufferedReader(result)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rhpvorderman 👋

taking the liberty to tag you here for review since you originally suggested this fix in 2023 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to benchmark the performance too as in CPython this buffering issue is fixed in the gzip module.

@ddelange
Copy link
Collaborator Author

ddelange commented Sep 8, 2025

I'm seeing an improvement across the board for many short writes/reads 🎉

Without _maybe_wrap_buffered:

In [1]: import smart_open
   ...: from tempfile import NamedTemporaryFile as named_temporary_file
   ...:
   ...: def test_compression_extension(extension):
   ...:     with named_temporary_file(suffix=extension) as tmp:
   ...:         with smart_open.open(tmp.name, "w") as fout:
   ...:             for i in range(100000):
   ...:                 fout.write("hello world\n")
   ...:
   ...:         with smart_open.open(tmp.name, "r") as fin:
   ...:             list(fin)
   ...:
   ...:     with named_temporary_file(suffix=extension) as tmp:
   ...:         with smart_open.open(tmp.name, "wb") as fout:
   ...:             for i in range(100000):
   ...:                 fout.write(b"hello world\n")
   ...:
   ...:         with smart_open.open(tmp.name, "rb") as fin:
   ...:             list(fin)
   ...:
   ...: for extension in smart_open.compression.get_supported_extensions():
   ...:     print(extension)
   ...:     %timeit -n 20 test_compression_extension(extension)
   ...:
.bz2
329 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.gz
76.7 ms ± 619 μs per loop (mean ± std. dev. of 7 runs, 20 loops each)
.xz
140 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.zst
# UnsupportedOperation ref https://github.com/piskvorky/smart_open/pull/815

With _maybe_wrap_buffered:

In [1]: import smart_open
   ...: from tempfile import NamedTemporaryFile as named_temporary_file
   ...:
   ...: def test_compression_extension(extension):
   ...:     with named_temporary_file(suffix=extension) as tmp:
   ...:         with smart_open.open(tmp.name, "w") as fout:
   ...:             for i in range(100000):
   ...:                 fout.write("hello world\n")
   ...:
   ...:         with smart_open.open(tmp.name, "r") as fin:
   ...:             list(fin)
   ...:
   ...:     with named_temporary_file(suffix=extension) as tmp:
   ...:         with smart_open.open(tmp.name, "wb") as fout:
   ...:             for i in range(100000):
   ...:                 fout.write(b"hello world\n")
   ...:
   ...:         with smart_open.open(tmp.name, "rb") as fin:
   ...:             list(fin)
   ...:
   ...: for extension in smart_open.compression.get_supported_extensions():
   ...:     print(extension)
   ...:     %timeit -n 20 test_compression_extension(extension)
   ...:
.bz2
254 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.gz
49.8 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
.xz
61.9 ms ± 225 μs per loop (mean ± std. dev. of 7 runs, 20 loops each)
.zst
28.5 ms ± 239 μs per loop (mean ± std. dev. of 7 runs, 20 loops each)

@ddelange ddelange merged commit 676099e into develop Sep 8, 2025
31 checks passed
ddelange added a commit that referenced this pull request Sep 8, 2025
* develop:
  Update CHANGELOG.md
  Add .xz and increase performance of compression module (#875)
  Bump pypa/gh-action-pypi-publish in /.github/workflows (#878)
  Bump actions/checkout from 4 to 5 in the github-actions group (#877)
  Fix release.sh for the final merge back into develop (#872)
Copy link

github-actions bot commented Sep 8, 2025

Released v7.3.1

@piskvorky piskvorky deleted the buffered-compression branch September 8, 2025 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow performance due to lack of buffering for GzipFile.write
2 participants