Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Slow performance due to lack of buffering for GzipFile.write #760

@svisser

Description

@svisser

Problem description

When writing a .csv.gz file to S3, performance was slow as it's calling GzipFile.write() for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering in GzipFile.write() (see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the '.gz' file extension in which GzipFile.write does have buffering).

I'm opening this issue for the smart_open library to discuss:

  • Should the smart_open library make it easier to enable and/or configure buffering for GzipFile.write() for Python versions that don't yet have the above fix?
  • If not, should we document this or make it clear to users of the smart_open library that performance can be improved by adding buffering yourself?

Steps/code to reproduce the problem

Use smart_open.open in write mode for a .csv.gz file to S3:

import csv

import smart_open

column_names = ("a", "b")
my_data = [.......]
with smart_open.open("s3://.../example.csv.gz", "wb", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=column_names)
    writer.writerows(my_data)

(writerows, in C, calls writerow for each line, which in turn call Gzip.write() for each line)

Versions

  • Python 3.8
  • smart_open 6.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions