-
-
Notifications
You must be signed in to change notification settings - Fork 387
Closed
Description
Problem description
When writing a .csv.gz
file to S3, performance was slow as it's calling GzipFile.write()
for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering in GzipFile.write()
(see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the '.gz'
file extension in which GzipFile.write
does have buffering).
I'm opening this issue for the smart_open library to discuss:
- Should the smart_open library make it easier to enable and/or configure buffering for
GzipFile.write()
for Python versions that don't yet have the above fix? - If not, should we document this or make it clear to users of the smart_open library that performance can be improved by adding buffering yourself?
Steps/code to reproduce the problem
Use smart_open.open
in write mode for a .csv.gz
file to S3:
import csv
import smart_open
column_names = ("a", "b")
my_data = [.......]
with smart_open.open("s3://.../example.csv.gz", "wb", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=column_names)
writer.writerows(my_data)
(writerows
, in C, calls writerow
for each line, which in turn call Gzip.write()
for each line)
Versions
- Python 3.8
- smart_open 6.2.0
Metadata
Metadata
Assignees
Labels
No labels