Bug report
When passed a bytestring that is over a hundred mebibytes (MiB), the urllib.parse.quote_from_bytes function uses much more memory and CPU than one would expect.
repro.py:
#!/usr/bin/env python3
import base64
from time import perf_counter
from urllib.parse import quote_from_bytes
MIB = 1024 ** 2
def main():
bytes_ = base64.b64encode(100 * MIB * b'\x00') # note 1
start = perf_counter()
quoted = quote_from_bytes(bytes_)
stop = perf_counter()
print(f"Quoting {len(bytes_)/1024**2:.3f} MiB took {stop-start} seconds")
if __name__ == '__main__':
main()
I use /usr/bin/time to track how much CPU and memory is used.
$ /usr/bin/time -v ./repro.py
Quoting 133.333 MiB took 7.290915511985077 seconds
Command being timed: "./repro.py"
User time (seconds): 7.12
System time (seconds): 0.68
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.82
...
Maximum resident set size (kbytes): 1374872
...
The function ends up at one point needing ten times the size of the bytestring to quote it (i.e. 1.31 GiB). It also takes several seconds to return. I expect it to return in under a second. Fortunately, there's no memory leak as the interpreter does return the memory after the function returns.
Interestingly, if I reduce 100 to 90 in the line marked "note 1", the function returns in half a second and uses only 250 MiB, which is much more in line with my pre-bug expectations.
This function consuming so much memory affects the AWSSDK for Python, boto3, as a lot of AWS APIs are called with URL-encoded parameters. boto3/botocore calls urllib.parse.urlencode to do that encoding. That ends up calling the problematic quote_from_bytes. Sample stack trace:
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 508, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 898, in _make_api_call
http, parsed_response = self._make_request(
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 921, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 139, in create_request
prepared_request = self.prepare_request(request)
File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 150, in prepare_request
return request.prepare()
File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 473, in prepare
return self._request_preparer.prepare(self)
File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 360, in prepare
body = self._prepare_body(original)
File "/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py", line 416, in _prepare_body
body = urlencode(params, doseq=True)
File "/usr/lib/python3.8/urllib/parse.py", line 962, in urlencode
v = quote_via(v, safe)
File "/usr/lib/python3.8/urllib/parse.py", line 870, in quote_plus
return quote(string, safe, encoding, errors)
File "/usr/lib/python3.8/urllib/parse.py", line 859, in quote
return quote_from_bytes(string, safe)
File "/usr/lib/python3.8/urllib/parse.py", line 898, in quote_from_bytes
return ''.join([quoter(char) for char in bs])
Your environment
Python 3.8.10 on Ubuntu 20.04 running on a t3.large EC2 instance. I have also been able to reproduce it with Python 3.10.6 and 3.11.0rc1+. I also reproduced it on Windows 10 running Python 3.9.13.
Bug report
When passed a bytestring that is over a hundred mebibytes (MiB), the
urllib.parse.quote_from_bytesfunction uses much more memory and CPU than one would expect.repro.py:
I use
/usr/bin/timeto track how much CPU and memory is used.The function ends up at one point needing ten times the size of the bytestring to quote it (i.e. 1.31 GiB). It also takes several seconds to return. I expect it to return in under a second. Fortunately, there's no memory leak as the interpreter does return the memory after the function returns.
Interestingly, if I reduce 100 to 90 in the line marked "note 1", the function returns in half a second and uses only 250 MiB, which is much more in line with my pre-bug expectations.
This function consuming so much memory affects the AWSSDK for Python, boto3, as a lot of AWS APIs are called with URL-encoded parameters. boto3/botocore calls
urllib.parse.urlencodeto do that encoding. That ends up calling the problematicquote_from_bytes. Sample stack trace:Your environment
Python 3.8.10 on Ubuntu 20.04 running on a t3.large EC2 instance. I have also been able to reproduce it with Python 3.10.6 and 3.11.0rc1+. I also reproduced it on Windows 10 running Python 3.9.13.