Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 52dbbb9

Browse files
committed
- Issue #3300: make urllib.parse.[un]quote() default to UTF-8.
Code contributed by Matt Giuca. quote() now encodes the input before quoting, unquote() decodes after unquoting. There are new arguments to change the encoding and errors settings. There are also new APIs to skip the encode/decode steps. [un]quote_plus() are also affected.
1 parent 4171da5 commit 52dbbb9

8 files changed

Lines changed: 439 additions & 80 deletions

File tree

Doc/library/urllib.parse.rst

Lines changed: 56 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -182,36 +182,84 @@ The :mod:`urllib.parse` module defines the following functions:
182182
string. If there is no fragment identifier in *url*, return *url* unmodified
183183
and an empty string.
184184

185-
.. function:: quote(string[, safe])
185+
.. function:: quote(string[, safe[, encoding[, errors]]])
186186

187187
Replace special characters in *string* using the ``%xx`` escape. Letters,
188188
digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
189-
parameter specifies additional characters that should not be quoted --- its
190-
default value is ``'/'``.
189+
parameter specifies additional ASCII characters that should not be quoted
190+
--- its default value is ``'/'``.
191191

192-
Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
192+
*string* may be either a :class:`str` or a :class:`bytes`.
193193

194+
The optional *encoding* and *errors* parameters specify how to deal with
195+
non-ASCII characters, as accepted by the :meth:`str.encode` method.
196+
*encoding* defaults to ``'utf-8'``.
197+
*errors* defaults to ``'strict'``, meaning unsupported characters raise a
198+
:class:`UnicodeEncodeError`.
199+
*encoding* and *errors* must not be supplied if *string* is a
200+
:class:`bytes`, or a :class:`TypeError` is raised.
194201

195-
.. function:: quote_plus(string[, safe])
202+
Note that ``quote(string, safe, encoding, errors)`` is equivalent to
203+
``quote_from_bytes(string.encode(encoding, errors), safe)``.
204+
205+
Example: ``quote('/El Niño/')`` yields ``'/El%20Ni%C3%B1o/'``.
206+
207+
208+
.. function:: quote_plus(string[, safe[, encoding[, errors]]])
196209

197210
Like :func:`quote`, but also replace spaces by plus signs, as required for
198211
quoting HTML form values. Plus signs in the original string are escaped
199212
unless they are included in *safe*. It also does not have *safe* default to
200213
``'/'``.
201214

215+
Example: ``quote_plus('/El Niño/')`` yields ``'%2FEl+Ni%C3%B1o%2F'``.
216+
217+
.. function:: quote_from_bytes(bytes[, safe])
202218

203-
.. function:: unquote(string)
219+
Like :func:`quote`, but accepts a :class:`bytes` object rather than a
220+
:class:`str`, and does not perform string-to-bytes encoding.
221+
222+
Example: ``quote_from_bytes(b'a&\xef')`` yields
223+
``'a%26%EF'``.
224+
225+
.. function:: unquote(string[, encoding[, errors]])
204226

205227
Replace ``%xx`` escapes by their single-character equivalent.
228+
The optional *encoding* and *errors* parameters specify how to decode
229+
percent-encoded sequences into Unicode characters, as accepted by the
230+
:meth:`bytes.decode` method.
231+
232+
*string* must be a :class:`str`.
233+
234+
*encoding* defaults to ``'utf-8'``.
235+
*errors* defaults to ``'replace'``, meaning invalid sequences are replaced
236+
by a placeholder character.
206237

207-
Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
238+
Example: ``unquote('/El%20Ni%C3%B1o/')`` yields ``'/El Niño/'``.
208239

209240

210-
.. function:: unquote_plus(string)
241+
.. function:: unquote_plus(string[, encoding[, errors]])
211242

212243
Like :func:`unquote`, but also replace plus signs by spaces, as required for
213244
unquoting HTML form values.
214245

246+
*string* must be a :class:`str`.
247+
248+
Example: ``unquote_plus('/El+Ni%C3%B1o/')`` yields ``'/El Niño/'``.
249+
250+
.. function:: unquote_to_bytes(string)
251+
252+
Replace ``%xx`` escapes by their single-octet equivalent, and return a
253+
:class:`bytes` object.
254+
255+
*string* may be either a :class:`str` or a :class:`bytes`.
256+
257+
If it is a :class:`str`, unescaped non-ASCII characters in *string*
258+
are encoded into UTF-8 bytes.
259+
260+
Example: ``unquote_to_bytes('a%26%EF')`` yields
261+
``b'a&\xef'``.
262+
215263

216264
.. function:: urlencode(query[, doseq])
217265

Lib/email/utils.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ def encode_rfc2231(s, charset=None, language=None):
219219
charset is given but not language, the string is encoded using the empty
220220
string for language.
221221
"""
222-
s = urllib.parse.quote(s, safe='')
222+
s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
223223
if charset is None and language is None:
224224
return s
225225
if language is None:
@@ -271,7 +271,10 @@ def decode_params(params):
271271
# language specifiers at the beginning of the string.
272272
for num, s, encoded in continuations:
273273
if encoded:
274-
s = urllib.parse.unquote(s)
274+
# Decode as "latin-1", so the characters in s directly
275+
# represent the percent-encoded octet values.
276+
# collapse_rfc2231_value treats this as an octet sequence.
277+
s = urllib.parse.unquote(s, encoding="latin-1")
275278
extended = True
276279
value.append(s)
277280
value = quote(EMPTYSTRING.join(value))

Lib/test/test_cgi.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ def do_test(buf, method):
6868
("&a=b", [('a', 'b')]),
6969
("a=a+b&b=b+c", [('a', 'a b'), ('b', 'b c')]),
7070
("a=1&a=2", [('a', '1'), ('a', '2')]),
71+
("a=%26&b=%3D", [('a', '&'), ('b', '=')]),
72+
("a=%C3%BC&b=%CA%83", [('a', '\xfc'), ('b', '\u0283')]),
7173
]
7274

7375
parse_strict_test_cases = [

Lib/test/test_http_cookiejar.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -539,6 +539,8 @@ def test_escape_path(self):
539539
# unquoted unsafe
540540
("/foo\031/bar", "/foo%19/bar"),
541541
("/\175foo/bar", "/%7Dfoo/bar"),
542+
# unicode, latin-1 range
543+
("/foo/bar\u00fc", "/foo/bar%C3%BC"), # UTF-8 encoded
542544
# unicode
543545
("/foo/bar\uabcd", "/foo/bar%EA%AF%8D"), # UTF-8 encoded
544546
]
@@ -1444,7 +1446,8 @@ def test_url_encoding(self):
14441446
# Try some URL encodings of the PATHs.
14451447
# (the behaviour here has changed from libwww-perl)
14461448
c = CookieJar(DefaultCookiePolicy(rfc2965=True))
1447-
interact_2965(c, "http://www.acme.com/foo%2f%25/%3c%3c%0Anew%E5/%E5",
1449+
interact_2965(c, "http://www.acme.com/foo%2f%25/"
1450+
"%3c%3c%0Anew%C3%A5/%C3%A5",
14481451
"foo = bar; version = 1")
14491452

14501453
cookie = interact_2965(

0 commit comments

Comments
 (0)