Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 428de65

Browse files
committed
- Issue #719888: Updated tokenize to use a bytes API. generate_tokens has been
renamed tokenize and now works with bytes rather than strings. A new detect_encoding function has been added for determining source file encoding according to PEP-0263. Token sequences returned by tokenize always start with an ENCODING token which specifies the encoding used to decode the file. This token is used to encode the output of untokenize back to bytes. Credit goes to Michael "I'm-going-to-name-my-first-child-unittest" Foord from Resolver Systems for this work.
1 parent 112367a commit 428de65

16 files changed

Lines changed: 610 additions & 183 deletions

Doc/ACKS.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,3 +209,5 @@ [email protected]), and we'll be glad to correct the problem.
209209
* Moshe Zadka
210210
* Milan Zamazal
211211
* Cheng Zhang
212+
* Trent Nelson
213+
* Michael Foord

Doc/library/tokenize.rst

Lines changed: 98 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -9,50 +9,34 @@
99

1010

1111
The :mod:`tokenize` module provides a lexical scanner for Python source code,
12-
implemented in Python. The scanner in this module returns comments as tokens as
13-
well, making it useful for implementing "pretty-printers," including colorizers
14-
for on-screen displays.
12+
implemented in Python. The scanner in this module returns comments as tokens
13+
as well, making it useful for implementing "pretty-printers," including
14+
colorizers for on-screen displays.
1515

1616
The primary entry point is a :term:`generator`:
1717

1818

19-
.. function:: generate_tokens(readline)
19+
.. function:: tokenize(readline)
2020

21-
The :func:`generate_tokens` generator requires one argument, *readline*, which
21+
The :func:`tokenize` generator requires one argument, *readline*, which
2222
must be a callable object which provides the same interface as the
2323
:meth:`readline` method of built-in file objects (see section
24-
:ref:`bltin-file-objects`). Each call to the function should return one line of
25-
input as a string.
24+
:ref:`bltin-file-objects`). Each call to the function should return one
25+
line of input as bytes.
2626

27-
The generator produces 5-tuples with these members: the token type; the token
28-
string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column where
29-
the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints specifying
30-
the row and column where the token ends in the source; and the line on which the
31-
token was found. The line passed is the *logical* line; continuation lines are
32-
included.
33-
34-
35-
An older entry point is retained for backward compatibility:
36-
37-
.. function:: tokenize(readline[, tokeneater])
38-
39-
The :func:`tokenize` function accepts two parameters: one representing the input
40-
stream, and one providing an output mechanism for :func:`tokenize`.
41-
42-
The first parameter, *readline*, must be a callable object which provides the
43-
same interface as the :meth:`readline` method of built-in file objects (see
44-
section :ref:`bltin-file-objects`). Each call to the function should return one
45-
line of input as a string. Alternately, *readline* may be a callable object that
46-
signals completion by raising :exc:`StopIteration`.
47-
48-
The second parameter, *tokeneater*, must also be a callable object. It is
49-
called once for each token, with five arguments, corresponding to the tuples
50-
generated by :func:`generate_tokens`.
27+
The generator produces 5-tuples with these members: the token type; the
28+
token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
29+
column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
30+
ints specifying the row and column where the token ends in the source; and
31+
the line on which the token was found. The line passed is the *logical*
32+
line; continuation lines are included.
33+
34+
tokenize determines the source encoding of the file by looking for a utf-8
35+
bom or encoding cookie, according to :pep:`263`.
5136

5237

5338
All constants from the :mod:`token` module are also exported from
54-
:mod:`tokenize`, as are two additional token type values that might be passed to
55-
the *tokeneater* function by :func:`tokenize`:
39+
:mod:`tokenize`, as are three additional token type values:
5640

5741
.. data:: COMMENT
5842

@@ -62,55 +46,95 @@ the *tokeneater* function by :func:`tokenize`:
6246
.. data:: NL
6347

6448
Token value used to indicate a non-terminating newline. The NEWLINE token
65-
indicates the end of a logical line of Python code; NL tokens are generated when
66-
a logical line of code is continued over multiple physical lines.
49+
indicates the end of a logical line of Python code; NL tokens are generated
50+
when a logical line of code is continued over multiple physical lines.
6751

68-
Another function is provided to reverse the tokenization process. This is useful
69-
for creating tools that tokenize a script, modify the token stream, and write
70-
back the modified script.
7152

53+
.. data:: ENCODING
7254

73-
.. function:: untokenize(iterable)
55+
Token value that indicates the encoding used to decode the source bytes
56+
into text. The first token returned by :func:`tokenize` will always be an
57+
ENCODING token.
7458

75-
Converts tokens back into Python source code. The *iterable* must return
76-
sequences with at least two elements, the token type and the token string. Any
77-
additional sequence elements are ignored.
7859

79-
The reconstructed script is returned as a single string. The result is
80-
guaranteed to tokenize back to match the input so that the conversion is
81-
lossless and round-trips are assured. The guarantee applies only to the token
82-
type and token string as the spacing between tokens (column positions) may
83-
change.
60+
Another function is provided to reverse the tokenization process. This is
61+
useful for creating tools that tokenize a script, modify the token stream, and
62+
write back the modified script.
8463

8564

65+
.. function:: untokenize(iterable)
66+
67+
Converts tokens back into Python source code. The *iterable* must return
68+
sequences with at least two elements, the token type and the token string.
69+
Any additional sequence elements are ignored.
70+
71+
The reconstructed script is returned as a single string. The result is
72+
guaranteed to tokenize back to match the input so that the conversion is
73+
lossless and round-trips are assured. The guarantee applies only to the
74+
token type and token string as the spacing between tokens (column
75+
positions) may change.
76+
77+
It returns bytes, encoded using the ENCODING token, which is the first
78+
token sequence output by :func:`tokenize`.
79+
80+
81+
:func:`tokenize` needs to detect the encoding of source files it tokenizes. The
82+
function it uses to do this is available:
83+
84+
.. function:: detect_encoding(readline)
85+
86+
The :func:`detect_encoding` function is used to detect the encoding that
87+
should be used to decode a Python source file. It requires one argment,
88+
readline, in the same way as the :func:`tokenize` generator.
89+
90+
It will call readline a maximum of twice, and return the encoding used
91+
(as a string) and a list of any lines (not decoded from bytes) it has read
92+
in.
93+
94+
It detects the encoding from the presence of a utf-8 bom or an encoding
95+
cookie as specified in pep-0263. If both a bom and a cookie are present,
96+
but disagree, a SyntaxError will be raised.
97+
98+
If no encoding is specified, then the default of 'utf-8' will be returned.
99+
100+
86101
Example of a script re-writer that transforms float literals into Decimal
87102
objects::
88103

89-
def decistmt(s):
90-
"""Substitute Decimals for floats in a string of statements.
91-
92-
>>> from decimal import Decimal
93-
>>> s = 'print(+21.3e-5*-.1234/81.7)'
94-
>>> decistmt(s)
95-
"print(+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
96-
97-
>>> exec(s)
98-
-3.21716034272e-007
99-
>>> exec(decistmt(s))
100-
-3.217160342717258261933904529E-7
101-
102-
"""
103-
result = []
104-
g = generate_tokens(StringIO(s).readline) # tokenize the string
105-
for toknum, tokval, _, _, _ in g:
106-
if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
107-
result.extend([
108-
(NAME, 'Decimal'),
109-
(OP, '('),
110-
(STRING, repr(tokval)),
111-
(OP, ')')
112-
])
113-
else:
114-
result.append((toknum, tokval))
115-
return untokenize(result)
104+
def decistmt(s):
105+
"""Substitute Decimals for floats in a string of statements.
106+
107+
>>> from decimal import Decimal
108+
>>> s = 'print(+21.3e-5*-.1234/81.7)'
109+
>>> decistmt(s)
110+
"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
111+
112+
The format of the exponent is inherited from the platform C library.
113+
Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
114+
we're only showing 12 digits, and the 13th isn't close to 5, the
115+
rest of the output should be platform-independent.
116+
117+
>>> exec(s) #doctest: +ELLIPSIS
118+
-3.21716034272e-0...7
119+
120+
Output from calculations with Decimal should be identical across all
121+
platforms.
122+
123+
>>> exec(decistmt(s))
124+
-3.217160342717258261933904529E-7
125+
"""
126+
result = []
127+
g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
128+
for toknum, tokval, _, _, _ in g:
129+
if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
130+
result.extend([
131+
(NAME, 'Decimal'),
132+
(OP, '('),
133+
(STRING, repr(tokval)),
134+
(OP, ')')
135+
])
136+
else:
137+
result.append((toknum, tokval))
138+
return untokenize(result).decode('utf-8')
139+
116140

Doc/whatsnew/3.0.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,9 @@ details.
392392
* The functions :func:`os.tmpnam`, :func:`os.tempnam` and :func:`os.tmpfile`
393393
have been removed in favor of the :mod:`tempfile` module.
394394

395+
* The :mod:`tokenize` module has been changed to work with bytes. The main
396+
entry point is now :func:`tokenize.tokenize`, instead of generate_tokens.
397+
395398
.. ======================================================================
396399
.. whole new modules get described in subsections here
397400

Lib/idlelib/EditorWindow.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1437,7 +1437,9 @@ def run(self):
14371437
_tokenize.tabsize = self.tabwidth
14381438
try:
14391439
try:
1440-
_tokenize.tokenize(self.readline, self.tokeneater)
1440+
tokens = _tokenize.generate_tokens(self.readline)
1441+
for token in tokens:
1442+
self.tokeneater(*token)
14411443
except _tokenize.TokenError:
14421444
# since we cut off the tokenizer early, we can trigger
14431445
# spurious errors

Lib/inspect.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -657,7 +657,9 @@ def getblock(lines):
657657
"""Extract the block of code at the top of the given list of lines."""
658658
blockfinder = BlockFinder()
659659
try:
660-
tokenize.tokenize(iter(lines).__next__, blockfinder.tokeneater)
660+
tokens = tokenize.generate_tokens(iter(lines).__next__)
661+
for _token in tokens:
662+
blockfinder.tokeneater(*_token)
661663
except (EndOfBlock, IndentationError):
662664
pass
663665
return lines[:blockfinder.last]

0 commit comments

Comments
 (0)