99
1010
1111The :mod: `tokenize ` module provides a lexical scanner for Python source code,
12- implemented in Python. The scanner in this module returns comments as tokens as
13- well, making it useful for implementing "pretty-printers," including colorizers
14- for on-screen displays.
12+ implemented in Python. The scanner in this module returns comments as tokens
13+ as well, making it useful for implementing "pretty-printers," including
14+ colorizers for on-screen displays.
1515
1616The primary entry point is a :term: `generator `:
1717
1818
19- .. function :: generate_tokens (readline)
19+ .. function :: tokenize (readline)
2020
21- The :func: `generate_tokens ` generator requires one argument, *readline *, which
21+ The :func: `tokenize ` generator requires one argument, *readline *, which
2222 must be a callable object which provides the same interface as the
2323 :meth: `readline ` method of built-in file objects (see section
24- :ref: `bltin-file-objects `). Each call to the function should return one line of
25- input as a string .
24+ :ref: `bltin-file-objects `). Each call to the function should return one
25+ line of input as bytes .
2626
27- The generator produces 5-tuples with these members: the token type; the token
28- string; a 2-tuple ``(srow, scol) `` of ints specifying the row and column where
29- the token begins in the source; a 2-tuple ``(erow, ecol) `` of ints specifying
30- the row and column where the token ends in the source; and the line on which the
31- token was found. The line passed is the *logical * line; continuation lines are
32- included.
33-
34-
35- An older entry point is retained for backward compatibility:
36-
37- .. function :: tokenize(readline[, tokeneater])
38-
39- The :func: `tokenize ` function accepts two parameters: one representing the input
40- stream, and one providing an output mechanism for :func: `tokenize `.
41-
42- The first parameter, *readline *, must be a callable object which provides the
43- same interface as the :meth: `readline ` method of built-in file objects (see
44- section :ref: `bltin-file-objects `). Each call to the function should return one
45- line of input as a string. Alternately, *readline * may be a callable object that
46- signals completion by raising :exc: `StopIteration `.
47-
48- The second parameter, *tokeneater *, must also be a callable object. It is
49- called once for each token, with five arguments, corresponding to the tuples
50- generated by :func: `generate_tokens `.
27+ The generator produces 5-tuples with these members: the token type; the
28+ token string; a 2-tuple ``(srow, scol) `` of ints specifying the row and
29+ column where the token begins in the source; a 2-tuple ``(erow, ecol) `` of
30+ ints specifying the row and column where the token ends in the source; and
31+ the line on which the token was found. The line passed is the *logical *
32+ line; continuation lines are included.
33+
34+ tokenize determines the source encoding of the file by looking for a utf-8
35+ bom or encoding cookie, according to :pep: `263 `.
5136
5237
5338All constants from the :mod: `token ` module are also exported from
54- :mod: `tokenize `, as are two additional token type values that might be passed to
55- the *tokeneater * function by :func: `tokenize `:
39+ :mod: `tokenize `, as are three additional token type values:
5640
5741.. data :: COMMENT
5842
@@ -62,55 +46,95 @@ the *tokeneater* function by :func:`tokenize`:
6246.. data :: NL
6347
6448 Token value used to indicate a non-terminating newline. The NEWLINE token
65- indicates the end of a logical line of Python code; NL tokens are generated when
66- a logical line of code is continued over multiple physical lines.
49+ indicates the end of a logical line of Python code; NL tokens are generated
50+ when a logical line of code is continued over multiple physical lines.
6751
68- Another function is provided to reverse the tokenization process. This is useful
69- for creating tools that tokenize a script, modify the token stream, and write
70- back the modified script.
7152
53+ .. data :: ENCODING
7254
73- .. function :: untokenize(iterable)
55+ Token value that indicates the encoding used to decode the source bytes
56+ into text. The first token returned by :func: `tokenize ` will always be an
57+ ENCODING token.
7458
75- Converts tokens back into Python source code. The *iterable * must return
76- sequences with at least two elements, the token type and the token string. Any
77- additional sequence elements are ignored.
7859
79- The reconstructed script is returned as a single string. The result is
80- guaranteed to tokenize back to match the input so that the conversion is
81- lossless and round-trips are assured. The guarantee applies only to the token
82- type and token string as the spacing between tokens (column positions) may
83- change.
60+ Another function is provided to reverse the tokenization process. This is
61+ useful for creating tools that tokenize a script, modify the token stream, and
62+ write back the modified script.
8463
8564
65+ .. function :: untokenize(iterable)
66+
67+ Converts tokens back into Python source code. The *iterable * must return
68+ sequences with at least two elements, the token type and the token string.
69+ Any additional sequence elements are ignored.
70+
71+ The reconstructed script is returned as a single string. The result is
72+ guaranteed to tokenize back to match the input so that the conversion is
73+ lossless and round-trips are assured. The guarantee applies only to the
74+ token type and token string as the spacing between tokens (column
75+ positions) may change.
76+
77+ It returns bytes, encoded using the ENCODING token, which is the first
78+ token sequence output by :func: `tokenize `.
79+
80+
81+ :func: `tokenize ` needs to detect the encoding of source files it tokenizes. The
82+ function it uses to do this is available:
83+
84+ .. function :: detect_encoding(readline)
85+
86+ The :func: `detect_encoding ` function is used to detect the encoding that
87+ should be used to decode a Python source file. It requires one argment,
88+ readline, in the same way as the :func: `tokenize ` generator.
89+
90+ It will call readline a maximum of twice, and return the encoding used
91+ (as a string) and a list of any lines (not decoded from bytes) it has read
92+ in.
93+
94+ It detects the encoding from the presence of a utf-8 bom or an encoding
95+ cookie as specified in pep-0263. If both a bom and a cookie are present,
96+ but disagree, a SyntaxError will be raised.
97+
98+ If no encoding is specified, then the default of 'utf-8' will be returned.
99+
100+
86101Example of a script re-writer that transforms float literals into Decimal
87102objects::
88103
89- def decistmt(s):
90- """Substitute Decimals for floats in a string of statements.
91-
92- >>> from decimal import Decimal
93- >>> s = 'print(+21.3e-5*-.1234/81.7)'
94- >>> decistmt(s)
95- "print(+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
96-
97- >>> exec(s)
98- -3.21716034272e-007
99- >>> exec(decistmt(s))
100- -3.217160342717258261933904529E-7
101-
102- """
103- result = []
104- g = generate_tokens(StringIO(s).readline) # tokenize the string
105- for toknum, tokval, _, _, _ in g:
106- if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
107- result.extend([
108- (NAME, 'Decimal'),
109- (OP, '('),
110- (STRING, repr(tokval)),
111- (OP, ')')
112- ])
113- else:
114- result.append((toknum, tokval))
115- return untokenize(result)
104+ def decistmt(s):
105+ """Substitute Decimals for floats in a string of statements.
106+
107+ >>> from decimal import Decimal
108+ >>> s = 'print(+21.3e-5*-.1234/81.7)'
109+ >>> decistmt(s)
110+ "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
111+
112+ The format of the exponent is inherited from the platform C library.
113+ Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
114+ we're only showing 12 digits, and the 13th isn't close to 5, the
115+ rest of the output should be platform-independent.
116+
117+ >>> exec(s) #doctest: +ELLIPSIS
118+ -3.21716034272e-0...7
119+
120+ Output from calculations with Decimal should be identical across all
121+ platforms.
122+
123+ >>> exec(decistmt(s))
124+ -3.217160342717258261933904529E-7
125+ """
126+ result = []
127+ g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
128+ for toknum, tokval, _, _, _ in g:
129+ if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
130+ result.extend([
131+ (NAME, 'Decimal'),
132+ (OP, '('),
133+ (STRING, repr(tokval)),
134+ (OP, ')')
135+ ])
136+ else:
137+ result.append((toknum, tokval))
138+ return untokenize(result).decode('utf-8')
139+
116140
0 commit comments