From e319adbe6931ff8759c3778e64525b064b255c17 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 5 Mar 2025 17:35:57 +0100 Subject: [PATCH 1/3] Link token names in Lexical analysis --- Doc/reference/lexical_analysis.rst | 66 +++++++++++++++++++----------- 1 file changed, 41 insertions(+), 25 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index f7167032ad7df9..10f46e45cdef73 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -8,8 +8,8 @@ Lexical analysis .. index:: lexical analysis, parser, token A Python program is read by a *parser*. Input to the parser is a stream of -*tokens*, generated by the *lexical analyzer*. This chapter describes how the -lexical analyzer breaks a file into tokens. +:term:`tokens `, generated by the *lexical analyzer*. +This chapter describes how the lexical analyzer breaks a file into tokens. Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` @@ -34,11 +34,11 @@ Logical lines .. index:: logical line, physical line, line joining, NEWLINE token -The end of a logical line is represented by the token NEWLINE. Statements -cannot cross logical line boundaries except where NEWLINE is allowed by the -syntax (e.g., between statements in compound statements). A logical line is -constructed from one or more *physical lines* by following the explicit or -implicit *line joining* rules. +The end of a logical line is represented by the token :data:`~token.NEWLINE`. +Statements cannot cross logical line boundaries except where :data:`!NEWLINE` +is allowed by the syntax (e.g., between statements in compound statements). +A logical line is constructed from one or more *physical lines* by following +the explicit or implicit *line joining* rules. .. _physical-lines: @@ -159,11 +159,12 @@ Blank lines .. index:: single: blank line A logical line that contains only spaces, tabs, formfeeds and possibly a -comment, is ignored (i.e., no NEWLINE token is generated). During interactive -input of statements, handling of a blank line may differ depending on the -implementation of the read-eval-print loop. In the standard interactive -interpreter, an entirely blank logical line (i.e. one containing not even -whitespace or a comment) terminates a multi-line statement. +comment, is ignored (i.e., no :data:`~token.NEWLINE` token is generated). +During interactive input of statements, handling of a blank line may differ +depending on the implementation of the read-eval-print loop. +In the standard interactive interpreter, an entirely blank logical line (that +is, one containing not even whitespace or a comment) terminates a multi-line +statement. .. _indentation: @@ -201,19 +202,20 @@ the space count to zero). .. index:: INDENT token, DEDENT token -The indentation levels of consecutive lines are used to generate INDENT and -DEDENT tokens, using a stack, as follows. +The indentation levels of consecutive lines are used to generate +:data:`~token.INDENT` and :data:`~token.DEDENT` tokens, using a stack, +as follows. Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and -one INDENT token is generated. If it is smaller, it *must* be one of the +one :data:`!INDENT` token is generated. If it is smaller, it *must* be one of the numbers occurring on the stack; all numbers on the stack that are larger are -popped off, and for each number popped off a DEDENT token is generated. At the -end of the file, a DEDENT token is generated for each number remaining on the -stack that is larger than zero. +popped off, and for each number popped off a :data:`!DEDENT` token is generated. +At the end of the file, a :data:`!DEDENT` token is generated for each number +remaining on the stack that is larger than zero. Here is an example of a correctly (though confusingly) indented piece of Python code:: @@ -253,8 +255,18 @@ Whitespace between tokens Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two tokens only if their concatenation -could otherwise be interpreted as a different token (e.g., ab is one token, but -a b is two tokens). +could otherwise be interpreted as a different token. For example, ``ab`` is one +token, but ``a b`` is two tokens. For another example, ``+a`` is two tokens, +and is equivalent to ``+ a``. + + +.. _endmarker-token: + +End marker +---------- + +At the end of non-interactive input, the lexical analyzer generates an +:data:`~token.ENDMARKER` token. .. _other-tokens: @@ -262,11 +274,15 @@ a b is two tokens). Other tokens ============ -Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist: -*identifiers*, *keywords*, *literals*, *operators*, and *delimiters*. Whitespace -characters (other than line terminators, discussed earlier) are not tokens, but -serve to delimit tokens. Where ambiguity exists, a token comprises the longest -possible string that forms a legal token, when read from left to right. +Besides :data:`~token.NEWLINE`, :data:`~token.INDENT` and :data:`~token.DEDENT`, +the following categories of tokens exist: +*identifiers* and *keywords* (:data:`~token.NAME`), *literals* (such as +:data:`~token.NUMBER` and :data:`~token.STRING`), and other symbols +(*operators* and *delimiters*, :data:`~token.OP`). +Whitespace characters (other than logical line terminators, discussed earlier) +are not tokens, but serve to delimit tokens. +Where ambiguity exists, a token comprises the longest possible string that +forms a legal token, when read from left to right. .. _identifiers: From 8ea8576663eeb79377f04e9c45231e3e2ebe495d Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Thu, 8 May 2025 11:21:54 +0200 Subject: [PATCH 2/3] Update Doc/reference/lexical_analysis.rst --- Doc/reference/lexical_analysis.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 356eecf7c12524..2cef1d8fead8a5 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -257,8 +257,8 @@ Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token. For example, ``ab`` is one -token, but ``a b`` is two tokens. For another example, ``+a`` is two tokens, -and is equivalent to ``+ a``. +token, but ``a b`` is two tokens. However, ``+a`` and ``+ a`` both produce +two tokens, ``a`` and ``b``, as ``+a`` is not a valid token. .. _endmarker-token: From f8044f78cc2177a10864f9091817ecc16d7892e0 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Thu, 8 May 2025 11:32:58 +0200 Subject: [PATCH 3/3] Fix Doc/reference/lexical_analysis.rst Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com> --- Doc/reference/lexical_analysis.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 2cef1d8fead8a5..c465cb16870d56 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -258,7 +258,7 @@ characters space, tab and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two tokens only if their concatenation could otherwise be interpreted as a different token. For example, ``ab`` is one token, but ``a b`` is two tokens. However, ``+a`` and ``+ a`` both produce -two tokens, ``a`` and ``b``, as ``+a`` is not a valid token. +two tokens, ``+`` and ``a``, as ``+a`` is not a valid token. .. _endmarker-token: