Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Lua lexer modifications#3143

Open
MtScience wants to merge 3 commits into
pygments:masterfrom
MtScience:lua-improvements
Open

Lua lexer modifications#3143
MtScience wants to merge 3 commits into
pygments:masterfrom
MtScience:lua-improvements

Conversation

@MtScience

Copy link
Copy Markdown
Contributor

Motivation

When experimenting with Pygments, I noticed that the Lua lexer behaved somewhat differently from other lexers. Namely:

  1. it output consecutive punctuation in a single token (that is, something like ({}) was all rendered as one Punctuation token), unlike, e. g. the C, Haskell and many other lexers;
  2. it accepted any combination of the characters =<>|~&+\-*/%#^ as an operator (and output it as a single token), despite the fact that most such combinations are not valid Lua operators, which might interfere with using filters. Meanwhile, the Ruby lexer, for example, only accepts valid Ruby operators;
  3. it didn't support variable attributes (which were added in Lua 5.4).

Changes made

  1. Rewritten the operators regex, so now only valid Lua operators are output as tokens. E.g., something like <>= will now result in a < token and a >= token, not <>=.
  2. Removed the + from the punctuation regex, so now Punctuation tokens are output one by one, like they are in numerous other lexers (thus making the behavior of the Lua lexer more consistent with the rest of the library).
  3. Added the support for attributes (taking some inspiration from the C lexer). The attributes are only parsed as such if they appear after a variable name. I. e. this code
    local a <const> = 10
    will produce a Name.Attribute token, however this code
    local function a <const> ()
    end
    will not.
  4. Fixed a couple of typos in docstrings.

@birkenfeld

Copy link
Copy Markdown
Member

Thanks for the PR. For clarification, Pygments is a highlighter, not a language parser or interpreter, so its tokenizing doesn't have to (but often does) correspond to tokens emitted for a parser. Insofar, having {} in a single token is perfectly fine and even preferable, since it is more compact.

Accepting only existing operators is of course a valid choice, especially when that is easy to do. In some instances (e.g. where it's up to the context which operator is valid) it would be the pragmatic thing to just accept all, since we do not need to reject invalid code and highlighting something that the compiler/interpreter would complain about is acceptable.

@MtScience

Copy link
Copy Markdown
Contributor Author

Thank you for the quick review. I know that Pygments is not supposed to be a 100% robust language parser and it was not my intention to make it into one. I merely thought that it is always nice to have consistent behavior, and most Pygments lexers (or, at least, those I've played with) tend to emit separate tokens for separate punctuation characters, so I changed the Lua lexer accordingly.

Regarding operators: my changes do not result in the lexer rejecting invalid code; it still accepts everything that looks like an operator, just separates the input into several tokens.

I can, of course, rollback these changes, if they are somehow problematic. Is the last change (addition of attributes support) all right, though?

@birkenfeld

Copy link
Copy Markdown
Member

The comment wasn't meant as a negative review, just a pointer which directions make more and less sense to go in when updating lexers 😁

@MtScience

MtScience commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Oh, I see. Guess, it means I've had too much exposure to my local culture and now I'm assuming the worst in any given situation 😅. Thank you, and sorry for straying off topic.

So, are any changes in order?

@birkenfeld birkenfeld left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing, otherwise LGTM.

Comment thread pygments/lexers/scripting.py Outdated
(r'[=<>|~&+\-*/%#^]+|\.\.', Operator),
(r'[\[\]{}().,:;]+', Punctuation),
(r'[+\-*%^&|#]|//?|>>|<<|\.\.|[=~<>]=?', Operator),
(r'[\[\]{}().,:;]', Punctuation),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted:

Suggested change
(r'[\[\]{}().,:;]', Punctuation),
(r'[\[\]{}().,:;]+', Punctuation),

@MtScience

Copy link
Copy Markdown
Contributor Author

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants