Closed
Description
Currently most syntax errors raised in the compiler (except these raised in the parser) use PyErr_ProgramTextObject()
to get the line of the code. It does not know the encoding of the source file and interpret it as UTF-8 (failing if it contain non-UTF-8 sequences). The parser uses _PyErr_ProgramDecodedTextObject()
.
There are two ways to solve this issue:
- Pass the source file encoding from the parser to the code generator. This may require changing some data structures. But this is more efficient.
- Detect the encoding in
PyErr_ProgramTextObject()
. Since the latter is in the public C API, this can also affect the third-party code.
There are other issues with PyErr_ProgramTextObject()
:
- It leave the BOM in the first line if the source line contains it. This is not consistent with offsets.
- For very long lines, it returns the tail of the line that exceeds 1000 bytes. It can be short, it can start with invalid character, it is not consistent with offsets. If return incomplete line, it is better to return the head.
This all applies to PyErr_ProgramText()
as well.