Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 474458d

Browse files
committed
Add constants BOM_UTF8, BOM_UTF16, BOM_UTF16_LE, BOM_UTF16_BE,
BOM_UTF32, BOM_UTF32_LE and BOM_UTF32_BE that represent the Byte Order Mark in UTF-8, UTF-16 and UTF-32 encodings for little and big endian systems. The old names BOM32_* and BOM64_* were off by a factor of 2. This closes SF bug http://www.python.org/sf/555360
1 parent bc48826 commit 474458d

3 files changed

Lines changed: 53 additions & 27 deletions

File tree

Doc/lib/libcodecs.tex

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -142,16 +142,21 @@ \section{\module{codecs} ---
142142
\begin{datadesc}{BOM}
143143
\dataline{BOM_BE}
144144
\dataline{BOM_LE}
145-
\dataline{BOM32_BE}
146-
\dataline{BOM32_LE}
147-
\dataline{BOM64_BE}
148-
\dataline{BOM64_LE}
149-
These constants define the byte order marks (BOM) used in data
150-
streams to indicate the byte order used in the stream or file.
151-
\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
152-
depending on the platform's native byte order, while the others
153-
represent big endian (\samp{_BE} suffix) and little endian
154-
(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
145+
\dataline{BOM_UTF8}
146+
\dataline{BOM_UTF16}
147+
\dataline{BOM_UTF16_BE}
148+
\dataline{BOM_UTF16_LE}
149+
\dataline{BOM_UTF32}
150+
\dataline{BOM_UTF32_BE}
151+
\dataline{BOM_UTF32_LE}
152+
These constants define various encodings of the Unicode byte order mark
153+
(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
154+
used in the stream or file and in UTF-8 as a Unicode signature.
155+
\constant{BOM_UTF16} is either \constant{BOM_UTF16_BE} or
156+
\constant{BOM_UTF16_LE} depending on the platform's native byte order,
157+
\constant{BOM} is an alias for \constant{BOM_UTF16}, \constant{BOM_LE}
158+
for \constant{BOM_UTF16_LE} and \constant{BOM_BE} for \constant{BOM_UTF16_BE}.
159+
The others represent the BOM in UTF-8 and UTF-32 encodings.
155160
\end{datadesc}
156161

157162

Lib/codecs.py

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,29 +18,44 @@
1818
'Failed to load the builtin codecs: %s' % why
1919

2020
__all__ = ["register", "lookup", "open", "EncodedFile", "BOM", "BOM_BE",
21-
"BOM_LE", "BOM32_BE", "BOM32_LE", "BOM64_BE", "BOM64_LE"]
21+
"BOM_LE", "BOM32_BE", "BOM32_LE", "BOM64_BE", "BOM64_LE",
22+
"BOM_UTF8", "BOM_UTF16", "BOM_UTF16_LE", "BOM_UTF16_BE",
23+
"BOM_UTF32", "BOM_UTF32_LE", "BOM_UTF32_BE"]
2224

2325
### Constants
2426

2527
#
26-
# Byte Order Mark (BOM) and its possible values (BOM_BE, BOM_LE)
28+
# Byte Order Mark (BOM = ZERO WIDTH NO-BREAK SPACE = U+FEFF)
29+
# and its possible byte string values
30+
# for UTF8/UTF16/UTF32 output and little/big endian machines
2731
#
28-
BOM = struct.pack('=H', 0xFEFF)
29-
#
30-
BOM_BE = BOM32_BE = '\376\377'
31-
# corresponds to Unicode U+FEFF in UTF-16 on big endian
32-
# platforms == ZERO WIDTH NO-BREAK SPACE
33-
BOM_LE = BOM32_LE = '\377\376'
34-
# corresponds to Unicode U+FFFE in UTF-16 on little endian
35-
# platforms == defined as being an illegal Unicode character
3632

37-
#
38-
# 64-bit Byte Order Marks
39-
#
40-
BOM64_BE = '\000\000\376\377'
41-
# corresponds to Unicode U+0000FEFF in UCS-4
42-
BOM64_LE = '\377\376\000\000'
43-
# corresponds to Unicode U+0000FFFE in UCS-4
33+
# UTF-8
34+
BOM_UTF8 = '\xef\xbb\xbf'
35+
36+
# UTF-16, little endian
37+
BOM_LE = BOM_UTF16_LE = '\xff\xfe'
38+
39+
# UTF-16, big endian
40+
BOM_BE = BOM_UTF16_BE = '\xfe\xff'
41+
42+
# UTF-32, little endian
43+
BOM_UTF32_LE = '\xff\xfe\x00\x00'
44+
45+
# UTF-32, big endian
46+
BOM_UTF32_BE = '\x00\x00\xfe\xff'
47+
48+
# UTF-16, native endianness
49+
BOM = BOM_UTF16 = struct.pack('=H', 0xFEFF)
50+
51+
# UTF-32, native endianness
52+
BOM_UTF32 = struct.pack('=L', 0x0000FEFF)
53+
54+
# Old broken names (don't use in new code)
55+
BOM32_LE = BOM_UTF16_LE
56+
BOM32_BE = BOM_UTF16_BE
57+
BOM64_LE = BOM_UTF32_LE
58+
BOM64_BE = BOM_UTF32_BE
4459

4560

4661
### Codec base classes (defining the API)

Misc/NEWS

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,12 @@ Extension modules
124124

125125
Library
126126

127+
- Constants BOM_UTF8, BOM_UTF16, BOM_UTF16_LE, BOM_UTF16_BE,
128+
BOM_UTF32, BOM_UTF32_LE and BOM_UTF32_BE that represent the Byte
129+
Order Mark in UTF-8, UTF-16 and UTF-32 encodings for little and
130+
big endian systems were added to the codecs module. The old names
131+
BOM32_* and BOM64_* were off by a factor of 2.
132+
127133
- added degree/radian conversion functions to the math module.
128134

129135
- ftplib.retrlines() now tests for callback is None rather than testing

0 commit comments

Comments
 (0)