|
3 | 3 | % $Id$ |
4 | 4 |
|
5 | 5 | \title{What's New in Python 2.2} |
6 | | -\release{0.03} |
| 6 | +\release{0.04} |
7 | 7 | \author{A.M. Kuchling} |
8 | 8 | \authoraddress{ \email{ [email protected]}} |
9 | 9 | \begin{document} |
@@ -339,32 +339,46 @@ \section{PEP 255: Simple Generators} |
339 | 339 | \section{Unicode Changes} |
340 | 340 |
|
341 | 341 | Python's Unicode support has been enhanced a bit in 2.2. Unicode |
342 | | -strings are usually stored as UCS-2, as 16-bit unsigned integers. |
| 342 | +strings are usually stored as UTF-16, as 16-bit unsigned integers. |
343 | 343 | Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned |
344 | 344 | integers, as its internal encoding by supplying |
345 | 345 | \longprogramopt{enable-unicode=ucs4} to the configure script. When |
346 | | -built to use UCS-4, in theory Python could handle Unicode characters |
347 | | -from U-00000000 to U-7FFFFFFF. Being able to use UCS-4 internally is |
348 | | -a necessary step to do that, but it's not the only step, and in Python |
349 | | -2.2alpha1 the work isn't complete yet. For example, the |
350 | | -\function{unichr()} function still only accepts values from 0 to |
351 | | -65535, and there's no \code{\e U} notation for embedding characters |
352 | | -greater than 65535 in a Unicode string literal. All this is the |
353 | | -province of the still-unimplemented PEP 261, ``Support for `wide' |
354 | | -Unicode characters''; consult it for further details, and please offer |
355 | | -comments and suggestions on the proposal it describes. |
356 | | - |
357 | | -Another change is much simpler to explain. |
358 | | -Since their introduction, Unicode strings have supported an |
359 | | -\method{encode()} method to convert the string to a selected encoding |
360 | | -such as UTF-8 or Latin-1. A symmetric |
361 | | -\method{decode(\optional{\var{encoding}})} method has been added to |
362 | | -both 8-bit and Unicode strings in 2.2, which assumes that the string |
363 | | -is in the specified encoding and decodes it. This means that |
364 | | -\method{encode()} and \method{decode()} can be called on both types of |
365 | | -strings, and can be used for tasks not directly related to Unicode. |
366 | | -For example, codecs have been added for UUencoding, MIME's base-64 |
367 | | -encoding, and compression with the \module{zlib} module. |
| 346 | +built to use UCS-4 (a ``wide Python''), the interpreter can natively |
| 347 | +handle Unicode characters from U+000000 to U+110000. The range of |
| 348 | +legal values for the \function{unichr()} function has been expanded; |
| 349 | +it used to only accept values up to 65535, but in 2.2 will accept |
| 350 | +values from 0 to 0x110000. Using a ``narrow Python'', an interpreter |
| 351 | +compiled to use UTF-16, values greater than 65535 will result in |
| 352 | +\function{unichr()} returning a string of length 2: |
| 353 | + |
| 354 | +\begin{verbatim} |
| 355 | +>>> s = unichr(65536) |
| 356 | +>>> s |
| 357 | +u'\ud800\udc00' |
| 358 | +>>> len(s) |
| 359 | +2 |
| 360 | +\end{verbatim} |
| 361 | + |
| 362 | +This possibly-confusing behaviour, breaking the intuitive invariant |
| 363 | +that \function{chr()} and\function{unichr()} always return strings of |
| 364 | +length 1, may be changed later in 2.2 depending on public reaction. |
| 365 | + |
| 366 | +All this is the province of the still-unimplemented PEP 261, ``Support |
| 367 | +for `wide' Unicode characters''; consult it for further details, and |
| 368 | +please offer comments and suggestions on the proposal it describes. |
| 369 | + |
| 370 | +Another change is much simpler to explain. Since their introduction, |
| 371 | +Unicode strings have supported an \method{encode()} method to convert |
| 372 | +the string to a selected encoding such as UTF-8 or Latin-1. A |
| 373 | +symmetric \method{decode(\optional{\var{encoding}})} method has been |
| 374 | +added to 8-bit strings (though not to Unicode strings) in 2.2. |
| 375 | +\method{decode()} assumes that the string is in the specified encoding |
| 376 | +and decodes it, returning whatever is returned by the codec. |
| 377 | + |
| 378 | +Using this new feature, codecs have been added for tasks not directly |
| 379 | +related to Unicode. For example, codecs have been added for |
| 380 | +uu-encoding, MIME's base64 encoding, and compression with the |
| 381 | +\module{zlib} module: |
368 | 382 |
|
369 | 383 | \begin{verbatim} |
370 | 384 | >>> s = """Here is a lengthy piece of redundant, overly verbose, |
@@ -610,6 +624,15 @@ \section{Other Changes and Fixes} |
610 | 624 | been changed to use the new C-level interface. (Contributed by Fred |
611 | 625 | L. Drake, Jr.) |
612 | 626 |
|
| 627 | + \item Another low-level API, primarily of interest to implementors |
| 628 | + of Python debuggers and development tools, was added. |
| 629 | + \cfunction{PyInterpreterState_Head()} and |
| 630 | + \cfunction{PyInterpreterState_Next()} let a caller walk through all |
| 631 | + the existing interpreter objects; |
| 632 | + \cfunction{PyInterpreterState_ThreadHead()} and |
| 633 | + \cfunction{PyThreadState_Next()} allow looping over all the thread |
| 634 | + states for a given interpreter. (Contributed by David Beazley.) |
| 635 | + |
613 | 636 | % XXX is this explanation correct? |
614 | 637 | \item When presented with a Unicode filename on Windows, Python will |
615 | 638 | now correctly convert it to a string using the MBCS encoding. |
@@ -668,6 +691,7 @@ \section{Acknowledgements} |
668 | 691 |
|
669 | 692 | The author would like to thank the following people for offering |
670 | 693 | suggestions and corrections to various drafts of this article: Fred |
671 | | -Bremmer, Fred L. Drake, Jr., Tim Peters, Neil Schemenauer. |
| 694 | +Bremmer, Fred L. Drake, Jr., Marc-Andr\'e Lemburg, |
| 695 | +Tim Peters, Neil Schemenauer, Guido van Rossum. |
672 | 696 |
|
673 | 697 | \end{document} |
0 commit comments