|
| 1 | +\section{\module{codecs} --- |
| 2 | + Python codec registry and base classes} |
| 3 | + |
| 4 | +\declaremodule{standard}{codec} |
| 5 | +\modulesynopsis{Encode and decode data and streams.} |
| 6 | +\moduleauthor{Marc-Andre Lemburg}{ [email protected]} |
| 7 | +\sectionauthor{Marc-Andre Lemburg}{ [email protected]} |
| 8 | + |
| 9 | + |
| 10 | +\index{Unicode} |
| 11 | +\index{Codecs} |
| 12 | +\indexii{Codecs}{encode} |
| 13 | +\indexii{Codecs}{decode} |
| 14 | +\index{streams} |
| 15 | +\indexii{stackable}{streams} |
| 16 | + |
| 17 | + |
| 18 | +This module defines base classes for standard Python codecs (encoders |
| 19 | +and decoders) and provides access to the internal Python codec |
| 20 | +registry which manages the codec lookup process. |
| 21 | + |
| 22 | +It defines the following functions: |
| 23 | + |
| 24 | +\begin{funcdesc}{register}{search_function} |
| 25 | +Register a codec search function. Search functions are expected to |
| 26 | +take one argument, the encoding name in all lower case letters, and |
| 27 | +return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader}, |
| 28 | +\var{stream_writer})} taking the following arguments: |
| 29 | + |
| 30 | + \var{encoder} and \var{decoder}: These must be functions or methods |
| 31 | + which have the same interface as the .encode/.decode methods of |
| 32 | + Codec instances (see Codec Interface). The functions/methods are |
| 33 | + expected to work in a stateless mode. |
| 34 | + |
| 35 | + \var{stream_reader} and \var{stream_writer}: These have to be |
| 36 | + factory functions providing the following interface: |
| 37 | + |
| 38 | + \code{factory(\var{stream},\var{errors}='strict')} |
| 39 | + |
| 40 | + The factory functions must return objects providing the interfaces |
| 41 | + defined by the base classes |
| 42 | + \class{StreamWriter}/\class{StreamReader} resp. Stream codecs can |
| 43 | + maintain state. |
| 44 | + |
| 45 | + Possible values for errors are 'strict' (raise an exception in case |
| 46 | + of an encoding error), 'replace' (replace malformed data with a |
| 47 | + suitable replacement marker, e.g. '?') and 'ignore' (ignore |
| 48 | + malformed data and continue without further notice). |
| 49 | + |
| 50 | +In case a search function cannot find a given encoding, it should |
| 51 | +return None. |
| 52 | +\end{funcdesc} |
| 53 | + |
| 54 | +\begin{funcdesc}{lookup}{encoding} |
| 55 | +Looks up a codec tuple in the Python codec registry and returns the |
| 56 | +function tuple as defined above. |
| 57 | + |
| 58 | +Encodings are first looked up in the registry's cache. If not found, |
| 59 | +the list of registered search functions is scanned. If no codecs tuple |
| 60 | +is found, a LookupError is raised. Otherwise, the codecs tuple is |
| 61 | +stored in the cache and returned to the caller. |
| 62 | +\end{funcdesc} |
| 63 | + |
| 64 | +To simplify working with encoded files or stream, the module |
| 65 | +also defines these utility functions: |
| 66 | + |
| 67 | +\begin{funcdesc}{open}{filename, mode\optional{, encoding=None, errors='strict', buffering=1}} |
| 68 | +Open an encoded file using the given \var{mode} and return |
| 69 | +a wrapped version providing transparent encoding/decoding. |
| 70 | + |
| 71 | +Note: The wrapped version will only accept the object format defined |
| 72 | +by the codecs, i.e. Unicode objects for most builtin codecs. Output is |
| 73 | +also codec dependent and will usually by Unicode as well. |
| 74 | + |
| 75 | +\var{encoding} specifies the encoding which is to be used for the |
| 76 | +the file. |
| 77 | + |
| 78 | +\var{errors} may be given to define the error handling. It defaults |
| 79 | +to 'strict' which causes a \exception{ValueError} to be raised in case |
| 80 | +an encoding error occurs. |
| 81 | + |
| 82 | +\var{buffering} has the same meaning as for the builtin open() API. |
| 83 | +It defaults to line buffered. |
| 84 | +\end{funcdesc} |
| 85 | + |
| 86 | +\begin{funcdesc}{EncodedFile}{file, input\optional{, output=None, errors='strict'}} |
| 87 | + |
| 88 | +Return a wrapped version of file which provides transparent |
| 89 | +encoding translation. |
| 90 | + |
| 91 | +Strings written to the wrapped file are interpreted according to the |
| 92 | +given \var{input} encoding and then written to the original file as |
| 93 | +string using the \var{output} encoding. The intermediate encoding will |
| 94 | +usually be Unicode but depends on the specified codecs. |
| 95 | + |
| 96 | +If \var{output} is not given, it defaults to input. |
| 97 | + |
| 98 | +\var{errors} may be given to define the error handling. It defaults to |
| 99 | +'strict' which causes \exception{ValueError} to be raised in case |
| 100 | +an encoding error occurs. |
| 101 | +\end{funcdesc} |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +...XXX document codec base classes... |
| 106 | + |
| 107 | + |
| 108 | + |
| 109 | +The module also provides the following constants which are useful |
| 110 | +for reading and writing to platform dependent files: |
| 111 | + |
| 112 | +\begin{datadesc}{BOM} |
| 113 | +\dataline{BOM_BE} |
| 114 | +\dataline{BOM_LE} |
| 115 | +\dataline{BOM32_BE} |
| 116 | +\dataline{BOM32_LE} |
| 117 | +\dataline{BOM64_BE} |
| 118 | +\dataline{BOM64_LE} |
| 119 | +These constants define the byte order marks (BOM) used in data |
| 120 | +streams to indicate the byte order used in the stream or file. |
| 121 | +\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE} |
| 122 | +depending on the platform's native byte order, while the others |
| 123 | +represent big endian (\samp{_BE} suffix) and little endian |
| 124 | +(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings. |
| 125 | +\end{datadesc} |
| 126 | + |
0 commit comments