Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit a1ce93f

Browse files
committed
- Expose NullTranslations and GNUTranslations to __all__ - Set the default charset to iso-8859-1. It used to be None, which would cause problems with .ugettext() if the file had no charset parameter. Arguably, the po/mo file would be broken, but I still think iso-8859-1 is a reasonable default. - Add a "coerce" default argument to GNUTranslations's constructor. The reason for this is that in Zope, we want all msgids and msgstrs to be Unicode. For the latter, we could use .ugettext() but there isn't currently a mechanism for Unicode-ifying msgids. The plan then is that the charset parameter specifies the encoding for both the msgids and msgstrs, and both are decoded to Unicode when read. For example, we might encode po files with utf-8. I think the GNU gettext tools don't care. Since this could potentially break code [*] that wants to use the encoded interface .gettext(), the constructor flag is added, defaulting to False. Most code I suspect will want to set this to True and use .ugettext(). - A few other minor changes from the Zope project, including asserting that a zero-length msgid must have a Project-ID-Version header for it to be counted as the metadata record.
1 parent de354b7 commit a1ce93f

3 files changed

Lines changed: 238 additions & 146 deletions

File tree

Doc/lib/libgettext.tex

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -285,13 +285,17 @@ \subsubsection{The \class{GNUTranslations} class}
285285
\class{NullTranslations}: \class{GNUTranslations}. This class
286286
overrides \method{_parse()} to enable reading GNU \program{gettext}
287287
format \file{.mo} files in both big-endian and little-endian format.
288-
289-
It also parses optional meta-data out of the translation catalog. It
290-
is convention with GNU \program{gettext} to include meta-data as the
291-
translation for the empty string. This meta-data is in \rfc{822}-style
292-
\code{key: value} pairs. If the key \code{Content-Type} is found,
293-
then the \code{charset} property is used to initialize the
294-
``protected'' \member{_charset} instance variable. The entire set of
288+
It also adds the ability to coerce both message ids and message
289+
strings to Unicode.
290+
291+
\class{GNUTranslations} parses optional meta-data out of the
292+
translation catalog. It is convention with GNU \program{gettext} to
293+
include meta-data as the translation for the empty string. This
294+
meta-data is in \rfc{822}-style \code{key: value} pairs, and must
295+
contain the \code{Project-Id-Version}. If the key
296+
\code{Content-Type} is found, then the \code{charset} property is used
297+
to initialize the ``protected'' \member{_charset} instance variable,
298+
defaulting to \code{iso-8859-1} if not found. The entire set of
295299
key/value pairs are placed into a dictionary and set as the
296300
``protected'' \member{_info} instance variable.
297301

@@ -302,11 +306,27 @@ \subsubsection{The \class{GNUTranslations} class}
302306
The other usefully overridden method is \method{ugettext()}, which
303307
returns a Unicode string by passing both the translated message string
304308
and the value of the ``protected'' \member{_charset} variable to the
305-
builtin \function{unicode()} function.
309+
builtin \function{unicode()} function. Note that if you use
310+
\method{ugettext()} you probably also want your message ids to be
311+
Unicode. To do this, set the variable \var{coerce} to \code{True} in
312+
the \class{GNUTranslations} constructor. This ensures that both the
313+
message ids and message strings are decoded to Unicode when the file
314+
is read, using the file's \code{charset} value. If you do this, you
315+
will not want to use the \method{gettext()} method -- always use
316+
\method{ugettext()} instead.
306317

307318
To facilitate plural forms, the methods \method{ngettext} and
308319
\method{ungettext} are overridden as well.
309320

321+
\begin{methoddesc}[GNUTranslations]{__init__}{
322+
\optional{fp\optional{, coerce}}
323+
Constructs and parses a translation catalog in GNU gettext format.
324+
\var{fp} is passed to the base class (\class{NullTranslations})
325+
constructor. \var{coerce} is a flag specifying whether message ids
326+
and message strings should be converted to Unicode when the file is
327+
parsed. It defaults to \code{False} for backward compatibility.
328+
\end{methoddesc}
329+
310330
\subsubsection{Solaris message catalog support}
311331

312332
The Solaris operating system defines its own binary

Lib/gettext.py

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,10 @@
5050
from errno import ENOENT
5151

5252

53-
__all__ = ["bindtextdomain","textdomain","gettext","dgettext",
54-
"find","translation","install","Catalog"]
53+
__all__ = ['NullTranslations', 'GNUTranslations', 'Catalog',
54+
'find', 'translation', 'install', 'textdomain', 'bindtextdomain',
55+
'dgettext', 'dngettext', 'gettext', 'ngettext',
56+
]
5557

5658
_default_localedir = os.path.join(sys.prefix, 'share', 'locale')
5759

@@ -170,7 +172,7 @@ def _expand_lang(locale):
170172
class NullTranslations:
171173
def __init__(self, fp=None):
172174
self._info = {}
173-
self._charset = None
175+
self._charset = 'iso-8859-1'
174176
self._fallback = None
175177
if fp is not None:
176178
self._parse(fp)
@@ -226,6 +228,12 @@ class GNUTranslations(NullTranslations):
226228
LE_MAGIC = 0x950412deL
227229
BE_MAGIC = 0xde120495L
228230

231+
def __init__(self, fp=None, coerce=False):
232+
# Set this attribute before calling the base class constructor, since
233+
# the latter calls _parse() which depends on self._coerce.
234+
self._coerce = coerce
235+
NullTranslations.__init__(self, fp)
236+
229237
def _parse(self, fp):
230238
"""Override this method to support alternative .mo formats."""
231239
unpack = struct.unpack
@@ -260,16 +268,22 @@ def _parse(self, fp):
260268
# Plural forms
261269
msgid1, msgid2 = msg.split('\x00')
262270
tmsg = tmsg.split('\x00')
271+
if self._coerce:
272+
msgid1 = unicode(msgid1, self._charset)
273+
tmsg = [unicode(x, self._charset) for x in tmsg]
263274
for i in range(len(tmsg)):
264275
catalog[(msgid1, i)] = tmsg[i]
265276
else:
277+
if self._coerce:
278+
msg = unicode(msg, self._charset)
279+
tmsg = unicode(tmsg, self._charset)
266280
catalog[msg] = tmsg
267281
else:
268282
raise IOError(0, 'File is corrupt', filename)
269283
# See if we're looking at GNU .mo conventions for metadata
270-
if mlen == 0:
284+
if mlen == 0 and tmsg.lower().startswith('project-id-version:'):
271285
# Catalog description
272-
for item in tmsg.split('\n'):
286+
for item in tmsg.splitlines():
273287
item = item.strip()
274288
if not item:
275289
continue
@@ -297,7 +311,6 @@ def gettext(self, message):
297311
return self._fallback.gettext(message)
298312
return message
299313

300-
301314
def ngettext(self, msgid1, msgid2, n):
302315
try:
303316
return self._catalog[(msgid1, self.plural(n))]
@@ -309,16 +322,17 @@ def ngettext(self, msgid1, msgid2, n):
309322
else:
310323
return msgid2
311324

312-
313325
def ugettext(self, message):
314-
try:
315-
tmsg = self._catalog[message]
316-
except KeyError:
326+
missing = object()
327+
tmsg = self._catalog.get(message, missing)
328+
if tmsg is missing:
317329
if self._fallback:
318330
return self._fallback.ugettext(message)
319331
tmsg = message
320-
return unicode(tmsg, self._charset)
321-
332+
if not self._coerce:
333+
return unicode(tmsg, self._charset)
334+
# The msgstr is already coerced to Unicode
335+
return tmsg
322336

323337
def ungettext(self, msgid1, msgid2, n):
324338
try:
@@ -330,7 +344,10 @@ def ungettext(self, msgid1, msgid2, n):
330344
tmsg = msgid1
331345
else:
332346
tmsg = msgid2
333-
return unicode(tmsg, self._charset)
347+
if not self._coerce:
348+
return unicode(tmsg, self._charset)
349+
# The msgstr is already coerced to Unicode
350+
return tmsg
334351

335352

336353
# Locate a .mo file using the gettext strategy

0 commit comments

Comments
 (0)