#18562: various revisions to the regex howto for 3.x

akuchling · akuchling · commit 3f4f3ba1a86c · 2013-08-18T18:57:22.000-04:00
* describe how \w is different when used in bytes and Unicode patterns.

* describe re.ASCII flag to change that behaviour.

* remove personal references ('I generally prefer...')

* add some more links to the re module in the library reference

* various small edits and re-wording.
diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst
@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
 or  ``\``, you can precede them with a backslash to remove their special
 meaning: ``\[`` or ``\\``.
 
-Some of the special sequences beginning with ``'\'`` represent predefined sets
-of characters that are often useful, such as the set of digits, the set of
-letters, or the set of anything that isn't whitespace.  The following predefined
-special sequences are a subset of those available. The equivalent classes are
-for bytes patterns. For a complete list of sequences and expanded class
-definitions for Unicode string patterns, see the last part of
-:ref:`Regular Expression Syntax <re-syntax>`.
+Some of the special sequences beginning with ``'\'`` represent
+predefined sets of characters that are often useful, such as the set
+of digits, the set of letters, or the set of anything that isn't
+whitespace.
+
+Let's take an example: ``\w`` matches any alphanumeric character.  If
+the regex pattern is expressed in bytes, this is equivalent to the
+class ``[a-zA-Z0-9_]``.  If the regex pattern is a string, ``\w`` will
+match all the characters marked as letters in the Unicode database
+provided by the :mod:`unicodedata` module.  You can use the more
+restricted definition of ``\w`` in a string pattern by supplying the
+:const:`re.ASCII` flag when compiling the regular expression.
+
+The following list of special sequences isn't complete. For a complete
+list of sequences and expanded class definitions for Unicode string
+patterns, see the last part of :ref:`Regular Expression Syntax
+<re-syntax>` in the Standard Library reference.  In general, the
+Unicode versions match any character that's in the appropriate
+category in the Unicode database.
 
 ``\d``
    Matches any decimal digit; this is equivalent to the class ``[0-9]``.
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
 For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
 ``caaat`` (3 ``a`` characters), and so forth.  The RE engine has various
 internal limitations stemming from the size of C's ``int`` type that will
-prevent it from matching over 2 billion ``a`` characters; you probably don't
-have enough memory to construct a string that large, so you shouldn't run into
-that limit.
+prevent it from matching over 2 billion ``a`` characters; patterns
+are usually not written to match that much data.
 
 Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
 engine will try to repeat it as many times as possible. If later portions of the
@@ -353,7 +364,7 @@ for a complete listing.
 |                  | returns them as an :term:`iterator`.          |
 +------------------+-----------------------------------------------+
 
-:meth:`match` and :meth:`search` return ``None`` if no match can be found.  If
+:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found.  If
 they're successful, a :ref:`match object <match-objects>` instance is returned,
 containing information about the match: where it starts and ends, the substring
 it matched, and more.
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
    >>> m.span()
    (0, 5)
 
-:meth:`group` returns the substring that was matched by the RE.  :meth:`start`
-and :meth:`end` return the starting and ending index of the match. :meth:`span`
+:meth:`~re.match.group` returns the substring that was matched by the RE.  :meth:`~re.match.start`
+and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span`
 returns both start and end indexes in a single tuple.  Since the :meth:`match`
 method only checks if the RE matches at the start of a string, :meth:`start`
 will always be zero.  However, the :meth:`search` method of patterns
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
        print('No match')
 
 Two pattern methods return all of the matches for a pattern.
-:meth:`findall` returns a list of matching strings::
+:meth:`~re.regex.findall` returns a list of matching strings::
 
    >>> p = re.compile('\d+')
    >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
    ['12', '11', '10']
 
 :meth:`findall` has to create the entire list before it can be returned as the
-result.  The :meth:`finditer` method returns a sequence of
+result.  The :meth:`~re.regex.finditer` method returns a sequence of
 :ref:`match object <match-objects>` instances as an :term:`iterator`::
 
    >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@@ -473,9 +484,9 @@ Module-Level Functions
 ----------------------
 
 You don't have to create a pattern object and call its methods; the
-:mod:`re` module also provides top-level functions called :func:`match`,
-:func:`search`, :func:`findall`, :func:`sub`, and so forth.  These functions
-take the same arguments as the corresponding pattern method, with
+:mod:`re` module also provides top-level functions called :func:`~re.match`,
+:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth.  These functions
+take the same arguments as the corresponding pattern method with
 the RE string added as the first argument, and still return either ``None`` or a
 :ref:`match object <match-objects>` instance. ::
 
@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
    <_sre.SRE_Match object at 0x...>
 
 Under the hood, these functions simply create a pattern object for you
-and call the appropriate method on it.  They also store the compiled object in a
-cache, so future calls using the same RE are faster.
+and call the appropriate method on it.  They also store the compiled
+object in a cache, so future calls using the same RE won't need to
+parse the pattern again and again.
 
 Should you use these module-level functions, or should you get the
-pattern and call its methods yourself?  That choice depends on how
-frequently the RE will be used, and on your personal coding style.  If the RE is
-being used at only one point in the code, then the module functions are probably
-more convenient.  If a program contains a lot of regular expressions, or re-uses
-the same ones in several locations, then it might be worthwhile to collect all
-the definitions in one place, in a section of code that compiles all the REs
-ahead of time.  To take an example from the standard library, here's an extract
-from the now-defunct Python 2 standard :mod:`xmllib` module::
-
-   ref = re.compile( ... )
-   entityref = re.compile( ... )
-   charref = re.compile( ... )
-   starttagopen = re.compile( ... )
-
-I generally prefer to work with the compiled object, even for one-time uses, but
-few people will be as much of a purist about this as I am.
+pattern and call its methods yourself?  If you're accessing a regex
+within a loop, pre-compiling it will save a few function calls.
+Outside of loops, there's not much difference thanks to the internal
+cache.
 
 
 Compilation Flags
@@ -524,6 +524,10 @@ of each one.
 +---------------------------------+--------------------------------------------+
 | Flag                            | Meaning                                    |
 +=================================+============================================+
+| :const:`ASCII`, :const:`A`      | Makes several escapes like ``\w``, ``\b``, |
+|                                 | ``\s`` and ``\d`` match only on ASCII      |
+|                                 | characters with the respective property.   |
++---------------------------------+--------------------------------------------+
 | :const:`DOTALL`, :const:`S`     | Make ``.`` match any character, including  |
 |                                 | newlines                                   |
 +---------------------------------+--------------------------------------------+
@@ -535,11 +539,7 @@ of each one.
 |                                 | ``$``                                      |
 +---------------------------------+--------------------------------------------+
 | :const:`VERBOSE`, :const:`X`    | Enable verbose REs, which can be organized |
-|                                 | more cleanly and understandably.           |
-+---------------------------------+--------------------------------------------+
-| :const:`ASCII`, :const:`A`      | Makes several escapes like ``\w``, ``\b``, |
-|                                 | ``\s`` and ``\d`` match only on ASCII      |
-|                                 | characters with the respective property.   |
+| (for 'extended')                | more cleanly and understandably.           |
 +---------------------------------+--------------------------------------------+
 
 
@@ -558,7 +558,8 @@ of each one.
           LOCALE
    :noindex:
 
-   Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale.
+   Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
+   instead of the Unicode database.
 
    Locales are a feature of the C library intended to help in writing programs that
    take account of language differences.  For example, if you're processing French
@@ -851,11 +852,10 @@ keep track of the group numbers.  There are two features which help with this
 problem.  Both of them use a common syntax for regular expression extensions, so
 we'll look at that first.
 
-Perl 5 added several additional features to standard regular expressions, and
-the Python :mod:`re` module supports most of them.   It would have been
-difficult to choose new single-keystroke metacharacters or new special sequences
-beginning with ``\`` to represent the new features without making Perl's regular
-expressions confusingly different from standard REs.  If you chose ``&`` as a
+Perl 5 is well-known for its powerful additions to standard regular expressions.
+For these new features the Perl developers couldn't choose new single-keystroke metacharacters
+or new special sequences beginning with ``\`` without making Perl's regular
+expressions confusingly different from standard REs.  If they chose ``&`` as a
 new metacharacter, for example, old expressions would be assuming that ``&`` was
 a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``.
 
@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
 assertion) and ``(?:foo)`` is something else (a non-capturing group containing
 the subexpression ``foo``).
 
-Python adds an extension syntax to Perl's extension syntax.  If the first
-character after the question mark is a ``P``, you know that it's an extension
-that's specific to Python.  Currently there are two such extensions:
-``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to
-a named group.  If future versions of Perl 5 add similar features using a
-different syntax, the :mod:`re` module will be changed to support the new
-syntax, while preserving the Python-specific syntax for compatibility's sake.
-
-Now that we've looked at the general extension syntax, we can return to the
-features that simplify working with groups in complex REs. Since groups are
-numbered from left to right and a complex expression may use many groups, it can
-become difficult to keep track of the correct numbering.  Modifying such a
-complex RE is annoying, too: insert a new group near the beginning and you
-change the numbers of everything that follows it.
-
-Sometimes you'll want to use a group to collect a part of a regular expression,
+Python supports several of Perl's extensions and adds an extension
+syntax to Perl's extension syntax.  If the first character after the
+question mark is a ``P``, you know that it's an extension that's
+specific to Python.
+
+Now that we've looked at the general extension syntax, we can return
+to the features that simplify working with groups in complex REs.
+
+Sometimes you'll want to use a group to denote a part of a regular expression,
 but aren't interested in retrieving the group's contents. You can make this fact
 explicit by using a non-capturing group: ``(?:...)``, where you can replace the
 ``...`` with any other regular expression. ::
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
 
 The syntax for a named group is one of the Python-specific extensions:
 ``(?P<name>...)``.  *name* is, obviously, the name of the group.  Named groups
-also behave exactly like capturing groups, and additionally associate a name
+behave exactly like capturing groups, and additionally associate a name
 with a group.  The :ref:`match object <match-objects>` methods that deal with
 capturing groups all accept either integers that refer to the group by number
 or strings that contain the desired group's name.  Named groups are still
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
 ``.*[.].*$``
 
 Notice that the ``.`` needs to be treated specially because it's a
-metacharacter; I've put it inside a character class.  Also notice the trailing
-``$``; this is added to ensure that all the rest of the string must be included
-in the extension.  This regular expression matches ``foo.bar`` and
+metacharacter, so it's inside a character class to only match that
+specific character.  Also notice the trailing ``$``; this is added to
+ensure that all the rest of the string must be included in the
+extension.  This regular expression matches ``foo.bar`` and
 ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
 
 Now, consider complicating the problem a bit; what if you want to match
@@ -1051,7 +1045,7 @@ Splitting Strings
 The :meth:`split` method of a pattern splits a string apart
 wherever the RE matches, returning a list of the pieces. It's similar to the
 :meth:`split` method of strings but provides much more generality in the
-delimiters that you can split by; :meth:`split` only supports splitting by
+delimiters that you can split by; string :meth:`split` only supports splitting by
 whitespace or by a fixed string.  As you'd expect, there's a module-level
 :func:`re.split` function, too.
 
@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
 with a different string.  The :meth:`sub` method takes a replacement value,
 which can be either a string or a function, and the string to be processed.
 
-
 .. method:: .sub(replacement, string[, count=0])
    :noindex:
 
@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
 reference for programming in Python.  (The first edition covered Python's
 now-removed :mod:`regex` module, which won't help you much.)  Consider checking
 it out from your library.
-