@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
104104or ``\ ``, you can precede them with a backslash to remove their special
105105meaning: ``\[ `` or ``\\ ``.
106106
107- Some of the special sequences beginning with ``'\' `` represent predefined sets
108- of characters that are often useful, such as the set of digits, the set of
109- letters, or the set of anything that isn't whitespace. The following predefined
110- special sequences are a subset of those available. The equivalent classes are
111- for bytes patterns. For a complete list of sequences and expanded class
112- definitions for Unicode string patterns, see the last part of
113- :ref: `Regular Expression Syntax <re-syntax >`.
107+ Some of the special sequences beginning with ``'\' `` represent
108+ predefined sets of characters that are often useful, such as the set
109+ of digits, the set of letters, or the set of anything that isn't
110+ whitespace.
111+
112+ Let's take an example: ``\w `` matches any alphanumeric character. If
113+ the regex pattern is expressed in bytes, this is equivalent to the
114+ class ``[a-zA-Z0-9_] ``. If the regex pattern is a string, ``\w `` will
115+ match all the characters marked as letters in the Unicode database
116+ provided by the :mod: `unicodedata ` module. You can use the more
117+ restricted definition of ``\w `` in a string pattern by supplying the
118+ :const: `re.ASCII ` flag when compiling the regular expression.
119+
120+ The following list of special sequences isn't complete. For a complete
121+ list of sequences and expanded class definitions for Unicode string
122+ patterns, see the last part of :ref: `Regular Expression Syntax
123+ <re-syntax>` in the Standard Library reference. In general, the
124+ Unicode versions match any character that's in the appropriate
125+ category in the Unicode database.
114126
115127``\d ``
116128 Matches any decimal digit; this is equivalent to the class ``[0-9] ``.
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
160172For example, ``ca*t `` will match ``ct `` (0 ``a `` characters), ``cat `` (1 ``a ``),
161173``caaat `` (3 ``a `` characters), and so forth. The RE engine has various
162174internal limitations stemming from the size of C's ``int `` type that will
163- prevent it from matching over 2 billion ``a `` characters; you probably don't
164- have enough memory to construct a string that large, so you shouldn't run into
165- that limit.
175+ prevent it from matching over 2 billion ``a `` characters; patterns
176+ are usually not written to match that much data.
166177
167178Repetitions such as ``* `` are :dfn: `greedy `; when repeating a RE, the matching
168179engine will try to repeat it as many times as possible. If later portions of the
@@ -353,7 +364,7 @@ for a complete listing.
353364| | returns them as an :term: `iterator `. |
354365+------------------+-----------------------------------------------+
355366
356- :meth: `match ` and :meth: `search ` return ``None `` if no match can be found. If
367+ :meth: `~re.regex. match ` and :meth: `~re.regex. search ` return ``None `` if no match can be found. If
357368they're successful, a :ref: `match object <match-objects >` instance is returned,
358369containing information about the match: where it starts and ends, the substring
359370it matched, and more.
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
419430 >>> m.span()
420431 (0, 5)
421432
422- :meth: `group ` returns the substring that was matched by the RE. :meth: `start `
423- and :meth: `end ` return the starting and ending index of the match. :meth: `span `
433+ :meth: `~re.match. group ` returns the substring that was matched by the RE. :meth: `~re.match. start `
434+ and :meth: `~re.match. end ` return the starting and ending index of the match. :meth: `~re.match. span `
424435returns both start and end indexes in a single tuple. Since the :meth: `match `
425436method only checks if the RE matches at the start of a string, :meth: `start `
426437will always be zero. However, the :meth: `search ` method of patterns
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
448459 print('No match')
449460
450461Two pattern methods return all of the matches for a pattern.
451- :meth: `findall ` returns a list of matching strings::
462+ :meth: `~re.regex. findall ` returns a list of matching strings::
452463
453464 >>> p = re.compile('\d+')
454465 >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
455466 ['12', '11', '10']
456467
457468:meth: `findall ` has to create the entire list before it can be returned as the
458- result. The :meth: `finditer ` method returns a sequence of
469+ result. The :meth: `~re.regex. finditer ` method returns a sequence of
459470:ref: `match object <match-objects >` instances as an :term: `iterator `::
460471
461472 >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@@ -473,9 +484,9 @@ Module-Level Functions
473484----------------------
474485
475486You don't have to create a pattern object and call its methods; the
476- :mod: `re ` module also provides top-level functions called :func: `match `,
477- :func: `search `, :func: `findall `, :func: `sub `, and so forth. These functions
478- take the same arguments as the corresponding pattern method, with
487+ :mod: `re ` module also provides top-level functions called :func: `~re. match `,
488+ :func: `~re. search `, :func: `~re. findall `, :func: `~re. sub `, and so forth. These functions
489+ take the same arguments as the corresponding pattern method with
479490the RE string added as the first argument, and still return either ``None `` or a
480491:ref: `match object <match-objects >` instance. ::
481492
@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
485496 <_sre.SRE_Match object at 0x...>
486497
487498Under the hood, these functions simply create a pattern object for you
488- and call the appropriate method on it. They also store the compiled object in a
489- cache, so future calls using the same RE are faster.
499+ and call the appropriate method on it. They also store the compiled
500+ object in a cache, so future calls using the same RE won't need to
501+ parse the pattern again and again.
490502
491503Should you use these module-level functions, or should you get the
492- pattern and call its methods yourself? That choice depends on how
493- frequently the RE will be used, and on your personal coding style. If the RE is
494- being used at only one point in the code, then the module functions are probably
495- more convenient. If a program contains a lot of regular expressions, or re-uses
496- the same ones in several locations, then it might be worthwhile to collect all
497- the definitions in one place, in a section of code that compiles all the REs
498- ahead of time. To take an example from the standard library, here's an extract
499- from the now-defunct Python 2 standard :mod: `xmllib ` module::
500-
501- ref = re.compile( ... )
502- entityref = re.compile( ... )
503- charref = re.compile( ... )
504- starttagopen = re.compile( ... )
505-
506- I generally prefer to work with the compiled object, even for one-time uses, but
507- few people will be as much of a purist about this as I am.
504+ pattern and call its methods yourself? If you're accessing a regex
505+ within a loop, pre-compiling it will save a few function calls.
506+ Outside of loops, there's not much difference thanks to the internal
507+ cache.
508508
509509
510510Compilation Flags
@@ -524,6 +524,10 @@ of each one.
524524+---------------------------------+--------------------------------------------+
525525| Flag | Meaning |
526526+=================================+============================================+
527+ | :const: `ASCII `, :const: `A ` | Makes several escapes like ``\w ``, ``\b ``, |
528+ | | ``\s `` and ``\d `` match only on ASCII |
529+ | | characters with the respective property. |
530+ +---------------------------------+--------------------------------------------+
527531| :const: `DOTALL `, :const: `S ` | Make ``. `` match any character, including |
528532| | newlines |
529533+---------------------------------+--------------------------------------------+
@@ -535,11 +539,7 @@ of each one.
535539| | ``$ `` |
536540+---------------------------------+--------------------------------------------+
537541| :const: `VERBOSE `, :const: `X ` | Enable verbose REs, which can be organized |
538- | | more cleanly and understandably. |
539- +---------------------------------+--------------------------------------------+
540- | :const: `ASCII `, :const: `A ` | Makes several escapes like ``\w ``, ``\b ``, |
541- | | ``\s `` and ``\d `` match only on ASCII |
542- | | characters with the respective property. |
542+ | (for 'extended') | more cleanly and understandably. |
543543+---------------------------------+--------------------------------------------+
544544
545545
@@ -558,7 +558,8 @@ of each one.
558558 LOCALE
559559 :noindex:
560560
561- Make ``\w ``, ``\W ``, ``\b ``, and ``\B ``, dependent on the current locale.
561+ Make ``\w ``, ``\W ``, ``\b ``, and ``\B ``, dependent on the current locale
562+ instead of the Unicode database.
562563
563564 Locales are a feature of the C library intended to help in writing programs that
564565 take account of language differences. For example, if you're processing French
@@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
851852problem. Both of them use a common syntax for regular expression extensions, so
852853we'll look at that first.
853854
854- Perl 5 added several additional features to standard regular expressions, and
855- the Python :mod: `re ` module supports most of them. It would have been
856- difficult to choose new single-keystroke metacharacters or new special sequences
857- beginning with ``\ `` to represent the new features without making Perl's regular
858- expressions confusingly different from standard REs. If you chose ``& `` as a
855+ Perl 5 is well-known for its powerful additions to standard regular expressions.
856+ For these new features the Perl developers couldn't choose new single-keystroke metacharacters
857+ or new special sequences beginning with ``\ `` without making Perl's regular
858+ expressions confusingly different from standard REs. If they chose ``& `` as a
859859new metacharacter, for example, old expressions would be assuming that ``& `` was
860860a regular character and wouldn't have escaped it by writing ``\& `` or ``[&] ``.
861861
@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
867867assertion) and ``(?:foo) `` is something else (a non-capturing group containing
868868the subexpression ``foo ``).
869869
870- Python adds an extension syntax to Perl's extension syntax. If the first
871- character after the question mark is a ``P ``, you know that it's an extension
872- that's specific to Python. Currently there are two such extensions:
873- ``(?P<name>...) `` defines a named group, and ``(?P=name) `` is a backreference to
874- a named group. If future versions of Perl 5 add similar features using a
875- different syntax, the :mod: `re ` module will be changed to support the new
876- syntax, while preserving the Python-specific syntax for compatibility's sake.
877-
878- Now that we've looked at the general extension syntax, we can return to the
879- features that simplify working with groups in complex REs. Since groups are
880- numbered from left to right and a complex expression may use many groups, it can
881- become difficult to keep track of the correct numbering. Modifying such a
882- complex RE is annoying, too: insert a new group near the beginning and you
883- change the numbers of everything that follows it.
884-
885- Sometimes you'll want to use a group to collect a part of a regular expression,
870+ Python supports several of Perl's extensions and adds an extension
871+ syntax to Perl's extension syntax. If the first character after the
872+ question mark is a ``P ``, you know that it's an extension that's
873+ specific to Python.
874+
875+ Now that we've looked at the general extension syntax, we can return
876+ to the features that simplify working with groups in complex REs.
877+
878+ Sometimes you'll want to use a group to denote a part of a regular expression,
886879but aren't interested in retrieving the group's contents. You can make this fact
887880explicit by using a non-capturing group: ``(?:...) ``, where you can replace the
888881``... `` with any other regular expression. ::
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
908901
909902The syntax for a named group is one of the Python-specific extensions:
910903``(?P<name>...) ``. *name * is, obviously, the name of the group. Named groups
911- also behave exactly like capturing groups, and additionally associate a name
904+ behave exactly like capturing groups, and additionally associate a name
912905with a group. The :ref: `match object <match-objects >` methods that deal with
913906capturing groups all accept either integers that refer to the group by number
914907or strings that contain the desired group's name. Named groups are still
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
975968``.*[.].*$ ``
976969
977970Notice that the ``. `` needs to be treated specially because it's a
978- metacharacter; I've put it inside a character class. Also notice the trailing
979- ``$ ``; this is added to ensure that all the rest of the string must be included
980- in the extension. This regular expression matches ``foo.bar `` and
971+ metacharacter, so it's inside a character class to only match that
972+ specific character. Also notice the trailing ``$ ``; this is added to
973+ ensure that all the rest of the string must be included in the
974+ extension. This regular expression matches ``foo.bar `` and
981975``autoexec.bat `` and ``sendmail.cf `` and ``printers.conf ``.
982976
983977Now, consider complicating the problem a bit; what if you want to match
@@ -1051,7 +1045,7 @@ Splitting Strings
10511045The :meth: `split ` method of a pattern splits a string apart
10521046wherever the RE matches, returning a list of the pieces. It's similar to the
10531047:meth: `split ` method of strings but provides much more generality in the
1054- delimiters that you can split by; :meth: `split ` only supports splitting by
1048+ delimiters that you can split by; string :meth: `split ` only supports splitting by
10551049whitespace or by a fixed string. As you'd expect, there's a module-level
10561050:func: `re.split ` function, too.
10571051
@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
11061100with a different string. The :meth: `sub ` method takes a replacement value,
11071101which can be either a string or a function, and the string to be processed.
11081102
1109-
11101103.. method :: .sub(replacement, string[, count=0])
11111104 :noindex:
11121105
@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
13621355reference for programming in Python. (The first edition covered Python's
13631356now-removed :mod: `regex ` module, which won't help you much.) Consider checking
13641357it out from your library.
1365-
0 commit comments