Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 3f4f3ba

Browse files
committed
#18562: various revisions to the regex howto for 3.x
* describe how \w is different when used in bytes and Unicode patterns. * describe re.ASCII flag to change that behaviour. * remove personal references ('I generally prefer...') * add some more links to the re module in the library reference * various small edits and re-wording.
1 parent ba5d8f3 commit 3f4f3ba

1 file changed

Lines changed: 62 additions & 70 deletions

File tree

Doc/howto/regex.rst

Lines changed: 62 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
104104
or ``\``, you can precede them with a backslash to remove their special
105105
meaning: ``\[`` or ``\\``.
106106

107-
Some of the special sequences beginning with ``'\'`` represent predefined sets
108-
of characters that are often useful, such as the set of digits, the set of
109-
letters, or the set of anything that isn't whitespace. The following predefined
110-
special sequences are a subset of those available. The equivalent classes are
111-
for bytes patterns. For a complete list of sequences and expanded class
112-
definitions for Unicode string patterns, see the last part of
113-
:ref:`Regular Expression Syntax <re-syntax>`.
107+
Some of the special sequences beginning with ``'\'`` represent
108+
predefined sets of characters that are often useful, such as the set
109+
of digits, the set of letters, or the set of anything that isn't
110+
whitespace.
111+
112+
Let's take an example: ``\w`` matches any alphanumeric character. If
113+
the regex pattern is expressed in bytes, this is equivalent to the
114+
class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will
115+
match all the characters marked as letters in the Unicode database
116+
provided by the :mod:`unicodedata` module. You can use the more
117+
restricted definition of ``\w`` in a string pattern by supplying the
118+
:const:`re.ASCII` flag when compiling the regular expression.
119+
120+
The following list of special sequences isn't complete. For a complete
121+
list of sequences and expanded class definitions for Unicode string
122+
patterns, see the last part of :ref:`Regular Expression Syntax
123+
<re-syntax>` in the Standard Library reference. In general, the
124+
Unicode versions match any character that's in the appropriate
125+
category in the Unicode database.
114126

115127
``\d``
116128
Matches any decimal digit; this is equivalent to the class ``[0-9]``.
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
160172
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
161173
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
162174
internal limitations stemming from the size of C's ``int`` type that will
163-
prevent it from matching over 2 billion ``a`` characters; you probably don't
164-
have enough memory to construct a string that large, so you shouldn't run into
165-
that limit.
175+
prevent it from matching over 2 billion ``a`` characters; patterns
176+
are usually not written to match that much data.
166177

167178
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
168179
engine will try to repeat it as many times as possible. If later portions of the
@@ -353,7 +364,7 @@ for a complete listing.
353364
| | returns them as an :term:`iterator`. |
354365
+------------------+-----------------------------------------------+
355366

356-
:meth:`match` and :meth:`search` return ``None`` if no match can be found. If
367+
:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found. If
357368
they're successful, a :ref:`match object <match-objects>` instance is returned,
358369
containing information about the match: where it starts and ends, the substring
359370
it matched, and more.
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
419430
>>> m.span()
420431
(0, 5)
421432

422-
:meth:`group` returns the substring that was matched by the RE. :meth:`start`
423-
and :meth:`end` return the starting and ending index of the match. :meth:`span`
433+
:meth:`~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.start`
434+
and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span`
424435
returns both start and end indexes in a single tuple. Since the :meth:`match`
425436
method only checks if the RE matches at the start of a string, :meth:`start`
426437
will always be zero. However, the :meth:`search` method of patterns
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
448459
print('No match')
449460

450461
Two pattern methods return all of the matches for a pattern.
451-
:meth:`findall` returns a list of matching strings::
462+
:meth:`~re.regex.findall` returns a list of matching strings::
452463

453464
>>> p = re.compile('\d+')
454465
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
455466
['12', '11', '10']
456467

457468
:meth:`findall` has to create the entire list before it can be returned as the
458-
result. The :meth:`finditer` method returns a sequence of
469+
result. The :meth:`~re.regex.finditer` method returns a sequence of
459470
:ref:`match object <match-objects>` instances as an :term:`iterator`::
460471

461472
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@@ -473,9 +484,9 @@ Module-Level Functions
473484
----------------------
474485

475486
You don't have to create a pattern object and call its methods; the
476-
:mod:`re` module also provides top-level functions called :func:`match`,
477-
:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions
478-
take the same arguments as the corresponding pattern method, with
487+
:mod:`re` module also provides top-level functions called :func:`~re.match`,
488+
:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions
489+
take the same arguments as the corresponding pattern method with
479490
the RE string added as the first argument, and still return either ``None`` or a
480491
:ref:`match object <match-objects>` instance. ::
481492

@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
485496
<_sre.SRE_Match object at 0x...>
486497

487498
Under the hood, these functions simply create a pattern object for you
488-
and call the appropriate method on it. They also store the compiled object in a
489-
cache, so future calls using the same RE are faster.
499+
and call the appropriate method on it. They also store the compiled
500+
object in a cache, so future calls using the same RE won't need to
501+
parse the pattern again and again.
490502

491503
Should you use these module-level functions, or should you get the
492-
pattern and call its methods yourself? That choice depends on how
493-
frequently the RE will be used, and on your personal coding style. If the RE is
494-
being used at only one point in the code, then the module functions are probably
495-
more convenient. If a program contains a lot of regular expressions, or re-uses
496-
the same ones in several locations, then it might be worthwhile to collect all
497-
the definitions in one place, in a section of code that compiles all the REs
498-
ahead of time. To take an example from the standard library, here's an extract
499-
from the now-defunct Python 2 standard :mod:`xmllib` module::
500-
501-
ref = re.compile( ... )
502-
entityref = re.compile( ... )
503-
charref = re.compile( ... )
504-
starttagopen = re.compile( ... )
505-
506-
I generally prefer to work with the compiled object, even for one-time uses, but
507-
few people will be as much of a purist about this as I am.
504+
pattern and call its methods yourself? If you're accessing a regex
505+
within a loop, pre-compiling it will save a few function calls.
506+
Outside of loops, there's not much difference thanks to the internal
507+
cache.
508508

509509

510510
Compilation Flags
@@ -524,6 +524,10 @@ of each one.
524524
+---------------------------------+--------------------------------------------+
525525
| Flag | Meaning |
526526
+=================================+============================================+
527+
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
528+
| | ``\s`` and ``\d`` match only on ASCII |
529+
| | characters with the respective property. |
530+
+---------------------------------+--------------------------------------------+
527531
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
528532
| | newlines |
529533
+---------------------------------+--------------------------------------------+
@@ -535,11 +539,7 @@ of each one.
535539
| | ``$`` |
536540
+---------------------------------+--------------------------------------------+
537541
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
538-
| | more cleanly and understandably. |
539-
+---------------------------------+--------------------------------------------+
540-
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
541-
| | ``\s`` and ``\d`` match only on ASCII |
542-
| | characters with the respective property. |
542+
| (for 'extended') | more cleanly and understandably. |
543543
+---------------------------------+--------------------------------------------+
544544

545545

@@ -558,7 +558,8 @@ of each one.
558558
LOCALE
559559
:noindex:
560560

561-
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale.
561+
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
562+
instead of the Unicode database.
562563

563564
Locales are a feature of the C library intended to help in writing programs that
564565
take account of language differences. For example, if you're processing French
@@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
851852
problem. Both of them use a common syntax for regular expression extensions, so
852853
we'll look at that first.
853854

854-
Perl 5 added several additional features to standard regular expressions, and
855-
the Python :mod:`re` module supports most of them. It would have been
856-
difficult to choose new single-keystroke metacharacters or new special sequences
857-
beginning with ``\`` to represent the new features without making Perl's regular
858-
expressions confusingly different from standard REs. If you chose ``&`` as a
855+
Perl 5 is well-known for its powerful additions to standard regular expressions.
856+
For these new features the Perl developers couldn't choose new single-keystroke metacharacters
857+
or new special sequences beginning with ``\`` without making Perl's regular
858+
expressions confusingly different from standard REs. If they chose ``&`` as a
859859
new metacharacter, for example, old expressions would be assuming that ``&`` was
860860
a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``.
861861

@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
867867
assertion) and ``(?:foo)`` is something else (a non-capturing group containing
868868
the subexpression ``foo``).
869869

870-
Python adds an extension syntax to Perl's extension syntax. If the first
871-
character after the question mark is a ``P``, you know that it's an extension
872-
that's specific to Python. Currently there are two such extensions:
873-
``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to
874-
a named group. If future versions of Perl 5 add similar features using a
875-
different syntax, the :mod:`re` module will be changed to support the new
876-
syntax, while preserving the Python-specific syntax for compatibility's sake.
877-
878-
Now that we've looked at the general extension syntax, we can return to the
879-
features that simplify working with groups in complex REs. Since groups are
880-
numbered from left to right and a complex expression may use many groups, it can
881-
become difficult to keep track of the correct numbering. Modifying such a
882-
complex RE is annoying, too: insert a new group near the beginning and you
883-
change the numbers of everything that follows it.
884-
885-
Sometimes you'll want to use a group to collect a part of a regular expression,
870+
Python supports several of Perl's extensions and adds an extension
871+
syntax to Perl's extension syntax. If the first character after the
872+
question mark is a ``P``, you know that it's an extension that's
873+
specific to Python.
874+
875+
Now that we've looked at the general extension syntax, we can return
876+
to the features that simplify working with groups in complex REs.
877+
878+
Sometimes you'll want to use a group to denote a part of a regular expression,
886879
but aren't interested in retrieving the group's contents. You can make this fact
887880
explicit by using a non-capturing group: ``(?:...)``, where you can replace the
888881
``...`` with any other regular expression. ::
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
908901

909902
The syntax for a named group is one of the Python-specific extensions:
910903
``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups
911-
also behave exactly like capturing groups, and additionally associate a name
904+
behave exactly like capturing groups, and additionally associate a name
912905
with a group. The :ref:`match object <match-objects>` methods that deal with
913906
capturing groups all accept either integers that refer to the group by number
914907
or strings that contain the desired group's name. Named groups are still
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
975968
``.*[.].*$``
976969

977970
Notice that the ``.`` needs to be treated specially because it's a
978-
metacharacter; I've put it inside a character class. Also notice the trailing
979-
``$``; this is added to ensure that all the rest of the string must be included
980-
in the extension. This regular expression matches ``foo.bar`` and
971+
metacharacter, so it's inside a character class to only match that
972+
specific character. Also notice the trailing ``$``; this is added to
973+
ensure that all the rest of the string must be included in the
974+
extension. This regular expression matches ``foo.bar`` and
981975
``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
982976

983977
Now, consider complicating the problem a bit; what if you want to match
@@ -1051,7 +1045,7 @@ Splitting Strings
10511045
The :meth:`split` method of a pattern splits a string apart
10521046
wherever the RE matches, returning a list of the pieces. It's similar to the
10531047
:meth:`split` method of strings but provides much more generality in the
1054-
delimiters that you can split by; :meth:`split` only supports splitting by
1048+
delimiters that you can split by; string :meth:`split` only supports splitting by
10551049
whitespace or by a fixed string. As you'd expect, there's a module-level
10561050
:func:`re.split` function, too.
10571051

@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
11061100
with a different string. The :meth:`sub` method takes a replacement value,
11071101
which can be either a string or a function, and the string to be processed.
11081102

1109-
11101103
.. method:: .sub(replacement, string[, count=0])
11111104
:noindex:
11121105

@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
13621355
reference for programming in Python. (The first edition covered Python's
13631356
now-removed :mod:`regex` module, which won't help you much.) Consider checking
13641357
it out from your library.
1365-

0 commit comments

Comments
 (0)