Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 70d56fb

Browse files
bpo-25054, bpo-1647489: Added support of splitting on zerowidth patterns. (#4471)
Also fixed searching patterns that could match an empty string.
1 parent e69fbb6 commit 70d56fb

9 files changed

Lines changed: 128 additions & 117 deletions

File tree

Doc/library/re.rst

Lines changed: 16 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -708,55 +708,41 @@ form.
708708
That way, separator components are always found at the same relative
709709
indices within the result list.
710710

711-
.. note::
712-
713-
:func:`split` doesn't currently split a string on an empty pattern match.
714-
For example::
715-
716-
>>> re.split('x*', 'axbc')
717-
['a', 'bc']
711+
The pattern can match empty strings. ::
718712

719-
Even though ``'x*'`` also matches 0 'x' before 'a', between 'b' and 'c',
720-
and after 'c', currently these matches are ignored. The correct behavior
721-
(i.e. splitting on empty matches too and returning ``['', 'a', 'b', 'c',
722-
'']``) will be implemented in future versions of Python, but since this
723-
is a backward incompatible change, a :exc:`FutureWarning` will be raised
724-
in the meanwhile.
725-
726-
Patterns that can only match empty strings currently never split the
727-
string. Since this doesn't match the expected behavior, a
728-
:exc:`ValueError` will be raised starting from Python 3.5::
729-
730-
>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
731-
Traceback (most recent call last):
732-
File "<stdin>", line 1, in <module>
733-
...
734-
ValueError: split() requires a non-empty pattern match.
713+
>>> re.split(r'\b', 'Words, words, words.')
714+
['', 'Words', ', ', 'words', ', ', 'words', '.']
715+
>>> re.split(r'(\W*)', '...words...')
716+
['', '...', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '']
735717

736718
.. versionchanged:: 3.1
737719
Added the optional flags argument.
738720

739-
.. versionchanged:: 3.5
740-
Splitting on a pattern that could match an empty string now raises
741-
a warning. Patterns that can only match empty strings are now rejected.
721+
.. versionchanged:: 3.7
722+
Added support of splitting on a pattern that could match an empty string.
723+
742724

743725
.. function:: findall(pattern, string, flags=0)
744726

745727
Return all non-overlapping matches of *pattern* in *string*, as a list of
746728
strings. The *string* is scanned left-to-right, and matches are returned in
747729
the order found. If one or more groups are present in the pattern, return a
748730
list of groups; this will be a list of tuples if the pattern has more than
749-
one group. Empty matches are included in the result unless they touch the
750-
beginning of another match.
731+
one group. Empty matches are included in the result.
732+
733+
.. versionchanged:: 3.7
734+
Non-empty matches can now start just after a previous empty match.
751735

752736

753737
.. function:: finditer(pattern, string, flags=0)
754738

755739
Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
756740
all non-overlapping matches for the RE *pattern* in *string*. The *string*
757741
is scanned left-to-right, and matches are returned in the order found. Empty
758-
matches are included in the result unless they touch the beginning of another
759-
match.
742+
matches are included in the result.
743+
744+
.. versionchanged:: 3.7
745+
Non-empty matches can now start just after a previous empty match.
760746

761747

762748
.. function:: sub(pattern, repl, string, count=0, flags=0)

Doc/whatsnew/3.7.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,10 @@ The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
364364
can be set within the scope of a group.
365365
(Contributed by Serhiy Storchaka in :issue:`31690`.)
366366

367+
:func:`re.split` now supports splitting on a pattern like ``r'\b'``,
368+
``'^$'`` or ``(?=-)`` that matches an empty string.
369+
(Contributed by Serhiy Storchaka in :issue:`25054`.)
370+
367371
string
368372
------
369373

@@ -768,6 +772,23 @@ Changes in the Python API
768772
avoid a warning escape them with a backslash.
769773
(Contributed by Serhiy Storchaka in :issue:`30349`.)
770774

775+
* The result of splitting a string on a :mod:`regular expression <re>`
776+
that could match an empty string has been changed. For example
777+
splitting on ``r'\s*'`` will now split not only on whitespaces as it
778+
did previously, but also between any pair of non-whitespace
779+
characters. The previous behavior can be restored by changing the pattern
780+
to ``r'\s+'``. A :exc:`FutureWarning` was emitted for such patterns since
781+
Python 3.5.
782+
783+
For patterns that match both empty and non-empty strings, the result of
784+
searching for all matches may also be changed in other cases. For example
785+
in the string ``'a\n\n'``, the pattern ``r'(?m)^\s*?$'`` will not only
786+
match empty strings at positions 2 and 3, but also the string ``'\n'`` at
787+
positions 2--3. To match only blank lines, the pattern should be rewritten
788+
as ``r'(?m)^[^\S\n]*$'``.
789+
790+
(Contributed by Serhiy Storchaka in :issue:`25054`.)
791+
771792
* :class:`tracemalloc.Traceback` frames are now sorted from oldest to most
772793
recent to be more consistent with :mod:`traceback`.
773794
(Contributed by Jesse Bakker in :issue:`32121`.)

Lib/doctest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1611,7 +1611,7 @@ def check_output(self, want, got, optionflags):
16111611
'', want)
16121612
# If a line in got contains only spaces, then remove the
16131613
# spaces.
1614-
got = re.sub(r'(?m)^\s*?$', '', got)
1614+
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
16151615
if got == want:
16161616
return True
16171617

Lib/test/test_re.py

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -331,21 +331,21 @@ def test_re_split(self):
331331
['', 'a', '', '', 'c'])
332332

333333
for sep, expected in [
334-
(':*', ['', 'a', 'b', 'c']),
335-
('(?::*)', ['', 'a', 'b', 'c']),
336-
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c']),
337-
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c']),
334+
(':*', ['', 'a', 'b', 'c', '']),
335+
('(?::*)', ['', 'a', 'b', 'c', '']),
336+
('(:*)', ['', ':', 'a', ':', 'b', '::', 'c', '', '']),
337+
('(:)*', ['', ':', 'a', ':', 'b', ':', 'c', None, '']),
338338
]:
339-
with self.subTest(sep=sep), self.assertWarns(FutureWarning):
339+
with self.subTest(sep=sep):
340340
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)
341341

342342
for sep, expected in [
343-
('', [':a:b::c']),
344-
(r'\b', [':a:b::c']),
345-
(r'(?=:)', [':a:b::c']),
346-
(r'(?<=:)', [':a:b::c']),
343+
('', ['', ':', 'a', ':', 'b', ':', ':', 'c', '']),
344+
(r'\b', [':', 'a', ':', 'b', '::', 'c', '']),
345+
(r'(?=:)', ['', ':a', ':b', ':', ':c']),
346+
(r'(?<=:)', [':', 'a:', 'b:', ':', 'c']),
347347
]:
348-
with self.subTest(sep=sep), self.assertRaises(ValueError):
348+
with self.subTest(sep=sep):
349349
self.assertTypedEqual(re.split(sep, ':a:b::c'), expected)
350350

351351
def test_qualified_re_split(self):
@@ -356,9 +356,8 @@ def test_qualified_re_split(self):
356356
['', ':', 'a', ':', 'b::c'])
357357
self.assertEqual(re.split("(:+)", ":a:b::c", maxsplit=2),
358358
['', ':', 'a', ':', 'b::c'])
359-
with self.assertWarns(FutureWarning):
360-
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
361-
['', ':', 'a', ':', 'b::c'])
359+
self.assertEqual(re.split("(:*)", ":a:b::c", maxsplit=2),
360+
['', ':', 'a', ':', 'b::c'])
362361

363362
def test_re_findall(self):
364363
self.assertEqual(re.findall(":+", "abc"), [])
@@ -1751,6 +1750,25 @@ def test_match_repr(self):
17511750
"span=(3, 5), match='bb'>" %
17521751
(type(second).__module__, type(second).__qualname__))
17531752

1753+
def test_zerowidth(self):
1754+
# Issues 852532, 1647489, 3262, 25054.
1755+
self.assertEqual(re.split(r"\b", "a::bc"), ['', 'a', '::', 'bc', ''])
1756+
self.assertEqual(re.split(r"\b|:+", "a::bc"), ['', 'a', '', 'bc', ''])
1757+
self.assertEqual(re.split(r"(?<!\w)(?=\w)|:+", "a::bc"), ['', 'a', 'bc'])
1758+
self.assertEqual(re.split(r"(?<=\w)(?!\w)|:+", "a::bc"), ['a', '', 'bc', ''])
1759+
1760+
self.assertEqual(re.sub(r"\b", "-", "a::bc"), '-a-::-bc-')
1761+
self.assertEqual(re.sub(r"\b|:+", "-", "a::bc"), '-a--bc-')
1762+
self.assertEqual(re.sub(r"(\b|:+)", r"[\1]", "a::bc"), '[]a[][::]bc[]')
1763+
1764+
self.assertEqual(re.findall(r"\b|:+", "a::bc"), ['', '', '::', '', ''])
1765+
self.assertEqual(re.findall(r"\b|\w+", "a::bc"),
1766+
['', 'a', '', '', 'bc', ''])
1767+
1768+
self.assertEqual([m.span() for m in re.finditer(r"\b|:+", "a::bc")],
1769+
[(0, 0), (1, 1), (1, 3), (3, 3), (5, 5)])
1770+
self.assertEqual([m.span() for m in re.finditer(r"\b|\w+", "a::bc")],
1771+
[(0, 0), (0, 1), (1, 1), (3, 3), (3, 5), (5, 5)])
17541772

17551773
def test_bug_2537(self):
17561774
# issue 2537: empty submatches
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Added support of splitting on a pattern that could match an empty string.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Fixed searching regular expression patterns that could match an empty
2+
string. Non-empty string can now be correctly found after matching an empty
3+
string.

Modules/_sre.c

Lines changed: 22 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -446,6 +446,8 @@ state_init(SRE_STATE* state, PatternObject* pattern, PyObject* string,
446446

447447
state->isbytes = isbytes;
448448
state->charsize = charsize;
449+
state->match_all = 0;
450+
state->must_advance = 0;
449451

450452
state->beginning = ptr;
451453

@@ -559,14 +561,14 @@ pattern_dealloc(PatternObject* self)
559561
}
560562

561563
LOCAL(Py_ssize_t)
562-
sre_match(SRE_STATE* state, SRE_CODE* pattern, int match_all)
564+
sre_match(SRE_STATE* state, SRE_CODE* pattern)
563565
{
564566
if (state->charsize == 1)
565-
return sre_ucs1_match(state, pattern, match_all);
567+
return sre_ucs1_match(state, pattern, 1);
566568
if (state->charsize == 2)
567-
return sre_ucs2_match(state, pattern, match_all);
569+
return sre_ucs2_match(state, pattern, 1);
568570
assert(state->charsize == 4);
569-
return sre_ucs4_match(state, pattern, match_all);
571+
return sre_ucs4_match(state, pattern, 1);
570572
}
571573

572574
LOCAL(Py_ssize_t)
@@ -606,7 +608,7 @@ _sre_SRE_Pattern_match_impl(PatternObject *self, PyObject *string,
606608

607609
TRACE(("|%p|%p|MATCH\n", PatternObject_GetCode(self), state.ptr));
608610

609-
status = sre_match(&state, PatternObject_GetCode(self), 0);
611+
status = sre_match(&state, PatternObject_GetCode(self));
610612

611613
TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
612614
if (PyErr_Occurred()) {
@@ -645,7 +647,8 @@ _sre_SRE_Pattern_fullmatch_impl(PatternObject *self, PyObject *string,
645647

646648
TRACE(("|%p|%p|FULLMATCH\n", PatternObject_GetCode(self), state.ptr));
647649

648-
status = sre_match(&state, PatternObject_GetCode(self), 1);
650+
state.match_all = 1;
651+
status = sre_match(&state, PatternObject_GetCode(self));
649652

650653
TRACE(("|%p|%p|END\n", PatternObject_GetCode(self), state.ptr));
651654
if (PyErr_Occurred()) {
@@ -808,11 +811,8 @@ _sre_SRE_Pattern_findall_impl(PatternObject *self, PyObject *string,
808811
if (status < 0)
809812
goto error;
810813

811-
if (state.ptr == state.start)
812-
state.start = (void*) ((char*) state.ptr + state.charsize);
813-
else
814-
state.start = state.ptr;
815-
814+
state.must_advance = (state.ptr == state.start);
815+
state.start = state.ptr;
816816
}
817817

818818
state_fini(&state);
@@ -901,17 +901,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
901901
void* last;
902902

903903
assert(self->codesize != 0);
904-
if (self->code[0] != SRE_OP_INFO || self->code[3] == 0) {
905-
if (self->code[0] == SRE_OP_INFO && self->code[4] == 0) {
906-
PyErr_SetString(PyExc_ValueError,
907-
"split() requires a non-empty pattern match.");
908-
return NULL;
909-
}
910-
if (PyErr_WarnEx(PyExc_FutureWarning,
911-
"split() requires a non-empty pattern match.",
912-
1) < 0)
913-
return NULL;
914-
}
915904

916905
if (!state_init(&state, self, string, 0, PY_SSIZE_T_MAX))
917906
return NULL;
@@ -942,14 +931,6 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
942931
goto error;
943932
}
944933

945-
if (state.start == state.ptr) {
946-
if (last == state.end || state.ptr == state.end)
947-
break;
948-
/* skip one character */
949-
state.start = (void*) ((char*) state.ptr + state.charsize);
950-
continue;
951-
}
952-
953934
/* get segment before this match */
954935
item = getslice(state.isbytes, state.beginning,
955936
string, STATE_OFFSET(&state, last),
@@ -974,7 +955,7 @@ _sre_SRE_Pattern_split_impl(PatternObject *self, PyObject *string,
974955
}
975956

976957
n = n + 1;
977-
958+
state.must_advance = 1;
978959
last = state.start = state.ptr;
979960

980961
}
@@ -1101,9 +1082,7 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,
11011082
if (status < 0)
11021083
goto error;
11031084

1104-
} else if (i == b && i == e && n > 0)
1105-
/* ignore empty match on latest position */
1106-
goto next;
1085+
}
11071086

11081087
if (filter_is_callable) {
11091088
/* pass match object through filter */
@@ -1130,16 +1109,8 @@ pattern_subx(PatternObject* self, PyObject* ptemplate, PyObject* string,
11301109

11311110
i = e;
11321111
n = n + 1;
1133-
1134-
next:
1135-
/* move on */
1136-
if (state.ptr == state.end)
1137-
break;
1138-
if (state.ptr == state.start)
1139-
state.start = (void*) ((char*) state.ptr + state.charsize);
1140-
else
1141-
state.start = state.ptr;
1142-
1112+
state.must_advance = 1;
1113+
state.start = state.ptr;
11431114
}
11441115

11451116
/* get segment following last match */
@@ -2450,7 +2421,7 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)
24502421

24512422
state->ptr = state->start;
24522423

2453-
status = sre_match(state, PatternObject_GetCode(self->pattern), 0);
2424+
status = sre_match(state, PatternObject_GetCode(self->pattern));
24542425
if (PyErr_Occurred())
24552426
return NULL;
24562427

@@ -2459,12 +2430,10 @@ _sre_SRE_Scanner_match_impl(ScannerObject *self)
24592430

24602431
if (status == 0)
24612432
state->start = NULL;
2462-
else if (state->ptr != state->start)
2433+
else {
2434+
state->must_advance = (state->ptr == state->start);
24632435
state->start = state->ptr;
2464-
else if (state->ptr != state->end)
2465-
state->start = (void*) ((char*) state->ptr + state->charsize);
2466-
else
2467-
state->start = NULL;
2436+
}
24682437

24692438
return match;
24702439
}
@@ -2499,12 +2468,10 @@ _sre_SRE_Scanner_search_impl(ScannerObject *self)
24992468

25002469
if (status == 0)
25012470
state->start = NULL;
2502-
else if (state->ptr != state->start)
2471+
else {
2472+
state->must_advance = (state->ptr == state->start);
25032473
state->start = state->ptr;
2504-
else if (state->ptr != state->end)
2505-
state->start = (void*) ((char*) state->ptr + state->charsize);
2506-
else
2507-
state->start = NULL;
2474+
}
25082475

25092476
return match;
25102477
}

Modules/sre.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,18 +67,20 @@ typedef struct {
6767
void* end; /* end of original string */
6868
/* attributes for the match object */
6969
PyObject* string;
70+
Py_buffer buffer;
7071
Py_ssize_t pos, endpos;
7172
int isbytes;
7273
int charsize; /* character size */
7374
/* registers */
7475
Py_ssize_t lastindex;
7576
Py_ssize_t lastmark;
7677
void** mark;
78+
int match_all;
79+
int must_advance;
7780
/* dynamically allocated stuff */
7881
char* data_stack;
7982
size_t data_stack_size;
8083
size_t data_stack_base;
81-
Py_buffer buffer;
8284
/* current repeat context */
8385
SRE_REPEAT *repeat;
8486
} SRE_STATE;

0 commit comments

Comments
 (0)