gh-124130: Fix a bug in matching regular expression \B in empty string #127007

serhiy-storchaka · 2024-11-19T08:48:51Z

Issue: Regex \B doesn't match empty string #124130

📚 Documentation preview 📚: https://cpython-previews--127007.org.readthedocs.build/

… string

wjssz · 2024-11-19T09:03:45Z

Doc/library/re.rst

-      RE implementations in other programming languages such as Perl.
-      This behavior is kept for compatibility reasons.
+   .. versionchanged:: 3.14
+      ``\B`` now matches the whole empty string.



The fix is LGTM.

Some users may not be sure what "the whole empty string" is.
I suggest explaining in more detail, and no need to use "whole" for "empty string".

\B used to be unable to match empty string. Now it can match, this behavior is consistent with mainstream languages.

I know why you use "whole", the \B description:

\B Matches the empty string, but only when it ...

You want to distinguish from "empty string" here.
Let me think about how to describe it more clearly.

It is still difficult to avoid contradiction with "\B Matches the empty string, but only when it ...".

Yeah, "the empty string" means two different things there. The docs you're removing don't do a very good job either. Should be disambiguated somehow, for example

\B now matches if the input string is empty.

How about this:

\B used to be unable to match 0-length string ""/b"". Now it can match, this behavior is consistent with mainstream languages.

\B now matches if the input string is empty.

@Alcaro uses the word "input", it looks fine. Candidate:

\B now matches empty input string ""/b"", this behavior is consistent with RE implementations in mainstream programming languages.

The previous test: #124130 (comment)
After that, I tested ASCII mode for "a", Unicode mode for "ю".
The results are same ~~except in Perl v5.40.0~~.

Tested with: regex module, openjdk, .net, rust, ruby, php, pcre2.
javascript/golang only support ASCII word (for \w\W\b\B), even in Unicode mode.

~~It seems Perl v5.40.0 has bug in (\B + Unicode_mode).~~
Given this, it's not recommended to mention Perl in doc.

Save it to file perl.pl, and run: perl perl.pl

Click to see Perl script

print "Perl version: $]\n\n"; # test \b in ASCII mode my $n = () = ( "" =~ /\b/g ); print "\\b \"\" matches: $n\n"; my $n = () = ( "a" =~ /\b/g ); print "\\b \"a\" matches: $n\n"; my $n = () = ( "=" =~ /\b/g ); print "\\b \"=\" matches: $n\n"; print "~~~~~~~~~~~~\n"; # test \B in ASCII mode my $n = () = ( "" =~ /\B/g ); print "\\B \"\" matches: $n\n"; my $n = () = ( "a" =~ /\B/g ); print "\\B \"a\" matches: $n\n"; my $n = () = ( "=" =~ /\B/g ); print "\\B \"=\" matches: $n\n"; print "\nxxxx ASCII mode above / Unicode mode below xxxx\n\n"; # test \b in Unicode mode my $n = () = ( "" =~ /\b/gu ); print "\\b \"\" matches: $n\n"; my $n = () = ( "ю" =~ /\b/gu ); print "\\b \"ю\" matches: $n\n"; my $n = () = ( "=" =~ /\b/gu ); print "\\b \"=\" matches: $n\n"; print "~~~~~~~~~~~~\n"; # test \B in Unicode mode my $n = () = ( "" =~ /\B/gu ); print "\\B \"\" matches: $n\n"; my $n = () = ( "ю" =~ /\B/gu ); print "\\B \"ю\" matches: $n <- other REs get 0, it seems a bug.\n"; my $n = () = ( "=" =~ /\B/gu ); print "\\B \"=\" matches: $n\n"; # explanation print "\n(\\B + Unicode_mode) behaves inconsistently at the begin/end of a string:\n"; print "\n/^\\B/gu NOT IN \"ю\":\n"; my $n = () = ( "ю" =~ /^\B/gu ); print "matches: $n\n"; print "\n/\\B\$/gu IN \"ю\":\n"; my $n = () = ( "ю" =~ /\B$/gu ); print "matches: $n\n";

Script output:

Perl version: 5.040000 \b "" matches: 0 \b "a" matches: 2 \b "=" matches: 0 ~~~~~~~~~~~~ \B "" matches: 1 \B "a" matches: 0 \B "=" matches: 2 xxxx ASCII mode above / Unicode mode below xxxx \b "" matches: 0 \b "ю" matches: 2 \b "=" matches: 0 ~~~~~~~~~~~~ \B "" matches: 1 \B "ю" matches: 1 <- other REs get 0, it seems a bug. \B "=" matches: 2 (\B + Unicode_mode) behaves inconsistently at the begin/end of a string: /^\B/gu NOT IN "ю": matches: 0 /\B$/gu IN "ю": matches: 1

I'm sorry, Perl doesn't have (\B+Unicode_mode) bug.
If add a line use utf8; at the beginning of Perl script, it works as expected.
ref: https://perldoc.perl.org/utf8

vstinner

The overall change LGTM, but I didn't review Modules/_sre/sre_lib.h (I don't know this code).

Doc/library/re.rst

Misc/NEWS.d/next/Library/2024-11-19-10-46-57.gh-issue-124130.OZ_vR5.rst

picnixz

This one has a conflict (but I don't want to fix it for you since I don't know if you have local commits) but once it's solved, LGTM.

… string (pythonGH-127007)

pythongh-124130: Fix a bug in matching regular expression \B in empty…

592a769

… string

bedevere-app bot mentioned this pull request Nov 19, 2024

Regex \B doesn't match empty string #124130

Closed

bedevere-app bot added the awaiting core review label Nov 19, 2024

wjssz reviewed Nov 19, 2024

View reviewed changes

vstinner approved these changes Nov 19, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Nov 19, 2024

picnixz reviewed Nov 21, 2024

View reviewed changes

Doc/library/re.rst Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2024-11-19-10-46-57.gh-issue-124130.OZ_vR5.rst Outdated Show resolved Hide resolved

serhiy-storchaka added 2 commits November 22, 2024 21:12

Update docs.

bbace33

Merge branch 'main' into re-non-boundary

4c79573

picnixz approved these changes Jan 2, 2025

View reviewed changes

Merge branch 'main' into re-non-boundary

ef0a6fc

serhiy-storchaka enabled auto-merge (squash) January 2, 2025 11:44

serhiy-storchaka merged commit a3711d1 into python:main Jan 2, 2025
41 checks passed

bedevere-app bot removed the awaiting merge label Jan 2, 2025

serhiy-storchaka deleted the re-non-boundary branch January 2, 2025 12:23

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025

pythongh-124130: Fix a bug in matching regular expression \B in empty…

3ac998c

… string (pythonGH-127007)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-124130: Fix a bug in matching regular expression \B in empty string #127007

gh-124130: Fix a bug in matching regular expression \B in empty string #127007

Uh oh!

serhiy-storchaka commented Nov 19, 2024 •

edited by github-actions bot

Loading

Uh oh!

wjssz Nov 19, 2024 •

edited

Loading

Uh oh!

wjssz Nov 19, 2024

Uh oh!

serhiy-storchaka Nov 19, 2024

Uh oh!

Alcaro Nov 19, 2024

Uh oh!

wjssz Nov 19, 2024

Uh oh!

wjssz Nov 21, 2024 •

edited

Loading

Uh oh!

wjssz Nov 21, 2024

Uh oh!

vstinner left a comment

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-124130: Fix a bug in matching regular expression \B in empty string #127007

gh-124130: Fix a bug in matching regular expression \B in empty string #127007

Uh oh!

Conversation

serhiy-storchaka commented Nov 19, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjssz Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjssz Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

Alcaro Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

wjssz Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

wjssz Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjssz Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka commented Nov 19, 2024 •

edited by github-actions bot

Loading

wjssz Nov 19, 2024 •

edited

Loading

wjssz Nov 21, 2024 •

edited

Loading