-
Notifications
You must be signed in to change notification settings - Fork 2.5k
regexp: implement a new regular expression API #5226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/cc @carlosmn |
This looks like a good improvement to me. Do you need help with the macOS and Windows failures? |
On Fri, Sep 20, 2019 at 03:03:56AM -0700, Edward Thomson wrote:
This looks like a good improvement to me. Do you need help
with the macOS and Windows failures?
I didn't yet have time to dig deeper. I'll have a look tomorrow
and let you know if I need any help, thanks :)
|
ddcec7e
to
44740a0
Compare
We currently support a set of different regular expression backends with PCRE, PCRE2, regcomp(3P) and regcomp_l(3). The current implementation of this is done via a simple POSIX wrapper that either directly uses supplied functions or that is a very small wrapper. To support PCRE and PCRE2, we use their provided <pcreposix.h> and <pcre2posix.h> wrappers. These wrappers are implemented in such a way that the accompanying libraries pcre-posix and pcre2-posix provide the same symbols as the libc ones, namely regcomp(3P) et al. This works out on some systems just fine, most importantly on glibc-based ones, where the regular expression functions are implemented as weak aliases and thus get overridden by linking in the pcre{,2}-posix library. On other systems we depend on the linking order of libc and pcre library, and as libc always comes first we will end up with the functions of the libc implementation. As a result, we may use the structures `regex_t` and `regmatch_t` declared by <pcre{,2}posix.h>, but use functions defined by the libc, leading to segfaults. The issue is not easily solvable. Somed distributions like Debian have resolved this by patching PCRE and PCRE2 to carry custom prefixes to all the POSIX function wrappers. But this is not supported by upstream and thus inherently unportable between distributions. We could instead try to modify linking order, but this starts becoming fragile and will not work e.g. when libgit2 is loaded via dlopen(3P) or similar ways. In the end, this means that we simply cannot use the POSIX wrappers provided by the PCRE libraries at all. Thus, this commit introduces a new regular expression API. The new API is on a tad higher level than the previous POSIX abstraction layer, as it tries to abstract away any non-portable flags like e.g. REG_EXTENDED, which has no equivalents in all of our supported backends. As there are no users of POSIX regular expressions that do _not_ reguest REG_EXTENDED this is fine to be abstracted away, though. Due to the API being higher-level than before, it should generally be a tad easier to use than the previous one. Note: ideally, the new API would've been called `git_regex_foobar` with a file "regex.h" and "regex.c". Unfortunately, this is currently impossible to implement due to naming clashes between the then-existing "regex.h" and <regex.h> provided by the libc. As we add the source directory of libgit2 to the header search path, an include of <regex.h> would always find our own "regex.h". Thus, we have to take the bitter pill of adding one more character to all the functions to disambiguate the includes. To improve guarantees around cross-backend compatibility, this commit also brings along an improved regular expression test suite core::regexp.
The old POSIX regex API has been superseded by our new regexp API. Convert all users to make use of the new one.
The old POSIX regex wrappers have been superseded by our own regexp API that provides a higher-level abstraction. Remove the POSIX wrappers in favor of the new one.
44740a0
to
f585b12
Compare
And another green PR :) |
carlosmn
approved these changes
Sep 28, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense; thanks.
Thanks for doing this work @pks-t, what a nice improvement. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We currently support a set of different regular expression backends with
PCRE, PCRE2, regcomp(3P) and regcomp_l(3). The current implementation of
this is done via a simple POSIX wrapper that either directly uses
supplied functions or that is a very small wrapper.
To support PCRE and PCRE2, we use their provided <pcreposix.h> and
<pcre2posix.h> wrappers. These wrappers are implemented in such a way
that the accompanying libraries pcre-posix and pcre2-posix provide the
same symbols as the libc ones, namely regcomp(3P) et al. This works out
on some systems just fine, most importantly on glibc-based ones, where
the regular expression functions are implemented as weak aliases and
thus get overridden by linking in the pcre{,2}-posix library. On other
systems we depend on the linking order of libc and pcre library, and as
libc always comes first we will end up with the functions of the libc
implementation. As a result, we may use the structures
regex_t
andregmatch_t
declared by <pcre{,2}posix.h>, but use functions defined bythe libc, leading to segfaults.
The issue is not easily solvable. Somed distributions like Debian have
resolved this by patching PCRE and PCRE2 to carry custom prefixes to all
the POSIX function wrappers. But this is not supported by upstream and
thus inherently unportable between distributions. We could instead try
to modify linking order, but this starts becoming fragile and will not
work e.g. when libgit2 is loaded via dlopen(3P) or similar ways. In the
end, this means that we simply cannot use the POSIX wrappers provided by
the PCRE libraries at all.
Thus, this commit introduces a new regular expression API. The new API
is on a tad higher level than the previous POSIX abstraction layer, as
it tries to abstract away any non-portable flags like e.g. REG_EXTENDED,
which has no equivalents in all of our supported backends. As there are
no users of POSIX regular expressions that do not reguest REG_EXTENDED
this is fine to be abstracted away, though. Due to the API being
higher-level than before, it should generally be a tad easier to use
than the previous one.
Note: ideally, the new API would've been called
git_regex_foobar
witha file "regex.h" and "regex.c". Unfortunately, this is currently
impossible to implement due to naming clashes between the then-existing
"regex.h" and <regex.h> provided by the libc. As we add the source
directory of libgit2 to the header search path, an include of <regex.h>
would always find our own "regex.h". Thus, we have to take the bitter
pill of adding one more character to all the functions to disambiguate
the includes.
To improve guarantees around cross-backend compatibility, this commit
also brings along an improved regular expression test suite
core::regexp.
Supersedes #5218
Fixes #5217