Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

pks-t
Copy link
Member

@pks-t pks-t commented Sep 13, 2019

We currently support a set of different regular expression backends with
PCRE, PCRE2, regcomp(3P) and regcomp_l(3). The current implementation of
this is done via a simple POSIX wrapper that either directly uses
supplied functions or that is a very small wrapper.

To support PCRE and PCRE2, we use their provided <pcreposix.h> and
<pcre2posix.h> wrappers. These wrappers are implemented in such a way
that the accompanying libraries pcre-posix and pcre2-posix provide the
same symbols as the libc ones, namely regcomp(3P) et al. This works out
on some systems just fine, most importantly on glibc-based ones, where
the regular expression functions are implemented as weak aliases and
thus get overridden by linking in the pcre{,2}-posix library. On other
systems we depend on the linking order of libc and pcre library, and as
libc always comes first we will end up with the functions of the libc
implementation. As a result, we may use the structures regex_t and
regmatch_t declared by <pcre{,2}posix.h>, but use functions defined by
the libc, leading to segfaults.

The issue is not easily solvable. Somed distributions like Debian have
resolved this by patching PCRE and PCRE2 to carry custom prefixes to all
the POSIX function wrappers. But this is not supported by upstream and
thus inherently unportable between distributions. We could instead try
to modify linking order, but this starts becoming fragile and will not
work e.g. when libgit2 is loaded via dlopen(3P) or similar ways. In the
end, this means that we simply cannot use the POSIX wrappers provided by
the PCRE libraries at all.

Thus, this commit introduces a new regular expression API. The new API
is on a tad higher level than the previous POSIX abstraction layer, as
it tries to abstract away any non-portable flags like e.g. REG_EXTENDED,
which has no equivalents in all of our supported backends. As there are
no users of POSIX regular expressions that do not reguest REG_EXTENDED
this is fine to be abstracted away, though. Due to the API being
higher-level than before, it should generally be a tad easier to use
than the previous one.

Note: ideally, the new API would've been called git_regex_foobar with
a file "regex.h" and "regex.c". Unfortunately, this is currently
impossible to implement due to naming clashes between the then-existing
"regex.h" and <regex.h> provided by the libc. As we add the source
directory of libgit2 to the header search path, an include of <regex.h>
would always find our own "regex.h". Thus, we have to take the bitter
pill of adding one more character to all the functions to disambiguate
the includes.

To improve guarantees around cross-backend compatibility, this commit
also brings along an improved regular expression test suite
core::regexp.


Supersedes #5218
Fixes #5217

@pks-t
Copy link
Member Author

pks-t commented Sep 13, 2019

/cc @carlosmn

@ethomson
Copy link
Member

This looks like a good improvement to me. Do you need help with the macOS and Windows failures?

@pks-t
Copy link
Member Author

pks-t commented Sep 20, 2019 via email

@pks-t pks-t force-pushed the pks/regexp-api branch 2 times, most recently from ddcec7e to 44740a0 Compare September 21, 2019 12:52
We currently support a set of different regular expression backends with
PCRE, PCRE2, regcomp(3P) and regcomp_l(3). The current implementation of
this is done via a simple POSIX wrapper that either directly uses
supplied functions or that is a very small wrapper.

To support PCRE and PCRE2, we use their provided <pcreposix.h> and
<pcre2posix.h> wrappers. These wrappers are implemented in such a way
that the accompanying libraries pcre-posix and pcre2-posix provide the
same symbols as the libc ones, namely regcomp(3P) et al. This works out
on some systems just fine, most importantly on glibc-based ones, where
the regular expression functions are implemented as weak aliases and
thus get overridden by linking in the pcre{,2}-posix library. On other
systems we depend on the linking order of libc and pcre library, and as
libc always comes first we will end up with the functions of the libc
implementation. As a result, we may use the structures `regex_t` and
`regmatch_t` declared by <pcre{,2}posix.h>, but use functions defined by
the libc, leading to segfaults.

The issue is not easily solvable. Somed distributions like Debian have
resolved this by patching PCRE and PCRE2 to carry custom prefixes to all
the POSIX function wrappers. But this is not supported by upstream and
thus inherently unportable between distributions. We could instead try
to modify linking order, but this starts becoming fragile and will not
work e.g. when libgit2 is loaded via dlopen(3P) or similar ways. In the
end, this means that we simply cannot use the POSIX wrappers provided by
the PCRE libraries at all.

Thus, this commit introduces a new regular expression API. The new API
is on a tad higher level than the previous POSIX abstraction layer, as
it tries to abstract away any non-portable flags like e.g. REG_EXTENDED,
which has no equivalents in all of our supported backends. As there are
no users of POSIX regular expressions that do _not_ reguest REG_EXTENDED
this is fine to be abstracted away, though. Due to the API being
higher-level than before, it should generally be a tad easier to use
than the previous one.

Note: ideally, the new API would've been called `git_regex_foobar` with
a file "regex.h" and "regex.c". Unfortunately, this is currently
impossible to implement due to naming clashes between the then-existing
"regex.h" and <regex.h> provided by the libc. As we add the source
directory of libgit2 to the header search path, an include of <regex.h>
would always find our own "regex.h". Thus, we have to take the bitter
pill of adding one more character to all the functions to disambiguate
the includes.

To improve guarantees around cross-backend compatibility, this commit
also brings along an improved regular expression test suite
core::regexp.
The old POSIX regex API has been superseded by our new regexp API.
Convert all users to make use of the new one.
The old POSIX regex wrappers have been superseded by our own regexp API
that provides a higher-level abstraction. Remove the POSIX wrappers in
favor of the new one.
@pks-t
Copy link
Member Author

pks-t commented Sep 21, 2019

And another green PR :)

Copy link
Member

@carlosmn carlosmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense; thanks.

@ethomson
Copy link
Member

Thanks for doing this work @pks-t, what a nice improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

With PCRE2 selected we link against libc's functions causing crashes
3 participants