lang: es
regexp3 (C-lang, Go-lang) and regexp4 (C-lang, Go-lang)
raptor-book (draft (spanish)) : here
benchmarks ==> here
- Easy to use.
- No error checking.
- only regexp
- The most compact and clear code in a human regexp library.
- Zero dependencies. Neither the standard C library is present PURE C.
- No explicit dynamic memory management. No
malloc
,calloc
,free
, β¦ - Count matches
- Catchs
- Replacement catch
- Placement of specific catches within an array
- Backreferences
- Support UTF8
Recurseve Regexp Raptor is a library search, capture and replacement Regular expressions written in C language from zero, trying to achieve what following:
- Having most of the features present in any other regexp library.
- Elegant Code: simple, clear and endowed with grace.
- Avoid explicit request dynamic memory.
- Avoid using any external libraries, including the standard library.
- Be a useful learning material.
There are two parallel developments of this library the first (this) focuses on simplicity and code, the second (regexp4) still in beta seeks achieve the maximum speed possible implementing a βtable of instructions.β In both cases the algorithm is from scratch, and only use C, enjoy!
C does not have a standard library of regular expressions, although there are several implementations, such as pcre, the regexp.h library of the GNU project, regexp of the Plan 9 operating system, and some other more, the author of this work (which is a little bit retard) found in all code farfetched and mystical divided into several files full of macros, scripts low and cryptic variables. Unable to understand anything and after a retreat to the island of onanista meditacion the author intended to make your own library with casinos and Japanese schoolgirls.
Has been used GNU Emacs (the only true operating system), gcc (6.3.1) & clang compiler (LLVM) 3.8.1, konsole and fish, running in Freidora 25.
There are two tests for the library, the first ascii test battery is used in
the file ascii_test.c
.
to test the ascii library
gcc ascii_test.c regexp3_ascii.c
to the ut8 vercion
gcc ascii_test.c regexp3_utf8.c
the second battery of tests is exclusive of regexp3_utf8.c
gcc utf8_test.c regexp3_utf8.c
in either case run with
./a.out
To include Recursive Regexp Raptor in their code, place the files regexp3.h
,
charUtils.h
and regexp3_ascii.c
or regexp3_utf8.c
inside the folder of
your draft. You must include the header
#include "regexp3.h"
and finally compile well with
gcc myProyect.c regexp3_ascii.c
or
gcc myProyect.c regexp3_utf8.c
obviously compile with optimization provides a significant decline,
runtime, try -O3
This the only search function, its prototype is:
int regexp3( const char *txt, const char *re );
- txt
- pointer to string on which to perform the search, must end with the sign of termination β\0β.
- re
- pointer to string containing the regular expression search, You must end with the sign of termination β\0β.
The function returns the number of matches 0
(none) o n
matches.
The standard syntax for regular expressions using the character β\
β,
unfortunately this sign goes into βconflictβ with the syntax of C, by this
and trying to keep simple the code, has opted for a alternate syntax detailed
below
- Text search in any location:
regexp3( "Raptor Test", "Raptor" );
- Multiple search options βexp1|exp2β
regexp3( "Raptor Test", "Dinosaur|T Rex|Raptor|Triceratops" );
- Matches any character β.β
regexp3( "Raptor Test", "R.ptor" );
- Zero or one coincidences β?β
regexp3( "Raptor Test", "Ra?ptor" );
- One or more coincidences β+β
regexp3( "Raaaptor Test", "Ra+ptor" );
- Zero or more coincidences β*β
regexp3( "Raaaptor Test", "Ra*ptor" );
- Range of coincidences β{n1,n2}β
regexp3( "Raaaptor Test", "Ra{0,100}ptor" );
- Number of specific matches β{n1}β
regexp3( "Raptor Test", "Ra{1}ptor" );
- Minimum Number of matches β{n1,}β
regexp3( "Raaaptor Test", "Ra{2,}ptor" );
- Sets.
- Character Set β[abc]β
regexp3( "Raptor Test", "R[uoiea]ptor" );
- Range within a set of characters β[a-b]β
regexp3( "Raptor Test", "R[a-z]ptor" );
- Metacaracter within a set of characters β[:meta]β
regexp3( "Raptor Test", "R[:w]ptor" );
- Investment character set β[^abc]β
regexp3( "Raptor Test", "R[^uoie]ptor" );
- Character Set β[abc]β
- UTF8 characters
regexp3( "Rβ³ptor Test", "Rβ³ptor" );
also
regexp3( "Rβ³ptor Test", "R[β³]ptor" );
- Coinciding with a character that is a letter β:aβ
regexp3( "RAptor Test", "R:aptor" );
- Coinciding with a character that is not a letter β:Aβ
regexp3( "Rβ³ptor Test", "R:Aptor" );
- Coinciding with a character that is a number β:dβ
regexp3( "R4ptor Test", "R:dptor" );
- Coinciding with a character other than a number β:Dβ
regexp3( "Raptor Test", "R:Dptor" );
- Coinciding with an alphanumeric character β:wβ
regexp3( "Raptor Test", "R:wptor" );
- Coinciding with a non-alphanumeric character β:Wβ
regexp3( "Rβ³ptor Test", "R:Wptor" );
- Coinciding with a character that is a space β:sβ
regexp3( "R ptor Test", "R:sptor" );
- Coinciding with a character other than a space β:Sβ
regexp3( "Raptor Test", "R:Sptor" );
- Coincidence with utf8 character β:&β
regexp3( "Rβ³ptor Test", "R:&ptor" );
- Escape character with special meaning β:characterβ
the characters β|β, β(β, β)β, β<β, β>β, β[β, β]β, β?β, β+β, β*β, β{β, β}β, β-β, β#β and β@β as a especial characters, placing one of these characters as is, regardless one correct syntax within the exprecion, can generate infinite loops and other errors.
regexp3( ":#()|<>", ":::#:(:):|:<:>" );
The special characters (except the metacharacter) lose their meaning within a set
regexp3( "()<>[]|{}*#@?+", "[()<>:[:]|{}*?+#@]" );
- Grouping β(exp)β
regexp3( "Raptor Test", "(Raptor)" );
- Grouping with capture β<exp>β
regexp3( "Raptor Test", "<Raptor>" );
- Backreferences β@idβ
the backreferences need one previously captured expression β<exp>β, then the number of capture is placed, preceded by β@β
regexp3( "ae_ea", "<a><e>_@2@1" )
- Behavior modifiers
There are two types of modifiers. The first affects globally the exprecion behaviour, the second affects specific sections. In either case, the syntax is the same, the sign β#β, followed by modifiers,
modifiers global reach is placed at the beginning, the whole and are as follows exprecion
- Search only the beginning β#^expβ
regexp3( "Raptor Test", "#^Raptor" );
- Search only at the end β#$expβ
regexp3( "Raptor Test", "#$Test" );
- Search the beginning and end β#^$expβ
regexp3( "Raptor Test", "#^$Raptor Test" );
- Stop with the first match β#?expβ
regexp3( "Raptor Test", "#?Raptor Test" );
- Search for the string, character by character β#~β
By default, when a exprecion coincides with a region of text search, the search continues from the end of that coincidence to ignore this behavior, making the search always be character by character this switch is used
regexp3( "aaaaa", "#~a*" );
in this example, without modifying the result it would be a coincidence, however with this switch continuous search immediately after returning character representations of the following five matches.
- Ignore case sensitive β#*expβ
regexp3( "Raptor Test", "#*RaPtOr TeSt" );
- Search only the beginning β#^expβ
all of the above switches are compatible with each other ie could search
regexp3( "Raptor Test", "#^$*?~RaPtOr TeSt" );
however modifiers β~β and β?β lose sense because the presence of β^β and/or β$β.
one exprecion type:
regexp3( "Raptor Test", "#$RaPtOr|#$TeSt" );
is erroneous, the modifier after the β|β section would apply between β|β and β#β, ie zero, with a return of wrong
local modifiers are placed after the repeat indicator (if there) and affect the same region affecting indicators repetition, ie characters, sets or groups.
- Ignore case sensitive βexp#*β
regexp3( "Raptor Test", "(RaPtOr)#* TeS#*t" );
- Not ignore case sensitive βexp#/β
regexp3( "RaPtOr TeSt", "#*(RaPtOr)#/ TES#/T" );
Catches are indexed according to the order of appearance in the expression for example:
< < > | < < > > > = 1 ========================== = 2== = 2 ========= = 3 =
If the exprecion matches more than one occasion in the search text index is increased according to their appearance that is:
< < > | < > > < < > | < > > < < > | < > > = 1 ================== = 3 ================== = 5 ================== = 2== = 2== = 4== = 4== = 6== = 6== coincidencia uno coincidencia dos coincidencia tres
cpytCatch
function makes a copy of a catch into an array character, here
its prototype:
char * cpyCatch( char * str, const int index )
- str
- pointer capable of holding the largest capture.
- index
- index of the grouping (
1
ton
).
function returns a pointer to the capture terminated β\0β. an index incorrect return a pointer that begins in β\0β.
to get the number of catches in a search, using totCatch
:
int totCatch();
returning a value of 0
a n
.
Could use this and the previous function to print all catches with a function like this:
void printCatch(){
char str[128];
int i = 0, max = totCatch();
while( ++i <= max )
printf( "[%d] >%s<\n", i, cpyCatch( str, i ) );
}
functions gpsCatch()
and lenCatch()
perform the same work cpyCatch
with the variant not use an array, instead the first returns a pointer to
the initial position of capture within the text of search and the second
returns the length of the capture.
int lenCatch( const int index );
const char * gpsCatch( const int index );
the above example with these fuciones, would:
void printCatch(){
int i = 0, max = totCatch();
while( ++i <= max )
printf( "[%d] >%.*s<\n", i, lenCatch( i ), gpsCatch( i ) );
}
char * putCatch( char * newStr, const char * putStr );
putStr
argument contains the text with which to form the new chain as well
as indicators which you catch place. To indicate the insertion a coke
capture the β#β sign followed the capture index. for example putStr
argument could be
char *putStr = "catch 1 >>#1<< catch 2 >>#2<< catch 747 >>#747<<";
newStr
is an character array large enough to contain the string +
catches. the function returns a pointer to the starting position of this
arrangement, which ends with the sign of completion β\0β.
to place the character β#β within the escape string β#β with β#β further, ie:
"## Comment" -> "# comment"
Replacement operates on an array of characters in which is placed the text
search modifying a specified catch by a string text, the function in
charge of this work is rplCatch
, its prototype is:
char * rplCatch( char * newStr, const char * rplStr, const int id );
- newStr
- character array dimension text is placed dende original on which is carried out and the replacement text of catches.
- rplStr
- replacement text capture.
- id
- Capture identifier after the order of appearance within regular exprecion. Spend a wrong index, place a unaltered copy of the search string on the settlement = Newstr =.
in this case the use of the argument id
unlike function getCatch
does
not refer to a βcatchβ in specific, that is no matter how much of occasions
that has captured a exprecion, the identifier indicates the position
within the exprecion itself, ie:
< < > | < < > > > id = 1 ========================== id = 2== = 2 ========= id = 3 = capturing position within the exprecion
The amendment affects so
< < > | < > > < < > | < > > < < > | < > > = 1 ================== = 1 ================== = 1 ================== = 2== = 2== = 2== = 2== = 2== = 2== capture one "..." two "..." Three
:d
- digit from 0 to 9.
:D
- any character other than a digit from 0 to 9.
:a
- any character is a letter (a-z, A-Z)
:A
- any character other than a letter
:w
- any alphanumeric character.
:W
- any non-alphanumeric character.
:s
- any blank space character.
:S
- any character other than a blank.
:&
- Non-ASCII character (in UTF8 version only).
:|
- Vertical bar
:^
- Caret
:$
- Dollar sign
:(
- Left parenthesis
:)
- Right parenthesis
:<
- Greater than
:>
- Less than
:[
- Left bracket
:]
- Right bracket
:.
- Point
:?
- Interrogacion
:+
- More
:-
- Less
:*
- Asterisk
:{
- Left key
:}
- Right key
:#
- Modifier
::
- Colons
additionally use the proper c syntax to place characters new line, tab, β¦, etc. Similarly you can use the c syntax for βplacingβ characters in octal, hexadecimal or unicode.
ascii_test.c
file contains a wide variety of tests that are useful as
examples of use, these include the next:
regexp3( "07-07-1777", "<0?[1-9]|[12][0-9]|3[01]><[/:-\\]><0?[1-9]|1[012]>@2<[12][0-9]{3}>" );
captures a date format string, separately day, stripper, month and year. The separator has to coincider the two occasions that appears
regexp3( "https://en.wikipedia.org/wiki/Regular_expression", "(https?|ftp):://<[^:s/:<:>]+></[^:s:.:<:>,/]+>*<.>*" );
capture something like a web link
regexp3( "<mail>[email protected]</mail>", "<[_A-Za-z0-9:-]+(:.[_A-Za-z0-9:-]+)*>:@<[A-Za-z0-9]+>:.<[A-Za-z0-9]+><:.[A-Za-z0-9]{2}>*" );
capture sections (user, site, domain) something like an email.
ββββββ βinitβ ββββββ ββββββββββββββββββββββββββββββββββββββ βΌ β ββββββββββββββββ β βloop in stringβ β ββββββββββββββββ β β β βΌ β βββββββββββββββ no βββββββββββββββ β <βend of stringβ>βββββΆ<βsearch regexpβ>βββββββ βββββββββββββββ βββββββββββββββ no match β yes β match βΌ βΌ ββββββββββββββββββ βββββββββββββββ βreport: no matchβ βreport: matchβ ββββββββββββββββββ βββββββββββββββ β β βββββββββββββββββββββββ βΌ βββββ βendβ βββββ
search regexp
version one
ββββββββββββββββββββββββββββββββ βββββββββββββββ βΌ β βsearch regexpβ βββββββββββββ β βββββββββββββββ βget builderβ β βββββββββββββ β β β βΌ β βββββββββββββββββ no ββββββββββββββ β <βwe have builderβ>βββββΆβfinish: the β β βββββββββββββββββ βpath matchesβ β β yes ββββββββββββββ β ββββββββββ¬ββββββ¬βββββββββββΌβββββββββββββ¬βββββββββββ β βΌ βΌ βΌ βΌ βΌ βΌ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βalternationββsetββpointββmetacharacterββcharacterββgroupingβ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β β β β βΌ βββββββ΄βββββββββββΌβββββββββββββ ββββββββ€ ββββββββββββββββββ β β ββββββββββ save position β βΌ β β ββββββββββββββββββ βββββββββββββββ no match β β ββββββββββββββββββ <βmatch builderβ>βββββββββββ β βΌβββββββββrestore positionβββββββ βββββββββββββββ β β ββββββββββββββββββββββββββββββββββ β β match β β βloop in paths β β βΌ βΌ β ββββββββββββββββ β βββββββββββββββββββ βββββββββββββββββ β β β βadvance in stringβ βfinish, the β β βΌ β βββββββββββββββββββ βpath no matchesβ β ββββββββββββββ yes βββββββββββββββ β β βββββββββββββββββ β <βwe have pathβ>ββββΆ<βsearch regexpβ>βββ ββββββββββββββββββββββββββββββββ ββββββββββββββ βββββββββββββββ no match β no match β βΌ βΌ βββββββββββββββββββββββββ ββββββββββββββ βfinish, without matchesβ βfinish, the β βββββββββββββββββββββββββ βpath matchesβ ββββββββββββββ
search regexp
version two
βββββββββββββββ βsave positionβ βββββββββββββββ βββββββββββββββ βsearch regexpβ ββββββββββββββΆβ βββββββββββββββ β βΌ β ββββββββββββββββ β βloop in paths β β ββββββββββββββββ β β ββββββββββββββββββββββββββββββββββ β βΌ βΌ β β ββββββββββββββ yes βββββββββββββ β β <βwe have pathβ>βββββββββΆβget builderβ β β ββββββββββββββ βββββββββββββ β β β no β β β βΌ βΌ β β βββββββββββββββββββββββββ βββββββββββββββββ no βββββββββββββββ β β βfinish: without matchesβ <βwe have builderβ>ββββΆβfinish: the β β β βββββββββββββββββββββββββ βββββββββββββββββ βpath matches β β β β yes βββββββββββββββ β β βββββββ¬βββββββββββΌβββββββββββββ¬ββββββββββ β β βΌ βΌ βΌ βΌ βΌ β ββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β βrestore positionβ βsetββpointββmetacharacterββcharacterββgroupingβ β ββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββ β β² β β β β β β β βββββββ΄βββββββββββΌβββββββββββββ β β β βΌ βΌ β βββββββββββββββββ no match βββββββββββββββ βββββββββββββββ β βfinish: the βββββββββββ¬ββββββββββ<βmatch builderβ> ββββ<βsearch regexpβ> β βpath no matchesβ β βββββββββββββββ β βββββββββββββββ β βββββββββββββββββ β β match β β β βββββββββββββββββββββββββββββββ β match β βΌ β β βββββββββββββββββββ βββββββββββ€ βadvance in stringβ β βββββββββββββββββββ β β β ββββββββββββββββββββββββββββββββββ
This project is not βopen sourceβ is free software, and according to this, use the GNU GPL Version 3. Any work that includes used or resulting code of this library, you must comply with the terms of this license.