Codestin Search App

Introduction Limitations of simple search Regular Expressions Examples Implementation notes Tips Regular Expressions API V2 API V1 API Regular Expressions Manual Reserved characters Lucene differences Operators

Regular Expression Search Documentation

A powerful and flexible way to search in grayhatwarfare.com using regular expressions.

Introduction

Regular expression search is a new, more powerful and flexible way to search in grayhatwarfare.com. Although simple keywords search is efficient and enough form most uses, it has certain limitations that Regular expression search solves. Being flexible though, comes at a cost. Each search costs takes about a minute to completed and takes considerable CPU resources. That is why we charge extra for the package that includes regular expression - more resources are needed.

Limitations of simple search

Lets say we have a filename like: student_resumes/resume1-1_thebackup_1519380372187.docx

How we index this is strip all special characters and replace them with spaces, then these are indexed independently.

So: student_resumes/resume1-1_thebackup_1519380372187.docx
Becomes: student resumes resume1 1 thebackup 1519380372187 docx

That means if you search for "thebackup", "1519380372187", "docx", "student" you will get this in the results

But if you search only for "backup", "15193" you will not find this entry, there is no partial match on the keywords. Also if for some reason you need to include special characters on the search, this will not work.

Regular Expressions

We take a different approach on regular expressions:

special characters are not removed.

Examples

This gives you more control over searching. Some examples:

.*backup.*
- Finds everything that contains backup, like thebackup, backup2 _backup_ etc.
.*2018[\-_\. ]11.*
- Find everything related to November 2018
.*dump.*(gz|tar|zip)
- Find all files containing keyword "dump" and end with gz, tar, or zip
backup.*
- Find all files that BEGIN with backup
.*backup
- Find all files that END with backup.
.*\.php
- Find all files with php extension in the site
19[0-9]{2}
- Text that contains 19 and then exactly 2 digits from 0-9
.*"test.txt"
- Everything in "" is literal and its not used as in the engine.

You can use regex101.com to test your regular expressions. While doing that, please keep ind mind the below minor differences and implementation details:

Implementation notes

We convert all text to lower case, to make searching easier. No need for [AaBbCc] etc.
Notice that all the above start with .* - That is because in order for an entry to be returned the whole filename must match the regular expression.
- Our system auto adds .* in the start and end when it detects that input does not contain ^ $ or .*
- If you want us to not autocorrect the input regular expression, there is a "Do not autocorrect regex" option
There are some special characters.
If the Full Path option is clicked in the search, then the regular expression is ran in the complete filename (including directory if any). Otherwise it will ran only on the filename part.
- Lets assume 2 files:
  - files/Metallica - Outlaw Torn.mp3
  - files/Metallica/Bleeding Me.mp3
- Lets assume that we search for: .*Metallica.*\.mp3
  - With no full text enabled only, Outlaw Torn will be returned.
  - With the full text enabled, both files, Outlaw Torn AND Bleeding Me will be returned.
  - And yes, Load is a masterpiece, what if is not metal enough ? Grow up, good music is good music.
Sorting and other filters (Extensions, Exclude extensions) work together regular expression.
The project runs on ElasticSearch and is compatible with apache Lucene regular expressions. That means tha the regular expressions are compatible but only a subset of Java regular expressions. For example lookaheads and lookbehinds are not supported.

Tips

Searching for .*keyword1.*keyword2.* will return different results than searching for .*keyword2.*keyword1.*

Regular Expressions API

V2 API

With version 2, we no longer need to base64 encode the regular expression. You can use the regular expression directly in the API. The API is the same as the files filter API, but instead of keywords and stopwords, you input the regular expression. An example is:

Example

curl "https://buckets.grayhatwarfare.com/api/v2/files?regexp=1&keywords=.*([0-9]{7})&access_token={apiKey}"

The above search finds all files that end with 7 digits in a row. Please note that based on how you perform the request you might need to urlencode the regex value. For example for curl running:

Example

curl "https://buckets.grayhatwarfare.com/api/v2/files?keywords=.*([0-9]{7})®exp=1&access_token={apiKey}"

will not be performed correctly. But running:

Example

curl "https://buckets.grayhatwarfare.com/api/v2/files?regexp=1&keywords=.%2A%28%5B0-9%5D%7B7%7D%29&access_token={apiKey}"

will work ok. You can use https://www.urlencoder.org/ for your tests and all programming languages have urlencode functions. It is good to perform tests with more common queries to make sure that some results are returned before crafting the final regex.

V1 API

To use regular expressions with the API:

Base 64 your regular expression.
Use the files filter API but instead of keywords and stopwords input the base64 text of the regular expression and add the parameter &regexp=1
An example is:
- .*dump.*(gz|tar|zip)
- Base64: LipkdW1wLiooZ3p8dGFyfHppcCk=
- More info about the query in the Api Documentation page

Regular Expressions Manual

Reserved characters

Regular expression engine supports all Unicode characters. However, the following characters are reserved as operators:

. ? + * | { } [ ] ( ) " \

To use one of these characters literally, escape it with a preceding backslash or surround it with double quotes. For example:

\.                  # renders as a literal '.'
\\                  # renders as a literal '\'
"[email protected]"    # renders as '[email protected]'

Lucene differences from PCRE

Our regular expression engine is based on Apache Lucene regular expressions, not Perl Compatible Regular Expressions (PCRE). Patterns copied from tools like regex101 may need adjustments.

Lookahead and lookbehind are not supported, including (?=...), (?!...), (?<=...), and (?<!...).
Backreferences are not supported, for example (foo)\1.
Prefer explicit character classes like [0-9] instead of shorthand classes like \d, \w, or \s.
Anchors like ^ and $ are not line anchors like in PCRE. The whole filename must match the expression, which is why searches often use .*backup.*.
Lazy quantifiers such as .*? are not supported like in PCRE.

Operators

Our regular expression engine does not use the Perl Compatible Regular Expressions (PCRE) library, but it does support the following standard operators.

Operator	Description	Example	Matches
`.`	Matches any character	ab.	'aba', 'abb', 'abz', etc.
`?`	Repeat the preceding character zero or one times. Often used to make the preceding character optional.	abc?	'ab' and 'abc'
`+`	Repeat the preceding character one or more times.	ab+	'abb', 'abbb', 'abbbb', etc.
`*`	Repeat the preceding character zero or more times.	ab*	'ab', 'abb', 'abbb', 'abbbb', etc.
`{}`	Minimum and maximum number of times the preceding character can repeat.	a{2} a{2,4} a{2,}	'aa' 'aa', 'aaa', and 'aaaa' 'a' repeated two or more times
`\|`	Union operator. The match will succeed if the longest pattern on either the left side or the right side matches.	abc\|xyz	'abc' and 'xyz'
`( ... )`	Forms a group. You can use a group to treat part of the expression as a single unit.	abc(def)?	'abc' and 'abcdef' but not 'abcd'
`[ ... ]`	Match one of the characters in the brackets. Inside the brackets, `-` indicates a range unless `-` is the first character or escaped. A `^` before a character in the brackets negates the character or range.	[abc] [a-c] [-abc] [abc\-] [^abc] [^a-c] [^-abc] [^abc\-]	'a', 'b', 'c' 'a', 'b', or 'c' '-' is first character. Matches '-', 'a', 'b', or 'c' Escapes '-'. Matches 'a', 'b', 'c', or '-' Any character except 'a', 'b', or 'c' Any character except 'a', 'b', or 'c' Any character except '-', 'a', 'b', or 'c' Any character except 'a', 'b', 'c', or '-'