@@ -45,9 +45,7 @@ import javascript
4545 *
4646 * This is what the query does. It makes no attempt to construct a prefix
4747 * leading into `q`, and only a weak one to construct a suffix that ensures
48- * rejection; this causes some false positives. Also, the query does not fully
49- * handle character classes and does not handle various other features at all;
50- * this causes false negatives.
48+ * rejection; this causes some false positives.
5149 *
5250 * Finally, sometimes it depends on the translation whether the NFA generated
5351 * for a regular expression has a pumpable fork or not. We implement one
@@ -63,20 +61,23 @@ import javascript
6361 * * Transitions between states may be labelled with epsilon, or an abstract
6462 * input symbol.
6563 * * Each abstract input symbol represents a set of concrete input characters:
66- * either a single character, a set of characters represented by a (positive)
64+ * either a single character, a set of characters represented by a
6765 * character class, or the set of all characters.
6866 * * The product automaton is constructed lazily, starting with pair states
6967 * `(q, q)` where `q` is a fork, and proceding along an over-approximate
7068 * step relation.
7169 * * The over-approximate step relation allows transitions along pairs of
72- * abstract input symbols as long as the symbols are not trivially incompatible .
70+ * abstract input symbols where the symbols have overlap in the characters they accept .
7371 * * Once a trace of pairs of abstract input symbols that leads from a fork
7472 * back to itself has been identified, we attempt to construct a concrete
7573 * string corresponding to it, which may fail.
7674 * * Instead of trying to construct a suffix that makes the automaton fail,
77- * we ensure that it isn't possible to reach the accepting state from the
78- * fork along epsilon transitions. In this case, it is very likely (though
79- * not guaranteed) that a rejecting suffix exists.
75+ * we ensure that repeating `n` copies of `w` does not reach a state that is
76+ * an epsilon transition from the accepting state.
77+ * This assumes that the accepting state accepts any suffix.
78+ * Regular expressions - where the end anchor `$` is used - have an accepting state
79+ * that does not accept all suffixes. Such regular expression not accurately
80+ * modelled by this assumption, which can cause false negatives.
8081 */
8182
8283/**
@@ -862,6 +863,19 @@ predicate isPumpable(State fork, string w) {
862863/**
863864 * Gets a state that can be reached from pumpable `fork` consuming all
864865 * chars in `w` any number of times followed by the first `i+1` characters of `w`.
866+ *
867+ * This predicate is used to ensure that the accepting state is not reached from the fork by repeating `w`.
868+ * This works under the assumption that any accepting state accepts all suffixes.
869+ * For example, a regexp like `/^(a+)+/` will accept any string as long the prefix is some number of `"a"`s,
870+ * and it is therefore not possible to construct a rejected suffix.
871+ * This assumption breaks on regular expression that use the anchor `$`, e.g: `/^(a+)+$/`, and such regular
872+ * expression are not accurately modeled by this query.
873+ *
874+ * The the string `w` is repeated any number of times because the string `w` needs to be
875+ * infinitely repeatedable for the attack to work.
876+ * For a regular expression `/((ab)+)*abab/` the accepting state is not reachable from the fork
877+ * using epsilon transitions. But any attempt at repeating `w` will end in the accepting state.
878+ * This is also build on the assumption that any accepting state will accept all suffixes.
865879 */
866880State process ( State fork , string w , int i ) {
867881 isPumpable ( fork , w ) and
0 commit comments