Allow macros in char classes #654

lsf37 · 2019-12-04T23:31:50Z

fixes #216

refactor parsing of character classes to produce an AST
extend macro expansion to that AST

This PR refactors and redesigns how parsing of char classes works, and by that removes some duplication in the .cup file. It also moves the definition of char classes/regular expressions like empty, anyChar, newLine out of the .cup file into RegExp and IntCharSet respectively, to ensure consistency.

Compound character classes (= classes that are not a primitive predefined class or unicode property) are now first fully parsed, then in a separate pass after macro expansion, normalised into one IntCharSet that fully describes the class. Only after that step, we partition the input character set into these classes. Apart from being much cleaner, this should in theory lead to a slightly better (=coarser) partition of the character set, because we don't make a partition for each operand in a compound expression separately. In practice, the effect of this is probably minimal.

lsf37 · 2019-12-04T23:33:13Z

This is a fairly big PR, but in total it removes code, which I'm happy about, and it didn't require any changes to the test suite (at least not after the recent commit that ensured a consistent class order).

lsf37 · 2019-12-04T23:36:56Z

This is still somewhat WIP. It should work now, but I intend to add at least one test case for macro expansion, and to make Codacy a bit happier.

lgtm-com · 2019-12-04T23:57:17Z

This pull request introduces 3 alerts when merging a9afe35 into 7ff3188 - view on LGTM.com

new alerts:

3 for Spurious Javadoc @param tags

lgtm-com · 2019-12-05T09:12:01Z

This pull request fixes 1 alert when merging 2036c74 into e343125 - view on LGTM.com

fixed alerts:

1 for Spurious Javadoc @param tags

lgtm-com · 2019-12-05T09:27:37Z

This pull request fixes 1 alert when merging 2e20f67 into e343125 - view on LGTM.com

fixed alerts:

1 for Spurious Javadoc @param tags

lgtm-com · 2019-12-05T09:55:13Z

This pull request fixes 1 alert when merging 9147781 into e343125 - view on LGTM.com

fixed alerts:

1 for Spurious Javadoc @param tags

lsf37 · 2019-12-05T10:46:38Z

Alright, rebased and cleaned up. @regisd, @sarowe the PR is now ready for review.

lgtm-com · 2019-12-05T10:59:17Z

This pull request fixes 1 alert when merging 31fbf24 into e343125 - view on LGTM.com

fixed alerts:

1 for Spurious Javadoc @param tags

lsf37 · 2019-12-05T22:59:57Z

Thanks for the additional cleanup!

How strongly do you feel about using map? I had it in one place originally, but removed it again, because the imperative version is easy to read and seeing how awkward map is in Java every time compared to functional languages made me somewhat sad. It's not really critical either way from my side, though.

jflex/src/main/cup/LexParse.cup

jflex/src/main/java/jflex/core/IntCharSet.java

regisd · 2019-12-06T13:17:01Z

How strongly do you feel about using map? I had it in one place originally, but removed it again, because the imperative version is easy to read and seeing how awkward map is in Java every time compared to functional languages made me somewhat sad.

If you talk about java8 streams, I don't feel strong at all.

lsf37 · 2019-12-07T05:21:45Z

How strongly do you feel about using map? I had it in one place originally, but removed it again, because the imperative version is easy to read and seeing how awkward map is in Java every time compared to functional languages made me somewhat sad.

If you talk about java8 streams, I don't feel strong at all.

I think I'll leave it in after all, it's not so bad.

lsf37 · 2019-12-07T05:58:33Z

Will rebase manually to clean up the commits a bit (this is a not unlikely point for future bisects), and merge after.

Addresses issue #216.

(removed braces where possible)

jflex-de#654

#654

lsf37 self-assigned this Dec 4, 2019

lsf37 added bug Not working as intended code quality Code health and clean-up labels Dec 4, 2019

lsf37 added this to the 1.8.0 milestone Dec 4, 2019

lsf37 requested review from regisd and sarowe December 4, 2019 23:37

regisd self-assigned this Dec 5, 2019

lsf37 force-pushed the ccl-macros branch from 71b00df to 31fbf24 Compare December 5, 2019 10:45

lsf37 changed the title ~~[WIP] allow Macros in char classes~~ Allow macros in char classes Dec 5, 2019

regisd force-pushed the ccl-macros branch from bc16b53 to 72312c4 Compare December 5, 2019 17:28

regisd reviewed Dec 6, 2019

View reviewed changes

jflex/src/main/cup/LexParse.cup Outdated Show resolved Hide resolved

jflex/src/main/java/jflex/core/IntCharSet.java Outdated Show resolved Hide resolved

lsf37 force-pushed the ccl-macros branch from 8c5760e to 82e3bdb Compare December 7, 2019 05:58

lsf37 and others added 6 commits December 7, 2019 16:42

Refactor char class parsing and enable macros in char classes

922e311

Addresses issue #216.

unify unexpected RegExp exceptions

527d4a4

add test for issue #216 (macros in char classes)

0e6737e

Minor: The fact that RegExpException is in jflex.core is not javadoc

d0ccf28

use truth in IntCharSetTest

9ee598e

Reduce visibility of fields

14e62e9

regisd and others added 4 commits December 7, 2019 16:42

Replace new Array/.add() by stream()

db6e777

Cleanup in NFA

d3b8093

refactor to reduce complexity of normalise() method

f488689

more traditional switch/case syntax

a47275d

(removed braces where possible)

lsf37 force-pushed the ccl-macros branch from 82e3bdb to a47275d Compare December 7, 2019 06:12

lsf37 merged commit d51e25c into master Dec 7, 2019

lsf37 deleted the ccl-macros branch December 7, 2019 06:39

regisd added a commit to regisd/jflex that referenced this pull request Dec 8, 2019

Bazelify testcase/ccl-macros

354449e

jflex-de#654

regisd added a commit that referenced this pull request Dec 8, 2019

Bazelify testcase/ccl-macros

705610c

#654

lsf37 mentioned this pull request Dec 11, 2019

find right end-to-end testing level #679

Open

lsf37 mentioned this pull request Feb 3, 2020

Interesting stress case: UAX29URLEmailTokenizerImpl.jflex in Lucene #715

Closed

rmuir mentioned this pull request Dec 4, 2021

simplify jflex grammars by using difference rather than negation apache/lucene#515

Merged

Allow macros in char classes #654

Allow macros in char classes #654

Uh oh!

Conversation

lsf37 commented Dec 4, 2019

Uh oh!

lsf37 commented Dec 4, 2019

Uh oh!

lsf37 commented Dec 4, 2019

Uh oh!

lgtm-com bot commented Dec 4, 2019

Uh oh!

lgtm-com bot commented Dec 5, 2019

Uh oh!

lgtm-com bot commented Dec 5, 2019

Uh oh!

lgtm-com bot commented Dec 5, 2019

Uh oh!

lsf37 commented Dec 5, 2019

Uh oh!

lgtm-com bot commented Dec 5, 2019

Uh oh!

lsf37 commented Dec 5, 2019

Uh oh!

Uh oh!

Uh oh!

regisd commented Dec 6, 2019

Uh oh!

lsf37 commented Dec 7, 2019

Uh oh!

lsf37 commented Dec 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants