Introduce ambig mode #485

katef · 2024-07-26T16:34:11Z

This PR introduces (struct fsm_options).ambig:

    AMBIG_NONE     = 0, /* default */
    AMBIG_ERROR    = 1 << 0,
    AMBIG_EARLIEST = 1 << 1,
    AMBIG_MULTIPLE = 1 << 2,

and provides codegen for single, none, and multiple endids on accepting states.
For most languages this means returning a boolean to indicate success/failure,
independently of whether any endids are present.

There's an API change as a side effect, fsm_print() is now a single interface to print (previously this was one function per language). I've also separated the struct fsm_options and alloc hooks from various parts; in particular struct fsm_options is only passed to the print routines, and is no longer attached to struct fsm.

Other than the API changes, and perhaps quietly fixing some bugs along the way, this PR attempts to keep existing functionality unchanged.

Now you call fsm_print(f, fsm, FSM_PRINT_*) rather than fsm_print_*(). This might look a bit cumbersome to the caller, but I'm doing it so we have a single entry point to put shared stuff. More on that later.

This cuts down a lot of repetition for the various print functions, and also helps make clear what they actually use.

This drops information about which endids match. My intention is to output that later, but I wanted to think about one thing at a time, because this gets confusing. Here I'm making things worse before they get better. This applies for the codegen that doesn't use the endleaf callbacks. I'm going to call these "default" output. One observation here is that not all FSMs will have endids attached, and we still need to indicate success/failure. So that's why I'm not using the populated endids to indicate a match.

I've carried over only the fields we actually use to the vm ir struct. This is a bit brutal, I'm trying to avoid sacrificing functionality, but when it comes to a choice, at the moment I care more about keeping things separate than not cutting features. However the only thing I think I dropped is the state numbering for -l vmdot output.

…accept hooks. The accept callback gets a set of end ids. These hooks default to outputting those ids verbatim, you needn't override them per-program. I think that's clearer for users of the API. I ported the ambig enum over from (unmerged) work on rx(1). This is for how to handle endid ambiguities, which does something appropriate to each language's generated API. So that's now built in to libfsm rather than being implemented per-language per-program. This gets us clearer handling for multiple id output per language, and I think it's also clearer about what's responsible for what, overall.

This is a step towards moving fsm_options to the print interfaces only.

The important part of the diff here is just: ```diff --- a/src/libfsm/internal.h +++ b/src/libfsm/internal.h @@ -11,7 +11,6 @@ #include <stdlib.h> #include <fsm/fsm.h> -#include <fsm/options.h> #include <adt/common.h> @@ -76,7 +75,6 @@ struct fsm { struct fsm_capture_info *capture_info; struct endid_info *endid_info; - const struct fsm_options *opt; }; ``` Everything else is fallout from not needing to pass around the options struct. This leaves the options struct only used for the print routines.

…ugging. Now fsm_print() has a bunch of arguments, it's going to be annoying for debugging during development. fsm_dump() returns the same interface, but with some options set for compact output. The output looks like this: ``` # src/re/main.c:1081 new 0; 1; 0 -> 1 "a" .. "c", "x"; # e.g. "a" start: 0; end: 1; ```

There are a couple of awkward spots here (in particular calling fsm_endid_get() seems cumbersome for a user), but overall I think this came out really nicely. It does simplify a lot of the caller-side bookkeeping around tracking conflicts. And that's reflected in the diff removing program-specific datastructures.

silentbicycle · 2024-08-02T21:02:10Z

include/fsm/print.h

 struct fsm_options;

+/* a convenience for debugging */
+void fsm_dump(FILE *f, const struct fsm *fsm,


silentbicycle

There are a lot of line changes here due to pervasive interface changes, and I didn't look all at the language-specific formatting code that closely (in particular, I'm not that familiar with go syntax), but the overall changes make sense to me.

Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx. This commit passes through the necessary information to restore the old behavior. It works in the happy path, but needs further testing. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then yield the token type and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. Further things to check: - There's a special case for the end of input, because it can't unget the next character. Ensure a token at EOI is tagged correctly. - It currently doesn't check if the state is an end state, just whether it has at least one endid. This shouldn't matter, but it should check. - It also doesn't handle multiple end IDs. lx seems to report errors for inputs that can match multiple tokens, so this may be unreachable. - It's only implemented for the "c" output format, but would probably be usabled by others without a lot more work. In particular, kate mentioned it'd be good to be able to use vmc output for lx.

Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.

katef added 14 commits July 10, 2024 11:00

Centralise fsm_print() and friends.

0b84d75

Now you call fsm_print(f, fsm, FSM_PRINT_*) rather than fsm_print_*(). This might look a bit cumbersome to the caller, but I'm doing it so we have a single entry point to put shared stuff. More on that later.

Missing targets.

481af9f

Hoist up ir and vm opcode compilation to fsm_print().

0adad69

This cuts down a lot of repetition for the various print functions, and also helps make clear what they actually use.

Only qsort a non-empty retlist.

1545d1c

Cruft.

b4538cf

Move alloc hooks to struct fsm.

61fb10e

This is a step towards moving fsm_options to the print interfaces only.

Uninitialized value.

a7d6847

Clarification.

e03de6b

katef requested a review from silentbicycle July 26, 2024 16:34

Typo.

436d6e0

silentbicycle reviewed Aug 2, 2024

View reviewed changes

silentbicycle approved these changes Aug 2, 2024

View reviewed changes

katef merged commit 62d3b3b into main Aug 4, 2024

katef deleted the kate/ambig-mode branch August 4, 2024 15:23

katef mentioned this pull request Aug 20, 2024

rx, a program for compiling sets of regular expressions #488

Merged

silentbicycle mentioned this pull request Jul 3, 2025

Fix lx token identification #509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Introduce ambig mode #485

Introduce ambig mode #485

Uh oh!

katef commented Jul 26, 2024

Uh oh!

silentbicycle Aug 2, 2024

Uh oh!

silentbicycle left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Introduce ambig mode #485

Introduce ambig mode #485

Uh oh!

Conversation

katef commented Jul 26, 2024

Uh oh!

silentbicycle Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

silentbicycle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants