-
-
Notifications
You must be signed in to change notification settings - Fork 58
Introduce ambig mode #485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ambig mode #485
Conversation
Now you call fsm_print(f, fsm, FSM_PRINT_*) rather than fsm_print_*(). This might look a bit cumbersome to the caller, but I'm doing it so we have a single entry point to put shared stuff. More on that later.
This cuts down a lot of repetition for the various print functions, and also helps make clear what they actually use.
This drops information about which endids match. My intention is to output that later, but I wanted to think about one thing at a time, because this gets confusing. Here I'm making things worse before they get better. This applies for the codegen that doesn't use the endleaf callbacks. I'm going to call these "default" output. One observation here is that not all FSMs will have endids attached, and we still need to indicate success/failure. So that's why I'm not using the populated endids to indicate a match.
I've carried over only the fields we actually use to the vm ir struct. This is a bit brutal, I'm trying to avoid sacrificing functionality, but when it comes to a choice, at the moment I care more about keeping things separate than not cutting features. However the only thing I think I dropped is the state numbering for -l vmdot output.
…accept hooks. The accept callback gets a set of end ids. These hooks default to outputting those ids verbatim, you needn't override them per-program. I think that's clearer for users of the API. I ported the ambig enum over from (unmerged) work on rx(1). This is for how to handle endid ambiguities, which does something appropriate to each language's generated API. So that's now built in to libfsm rather than being implemented per-language per-program. This gets us clearer handling for multiple id output per language, and I think it's also clearer about what's responsible for what, overall.
This is a step towards moving fsm_options to the print interfaces only.
The important part of the diff here is just:
```diff
--- a/src/libfsm/internal.h
+++ b/src/libfsm/internal.h
@@ -11,7 +11,6 @@
#include <stdlib.h>
#include <fsm/fsm.h>
-#include <fsm/options.h>
#include <adt/common.h>
@@ -76,7 +75,6 @@ struct fsm {
struct fsm_capture_info *capture_info;
struct endid_info *endid_info;
- const struct fsm_options *opt;
};
```
Everything else is fallout from not needing to pass around the options struct. This leaves the options struct only used for the print routines.
…ugging. Now fsm_print() has a bunch of arguments, it's going to be annoying for debugging during development. fsm_dump() returns the same interface, but with some options set for compact output. The output looks like this: ``` # src/re/main.c:1081 new 0; 1; 0 -> 1 "a" .. "c", "x"; # e.g. "a" start: 0; end: 1; ```
There are a couple of awkward spots here (in particular calling fsm_endid_get() seems cumbersome for a user), but overall I think this came out really nicely. It does simplify a lot of the caller-side bookkeeping around tracking conflicts. And that's reflected in the diff removing program-specific datastructures.
| struct fsm_options; | ||
|
|
||
| /* a convenience for debugging */ | ||
| void fsm_dump(FILE *f, const struct fsm *fsm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of line changes here due to pervasive interface changes, and I didn't look all at the language-specific formatting code that closely (in particular, I'm not that familiar with go syntax), but the overall changes make sense to me.
Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx. This commit passes through the necessary information to restore the old behavior. It works in the happy path, but needs further testing. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then yield the token type and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. Further things to check: - There's a special case for the end of input, because it can't unget the next character. Ensure a token at EOI is tagged correctly. - It currently doesn't check if the state is an end state, just whether it has at least one endid. This shouldn't matter, but it should check. - It also doesn't handle multiple end IDs. lx seems to report errors for inputs that can match multiple tokens, so this may be unreachable. - It's only implemented for the "c" output format, but would probably be usabled by others without a lot more work. In particular, kate mentioned it'd be good to be able to use vmc output for lx.
Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.
This PR introduces
(struct fsm_options).ambig:and provides codegen for single, none, and multiple endids on accepting states.
For most languages this means returning a boolean to indicate success/failure,
independently of whether any endids are present.
There's an API change as a side effect,
fsm_print()is now a single interface to print (previously this was one function per language). I've also separated thestruct fsm_optionsand alloc hooks from various parts; in particularstruct fsm_optionsis only passed to the print routines, and is no longer attached tostruct fsm.Other than the API changes, and perhaps quietly fixing some bugs along the way, this PR attempts to keep existing functionality unchanged.