Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@katef
Copy link
Owner

@katef katef commented Jul 26, 2024

This PR introduces (struct fsm_options).ambig:

    AMBIG_NONE     = 0, /* default */
    AMBIG_ERROR    = 1 << 0,
    AMBIG_EARLIEST = 1 << 1,
    AMBIG_MULTIPLE = 1 << 2,

and provides codegen for single, none, and multiple endids on accepting states.
For most languages this means returning a boolean to indicate success/failure,
independently of whether any endids are present.

There's an API change as a side effect, fsm_print() is now a single interface to print (previously this was one function per language). I've also separated the struct fsm_options and alloc hooks from various parts; in particular struct fsm_options is only passed to the print routines, and is no longer attached to struct fsm.

Other than the API changes, and perhaps quietly fixing some bugs along the way, this PR attempts to keep existing functionality unchanged.

katef added 14 commits July 10, 2024 11:00
Now you call fsm_print(f, fsm, FSM_PRINT_*) rather than fsm_print_*().

This might look a bit cumbersome to the caller, but I'm doing it so we have a single entry point to put shared stuff. More on that later.
This cuts down a lot of repetition for the various print functions, and also helps make clear what they actually use.
This drops information about which endids match. My intention is to output that later, but I wanted to think about one thing at a time, because this gets confusing. Here I'm making things worse before they get better.

This applies for the codegen that doesn't use the endleaf callbacks. I'm going to call these "default" output.

One observation here is that not all FSMs will have endids attached, and we still need to indicate success/failure. So that's why I'm not using the populated endids to indicate a match.
I've carried over only the fields we actually use to the vm ir struct. This is a bit brutal, I'm trying to avoid sacrificing functionality, but when it comes to a choice, at the moment I care more about keeping things separate than not cutting features. However the only thing I think I dropped is the state numbering for -l vmdot output.
…accept hooks.

The accept callback gets a set of end ids. These hooks default to outputting those ids verbatim, you needn't override them per-program. I think that's clearer for users of the API.

I ported the ambig enum over from (unmerged) work on rx(1). This is for how to handle endid ambiguities, which does something appropriate to each language's generated API. So that's now built in to libfsm rather than being implemented per-language per-program.

This gets us clearer handling for multiple id output per language, and I think it's also clearer about what's responsible for what, overall.
This is a step towards moving fsm_options to the print interfaces only.
The important part of the diff here is just:
```diff
--- a/src/libfsm/internal.h
+++ b/src/libfsm/internal.h
@@ -11,7 +11,6 @@
 #include <stdlib.h>

 #include <fsm/fsm.h>
-#include <fsm/options.h>

 #include <adt/common.h>

@@ -76,7 +75,6 @@ struct fsm {

        struct fsm_capture_info *capture_info;
        struct endid_info *endid_info;
-       const struct fsm_options *opt;
 };
```

Everything else is fallout from not needing to pass around the options struct. This leaves the options struct only used for the print routines.
…ugging.

Now fsm_print() has a bunch of arguments, it's going to be annoying for debugging during development. fsm_dump() returns the same interface, but with some options set for compact output.

The output looks like this:

```
# src/re/main.c:1081 new
0; 1;

0  ->  1 "a" .. "c", "x"; # e.g. "a"

start: 0;
end:   1;
```
There are a couple of awkward spots here (in particular calling fsm_endid_get() seems cumbersome for a user), but overall I think this came out really nicely. It does simplify a lot of the caller-side bookkeeping around tracking conflicts. And that's reflected in the diff removing program-specific datastructures.
@katef katef requested a review from silentbicycle July 26, 2024 16:34
struct fsm_options;

/* a convenience for debugging */
void fsm_dump(FILE *f, const struct fsm *fsm,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@silentbicycle silentbicycle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of line changes here due to pervasive interface changes, and I didn't look all at the language-specific formatting code that closely (in particular, I'm not that familiar with go syntax), but the overall changes make sense to me.

@katef katef merged commit 62d3b3b into main Aug 4, 2024
@katef katef deleted the kate/ambig-mode branch August 4, 2024 15:23
silentbicycle added a commit that referenced this pull request Jul 3, 2025
Changing the leaf and endleaf callbacks to accept and reject in #485
broke lx. This commit passes through the necessary information to
restore the old behavior. It works in the happy path, but needs further
testing.

libfsm's normal execution mode evaluates a DFA, character by character,
terminating either when the next character isn't a valid edge or the end
of input is reached (in which case it checks end state metadata). lx's
execution mode is a little different, because it's tokenizing -- instead
of reading to the end of input, it should consume as much consecutive
input that matches a particular token, then yield the token type and
suspend.

lx used to work by breaking abstraction and calling directly into
`fsm_print_cfrag` (overriding the leaf behavior to yield token types),
but when the callback interfaces shifted its internals no longer fit
what lx expected. Now the reject hook is passed the same state
metadata as the accept state, and the reject hook in lx checks whether
the end id is associated with a particular AST mapping and token type.

Further things to check:

- There's a special case for the end of input, because it can't unget
  the next character. Ensure a token at EOI is tagged correctly.

- It currently doesn't check if the state is an end state, just whether
  it has at least one endid. This shouldn't matter, but it should check.

- It also doesn't handle multiple end IDs. lx seems to report errors for
  inputs that can match multiple tokens, so this may be unreachable.

- It's only implemented for the "c" output format, but would probably
  be usabled by others without a lot more work. In particular, kate
  mentioned it'd be good to be able to use vmc output for lx.
silentbicycle added a commit that referenced this pull request Jul 23, 2025
Changing the leaf and endleaf callbacks to accept and reject in #485
broke lx, but it went unnoticed for a while. This fixes it.

libfsm's normal execution mode evaluates a DFA, character by character,
terminating either when the next character isn't a valid edge or the end
of input is reached (in which case it checks end state metadata). lx's
execution mode is a little different, because it's tokenizing -- instead
of reading to the end of input, it should consume as much consecutive
input that matches a particular token, then push back the last character
read (so it can resume with it as context for the next token), yield the
token type, and suspend.

lx used to work by breaking abstraction and calling directly into
`fsm_print_cfrag` (overriding the leaf behavior to yield token types,
and adding an extra 'NONE' state to the generated state machine code),
but when the callback interfaces shifted its internals no longer fit
what lx expected. Now the reject hook is passed the same state
metadata as the accept state, and the reject hook in lx checks whether
the end id is associated with a particular AST mapping and token type.

This is only implemented for the "c" output format, but similar changes
could possibly make others usable without a lot more work. In
particular, kate mentioned it'd be good to be able to use "vmc" output
instead of "c" moving forward.

Most of the code changes happen inside of lx's code generation, but
there are a few elsewhere:

- The reject hook now has a state_metadata pointer, so update the
callers for all the output formats.

- libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`,
which is called with the next character read in the FSM_IO_STR and
FSM_IO_PAIR io modes immediately after advancing. This is used to inform
lx's internal bookkeeping about token positions and buffering token
names. FSM_IO_GETC doesn't need it, because its getc callback manages
the character stream. The macro defaults to a no-op when undefined.

- libfsm's 'c' output also includes a flag, `has_consumed_input`, so the
code expanded in place from the reject/accept hooks can determine when
the state machine input handler loop has consumed any input. This was
previously encoded by the extra NONE state.

lx's code generation using this flag is a bit cluttered, because the
reject hook doesn't know whether it's expanding for the end states, but
it's probably not worth changing the reject hook type signature to add
another flag. This results in checks for has_consumed_input in code
paths where trivial static analysis would show it to be dead code, and
some extra unreachable code at the end of the function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants