Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@silentbicycle
Copy link
Collaborator

@silentbicycle silentbicycle commented Jul 3, 2025

Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it.

libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend.

lx used to work by breaking abstraction and calling directly into fsm_print_cfrag (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type.

This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward.

Most of the code changes happen inside of lx's code generation, but there are a few elsewhere:

  • The reject hook now has a state_metadata pointer, so update the callers for all the output formats.

  • libfsm's 'c' output now includes a macro FSM_ADVANCE_HOOK(C), which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined.

  • libfsm's 'c' output also includes a flag, has_consumed_input, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state.

lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.

Also:

  • Re-enable the lxpos tests (using getc, str, and pair), and fix a couple things in their build config.
  • Make lx's -l dump output call lx->free when necessary.
  • Make sure the makefile test targets check *res* not res*, because some result files are prefixed.

There may still be combinations of lx output modes and other flags that get warnings for unused functions or otherwise have a broken build, but I tested getc, str, and pair, with dyn or fixed buffering, and with or without -x buf -x pos. All the combinations tested by the lxpos build work.

print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct
ir_state_table`, and IR_TABLE isn't actually implemented, so
this can be removed for now.

lx/print/c.c only needed internal.h because of print/ir.h, and
because of several direct accesses to fsm->statecount, which are
easily replaced by calls to fsm_countstates.
@silentbicycle silentbicycle requested a review from katef July 3, 2025 18:47
This will miss failures in prefixed res files, such as

    build/tests/lxpos/dyn-fdgetc-getc-res0
Changing the leaf and endleaf callbacks to accept and reject in #485
broke lx, but it went unnoticed for a while. This fixes it.

libfsm's normal execution mode evaluates a DFA, character by character,
terminating either when the next character isn't a valid edge or the end
of input is reached (in which case it checks end state metadata). lx's
execution mode is a little different, because it's tokenizing -- instead
of reading to the end of input, it should consume as much consecutive
input that matches a particular token, then push back the last character
read (so it can resume with it as context for the next token), yield the
token type, and suspend.

lx used to work by breaking abstraction and calling directly into
`fsm_print_cfrag` (overriding the leaf behavior to yield token types,
and adding an extra 'NONE' state to the generated state machine code),
but when the callback interfaces shifted its internals no longer fit
what lx expected. Now the reject hook is passed the same state
metadata as the accept state, and the reject hook in lx checks whether
the end id is associated with a particular AST mapping and token type.

This is only implemented for the "c" output format, but similar changes
could possibly make others usable without a lot more work. In
particular, kate mentioned it'd be good to be able to use "vmc" output
instead of "c" moving forward.

Most of the code changes happen inside of lx's code generation, but
there are a few elsewhere:

- The reject hook now has a state_metadata pointer, so update the
callers for all the output formats.

- libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`,
which is called with the next character read in the FSM_IO_STR and
FSM_IO_PAIR io modes immediately after advancing. This is used to inform
lx's internal bookkeeping about token positions and buffering token
names. FSM_IO_GETC doesn't need it, because its getc callback manages
the character stream. The macro defaults to a no-op when undefined.

- libfsm's 'c' output also includes a flag, `has_consumed_input`, so the
code expanded in place from the reject/accept hooks can determine when
the state machine input handler loop has consumed any input. This was
previously encoded by the extra NONE state.

lx's code generation using this flag is a bit cluttered, because the
reject hook doesn't know whether it's expanding for the end states, but
it's probably not worth changing the reject hook type signature to add
another flag. This results in checks for has_consumed_input in code
paths where trivial static analysis would show it to be dead code, and
some extra unreachable code at the end of the function.
Instead of having the EOF token occupy the same byte, line, and column
position as the last token, it should immediately follow.

The new lx codegen behaves this way, and katef and I decided that it
made sense to keep it like that, as long as it's consistent.
Add `${LX}` as a dependency for the targets using it.

Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be
a merge error? They just produce a warning.
@silentbicycle silentbicycle force-pushed the sv/fix-lx-token-identification branch from 7cb369e to 862a68c Compare July 23, 2025 14:57
Some of the CI test matrix builds set LX to 'true; echo lx', but that
obviously won't work for tests that actually need to run lx in order to
exercise its output.
@silentbicycle silentbicycle force-pushed the sv/fix-lx-token-identification branch from 012cce4 to affca78 Compare July 23, 2025 15:45
@silentbicycle silentbicycle marked this pull request as ready for review July 23, 2025 17:04
.if make(test)
.END::
grep FAIL ${BUILD}/tests/*/res*; [ $$? -ne 0 ]
grep FAIL ${BUILD}/tests/*/*res*; [ $$? -ne 0 ]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been causing test failures to go unnoticed before.

TEST_OUTDIR.tests/lxpos = ${BUILD}/tests/lxpos

LX?=${BUILD}/bin/lx
LX_BIN?=${BUILD}/bin/lx
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI test build sets LX='true; echo lx' and that breaks these tests.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to rename the variable just for lx's own tests, okay


${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx
${LX} -l ${ext} ${LX_CFLAGS} ${LX_CFLAGS.tests/lxpos/${buf}-${getc}-${io}-lexer.lx} < ${.ALLSRC:M*.lx} > $@ \
${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx ${LX_BIN}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was potentially a build race condition here -- it wasn't ensuring lx was actually built before trying to use it.

* input loop was skipped it would still be NONE. */
fprintf(f, "\tint has_consumed_input = 0;\n");

/* For FSM_IO_STR and FSM_IO_PAIR, define a macro that will be
Copy link
Collaborator Author

@silentbicycle silentbicycle Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is a bit ugly, but much simpler than changing normal code generation to instantiate some kind of character iterator, just so lx and libfsm are on the same page about advancing and pushing back the character stream.

Do you think we should put "LX" in the macro name?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro rubs me the wrong way too. There's gotta be a nicer way. Maybe we could have lx emit a call to %sadvance_end() somewhere (in a new hook, e.g. at the end of the generated fsm loop) instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to an advance hook in struct fsm_hooks.

struct ir_state_table {
unsigned to[FSM_SIGMA_COUNT];
} *table;
int not_yet_implemented;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing uses this yet, but referencing FSM_SIGMA_COUNT here made the ir.h header depend on internal.h.

fprintf(stderr, " fgetc");
}

if (opt->comments) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these comments because I found the multiple layers of generated *getc functions with similar names very confusing.

src/lx/print/c.c Outdated
fprintf(f, "\tassert(lx->p != NULL);\n");
/* FIXME: This should distinguish between alloc failure
* and EOF, but will require layers of interface changes. */
fprintf(f, "\tif (!lx_advance_end(lx, c)) { return EOF; }\n");
Copy link
Collaborator Author

@silentbicycle silentbicycle Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lx_advance_end can fail if lx->push has an allocation failure, but lx_getc either returns a character or EOF. It should probably signal the error more obviously, but what do you think the interface should look like? This only applies in dynamic buffering mode, of course.

Previously the calls to lx->push were injected into the body of the zone function, so it failing could directly return TOK_ERROR. We could add an error flag on the lx handle and check that after the zone's character+state switching, or we could leave that for now and make sure it's handled properly in the vmc codegen. Thoughts?


/* TODO: prerequisite that the FSM is a DFA */
/* prerequisite that the FSM is a DFA */
assert(fsm_all(z->fsm, fsm_isdfa));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented this

@silentbicycle
Copy link
Collaborator Author

I added some negative tests for truncated/corrupt input getting error messages and found that the block comment in in3.txt isn't reported as an error if the */ is missing. I will add the tests to the PR once I have that fixed.

Add a couple test cases (in8-10.txt) with an unexpected end of input,
either in the middle of a pattern, or after matching the first pattern
in a .. pair, but without matching the second.

Supporting this changes the expected result for in6.txt: Previously it
resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a
"lexically uncategorised" error message for the unexpected end of input.
This change is necessary for fixing #386 / #508, and more generally
to detect things like unterminated string literals.
@silentbicycle
Copy link
Collaborator Author

I called it out in the commit logs for ae53e94, but this changes the result for test/lxpos/in6.txt. Now it explicitly fails tokenization when there's an EOF before matching the second pattern in e.g. '// .. /\n/ -> $nl, rather than just failing to yield TOK_NLbut still yieldingTOK_EOF`.

This was previously ending up with a useless call to the current zone
after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"),
which led to a warning in CI [-Werror=implicit-fallthrough=].
@silentbicycle
Copy link
Collaborator Author

I just pushed a couple more commits, which fix issues in the generated code when lx -e is used to override the default API prefix.

These may or may not be called, depending on the input.
In some cases this was hardcoding "lx_" in the generated code, which
could lead to build failures if 'lx -e' was used to override the default
prefix.
@silentbicycle silentbicycle force-pushed the sv/fix-lx-token-identification branch from 930c7bd to 1e55db8 Compare July 31, 2025 18:21
fprintf(f, "%sungetc(lx, %s); ", prefix.api, cur_char_var);
if (pop && (~api_exclude & API_POS)) {
fprintf(f, "%s%spop(lx->buf_opaque); ",
prefix.api, buf_op_prefix());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sensible, okay

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to do here when api_tokbuf is 0? calling buf_op_prefix() is wrong. but does emitting ungetc() make sense at all then? probably not. I think we shouldn't emit the call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 08fd72c. That should have checked API_BUF rather than API_POS, and rechecking every combination of -k and -x buf/pos options found a few other warnings for unused functions/arguments that are now fixed.

fprintf(f, "lx->z = z%u, ", zindexof(ast, m->to));
fprintf(f, "%s", prefix.tok);
esctok(f, m->token->s);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, cool

unget_character(f, true, env->cur_char_var);
} else if (m->token != NULL && m->to != NULL) {
unget_character(f, true, env->cur_char_var);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew. you're thinking of match in rust here huh

Copy link
Collaborator Author

@silentbicycle silentbicycle Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was making the cases explicit while I figured out the correct behavior. They're almost all handled the same way now though, so I could collapse that logic a bit. Does that seem worth doing? This module should get replaced with the vmc codegen soon.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh i should translate "fine" from british to american: i mean i like it

* additional 'NONE' state. Inside the input loop, the default
* state of NONE would be updated to the start state, but if the
* input loop was skipped it would still be NONE. */
fprintf(f, "\tint has_consumed_input = 0;\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this being in the general purpose codegen. Do I have better ideas? I don't think so. Maybe it could live in the input buffer code, but I don't like that much either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can handle this via the advance hook I just added in 4a5ca84, rather than the libfsm print layer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 051aaf0 I moved this to the advance hook and lx's print interface, it's not in the general purpose codegen anymore.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me so happy, thank you!

.end_ids = ir->states[state_id].endids.ids,
.end_id_count = ir->states[state_id].endids.count,
};

Copy link
Owner

@katef katef Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass whatever type &ir->states[state_id].endids is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct ir_state_endids is defined in an internal ir.h header, and an upcoming PR is going to add another pair of fields to the state metadata struct, for eager output IDs and their count.

This avoids cluttering libfsm's print output with `has_consumed_input`,
which is specific to lx.
Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str),
with and without '-x buf', '-x pos', or both.
Copy link
Owner

@katef katef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@katef katef merged commit c897e9d into main Aug 20, 2025
346 checks passed
@silentbicycle silentbicycle deleted the sv/fix-lx-token-identification branch August 20, 2025 15:37
silentbicycle added a commit that referenced this pull request Aug 26, 2025
PR #509 introduced a bug: It didn't distinguish between an unexpected
end of input and an end of input in a zone that matches but ignores its
input. This caused several lxpos tests to fail due to getting a
TOK_UNKNOWN rather than a TOK_EOF when the input has trailing
whitespace, but I didn't notice until after merging because the normal
build doesn't regenerate the code for src/lx/lexer.lx or
src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and
parsers were regenerated, but missed those.)

Instead of always printing TOK_UNKNOWN, this this inspects the zone
mappings to determine whether the current end ID represents a dead end
for the zone. If not, it should instead print TOK_EOF.
katef added a commit that referenced this pull request Aug 29, 2025
…by-509

lx: Distinguish between unexpected EOF and EOF in ignored zones (broken by #509)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants