Fix lx token identification #509

silentbicycle · 2025-07-03T18:47:16Z

Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it.

libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend.

lx used to work by breaking abstraction and calling directly into fsm_print_cfrag (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type.

This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward.

Most of the code changes happen inside of lx's code generation, but there are a few elsewhere:

The reject hook now has a state_metadata pointer, so update the callers for all the output formats.
libfsm's 'c' output now includes a macro FSM_ADVANCE_HOOK(C), which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined.
libfsm's 'c' output also includes a flag, has_consumed_input, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state.

lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.

Also:

Re-enable the lxpos tests (using getc, str, and pair), and fix a couple things in their build config.
Make lx's -l dump output call lx->free when necessary.
Make sure the makefile test targets check *res* not res*, because some result files are prefixed.

There may still be combinations of lx output modes and other flags that get warnings for unused functions or otherwise have a broken build, but I tested getc, str, and pair, with dyn or fixed buffering, and with or without -x buf -x pos. All the combinations tested by the lxpos build work.

print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct ir_state_table`, and IR_TABLE isn't actually implemented, so this can be removed for now. lx/print/c.c only needed internal.h because of print/ir.h, and because of several direct accesses to fsm->statecount, which are easily replaced by calls to fsm_countstates.

This will miss failures in prefixed res files, such as build/tests/lxpos/dyn-fdgetc-getc-res0

Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.

Instead of having the EOF token occupy the same byte, line, and column position as the last token, it should immediately follow. The new lx codegen behaves this way, and katef and I decided that it made sense to keep it like that, as long as it's consistent.

Add `${LX}` as a dependency for the targets using it. Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be a merge error? They just produce a warning.

Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.

silentbicycle · 2025-07-23T17:23:00Z

Makefile

 .if make(test)
 .END::
-	grep FAIL ${BUILD}/tests/*/res*; [ $$? -ne 0 ]
+	grep FAIL ${BUILD}/tests/*/*res*; [ $$? -ne 0 ]


This may have been causing test failures to go unnoticed before.

silentbicycle · 2025-07-23T17:24:40Z

tests/lxpos/Makefile

 TEST_OUTDIR.tests/lxpos = ${BUILD}/tests/lxpos

-LX?=${BUILD}/bin/lx
+LX_BIN?=${BUILD}/bin/lx


The CI test build sets LX='true; echo lx' and that breaks these tests.

makes sense to rename the variable just for lx's own tests, okay

silentbicycle · 2025-07-23T17:25:16Z

tests/lxpos/Makefile


-${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx
-	${LX} -l ${ext} ${LX_CFLAGS} ${LX_CFLAGS.tests/lxpos/${buf}-${getc}-${io}-lexer.lx} < ${.ALLSRC:M*.lx} > $@ \
+${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx ${LX_BIN}


There was potentially a build race condition here -- it wasn't ensuring lx was actually built before trying to use it.

silentbicycle · 2025-07-23T17:26:53Z

src/libfsm/print/c.c

+	 * input loop was skipped it would still be NONE. */
+	fprintf(f, "\tint has_consumed_input = 0;\n");
+
+	/* For FSM_IO_STR and FSM_IO_PAIR, define a macro that will be


This macro is a bit ugly, but much simpler than changing normal code generation to instantiate some kind of character iterator, just so lx and libfsm are on the same page about advancing and pushing back the character stream.

Do you think we should put "LX" in the macro name?

This macro rubs me the wrong way too. There's gotta be a nicer way. Maybe we could have lx emit a call to %sadvance_end() somewhere (in a new hook, e.g. at the end of the generated fsm loop) instead.

I'll change it to an advance hook in struct fsm_hooks.

silentbicycle · 2025-07-23T17:28:09Z

src/libfsm/print/ir.h

-			struct ir_state_table {
-				unsigned to[FSM_SIGMA_COUNT];
-			} *table;
+			int not_yet_implemented;


Nothing uses this yet, but referencing FSM_SIGMA_COUNT here made the ir.h header depend on internal.h.

silentbicycle · 2025-07-23T17:30:14Z

src/lx/print/c.c

 			fprintf(stderr, " fgetc");
 		}

+		if (opt->comments) {


I added these comments because I found the multiple layers of generated *getc functions with similar names very confusing.

silentbicycle · 2025-07-23T17:32:39Z

src/lx/print/c.c

-		fprintf(f, "\tassert(lx->p != NULL);\n");
+		/* FIXME: This should distinguish between alloc failure
+		 * and EOF, but will require layers of interface changes. */
+		fprintf(f, "\tif (!lx_advance_end(lx, c)) { return EOF; }\n");


lx_advance_end can fail if lx->push has an allocation failure, but lx_getc either returns a character or EOF. It should probably signal the error more obviously, but what do you think the interface should look like? This only applies in dynamic buffering mode, of course.

Previously the calls to lx->push were injected into the body of the zone function, so it failing could directly return TOK_ERROR. We could add an error flag on the lx handle and check that after the zone's character+state switching, or we could leave that for now and make sure it's handled properly in the vmc codegen. Thoughts?

silentbicycle · 2025-07-23T17:35:57Z

src/lx/print/c.c


-	/* TODO: prerequisite that the FSM is a DFA */
+	/* prerequisite that the FSM is a DFA */
+	assert(fsm_all(z->fsm, fsm_isdfa));


implemented this

silentbicycle · 2025-07-29T14:01:01Z

I added some negative tests for truncated/corrupt input getting error messages and found that the block comment in in3.txt isn't reported as an error if the */ is missing. I will add the tests to the PR once I have that fixed.

Add a couple test cases (in8-10.txt) with an unexpected end of input, either in the middle of a pattern, or after matching the first pattern in a .. pair, but without matching the second. Supporting this changes the expected result for in6.txt: Previously it resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a "lexically uncategorised" error message for the unexpected end of input. This change is necessary for fixing #386 / #508, and more generally to detect things like unterminated string literals.

silentbicycle · 2025-07-30T14:26:55Z

I called it out in the commit logs for ae53e94, but this changes the result for test/lxpos/in6.txt. Now it explicitly fails tokenization when there's an EOF before matching the second pattern in e.g. '// .. /\n/ -> $nl, rather than just failing to yield TOK_NLbut still yieldingTOK_EOF`.

This was previously ending up with a useless call to the current zone after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"), which led to a warning in CI [-Werror=implicit-fallthrough=].

silentbicycle · 2025-07-31T17:33:40Z

I just pushed a couple more commits, which fix issues in the generated code when lx -e is used to override the default API prefix.

These may or may not be called, depending on the input.

In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.

katef · 2025-08-02T11:15:35Z

src/lx/print/c.c

+	fprintf(f, "%sungetc(lx, %s); ", prefix.api, cur_char_var);
+	if (pop && (~api_exclude & API_POS)) {
+		fprintf(f, "%s%spop(lx->buf_opaque); ",
+		    prefix.api, buf_op_prefix());


sensible, okay

What to do here when api_tokbuf is 0? calling buf_op_prefix() is wrong. but does emitting ungetc() make sense at all then? probably not. I think we shouldn't emit the call

Should be fixed in 08fd72c. That should have checked API_BUF rather than API_POS, and rechecking every combination of -k and -x buf/pos options found a few other warnings for unused functions/arguments that are now fixed.

katef · 2025-08-02T11:20:31Z

src/lx/print/c.c

+			fprintf(f, "lx->z = z%u, ", zindexof(ast, m->to));
+			fprintf(f, "%s", prefix.tok);
+			esctok(f, m->token->s);
+		}


katef · 2025-08-02T11:29:06Z

src/lx/print/c.c

+		  	unget_character(f, true, env->cur_char_var);
+		} else if (m->token != NULL && m->to != NULL) {
+		  	unget_character(f, true, env->cur_char_var);
+		}


whew. you're thinking of match in rust here huh

I was making the cases explicit while I figured out the correct behavior. They're almost all handled the same way now though, so I could collapse that logic a bit. Does that seem worth doing? This module should get replaced with the vmc codegen soon.

uh i should translate "fine" from british to american: i mean i like it

katef · 2025-08-02T14:58:00Z

src/libfsm/print/c.c

+	 * additional 'NONE' state. Inside the input loop, the default
+	 * state of NONE would be updated to the start state, but if the
+	 * input loop was skipped it would still be NONE. */
+	fprintf(f, "\tint has_consumed_input = 0;\n");


Not a fan of this being in the general purpose codegen. Do I have better ideas? I don't think so. Maybe it could live in the input buffer code, but I don't like that much either.

I'll see if I can handle this via the advance hook I just added in 4a5ca84, rather than the libfsm print layer.

In 051aaf0 I moved this to the advance hook and lx's print interface, it's not in the general purpose codegen anymore.

this makes me so happy, thank you!

katef · 2025-08-02T19:22:14Z

src/libfsm/print/c.c

+		.end_ids = ir->states[state_id].endids.ids,
+		.end_id_count = ir->states[state_id].endids.count,
+	};
+


why not pass whatever type &ir->states[state_id].endids is?

struct ir_state_endids is defined in an internal ir.h header, and an upcoming PR is going to add another pair of fields to the state metadata struct, for eager output IDs and their count.

This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.

Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.

katef

<3

PR #509 introduced a bug: It didn't distinguish between an unexpected end of input and an end of input in a zone that matches but ignores its input. This caused several lxpos tests to fail due to getting a TOK_UNKNOWN rather than a TOK_EOF when the input has trailing whitespace, but I didn't notice until after merging because the normal build doesn't regenerate the code for src/lx/lexer.lx or src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and parsers were regenerated, but missed those.) Instead of always printing TOK_UNKNOWN, this this inspects the zone mappings to determine whether the current end ID represents a dead end for the zone. If not, it should instead print TOK_EOF.

…by-509 lx: Distinguish between unexpected EOF and EOF in ignored zones (broken by #509)

silentbicycle requested a review from katef July 3, 2025 18:47

silentbicycle added 4 commits July 23, 2025 09:48

Makefile: Check '*res*' not 'res*' for tests.

cb42d58

This will miss failures in prefixed res files, such as build/tests/lxpos/dyn-fdgetc-getc-res0

Re-enable lxpos tests.

862a68c

Add `${LX}` as a dependency for the targets using it. Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be a merge error? They just produce a warning.

silentbicycle force-pushed the sv/fix-lx-token-identification branch from 7cb369e to 862a68c Compare July 23, 2025 14:57

Use $LX_BIN instead of $LX in lxpos makefile.

affca78

Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.

silentbicycle force-pushed the sv/fix-lx-token-identification branch from 012cce4 to affca78 Compare July 23, 2025 15:45

lx: Make -l dump's output call lx.free() when using dynamic buffer.

a8f0c59

silentbicycle marked this pull request as ready for review July 23, 2025 17:04

silentbicycle commented Jul 23, 2025

View reviewed changes

silentbicycle added 3 commits July 29, 2025 10:31

lx: Use prefix.tok, not "TOK_".

552aa01

lx: return TOK_ERROR if reaching the end of a zone function.

beebd1b

lx: Rewrite logic to make the four cases explicit, fix dead code.

ea9c90b

This was previously ending up with a useless call to the current zone after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"), which led to a warning in CI [-Werror=implicit-fallthrough=].

silentbicycle added 3 commits July 31, 2025 14:19

lx: Only gen fixedpop / dynpop & calls to them when buffer mode is set.

17d415d

lx: Suppress warning for possibly unused function.

7ed18b9

These may or may not be called, depending on the input.

lx: Ensure prefix.api & prefix.lx are used in the generated code.

1e55db8

In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.

silentbicycle force-pushed the sv/fix-lx-token-identification branch from 930c7bd to 1e55db8 Compare July 31, 2025 18:21

katef reviewed Aug 2, 2025

View reviewed changes

src/lx/print/c.c

fprintf(f, "lx->z = z%u, ", zindexof(ast, m->to));

fprintf(f, "%s", prefix.tok);

esctok(f, m->token->s);

}

Copy link

Owner

katef Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, cool

katef reviewed Aug 2, 2025

View reviewed changes

silentbicycle added 4 commits August 4, 2025 10:52

Replace FSM_ADVANCE_HOOK macro with optional hooks->advance callback.

4a5ca84

The advance hook should also be called for FSM_IO_STR.

f25e8b7

Move setting has_consumed_input flag into lx's advance hook.

051aaf0

This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.

lx: Avoid useless call to pop and some other 'unused' warnings.

08fd72c

Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.

katef approved these changes Aug 20, 2025

View reviewed changes

katef merged commit c897e9d into main Aug 20, 2025
346 checks passed

silentbicycle deleted the sv/fix-lx-token-identification branch August 20, 2025 15:37

silentbicycle mentioned this pull request Aug 26, 2025

lx: Distinguish between unexpected EOF and EOF in ignored zones (broken by #509) #510

Merged

katef added a commit that referenced this pull request Aug 29, 2025

Merge pull request #510 from katef/sv/fix-lx-handling-for-EOF-broken-…

6c66234

…by-509 lx: Distinguish between unexpected EOF and EOF in ignored zones (broken by #509)

Uh oh!

Fix lx token identification #509

Fix lx token identification #509

Uh oh!

Conversation

silentbicycle commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silentbicycle Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silentbicycle Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silentbicycle commented Jul 29, 2025

Uh oh!

silentbicycle commented Jul 30, 2025

Uh oh!

silentbicycle commented Jul 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silentbicycle Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katef Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

katef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

silentbicycle commented Jul 3, 2025 •

edited

Loading

silentbicycle Jul 23, 2025 •

edited

Loading

silentbicycle Jul 23, 2025 •

edited

Loading

silentbicycle Aug 4, 2025 •

edited

Loading

katef Aug 2, 2025 •

edited

Loading