-
-
Notifications
You must be signed in to change notification settings - Fork 58
Fix lx token identification #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct ir_state_table`, and IR_TABLE isn't actually implemented, so this can be removed for now. lx/print/c.c only needed internal.h because of print/ir.h, and because of several direct accesses to fsm->statecount, which are easily replaced by calls to fsm_countstates.
This will miss failures in prefixed res files, such as
build/tests/lxpos/dyn-fdgetc-getc-res0
Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.
Instead of having the EOF token occupy the same byte, line, and column position as the last token, it should immediately follow. The new lx codegen behaves this way, and katef and I decided that it made sense to keep it like that, as long as it's consistent.
Add `${LX}` as a dependency for the targets using it.
Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be
a merge error? They just produce a warning.
7cb369e to
862a68c
Compare
Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.
012cce4 to
affca78
Compare
| .if make(test) | ||
| .END:: | ||
| grep FAIL ${BUILD}/tests/*/res*; [ $$? -ne 0 ] | ||
| grep FAIL ${BUILD}/tests/*/*res*; [ $$? -ne 0 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been causing test failures to go unnoticed before.
| TEST_OUTDIR.tests/lxpos = ${BUILD}/tests/lxpos | ||
|
|
||
| LX?=${BUILD}/bin/lx | ||
| LX_BIN?=${BUILD}/bin/lx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI test build sets LX='true; echo lx' and that breaks these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to rename the variable just for lx's own tests, okay
|
|
||
| ${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx | ||
| ${LX} -l ${ext} ${LX_CFLAGS} ${LX_CFLAGS.tests/lxpos/${buf}-${getc}-${io}-lexer.lx} < ${.ALLSRC:M*.lx} > $@ \ | ||
| ${BUILD}/tests/lxpos/${buf}-${getc}-${io}-lexer.${ext}: tests/lxpos/lexer.lx ${LX_BIN} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was potentially a build race condition here -- it wasn't ensuring lx was actually built before trying to use it.
src/libfsm/print/c.c
Outdated
| * input loop was skipped it would still be NONE. */ | ||
| fprintf(f, "\tint has_consumed_input = 0;\n"); | ||
|
|
||
| /* For FSM_IO_STR and FSM_IO_PAIR, define a macro that will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is a bit ugly, but much simpler than changing normal code generation to instantiate some kind of character iterator, just so lx and libfsm are on the same page about advancing and pushing back the character stream.
Do you think we should put "LX" in the macro name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro rubs me the wrong way too. There's gotta be a nicer way. Maybe we could have lx emit a call to %sadvance_end() somewhere (in a new hook, e.g. at the end of the generated fsm loop) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to an advance hook in struct fsm_hooks.
| struct ir_state_table { | ||
| unsigned to[FSM_SIGMA_COUNT]; | ||
| } *table; | ||
| int not_yet_implemented; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing uses this yet, but referencing FSM_SIGMA_COUNT here made the ir.h header depend on internal.h.
| fprintf(stderr, " fgetc"); | ||
| } | ||
|
|
||
| if (opt->comments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these comments because I found the multiple layers of generated *getc functions with similar names very confusing.
src/lx/print/c.c
Outdated
| fprintf(f, "\tassert(lx->p != NULL);\n"); | ||
| /* FIXME: This should distinguish between alloc failure | ||
| * and EOF, but will require layers of interface changes. */ | ||
| fprintf(f, "\tif (!lx_advance_end(lx, c)) { return EOF; }\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lx_advance_end can fail if lx->push has an allocation failure, but lx_getc either returns a character or EOF. It should probably signal the error more obviously, but what do you think the interface should look like? This only applies in dynamic buffering mode, of course.
Previously the calls to lx->push were injected into the body of the zone function, so it failing could directly return TOK_ERROR. We could add an error flag on the lx handle and check that after the zone's character+state switching, or we could leave that for now and make sure it's handled properly in the vmc codegen. Thoughts?
|
|
||
| /* TODO: prerequisite that the FSM is a DFA */ | ||
| /* prerequisite that the FSM is a DFA */ | ||
| assert(fsm_all(z->fsm, fsm_isdfa)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented this
|
I added some negative tests for truncated/corrupt input getting error messages and found that the block comment in |
Add a couple test cases (in8-10.txt) with an unexpected end of input, either in the middle of a pattern, or after matching the first pattern in a .. pair, but without matching the second. Supporting this changes the expected result for in6.txt: Previously it resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a "lexically uncategorised" error message for the unexpected end of input. This change is necessary for fixing #386 / #508, and more generally to detect things like unterminated string literals.
|
I called it out in the commit logs for ae53e94, but this changes the result for |
This was previously ending up with a useless call to the current zone
after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"),
which led to a warning in CI [-Werror=implicit-fallthrough=].
|
I just pushed a couple more commits, which fix issues in the generated code when |
These may or may not be called, depending on the input.
In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.
930c7bd to
1e55db8
Compare
| fprintf(f, "%sungetc(lx, %s); ", prefix.api, cur_char_var); | ||
| if (pop && (~api_exclude & API_POS)) { | ||
| fprintf(f, "%s%spop(lx->buf_opaque); ", | ||
| prefix.api, buf_op_prefix()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sensible, okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to do here when api_tokbuf is 0? calling buf_op_prefix() is wrong. but does emitting ungetc() make sense at all then? probably not. I think we shouldn't emit the call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed in 08fd72c. That should have checked API_BUF rather than API_POS, and rechecking every combination of -k and -x buf/pos options found a few other warnings for unused functions/arguments that are now fixed.
| fprintf(f, "lx->z = z%u, ", zindexof(ast, m->to)); | ||
| fprintf(f, "%s", prefix.tok); | ||
| esctok(f, m->token->s); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, cool
| unget_character(f, true, env->cur_char_var); | ||
| } else if (m->token != NULL && m->to != NULL) { | ||
| unget_character(f, true, env->cur_char_var); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whew. you're thinking of match in rust here huh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was making the cases explicit while I figured out the correct behavior. They're almost all handled the same way now though, so I could collapse that logic a bit. Does that seem worth doing? This module should get replaced with the vmc codegen soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh i should translate "fine" from british to american: i mean i like it
src/libfsm/print/c.c
Outdated
| * additional 'NONE' state. Inside the input loop, the default | ||
| * state of NONE would be updated to the start state, but if the | ||
| * input loop was skipped it would still be NONE. */ | ||
| fprintf(f, "\tint has_consumed_input = 0;\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of this being in the general purpose codegen. Do I have better ideas? I don't think so. Maybe it could live in the input buffer code, but I don't like that much either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can handle this via the advance hook I just added in 4a5ca84, rather than the libfsm print layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 051aaf0 I moved this to the advance hook and lx's print interface, it's not in the general purpose codegen anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me so happy, thank you!
| .end_ids = ir->states[state_id].endids.ids, | ||
| .end_id_count = ir->states[state_id].endids.count, | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pass whatever type &ir->states[state_id].endids is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct ir_state_endids is defined in an internal ir.h header, and an upcoming PR is going to add another pair of fields to the state metadata struct, for eager output IDs and their count.
This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.
Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
PR #509 introduced a bug: It didn't distinguish between an unexpected end of input and an end of input in a zone that matches but ignores its input. This caused several lxpos tests to fail due to getting a TOK_UNKNOWN rather than a TOK_EOF when the input has trailing whitespace, but I didn't notice until after merging because the normal build doesn't regenerate the code for src/lx/lexer.lx or src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and parsers were regenerated, but missed those.) Instead of always printing TOK_UNKNOWN, this this inspects the zone mappings to determine whether the current end ID represents a dead end for the zone. If not, it should instead print TOK_EOF.
…by-509 lx: Distinguish between unexpected EOF and EOF in ignored zones (broken by #509)
Changing the leaf and endleaf callbacks to accept and reject in #485 broke lx, but it went unnoticed for a while. This fixes it.
libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend.
lx used to work by breaking abstraction and calling directly into
fsm_print_cfrag(overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type.This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward.
Most of the code changes happen inside of lx's code generation, but there are a few elsewhere:
The reject hook now has a state_metadata pointer, so update the callers for all the output formats.
libfsm's 'c' output now includes a macro
FSM_ADVANCE_HOOK(C), which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined.libfsm's 'c' output also includes a flag,
has_consumed_input, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state.lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.
Also:
getc,str, andpair), and fix a couple things in their build config.-l dumpoutput calllx->freewhen necessary.*res*notres*, because some result files are prefixed.There may still be combinations of lx output modes and other flags that get warnings for unused functions or otherwise have a broken build, but I tested
getc,str, andpair, withdynorfixedbuffering, and with or without-x buf -x pos. All the combinations tested by the lxpos build work.