-
Couldn't load subscription status.
- Fork 0
Description
NOTE: If you're reading this looking to score every sentence in a message rather than an entire message, but your use case doesn't actually require sentences, checking the entire message with is_toki_pona is guaranteed to be more accurate!
I don't have any scorers for all the sentences in a message- only for individual sentences, which I leave it largely up to the user to determine.
I do provide the following example scorers:
def all_must_pass(message: str) -> bool:
return all(ILO.are_toki_pona(message))
def portion_must_pass(message: str, score: Number = 0.8) -> bool:
results = ILO.are_toki_pona(message)
sent_count = len(results)
passing = results.count(True)
return (passing / sent_count) >= scoreBut these implementations are pretty weak. The first one is a no-go, because it would fail an entire message for ending with a string like ":D"- the sentence tokenizer doesn't consider emoticons, so this would become at least two sentences, where the second is a single token "D" that would fail on its own.
The second implementation is less severe, but still sensitive to the output of the sentence tokenizer, and sensitive to short messages. It would fail for, say, "ni li musi :D", because this input is 50% toki pona and 50% not for the same reason as above.
It may become necessary to build a particularly lenient meta-scorer, or to be selective about what kinds of sentences are considered in the first place- for example, it may be reasonable to ignore sentences of length 1 in the sentence meta-scorer (except where there is only one sentence), since these would almost certainly be artifacts of the sentence tokenizer.