Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Multi-sentence scorer implementation for are_toki_pona #2

@gregdan3

Description

@gregdan3

NOTE: If you're reading this looking to score every sentence in a message rather than an entire message, but your use case doesn't actually require sentences, checking the entire message with is_toki_pona is guaranteed to be more accurate!

I don't have any scorers for all the sentences in a message- only for individual sentences, which I leave it largely up to the user to determine.

I do provide the following example scorers:

        def all_must_pass(message: str) -> bool:
            return all(ILO.are_toki_pona(message))

        def portion_must_pass(message: str, score: Number = 0.8) -> bool:
            results = ILO.are_toki_pona(message)
            sent_count = len(results)
            passing = results.count(True)
            return (passing / sent_count) >= score

But these implementations are pretty weak. The first one is a no-go, because it would fail an entire message for ending with a string like ":D"- the sentence tokenizer doesn't consider emoticons, so this would become at least two sentences, where the second is a single token "D" that would fail on its own.
The second implementation is less severe, but still sensitive to the output of the sentence tokenizer, and sensitive to short messages. It would fail for, say, "ni li musi :D", because this input is 50% toki pona and 50% not for the same reason as above.

It may become necessary to build a particularly lenient meta-scorer, or to be selective about what kinds of sentences are considered in the first place- for example, it may be reasonable to ignore sentences of length 1 in the sentence meta-scorer (except where there is only one sentence), since these would almost certainly be artifacts of the sentence tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions