The dataset mixes two scripts, Latin and Cyrillic. The latter seems to have only a few lemmata in it. Perhaps these should be dropped? An exhaustive list of the "offending" lemmata follows: - молдовенеск - собэ - хомосексуалитате - юби