Cyrillic

The dataset mixes two scripts, Latin and Cyrillic. The latter seems to have only a few lemmata in it. Perhaps these should be dropped?

An exhaustive list of the "offending" lemmata follows:
- молдовенеск
- собэ
- хомосексуалитате
- юби