This repository is a bunch of scripts for developing word frequency dictionaries.
To build a word frequency dictionary, one needs a corpus of text. Various corpora can be obtained from https://opus.nlpl.eu/
We prefer:
- Contemporary, simple, every day language.
- Unbiased language that is not focused on any topic, such as politics or technology.
- Language that is not vulgar, obscene or otherwise triggering.
The word frequency dictionaries are often built in collaboration with native speakers, who manually and carefully review the lists to remove any bad words.