keybr.com-corpus

This repository is a bunch of scripts for developing word frequency dictionaries.

To build a word frequency dictionary, one needs a corpus of text. Various corpora can be obtained from https://opus.nlpl.eu/

We prefer:

Contemporary, simple, every day language.
Unbiased language that is not focused on any topic, such as politics or technology.
Language that is not vulgar, obscene or otherwise triggering.

The word frequency dictionaries are often built in collaboration with native speakers, who manually and carefully review the lists to remove any bad words.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
blacklist		blacklist
lang-ar		lang-ar
lang-be		lang-be
lang-cs		lang-cs
lang-de		lang-de
lang-el		lang-el
lang-en		lang-en
lang-es		lang-es
lang-fa		lang-fa
lang-fr		lang-fr
lang-ga		lang-ga
lang-hr		lang-hr
lang-hu		lang-hu
lang-it		lang-it
lang-lt		lang-lt
lang-nb		lang-nb
lang-nl		lang-nl
lang-pl		lang-pl
lang-pt-BR		lang-pt-BR
lang-pt-PT		lang-pt-PT
lang-pt		lang-pt
lang-ro		lang-ro
lang-ru		lang-ru
lang-sl		lang-sl
lang-sv		lang-sv
lang-tr		lang-tr
lang-uk		lang-uk
lib		lib
raw		raw
scan-words		scan-words
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierrc.js		.prettierrc.js
README.md		README.md
generate.js		generate.js
main.js		main.js
normalize.js		normalize.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

keybr.com-corpus

About

Uh oh!

Uh oh!

Contributors 3

Languages

aradzie/keybr.com-corpus

Folders and files

Latest commit

History

Repository files navigation

keybr.com-corpus

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Languages