New features are:
- Streamlined process + pipeline integration
- Wiktionary grounding + LLM explain
- Enormous words data across multiple languages
- Extremely detailed definitions
- New distribution format will be: jsonl, sqlite and more are to be determined
- Options are available to select specific category of words
Behold and stay tuned!
- Install project dependencies:
uv sync - Configure a
.envfile withDATABASE_URL - Ensure a PostgreSQL database is reachable via that URL
Download the compressed dump:
uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gzExtract the JSONL file:
uv run open-dictionary extract \
--input data/raw-wiktextract-data.jsonl.gz \
--output data/raw-wiktextract-data.jsonlStream the JSONL into PostgreSQL (dictionary_all.data is JSONB):
uv run open-dictionary load data/raw-wiktextract-data.jsonl \
--table dictionary_all \
--column data \
--truncateRun everything end-to-end with optional partitioning:
uv run open-dictionary pipeline \
--workdir data \
--table dictionary_all \
--column data \
--truncateSplit rows by language code into per-language tables when needed:
uv run open-dictionary partition \
--table dictionary_all \
--column data \
--lang-field lang_codeMaterialize a smaller set of languages into dedicated tables with a custom prefix:
uv run open-dictionary filter en zh \
--table dictionary_all \
--column data \
--table-prefix dictionary_filteredPass all to emit every language into its own table:
uv run open-dictionary filter all --table dictionary_all --column dataPopulate the common_score column with word frequency data (re-run with --recompute-existing to refresh scores):
uv run open-dictionary db-commonness --table dictionary_filtered_enRemove low-quality rows (zero common score, numeric tokens, legacy tags) directly in PostgreSQL:
uv run open-dictionary db-clean --table dictionary_filtered_enGenerate structured Chinese learner-friendly entries with the LLM define workflow (writes JSONB into new_speak by default). This streams rows in batches, dispatches up to 50 concurrent LLM calls with exponential-backoff retries, and resumes automatically on restart:
uv run open-dictionary llm-define \
--table dictionary_filtered_en \
--source-column data \
--target-column new_speakProvide LLM_MODEL, LLM_KEY, and LLM_API in your environment (e.g., .env) before running LLM commands.
Each command streams data in chunks to handle the 10M+ line dataset efficiently.