New features are:
- Streamlined process + pipeline integration
- Wiktionary grounding + LLM explain
- Enormous words data across multiple languages
- Extremely detailed definitions
 
- New distribution format will be: jsonl, sqlite and more are to be determined
- Options are available to select specific category of words
Behold and stay tuned!
- Install project dependencies: uv sync
- Configure a .envfile withDATABASE_URL
- Ensure a PostgreSQL database is reachable via that URL
Download the compressed dump:
uv run open-dictionary download --output data/raw-wiktextract-data.jsonl.gzExtract the JSONL file:
uv run open-dictionary extract \
  --input data/raw-wiktextract-data.jsonl.gz \
  --output data/raw-wiktextract-data.jsonlStream the JSONL into PostgreSQL (dictionary_all.data is JSONB):
uv run open-dictionary load data/raw-wiktextract-data.jsonl \
  --table dictionary_all \
  --column data \
  --truncateRun everything end-to-end with optional partitioning:
uv run open-dictionary pipeline \
  --workdir data \
  --table dictionary_all \
  --column data \
  --truncateSplit rows by language code into per-language tables when needed:
uv run open-dictionary partition \
  --table dictionary_all \
  --column data \
  --lang-field lang_codeMaterialize a smaller set of languages into dedicated tables with a custom prefix:
uv run open-dictionary filter en zh \
  --table dictionary_all \
  --column data \
  --table-prefix dictionary_filteredPass all to emit every language into its own table:
uv run open-dictionary filter all --table dictionary_all --column dataPopulate the common_score column with word frequency data (re-run with --recompute-existing to refresh scores):
uv run open-dictionary db-commonness --table dictionary_filtered_enNormalize raw Wiktionary payloads into a slimmer JSONB column without invoking LLMs (writes to process by default):
uv run open-dictionary pre-process \
  --table dictionary_filtered_en \
  --source-column data \
  --target-column processedRemove low-quality rows (zero common score, numeric tokens, legacy tags) directly in PostgreSQL:
uv run open-dictionary db-clean --table dictionary_filtered_enGenerate structured Chinese learner-friendly entries with the LLM define workflow (writes JSONB into new_speak by default). This streams rows in batches, dispatches up to 50 concurrent LLM calls with exponential-backoff retries, and resumes automatically on restart:
uv run open-dictionary llm-define \
  --table dictionary_filtered_en \
  --source-column data \
  --target-column new_speakProvide LLM_MODEL, LLM_KEY, and LLM_API in your environment (e.g., .env) before running LLM commands.
Each command streams data in chunks to handle the 10M+ line dataset efficiently.