Thanks to visit codestin.com
Credit goes to lib.rs

#ipa #pronunciation #phonetic #dictionary #english

open-english-pronouncing-dictionary

OpenEPD: open, fused English IPA pronunciation dictionary (~280k US English words). Misaki + CMUdict + WikiPron, canonical IPA with provenance and frequency-derived rarity. CC-BY-SA 4.0.

1 unstable release

0.1.0 May 13, 2026

#860 in Science

CC-BY-SA-4.0

6.5MB

open-english-pronouncing-dictionary

OpenEPD — an open, fused English IPA pronunciation dictionary. ~280,000 US English words with canonical IPA, optional pronunciation variants, frequency-derived rarity rank, and per-entry source provenance. CC-BY-SA 4.0.

{
  "stupid": {
    "rarity": 1907.0,
    "ipa": {
      "misaki_gold": "stˈupəd",
      "cmu":         "stˈupəd",
      "cmu2":        "stˈupɪd",
      "wikipron":    "stjupɪd",
      "wikipron2":   "stupɪd"
    },
    "alt_display": "STUPID"
  }
}

Why it exists

The widely-used open English pronunciation dictionaries each have a major hole:

  • CMUdict uses ARPABET, which collapses /i/–/ɪ/ and /u/–/ʊ/. Best long-tail coverage (~135k entries), but vowel quality is lossy.
  • Misaki ships ~90k vetted entries with proper narrow IPA — what Kokoro TTS uses — but it's small.
  • WikiPron scrapes Wiktionary and captures pronunciation variation, but the data is raw and CC-BY-SA.
  • Neural G2P (LatPhon, ByT5) currently sits at ~13% English PER — worse than dictionaries.

Fusing them gives narrow vowel quality, long-tail coverage, and variant capture, at the cost of CC-BY-SA on the result. This repo is the fusion pipeline plus the artifact it produces.

Sources

Source Entries License Role
Misaki us_gold 90,201 Apache 2.0 Vetted narrow IPA. Highest-quality core.
CMUdict 0.7b 135,166 BSD-style Broad ARPABET, converted to IPA at build time.
Misaki us_silver 93,361 Apache 2.0 Less-vetted near-IPA; gap fill.
WikiPron eng_latn_us_broad 101,371 CC-BY-SA / Apache 2.0 Wiktionary scrape, pronunciation variants.
wordfreq n/a MIT Per-word rarity rank from Zipf frequency.

The pipeline in corpus/ is reproducible — fetch the four source files, run corpus/build.py, get the exact data/openepd.json we ship. Details (ARPABET → IPA mapping, Misaki near-IPA expansion, WikiPron narrow-form stripping) live in corpus/README.md.

Using it

From Rust

[dependencies]
open-english-pronouncing-dictionary = "0.1"
phonetics = { package = "phonetics-rs", version = "0.3", features = ["transcriptions"] }
use open_english_pronouncing_dictionary as openepd;

let corpus = openepd::load()?;
corpus.preferred_ipa("stupid");        // → Some("stˈupəd")
corpus.transcribe("cat dog");          // → Some("kˈætdˈɔɡ")

load() returns a phonetics::transcriptions::Corpus, which carries both a forward word→IPA map and a reverse IPA-prefix trie. The raw JSON is also exposed as openepd::CORPUS_JSON (a &'static str) for callers that want to parse it themselves.

From any other language

data/openepd.json is plain JSON. Schema is documented in corpus/README.md.

Why CC-BY-SA 4.0

WikiPron inherits Wiktionary's CC-BY-SA. Including its entries — which is what gives us the variant-capture advantage — binds the merged corpus to CC-BY-SA too. To produce a permissive (Apache 2.0) build, omit WikiPron in corpus/build.py; you'll lose ~25k variant transcriptions and the long-tail Wiktionary entries.

The Rust loader code in src/ is also CC-BY-SA, for simplicity — it's ten lines of include_str!, and the data is the only thing worth licensing here.

Compared to

  • CMUdict: this is CMUdict-as-IPA plus three other layers. ~2× the unique words and proper narrow vowels.
  • Misaki gold: this is a strict superset (Misaki gold is included verbatim as the preferred layer for the 90k words it covers).
  • WikiPron US broad scrape: same source data, but cleaned to broad IPA, merged with higher-quality dictionaries, and indexed by frequency.

Status

v0.1 — first published release. The schema is stable; refresh cadence will be driven by upstream source updates (Misaki, CMUdict, WikiPron all see periodic revisions).

License

CC-BY-SA 4.0. See LICENSE.

Upstream attributions for the merged data are listed in corpus/README.md and preserved per-entry as the ipa source labels (misaki_gold, cmu, misaki_silver, wikipron).

Dependencies