1 unstable release

0.1.0	May 13, 2026

#860 in Science

CC-BY-SA-4.0

6.5MB

open-english-pronouncing-dictionary

OpenEPD — an open, fused English IPA pronunciation dictionary. ~280,000 US English words with canonical IPA, optional pronunciation variants, frequency-derived rarity rank, and per-entry source provenance. CC-BY-SA 4.0.

{
  "stupid": {
    "rarity": 1907.0,
    "ipa": {
      "misaki_gold": "stˈupəd",
      "cmu":         "stˈupəd",
      "cmu2":        "stˈupɪd",
      "wikipron":    "stjupɪd",
      "wikipron2":   "stupɪd"
    },
    "alt_display": "STUPID"
  }
}

Why it exists

The widely-used open English pronunciation dictionaries each have a major hole:

CMUdict uses ARPABET, which collapses /i/–/ɪ/ and /u/–/ʊ/. Best long-tail coverage (~135k entries), but vowel quality is lossy.
Misaki ships ~90k vetted entries with proper narrow IPA — what Kokoro TTS uses — but it's small.
WikiPron scrapes Wiktionary and captures pronunciation variation, but the data is raw and CC-BY-SA.
Neural G2P (LatPhon, ByT5) currently sits at ~13% English PER — worse than dictionaries.

Fusing them gives narrow vowel quality, long-tail coverage, and variant capture, at the cost of CC-BY-SA on the result. This repo is the fusion pipeline plus the artifact it produces.

Sources

Source	Entries	License	Role
Misaki `us_gold`	90,201	Apache 2.0	Vetted narrow IPA. Highest-quality core.
CMUdict 0.7b	135,166	BSD-style	Broad ARPABET, converted to IPA at build time.
Misaki `us_silver`	93,361	Apache 2.0	Less-vetted near-IPA; gap fill.
WikiPron `eng_latn_us_broad`	101,371	CC-BY-SA / Apache 2.0	Wiktionary scrape, pronunciation variants.
`wordfreq`	n/a	MIT	Per-word rarity rank from Zipf frequency.

The pipeline in corpus/ is reproducible — fetch the four source files, run corpus/build.py, get the exact data/openepd.json we ship. Details (ARPABET → IPA mapping, Misaki near-IPA expansion, WikiPron narrow-form stripping) live in corpus/README.md.

Using it

From Rust

[dependencies]
open-english-pronouncing-dictionary = "0.1"
phonetics = { package = "phonetics-rs", version = "0.3", features = ["transcriptions"] }

use open_english_pronouncing_dictionary as openepd;

let corpus = openepd::load()?;
corpus.preferred_ipa("stupid");        // → Some("stˈupəd")
corpus.transcribe("cat dog");          // → Some("kˈætdˈɔɡ")

load() returns a phonetics::transcriptions::Corpus, which carries both a forward word→IPA map and a reverse IPA-prefix trie. The raw JSON is also exposed as openepd::CORPUS_JSON (a &'static str) for callers that want to parse it themselves.

From any other language

data/openepd.json is plain JSON. Schema is documented in corpus/README.md.

Why CC-BY-SA 4.0

WikiPron inherits Wiktionary's CC-BY-SA. Including its entries — which is what gives us the variant-capture advantage — binds the merged corpus to CC-BY-SA too. To produce a permissive (Apache 2.0) build, omit WikiPron in corpus/build.py; you'll lose ~25k variant transcriptions and the long-tail Wiktionary entries.

The Rust loader code in src/ is also CC-BY-SA, for simplicity — it's ten lines of include_str!, and the data is the only thing worth licensing here.

Compared to

CMUdict: this is CMUdict-as-IPA plus three other layers. ~2× the unique words and proper narrow vowels.
Misaki gold: this is a strict superset (Misaki gold is included verbatim as the preferred layer for the 90k words it covers).
WikiPron US broad scrape: same source data, but cleaned to broad IPA, merged with higher-quality dictionaries, and indexed by frequency.

Status

v0.1 — first published release. The schema is stable; refresh cadence will be driven by upstream source updates (Misaki, CMUdict, WikiPron all see periodic revisions).

License

CC-BY-SA 4.0. See LICENSE.

Upstream attributions for the merged data are listed in corpus/README.md and preserved per-entry as the ipa source labels (misaki_gold, cmu, misaki_silver, wikipron).