UNIT 4
Phonetics
Introduction
1. Debate Context
• The “whole language” vs. “phonics” debate in teaching reading seems modern but
mirrors an older historical debate in writing systems.
2. Historical Writing Systems
• Earliest independently invented writing systems (Sumerian, Chinese, Mayan) were
mainly logographic: one symbol = a whole word.
• Even early logographic systems included syllabic or phonemic elements where
symbols represented sounds.
• Example: The Sumerian symbol for “ration” (pronounced ba) could also function
purely as the sound /ba/.
• Modern Chinese, though mostly logographic, uses sound-based characters for foreign
words.
3. Development of Purely Sound-Based Writing Systems
• Sound-based writing can be:
o Syllabic (e.g., Japanese hiragana, katakana)
o Alphabetic (e.g., Roman alphabet)
o Consonantal (e.g., Semitic scripts)
• These systems often evolved from early logo-syllabic systems, usually when cultures
interacted.
• The Arabic, Aramaic, Hebrew, Greek, and Roman alphabets all came from a West
Semitic script, likely adapted by Western Semitic mercenaries from a cursive form of
Egyptian hieroglyphs.
• Japanese syllabaries came from cursive Chinese characters used to represent sounds.
• Those Chinese characters had been used to phonetically represent Sanskrit in
Buddhist scriptures during China’s Tang dynasty.
4. Conceptual Foundation (Ur-theory)
• Sound-based writing implies the spoken word is made up of smaller units of
speech.
• This idea underlies modern phonology.
• The decomposition of speech into smaller units is also the basis for:
o Speech recognition (turning acoustic waveforms into text)
o Speech synthesis / Text-to-speech (turning text into acoustic waveforms)
5. Focus of Chapter 7: Phonetics
• Phonetics studies linguistic sounds:
o Production by the articulators of the human vocal tract
o Acoustic realization of those sounds
o Digitization & processing of acoustic signals
6. Phones and Pronunciation in Technology
• Phones = individual speech units.
• Speech recognition systems need pronunciations for every word they can recognize.
• Text-to-speech systems need pronunciations for every word they can say.
• Phonetic alphabets are used to describe these pronunciations.
7. Two Main Areas of Phonetics
• Articulatory phonetics → how speech sounds are produced in the mouth.
• Acoustic phonetics → acoustic analysis of speech sounds.
8. Link to Phonology
• Phonology studies:
o How sounds vary systematically in different environments.
o How the sound system connects to the rest of grammar.
• Variation in pronunciation depending on context is crucial in speech modeling.
Important Points from the Passage
1. Whole language vs. phonics debate
o Modern reading-teaching debate.
o Mirrors a historical debate in writing systems.
2. Earliest writing systems
o Examples: Sumerian, Chinese, Mayan.
o Mostly logographic (one symbol represents a whole word).
o Even early logographic systems included syllabic or phonemic elements
(symbols for sounds).
o Example: Sumerian symbol ba = “ration” and sound /ba/.
o Modern Chinese (still logographic) uses sound-based characters for foreign
words.
3. Purely sound-based writing systems
o Types:
▪ Syllabic (Japanese hiragana, katakana)
▪ Alphabetic (Roman alphabet)
▪ Consonantal (Semitic scripts)
o Origin: evolved from logo-syllabic systems.
o Often happened when two cultures met.
4. Historical origins
o Arabic, Aramaic, Hebrew, Greek, Roman alphabets came from West
Semitic script.
o Adapted by Western Semitic mercenaries from cursive Egyptian
hieroglyphs.
o Japanese syllabaries came from cursive Chinese characters used for sounds.
o Those Chinese characters were used to phonetically represent Sanskrit in
Buddhist scriptures during the Tang dynasty.
5. Ur-theory (basic idea behind phonology)
o Sound-based writing = spoken words are made of smaller speech units.
o Basis of modern phonology.
o Also basis for:
▪ Speech recognition → turning acoustic waveforms into text.
▪ Speech synthesis / Text-to-speech → turning text into acoustic
waveforms.
6. Phonetics (Chapter 7 introduction)
o Study of linguistic sounds:
▪ How produced by articulators in the human vocal tract.
▪ How acoustically realized.
▪ How digitized and processed.
7. Phones and pronunciation in technology
o Phones = individual speech units.
o Speech recognition: needs pronunciation for every recognizable word.
o Text-to-speech: needs pronunciation for every word it can speak.
o Phonetic alphabets are used to describe pronunciations.
8. Main areas of phonetics
o Articulatory phonetics → how speech sounds are produced in the mouth.
o Acoustic phonetics → acoustic analysis of speech sounds.
9. Phonology (link to phonetics)
o Studies:
▪ Systematic variation of sounds in different environments.
▪ Relation of sound system to the rest of grammar.
o Important for speech modeling because pronunciation changes with context.
Speech Sounds and Phonetic Transcription:
1. Phonetics
o Study of speech sounds used in the languages of the world.
o Pronunciation of a word is modeled as a string of symbols representing
phones or segments.
2. Phones
o A phone is a speech sound.
o Represented with phonetic symbols that may resemble letters in an alphabetic
language (like English).
3. Purpose of this section
o Surveys different phones in English, especially American English.
o Explains how they are produced and how they are represented
symbolically.
4. Phonetic Alphabets
o International Phonetic Alphabet (IPA)
▪ Developed in 1888 by the International Phonetic Association.
▪ Goal: transcribe the sounds of all human languages.
▪ Includes both an alphabet and principles for transcription.
▪ Same utterance can be transcribed in different ways according to IPA
principles.
o ARPAbet (Shoup, 1980)
▪ Designed specifically for American English.
▪ Uses ASCII symbols.
▪ Can be seen as an ASCII form of an American-English subset of IPA.
▪ Common in online pronunciation dictionaries and computational
applications where non-ASCII fonts are inconvenient.
5. Choice in this book
o Will use ARPAbet instead of IPA for computational purposes.
o Figures 7.1 (consonants) and 7.2 (vowels) show ARPAbet symbols with IPA
equivalents.
6. Rare Phones Example
o [ux]: Rare in General American English.
▪ Represents a fronted [uw] found in Western/Northern Cities dialects
from late 1970s.
▪ Popularized by imitations of “Valley Girls” (Moon Zappa, 1982).
▪ For most speakers, [uw] is still more common (e.g., dude [d uw d]).
7. ARPAbet Consonants – Examples
(With IPA equivalents)
o [p] → parsley [p aa r s l iy]
o [t] → tea [t iy]
o [k] → cook [k uh k]
o [b] → bay [b ey]
o [d] → dill [d ih l]
o [g] → garlic [g aa r l ix k]
o [m] → mint [m ih n t]
o [n] → nutmeg [n ah t m eh g]
o [ng] → baking [b ey k ix ng]
o [f] → flour [f l aw axr]
o [v] → clove [k l ow v]
o [th] → thick [th ih k]
o [dh] → those [dh ow z]
o [s] → soup [s uw p]
o [z] → eggs [eh g z]
o [sh] → squash [s k w aa sh]
o [zh] → ambrosia [ae m b r ow zh ax]
o [ch] → cherry [ch eh r iy]
o [jh] → jar [jh aa r]
o [l] → licorice [l ih k axr ix sh]
o [w] → kiwi [k iy w iy]
o [r] → rice [r ay s]
o [y] → yellow [y eh l ow]
o [h] → honey [h ah n iy]
o Rare consonants:
▪ [q] → uh-oh [q ah q ow] (glottal stop)
▪ [dx] → butter [b ah dx axr] (flap)
▪ [nx] → winner [w ih nx axr] (nasal flap)
▪ [el] → table [t ey b el] (syllabic consonant)
8. ARPAbet Vowels – Examples
(With IPA equivalents)
o [iy] → lily [l ih l iy]
o [ih] → lily [l ih l iy]
o [ey] → daisy [d ey z iy]
o [eh] → pen [p eh n]
o [ae] → aster [ae s t axr]
o [aa] → poppy [p aa p iy]
o [ao] → orchid [ao r k ix d]
o [uh] → wood [w uh d]
o [ow] → lotus [l ow dx ax s]
o [uw] → tulip [t uw l ix p]
o [ah] → buttercup [b ah dx axr k ah p]
o [er] → bird [b er d]
o [ay] → iris [ay r ix s]
o [aw] → sunflower [s ah n f l aw axr]
o [oy] → soil [s oy l]
o Reduced/Uncommon vowels:
▪ [ax] → lotus [l ow dx ax s] (schwa)
▪ [axr] → heather [h eh dh axr]
▪ [ix] → tulip [t uw l ix p] (reduced [ih])
▪ [ux] → dude [d ux d]
9. Orthography vs. Phonetic Symbols
o Many ARPAbet/IPA symbols match Roman letters (e.g., [p] in platypus).
o English spelling is opaque:
▪ Same letter can represent different sounds.
▪ Example: letter c → [k] in cougar [k uw g axr], but [s] in cell [s eh l].
o [k] can appear as:
▪ c (cougar)
▪ k (kangaroo)
▪ x (fox [f aa k s])
▪ ck (jackal [jh ae k el])
▪ cc (raccoon [r ae k uw n])
o Languages like Spanish have more transparent spelling-sound mapping than
English.
ARPAbet Consonants Table
ARPAbet IPA Example Word Example Transcription
p [p] parsley [p aa r s l iy]
ARPAbet IPA Example Word Example Transcription
t [t] tea [t iy]
k [k] cook [k uh k]
b [b] bay [b ey]
d [d] dill [d ih l]
g [g] garlic [g aa r l ix k]
m [m] mint [m ih n t]
n [n] nutmeg [n ah t m eh g]
ng [ŋ] baking [b ey k ix ng]
f [f] flour [f l aw axr]
v [v] clove [k l ow v]
th [θ] thick [th ih k]
dh [ð] those [dh ow z]
s [s] soup [s uw p]
z [z] eggs [eh g z]
sh [ʃ] squash [s k w aa sh]
zh [ʒ] ambrosia [ae m b r ow zh ax]
ch [tʃ] cherry [ch eh r iy]
jh [dʒ] jar [jh aa r]
l [l] licorice [l ih k axr ix sh]
w [w] kiwi [k iy w iy]
r [r] rice [r ay s]
y [j] yellow [y eh l ow]
h [h] honey [h ah n iy]
q [ʔ] uh-oh [q ah q ow]
ARPAbet IPA Example Word Example Transcription
dx [ɾ] butter [b ah dx axr]
nx [ɾ]̃ winner [w ih nx axr]
el [l̩ ] table [t ey b el]
ARPAbet Vowels Table
ARPAbet IPA Example Word Example Transcription
iy [i] lily [l ih l iy]
ih [ɪ] lily [l ih l iy]
ey [eɪ] daisy [d ey z iy]
eh [ɛ] pen [p eh n]
ae [æ] aster [ae s t axr]
aa [ɑ] poppy [p aa p iy]
ao [ɔ] orchid [ao r k ix d]
uh [ʊ] wood [w uh d]
ow [oʊ] lotus [l ow dx ax s]
uw [u] tulip [t uw l ix p]
ah [ʌ] buttercup [b ah dx axr k ah p]
er [ɝ] bird [b er d]
ay [aɪ] iris [ay r ix s]
aw [aʊ] sunflower [s ah n f l aw axr]
oy [ɔɪ] soil [s oy l]
ax [ə] lotus [l ow dx ax s]
axr [ɚ] heather [h eh dh axr]
ix [ɨ] tulip [t uw l ix p]
ARPAbet IPA Example Word Example Transcription
ux [ʉ] dude [d ux d]
ARTICULATORY PHONETICS
Study of how phones are produced as organs in the mouth, throat, and nose modify
airflow from the lungs.
1 The Vocal Organs
Airflow & Sound Production
• Air source: Rapid movement of air, usually expelled from the lungs.
• Air passes through:
1. Lungs →
2. Windpipe (Trachea) →
3. Larynx →
4. Mouth or Nose.
The Larynx & Vocal Folds
• Larynx = "Adam’s apple" / voice box.
• Contains vocal folds (vocal cords) → two small folds of muscle.
• Glottis = space between vocal folds.
Voicing
• Vocal folds close (but not tightly) → vibrate as air passes → Voiced sounds.
• Vocal folds far apart → no vibration → Unvoiced sounds.
Voiced Sounds Unvoiced Sounds
[b], [d], [g], [v], [z], all vowels [p], [t], [k], [f], [s], etc.
The Vocal Tract
• Area above trachea.
• Two parts:
1. Oral tract (mouth)
2. Nasal tract (nose)
Nasal Sounds
• Air passes through nose (and also mouth cavity as resonator).
• English examples: m, n, ng.
Phones Classification
• Consonants
o Made by restricting/blocking airflow.
o Can be voiced or unvoiced.
o Examples: [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l].
• Vowels
o Minimal obstruction to airflow.
o Usually voiced, louder, and longer-lasting.
o Examples: [aa], [ae], [ao], [ih], [aw], [ow], [uw].
• Semivowels
o Share properties of both consonants and vowels.
o Voiced like vowels but short & less syllabic like consonants.
o Examples: [y], [w].
Diagram: The Vocal Organs
Nasal cavity
┌───────┐
│ │
Lips │ │ Teeth
↓ │ │
Tongue ────── Palate
│ │
Pharynx │
↓ │
Larynx (vocal folds / glottis)
↓
Trachea (windpipe)
Lungs
ARTICULATORY PHONETICS
• Definition: Study of how phones (speech sounds) are produced as the mouth,
throat, and nose modify the airflow from the lungs.
• The ARPAbet phone list is meaningless without knowing how each phone is
produced.
7.2.1 The Vocal Organs
Airflow in Speech
• Speech is produced by rapid movement of air.
• Most sounds: Air from lungs → trachea (windpipe) → mouth or nose.
Larynx
• Known as Adam’s apple or voice box.
• Contains vocal folds (vocal cords) → 2 small muscle folds.
• Glottis: Space between vocal folds.
Voicing
• Vocal folds close (not tightly) → vibrate as air passes → Voiced sounds.
• Vocal folds far apart → no vibration → Unvoiced sounds.
Voiced Unvoiced
[b], [d], [g], [v], [z], all vowels [p], [t], [k], [f], [s], etc.
Vocal Tract
• Area above trachea.
• Two parts:
1. Oral tract (mouth)
2. Nasal tract (nose)
Nasal Sounds
• Air passes through nose (also resonates in mouth).
• Examples: m, n, ng.
Phones Classification
1. Consonants
o Restrict/block airflow.
o May be voiced or unvoiced.
o Examples: [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l].
2. Vowels
o Minimal obstruction.
o Usually voiced, louder & longer-lasting.
o Examples: [aa], [ae], [ao], [ih], [aw], [ow], [uw].
3. Semivowels
o Properties of both vowels & consonants.
o Voiced like vowels, short like consonants.
o Examples: [y], [w].
Diagram – Vocal Organs (Side View)
(Shows airflow path and main parts)
scss
CopyEdit
Nasal Cavity
↑
┌─────────┐
│ │
Lips Teeth Alveolar Ridge
↓ ↓ ↓
Tongue ─── Palate ─── Velum (soft palate)
↓
Pharynx
Larynx (vocal folds, glottis)
Trachea
↓
Lungs
7.2.2 Consonants – Place of Articulation
• Definition: The point of maximum restriction of airflow when producing a
consonant.
• Used in automatic speech recognition to group phones.
Major English Places of Articulation
1. Labial
o Bilabial: Both lips come together.
Examples: [p] (possum), [b] (bear), [m] (marmot).
o Labiodental: Bottom lip touches upper teeth.
Examples: [v], [f].
2. Dental
o Tongue against teeth (tip slightly between).
Examples: [th] (thing), [dh] (though).
3. Alveolar
o Tip of tongue against alveolar ridge (behind upper teeth).
Examples: [s], [z], [t], [d].
o Coronal: Term for both dental + alveolar sounds.
4. Palatal
o Palato-alveolar: Tongue blade against rising back of alveolar ridge.
Examples: [sh] (shrimp), [ch] (china), [zh] (Asian), [jh] (jar).
o Palatal proper: Front of tongue close to palate.
Example: [y] (yak).
5. Velar
o Back of tongue against velum (soft palate).
Examples: [k] (cuckoo), [g] (goose), [ng] (kingfisher).
6. Glottal
o Constriction at glottis (vocal folds close).
Example: Glottal stop [q] (IPA [ʔ]).
Diagram – Places of Articulation
scss
CopyEdit
[ Lips ] → Labial (Bilabial, Labiodental)
[ Teeth ] → Dental
[ Alveolar ridge ]→ Alveolar
[ Palate ] → Palatal
[ Velum ] → Velar
[ Glottis ] → Glottal
This has every term, example, and keyword exactly as in your text but in a clear point-
wise format with diagrams so it’s easy to study and reproduce in exams.
Consonants: Manner of Articulation:
Definition:
• Manner of articulation → How the restriction in airflow is made during consonant
production.
• Along with place of articulation, it usually identifies a consonant uniquely.
1. Stop (Plosive)
• Airflow completely blocked for a short time → followed by an explosive release.
• Two phases:
o Closure → period of complete blockage.
o Release → explosion of air.
• Examples:
o Voiced stops: [b], [d], [g]
o Unvoiced stops: [p], [t], [k]
• Other notes:
o Also called plosives.
o Some systems distinguish closure and release separately (e.g., ARPAbet [pcl],
[tcl], [kcl]).
o Unreleased stops → no explosive release (e.g., end of words)
▪ ARPAbet: [pd], [td], [kd], [bd], [dd], [gd]
▪ IPA: [p̚], [t̚], [k̚]
o In this chapter → [p], [t], [k] = full stop with both closure & release.
2. Nasal
• Produced by lowering the velum → allows air to pass into the nasal cavity.
• Examples: [n], [m], [ŋ] (ng).
3. Fricatives
• Airflow constricted but not fully blocked → causes turbulent airflow.
• Produces a “hissing” sound.
• Examples by place:
o Labiodental: [f], [v] → lower lip against upper teeth.
o Dental: [θ] (“th” in thing), [ð] (“th” in though) → air flows around tongue
between teeth.
o Alveolar: [s], [z] → tongue against alveolar ridge, forcing air over teeth.
o Palato-alveolar: [ʃ] (“sh”), [ʒ] (“zh” as in Asian) → tongue at back of alveolar
ridge, air through tongue groove.
• Sibilants → higher-pitched fricatives: [s], [z], [ʃ], [ʒ].
• Affricates → stop immediately followed by fricative: [tʃ] (“ch”), [dʒ] (“jh” as in
giraffe).
4. Approximants
• Articulators close together but not close enough for turbulence.
• Examples:
o [j] (“y” in yellow) → tongue near roof of mouth, no turbulence.
o [w] (“w” in wood) → back of tongue near velum.
o American [r] →
▪ Tongue tip near palate, OR
▪ Whole tongue bunched near palate.
o [l] → tip of tongue against alveolar ridge/teeth, sides lowered so air flows
over them.
▪ Called lateral sound (air passes along tongue sides).
5. Tap / Flap
• Quick motion of tongue against alveolar ridge.
• Example: [ɾ] → middle of lotus ([l ow dx ax s]) in American English.
• In many UK dialects → replaced by [t].
Diagram Needed
You should draw or insert:
1. Airflow blockage & release (for stops).
2. Velum lowered for nasals.
3. Constriction points for fricatives.
4. Positions for approximants & laterals.
5. Tap/flap quick contact motion.
Vowels:
Definition:
• Like consonants, vowels are described by articulator positions during production.
• Three main parameters:
1. Vowel height → height of highest part of tongue.
2. Vowel frontness/backness → position of highest tongue point (toward front
or back).
3. Lip shape → rounded or unrounded.
1. Vowel Height
• High vowels → tongue raised high.
• Mid vowels → tongue in mid position.
• Low vowels → tongue lowered.
• Examples:
o High front: [iy] (heed)
o Low front: [ae] (had)
o High back: [uw] (who’d)
o [ih] higher than [eh] (both front vowels).
2. Vowel Frontness / Backness
• Front vowels → tongue highest point toward front of mouth.
o Examples: [iy], [ih], [eh], [ae]
• Back vowels → tongue highest point toward back of mouth.
o Examples: [uw], [uh], [ao], [aa]
3. Lip Shape
• Rounded vowels → lips rounded (like in whistling).
o Examples: [uw], [ao], [ow]
• Unrounded vowels → lips relaxed/spread.
4. Diphthongs
• Definition: vowel where tongue position changes markedly during production.
• Represented in vowel charts as vectors instead of points.
• English is rich in diphthongs.
• Examples: [ay] (eye), [aw] (cow), [oy] (boy).
5. Important Notes
• Vowel height is schematic → correlates with acoustic patterns more than exact
tongue position.
• Figures to understand:
o Fig. 7.5 → tongue position examples for [iy], [ae], [uw].
o Fig. 7.6 → schematic vowel chart (high, mid, low; front–back positions;
monophthongs vs diphthongs).
If you want, I can make you a single combined diagram showing:
• Vowel chart (front–back, high–low)
• Examples of rounded/unrounded vowels
• Arrows for diphthongs
That way, your consonants + vowels notes will be visually complete.
UNIT 5
Speech synthesis
Speech Synthesis – Overview
Historical Background
• 1769, Vienna – Wolfgang von Kempelen built The Mechanical Turk for Empress
Maria Theresa.
o Appearance: Wooden box with gears + robot mannequin moving chess pieces
via mechanical arm.
o Toured Europe & Americas → defeated Napoleon Bonaparte, played
Charles Babbage.
o Hoax → secretly operated by a hidden human chess player.
• 1769–1790 – Von Kempelen also built first full-sentence speech synthesizer (not a
hoax).
o Components:
▪ Bellows → simulated lungs.
▪ Rubber mouthpiece + nose aperture.
▪ Reed → simulated vocal folds.
▪ Whistles → fricatives.
▪ Auxiliary bellows → puff of air for plosives.
▪ Flexible leather “vocal tract” → adjusted to produce different
consonants & vowels.
o Operation: Controlled by moving levers with both hands, opening/closing
passages.
Modern Speech Synthesis (Text-to-Speech, TTS)
• Definition: Generating speech (acoustic waveforms) from text input.
• Applications:
1. Conversational agents – telephone-based dialogue systems (with speech
recognition).
2. Non-conversational systems – reading aloud for the blind, video games, toys.
3. Assistive communication – e.g., Stephen Hawking (ALS) typed words →
synthesizer speech output.
• Limitations:
o Even best systems may sound wooden.
o Limited voice variety.
• State of the art:
o Can produce remarkably natural speech for wide range of inputs.
Basic TTS Process
• Text Analysis – Convert text into phonemic internal representation.
o Includes: Expanding acronyms (e.g., PG&E → “P G AND E”), converting
numbers (20 → “twentieth”), assigning phone sequences, prosody, and
phrasing.
• Waveform Synthesis – Convert internal representation into speech waveform.
Example (Fig. 8.1)
Sentence: PG&E will file schedules on April 20.
• Converted to:
o Expanded words: P G AND E WILL FILE SCHEDULES ON APRIL
TWENTIETH
o Phones: p iy jh iy ae n d iy w ih l f ay l s k eh jh ax l z aa n ey p r ih l t w eh n
t iy ax th
o Prosodic markers: * * * L-L%
Main Waveform Synthesis Paradigms
1. Concatenative synthesis (focus of chapter) –
o Store recorded speech samples in database.
o Chop and recombine to form new sentences.
o Unit selection synthesis → chooses best matching units from large database.
2. Formant synthesis –
o Model resonance characteristics of vocal tract.
o Fully artificial, not sample-based.
3. Articulatory synthesis –
o Simulate physical processes of speech production.
Architecture Example
• Hourglass metaphor (Taylor, 2008) → Two-step narrowing from text to phones, then
expanding from phones to waveform.
• Modern commercial TTS → Mostly based on concatenative unit selection synthesis.
If you want, I can also make this into a clear flowchart showing:
Historical → Modern TTS → Text Analysis → Waveform Synthesis → Methods
(Concatenative, Formant, Articulatory)
Text Normalization:
Purpose
• Goal: Convert raw text into a phonemic internal representation for speech
synthesis.
• Why needed: Raw text contains abbreviations, numbers, punctuation quirks, and
irregular formats that must be standardized for correct pronunciation.
Main Steps in Text Normalization
1. Sentence Tokenization
• Definition: Segmenting text into separate sentences/utterances for synthesis.
• Challenges:
o Abbreviations with periods – Avoid splitting sentences incorrectly.
▪ Example: B.C. Hydro → do not treat the period after B.C. as an end of
sentence.
o Non-standard sentence boundaries – Detect sentences ending with
punctuation other than a period.
▪ Example: Sentence ending after “collected” even though it uses a
colon instead of a period.
2. Handling Non-Standard Words (NSWs)
• Definition: Words/symbols that require expansion into standard spoken form.
• Types & Examples:
o Dates:
▪ March 31 → “March thirty-first” (not “March three one”).
o Numbers:
▪ $1 billion → “one billion dollars” (insert word dollars after “billion”).
o Acronyms:
▪ Expand into letter-by-letter or full-word form as appropriate (e.g.,
PG&E → “P G and E”).
o Abbreviations:
▪ Must identify correct expansions (e.g., Dr. → “Doctor”, St. → “Street”
or “Saint” depending on context).
Example from Enron Corpus (Klimt & Yang, 2004)
• Raw text contains:
o Abbreviation with period (B.C. Hydro).
o Date (March 31).
o Currency & number ($1 billion).
o Colon as sentence end (“collected:” → signals end of an utterance).
Key Output of Text Normalization
• Clean, sentence-segmented text.
• All non-standard words expanded into their full spoken forms.
• Ready for phonetic analysis and prosodic analysis in the TTS pipeline.
8.1.1 Sentence Tokenization
Definition
• Sentence Tokenization = Process of detecting sentence boundaries in text.
• Goal: Correctly identify where sentences begin and end for speech synthesis.
Challenges in Sentence Tokenization
1. Punctuation Ambiguity
• Periods are not always sentence boundaries.
• Examples:
o Abbreviations – B.C. Hydro (period is part of abbreviation, not sentence end).
o Dual Role – When abbreviation ends a sentence (Dr. J. M. Freeman).
o Non-period boundaries – Colons can signal sentence end:
▪ collected: “We continue…”
o Example references: (8.2), (8.3), (8.4).
2. Period Disambiguation
• Definition: Deciding whether a period marks End-of-Sentence (EOS) or not.
• Approach:
o Early method: Simple Perl scripts (Ch. 3).
o Modern method: Machine Learning–based EOS classifier.
Machine Learning Approach for Sentence Tokenization
Training Process
1. Hand-label a training set with correct sentence boundaries.
2. Tokenize text into tokens separated by whitespace.
3. Select candidate tokens containing potential boundary punctuation:
o ., !, ? (possibly also :).
4. Train a classifier to predict EOS vs not-EOS for each candidate.
Features for EOS Classification
A. Basic Feature Templates
• Prefix: Text before punctuation in the token.
• Suffix: Text after punctuation in the token.
• Abbreviation Check: Whether prefix/suffix is in abbreviation list.
• Previous Word and Next Word.
• Whether previous word is abbreviation.
• Whether next word is abbreviation.
Example (8.5): ANLP Corp. chairman Dr. Smith resigned.
• Candidate = period in "Corp."
o PreviousWord = ANLP
o NextWord = chairman
o Prefix = Corp
o Suffix = NULL
o PreviousWordAbbreviation = 1
o NextWordAbbreviation = 0
B. Lexical Probability Features
• Probability that the candidate token occurs at end of sentence.
• Probability that the word after candidate occurs at beginning of sentence.
C. Language-Specific Features
• Capitalization patterns:
o Case of candidate word: Upper, Lower, AllCap, Numbers.
o Case of following word: Upper, Lower, AllCap, Numbers.
• Special abbreviation classes:
o Honorifics/Titles – Dr., Mr., Gen.
o Corporate designators – Corp., Inc.
o Month abbreviations – Jan., Feb.
Classification Methods
• Common algorithms: Logistic Regression, Decision Trees.
• Logistic Regression → Often higher accuracy.
• Decision Trees → More interpretable (example shown in Fig. 8.3).
• Features used:
o Log likelihood of current word being sentence start (bprob).
o Log likelihood of previous word being sentence end (eprob).
o Capitalization of next word.
o Abbreviation subclass (company, state, unit of measurement).
Non-Standard Words in Text Normalization :
Definition
Non-standard words (NSWs) are tokens like numbers or abbreviations that must be expanded
into full English words before pronunciation.
Ambiguity in NSWs
• Numbers can have multiple pronunciations depending on context:
• 1750 → seventeen fifty (year)
• 1750 → one seven five zero (password)
• 1750 dollars → seventeen hundred and fifty OR one thousand seven hundred and fifty
• Roman numerals (e.g., IV) can mean four, fourth, or letters "I V" (intravenous).
• Fractions like 2/3 → two thirds, February third, or two slash three.
Types of NSWs
• Numbers
• NUM: number (cardinal) – 12, 45, 1/2, 0.6
• NORD: ordinal – 3rd, May 7
• NDIG: number as digits – Room 101
• NIDE: identifier – 747, I5
• NADDR: street address – 386 Main St.
• NZIP: zip code – 91020
• NTIME: time – 3.20, 11:45
• NDATE: date – 2/28/05
• NYER: years – 1998, 80s, 2008
• MONEY: $3.45, Y20,200
• BMONEY: billions/trillions – $3.2 billion
• PRCT: percentage – 75%
• Alphabetic
• EXPN: abbreviations – N.Y., mph, gov’t
• LSEQ: letter sequences – DVD, IBM
• ASWD: acronyms as words – NASA, IKEA
• Realization Patterns
• Paired method (e.g., years) → seventeen fifty for 1750.
• Serial method (e.g., zip codes) → nine four one one zero.
• BMONEY → always read with currency word at the end.
• Processing NSWs
Step 1: Tokenization
• Tokenize by whitespace.
• Identify tokens not in the pronunciation dictionary as NSWs.
Handle:
• Known abbreviations in dictionaries (e.g., st, mr, mrs, mon, tues, nov).
• Single-character tokens.
• Hyphenated words (2-car).
• CamelCase words (RVing).
Step 2: Classification
• Assign NSW type (NUM, LSEQ, EXPN, etc.) using:
• Regular expressions (e.g., NYER: /1[89][0-9][0-9]|20[0-9][0-9]/).
• Machine learning classifiers with features:
i Letter patterns (all caps, two vowels, contains slash, token length).
ii Context words (Chapter, on, king).
iii Neighboring word identity.
Step 3: Expansion
o EXPN: needs abbreviation dictionary + homonym disambiguation.
• LSEQ: expand letter by letter.
• ASWD: keep as is.
• NUM/NORD: expand to cardinal/ordinal words.
• NDIG/NZIP: digit-by-digit expansion.
• NYER: two-digit pairs (nineteen eighty), except:
➢ Ends in "00" → read as cardinal (two thousand).
➢ Hundreds method → eighteen hundred.
• NTEL: digit sequence, or paired digits for last four, or trailing unit method (five
thousand for last digits).
Dialect & Language-Specific Variations
• Australian English → double three for "33".
• French: depends on gender of noun (un garçon, une fille).
• German: morphological case affects pronunciation (Heinrich IV changes form).
Homograph Disambiguation :
Goal:
• To determine correct pronunciation for homographs — words with the same spelling
but different pronunciations.
Definition & Examples
Homographs: Same spelling, different pronunciations.
• English examples:
1. use
• noun: /y uw s/ → It’s no use
• verb: /y uw z/ → to use the telephone
2. live
• verb: /l ih v/ → Do you live near a zoo
• adjective: /l ay v/ → live animals
• bass
• fish: /b ae s/ → bass fishing
• instrument: /b ey s/ → bass guitar
French examples:
• fils: [fis] ‘son’ vs [fil] ‘thread’
• fier: ‘proud’ vs ‘to trust’
• est: ‘is’ vs ‘East’
Part-of-Speech (POS) as a Disambiguation Tool
In English (and similar languages like French, German), different homograph forms often
differ in part-of-speech.
Example:
• use: noun vs verb
• live: verb vs adjective
Fig. 8.5 Patterns:
o Final voicing: noun: /s/ vs verb: /z/ → use, close, house
o Stress shift: noun: initial stress vs verb: final stress → record, insult, object
o -ate final vowel weakening
• noun/adjective: final /ax/ vs verb: final /ey/ → estimate, separate, moderate
Statistical Evidence
Liberman & Church (1992):
• Many frequent homographs in AP newswire corpus can be disambiguated via POS.
• Top 15 most frequent homographs: use, increase, close, record, house, contract, lead,
live, lives, protest, survey, project, separate, present, read
Standard Disambiguation Approach
Store distinct pronunciations for homographs labeled by POS.
• Run a POS tagger to select the correct pronunciation in context.
When POS Alone Is Not Enough
o Same POS, different pronunciations:
• bass: fish (/b ae s/) vs instrument (/b ey s/)
• lead: noun (/l iy d/) leash vs noun (/l eh d/) metal
Abbreviation ambiguity:
• Dr. → doctor vs drive
• St. → Saint vs street
Capitalization differences:
• polish vs Polish (homographs in sentence-initial or all-caps text)
Practical TTS Handling
o Many of these harder homographs are ignored in TTS systems.
• Alternatively, use Word Sense Disambiguation (WSD) methods:
• Example: Decision-list algorithm (Yarowsky, 1997)
Phonetic Analysis:
Purpose:
• Take normalized word strings from text analysis and generate pronunciations for each
word.
Main Component
• Large pronunciation dictionary (core tool for phonetic analysis).
Why Dictionaries Alone Are Not Enough
• Running text often contains words not in the dictionary.
• Example (Black et al., 1998):
o British English dictionary: OALD lexicon tested on first section of Penn Wall
Street Journal Treebank.
o Total words (tokens): 39,923
o Not in dictionary: 1,775 tokens (4.6%)
▪ Unique types: 943
o Distribution of unseen word tokens:
Category Count %
Names 1,360 76.6%
Unknown words 351 19.8%
Typos & others 64 3.6%
• Two main areas needing augmentation:
1. Handling names
2. Handling other unknown words
Process Order
1. Dictionaries
2. Names
3. Grapheme-to-phoneme rules for other unknown words
8.2.1 Dictionary Lookup
Phonetic Dictionaries
• Introduced earlier in Ch. 8.
• Commonly used in TTS:
o CMU Pronouncing Dictionary (CMUdict, 1993)
▪ ~120,000 words
▪ Pronunciations: roughly phonemic from 39-phone ARPAbet-derived
set
▪ Phonemic transcription:
▪ No surface reductions (like [ax], [ix]) directly written
▪ Each vowel has stress tags:
▪ 0 → unstressed
▪ 1 → stressed
▪ 2 → secondary stress
▪ Non-diphthong vowels with stress 0 usually → [ax] or [ix]
▪ Most words have one pronunciation, ~8,000 words have two or three
▪ Some phonetic reductions are reflected
▪ Not syllabified (nucleus implicitly marked by numbered vowel)
▪ Limitations:
▪ Designed for speech recognition, not synthesis
▪ Does not specify which pronunciation to choose for synthesis
▪ No syllable boundaries
▪ Capitalization of headwords → can’t distinguish US vs us
▪ US: [AH1 S] or [Y UW1 EH1 S]
Sample CMUdict Entries (Fig. 8.6)
ANTECEDENTS AE2 N T IH0 S IY1 D AH0 N T S
PAKISTANI P AE2 K IH0 S T AE1 N IY0
CHANG CH AE1 NG
TABLE T EY1 B AH0 L
DICTIONARY D IH1 K SH AH0 N EH2 R IY0
TROTSKY T R AA1 T S K IY2
DINNER D IH1 N ER0
WALTER W AO1 L T ER0
LUNCH L AH1 N CH
WALTZING W AO1 L T S IH0 NG
MCFARLAND M AH0 K F AA1 R L AH0 N D
WALTZING(2) W AO1 L S IH0 NG
UNISYN Dictionary
• ~110,000 words
• Freely available for research
• Designed for speech synthesis (Fitt, 2002)
• Features:
o Syllabifications
o Stress markings
o Some morphological boundaries
o Can produce pronunciations in multiple English dialects:
▪ General American
▪ RP British
▪ Australian English
▪ …and dozens more
o Slightly different phone set than CMUdict
UNISYN Examples:
going: { g * ou }.> i ng >
antecedents: { * a n . tˆ i . s ˜ ii . d n! t }> s >
dictionary: { d * i k . sh @ . n ˜ e . r ii }
8.2.2 Names
Importance of Names in Speech Synthesis
• Names are a major source of pronunciation errors in TTS.
• Categories of names:
1. Personal names – first names and surnames
2. Geographical names – cities, streets, and other place names
3. Commercial names – company and product names
Scale of the Problem
• Personal names in the U.S. (Spiegel, 2003, based on Donnelly and household lists):
o ~2 million different surnames
o ~100,000 first names
• Comparison:
o 2 million surnames is an order of magnitude larger than the entire CMU
Pronouncing Dictionary (~120,000 words).
Large-Scale TTS Solutions
• Most large-scale TTS systems include a large name pronunciation dictionary.
• CMU Pronouncing Dictionary:
o Contains many names
o Includes:
▪ The most frequent 50,000 surnames (from Bell Labs estimate of U.S.
personal name frequency)
▪ ~6,000 first names
Coverage Studies
• Liberman & Church (1992):
o A dictionary of 50,000 names covered 70% of name tokens in a 44-million-
word AP newswire corpus.
o Many remaining names could be derived from these 50,000:
▪ Up to 97.43% coverage possible by applying simple modifications to
the known names.
Name Formation Techniques
1. Adding stress-neutral suffixes to known names
o Examples:
▪ walters = walter + s
▪ lucasville = lucas + ville
▪ abelson = abel + son
2. Rhyme analogy
o Example:
▪ Known: Trotsky → /tr/ replaced with /pl/ → Plotsky
3. Morphological decomposition
4. Analogical formation
5. Mapping unseen names to spelling variants already in the dictionary
o (Fackrell & Skut, 2004)
Current Challenges
• Name pronunciation remains difficult despite these techniques.
• Many modern systems handle unknown names via grapheme-to-phoneme (G2P)
methods:
o Often build two separate predictive systems:
▪ One for names
▪ One for non-names
Additional Resources
• Spiegel (2003, 2002) provides further details on issues in proper name
pronunciation.
8.2.3 Grapheme-to-Phoneme (G2P):
Definition
• Grapheme-to-Phoneme Conversion (G2P):
Converting a sequence of letters (graphemes) into a sequence of phones (phonemes).
o Example: "cake" → [K EY K]
• Also called Letter-to-Sound (LTS) conversion.
Historical Approach – Hand-Written Rules
• Based on Chomsky–Halle phonological rewrite rules (Chapter 7 format).
• Rules are applied in order:
o Earlier rules = context-specific
o Later rules = default (apply only if earlier ones don’t)
• Example simple rules for letter c:
1. c → [k] / {a, o} V (context-dependent)
2. c → [s] (default, context-independent)
• Actual rules are more complex:
o c → [ch] in cello, concerto
• Stress rules for English are especially complex:
o Example rule from Allen et al. (1987):
Assign primary stress in specific contexts involving weak syllables and
morpheme-final positions:
1. Before a weak syllable + morpheme-final short vowel + consonants
(e.g., difficult)
2. Before a weak syllable + morpheme-final vowel (e.g., oregano)
Modern Approach – Probabilistic Methods
• First formalized by Lucassen & Mercer (1984):
o Goal: Find the most probable phone sequence P for a letter sequence L:
P^=argmaxPP(P∣L)\hat{P} = \arg\max_P P(P|L)P^=argPmaxP(P∣L)
• Requires:
o Training set: Words with spelling + pronunciation
o Test set: New words to predict
Step 1 – Letter-to-Phone Alignment (Training Set)
• Needed: Which phones align with which letters.
• Some letters align to:
o Multiple phones: x → [k s]
o No phones: final e in cake → ε (null)
• Example alignment:
makefile
CopyEdit
L: c a k e
| | | |
P: K EY K ε
Semi-Automatic Alignment Method (Black et al., 1998)
• Relies on allowable phone lists for each letter.
• Example:
o c: k, ch, s, sh, t-s, ε
o e: ih, iy, er, ax, ah, eh, ey, uw, ay, ow, y-uw, oy, aa, ε
• Process:
1. For each word, find all alignments matching allowable lists.
2. Count (letter, phone) pairs across all alignments.
3. Compute probability:
P(pi∣lj)=count(pi,lj)count(lj)P(p_i | l_j) = \frac{\text{count}(p_i, l_j)}{\text{count}(l_j)}P(pi
∣lj)=count(lj)count(pi,lj)
4. Use Viterbi algorithm to find best alignment A for each (P, L) pair.
Step 2 – Predicting Phones for New Words (Test Set)
• Train machine learning classifier (e.g., decision tree) on aligned training set.
• Predicts most probable phone for each letter.
Features for Decision Tree
1. Window of surrounding letters:
o k previous + k following letters
o Example:
▪ "cat": a → [AE]
▪ "cake": a → [EY] (influenced by final e)
2. Previous predicted phone:
o Gives phonotactic context
o Must process left-to-right to use previous output.
3. Stress prediction:
o Augment vowels with stress info:
▪ 2 levels: AE / AE1
▪ 3 levels: AE0, AE1, AE2 (as in CMU lexicon)
4. Part-of-speech tag of the word:
o Even for unknown words (estimated by POS tagger)
5. Previous vowel stress status
6. Letter classes:
o Consonants, vowels, liquids, etc.
7. Following word features (language-specific phenomena):
o Example: French liaison:
▪ six → [sis] (j’en veux six)
▪ [siz] (six enfants)
▪ [si] (six filles)
Special Handling for Names
• Many systems build two separate G2P models:
1. For unknown personal names
2. For other unknown words
• Additional features for names:
o Foreign language origin (predicted via letter sequence n-gram models)
Model Type
• Decision tree = conditional classifier:
o Finds phone sequence with highest conditional probability given grapheme
sequence.
• More recent models:
o Joint classifier:
▪ Hidden state = graphone (grapheme + phoneme pair)
This is now a full extraction—nothing skipped, all examples, rules, and equations preserved,
plus structured for clarity.