Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views37 pages

NLSP 4

The document discusses the historical context and evolution of writing systems, particularly focusing on the debate between 'whole language' and 'phonics' in reading instruction. It outlines the development of sound-based writing systems from logographic origins and introduces the study of phonetics, including articulatory and acoustic phonetics. Additionally, it emphasizes the importance of phonetic transcription systems like ARPAbet and IPA in understanding and representing speech sounds.

Uploaded by

Safrin Fathima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views37 pages

NLSP 4

The document discusses the historical context and evolution of writing systems, particularly focusing on the debate between 'whole language' and 'phonics' in reading instruction. It outlines the development of sound-based writing systems from logographic origins and introduces the study of phonetics, including articulatory and acoustic phonetics. Additionally, it emphasizes the importance of phonetic transcription systems like ARPAbet and IPA in understanding and representing speech sounds.

Uploaded by

Safrin Fathima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT 4

Phonetics
Introduction
1. Debate Context

• The “whole language” vs. “phonics” debate in teaching reading seems modern but
mirrors an older historical debate in writing systems.

2. Historical Writing Systems

• Earliest independently invented writing systems (Sumerian, Chinese, Mayan) were


mainly logographic: one symbol = a whole word.

• Even early logographic systems included syllabic or phonemic elements where


symbols represented sounds.

• Example: The Sumerian symbol for “ration” (pronounced ba) could also function
purely as the sound /ba/.

• Modern Chinese, though mostly logographic, uses sound-based characters for foreign
words.

3. Development of Purely Sound-Based Writing Systems

• Sound-based writing can be:

o Syllabic (e.g., Japanese hiragana, katakana)

o Alphabetic (e.g., Roman alphabet)

o Consonantal (e.g., Semitic scripts)

• These systems often evolved from early logo-syllabic systems, usually when cultures
interacted.

• The Arabic, Aramaic, Hebrew, Greek, and Roman alphabets all came from a West
Semitic script, likely adapted by Western Semitic mercenaries from a cursive form of
Egyptian hieroglyphs.

• Japanese syllabaries came from cursive Chinese characters used to represent sounds.

• Those Chinese characters had been used to phonetically represent Sanskrit in


Buddhist scriptures during China’s Tang dynasty.

4. Conceptual Foundation (Ur-theory)

• Sound-based writing implies the spoken word is made up of smaller units of


speech.
• This idea underlies modern phonology.
• The decomposition of speech into smaller units is also the basis for:

o Speech recognition (turning acoustic waveforms into text)

o Speech synthesis / Text-to-speech (turning text into acoustic waveforms)

5. Focus of Chapter 7: Phonetics


• Phonetics studies linguistic sounds:

o Production by the articulators of the human vocal tract

o Acoustic realization of those sounds

o Digitization & processing of acoustic signals

6. Phones and Pronunciation in Technology

• Phones = individual speech units.

• Speech recognition systems need pronunciations for every word they can recognize.
• Text-to-speech systems need pronunciations for every word they can say.
• Phonetic alphabets are used to describe these pronunciations.

7. Two Main Areas of Phonetics

• Articulatory phonetics → how speech sounds are produced in the mouth.

• Acoustic phonetics → acoustic analysis of speech sounds.

8. Link to Phonology

• Phonology studies:

o How sounds vary systematically in different environments.


o How the sound system connects to the rest of grammar.

• Variation in pronunciation depending on context is crucial in speech modeling.

Important Points from the Passage

1. Whole language vs. phonics debate

o Modern reading-teaching debate.

o Mirrors a historical debate in writing systems.

2. Earliest writing systems


o Examples: Sumerian, Chinese, Mayan.

o Mostly logographic (one symbol represents a whole word).


o Even early logographic systems included syllabic or phonemic elements
(symbols for sounds).

o Example: Sumerian symbol ba = “ration” and sound /ba/.

o Modern Chinese (still logographic) uses sound-based characters for foreign


words.

3. Purely sound-based writing systems

o Types:

▪ Syllabic (Japanese hiragana, katakana)

▪ Alphabetic (Roman alphabet)


▪ Consonantal (Semitic scripts)

o Origin: evolved from logo-syllabic systems.

o Often happened when two cultures met.

4. Historical origins

o Arabic, Aramaic, Hebrew, Greek, Roman alphabets came from West


Semitic script.

o Adapted by Western Semitic mercenaries from cursive Egyptian


hieroglyphs.

o Japanese syllabaries came from cursive Chinese characters used for sounds.

o Those Chinese characters were used to phonetically represent Sanskrit in


Buddhist scriptures during the Tang dynasty.

5. Ur-theory (basic idea behind phonology)

o Sound-based writing = spoken words are made of smaller speech units.

o Basis of modern phonology.

o Also basis for:


▪ Speech recognition → turning acoustic waveforms into text.

▪ Speech synthesis / Text-to-speech → turning text into acoustic


waveforms.

6. Phonetics (Chapter 7 introduction)

o Study of linguistic sounds:

▪ How produced by articulators in the human vocal tract.


▪ How acoustically realized.
▪ How digitized and processed.

7. Phones and pronunciation in technology

o Phones = individual speech units.

o Speech recognition: needs pronunciation for every recognizable word.


o Text-to-speech: needs pronunciation for every word it can speak.

o Phonetic alphabets are used to describe pronunciations.

8. Main areas of phonetics

o Articulatory phonetics → how speech sounds are produced in the mouth.

o Acoustic phonetics → acoustic analysis of speech sounds.

9. Phonology (link to phonetics)

o Studies:
▪ Systematic variation of sounds in different environments.
▪ Relation of sound system to the rest of grammar.

o Important for speech modeling because pronunciation changes with context.

Speech Sounds and Phonetic Transcription:

1. Phonetics

o Study of speech sounds used in the languages of the world.

o Pronunciation of a word is modeled as a string of symbols representing


phones or segments.

2. Phones

o A phone is a speech sound.


o Represented with phonetic symbols that may resemble letters in an alphabetic
language (like English).

3. Purpose of this section


o Surveys different phones in English, especially American English.

o Explains how they are produced and how they are represented
symbolically.

4. Phonetic Alphabets

o International Phonetic Alphabet (IPA)


▪ Developed in 1888 by the International Phonetic Association.
▪ Goal: transcribe the sounds of all human languages.

▪ Includes both an alphabet and principles for transcription.

▪ Same utterance can be transcribed in different ways according to IPA


principles.

o ARPAbet (Shoup, 1980)

▪ Designed specifically for American English.

▪ Uses ASCII symbols.


▪ Can be seen as an ASCII form of an American-English subset of IPA.

▪ Common in online pronunciation dictionaries and computational


applications where non-ASCII fonts are inconvenient.

5. Choice in this book

o Will use ARPAbet instead of IPA for computational purposes.

o Figures 7.1 (consonants) and 7.2 (vowels) show ARPAbet symbols with IPA
equivalents.

6. Rare Phones Example

o [ux]: Rare in General American English.

▪ Represents a fronted [uw] found in Western/Northern Cities dialects


from late 1970s.

▪ Popularized by imitations of “Valley Girls” (Moon Zappa, 1982).

▪ For most speakers, [uw] is still more common (e.g., dude [d uw d]).

7. ARPAbet Consonants – Examples


(With IPA equivalents)

o [p] → parsley [p aa r s l iy]

o [t] → tea [t iy]


o [k] → cook [k uh k]

o [b] → bay [b ey]

o [d] → dill [d ih l]

o [g] → garlic [g aa r l ix k]

o [m] → mint [m ih n t]

o [n] → nutmeg [n ah t m eh g]
o [ng] → baking [b ey k ix ng]
o [f] → flour [f l aw axr]

o [v] → clove [k l ow v]

o [th] → thick [th ih k]

o [dh] → those [dh ow z]


o [s] → soup [s uw p]

o [z] → eggs [eh g z]

o [sh] → squash [s k w aa sh]

o [zh] → ambrosia [ae m b r ow zh ax]

o [ch] → cherry [ch eh r iy]

o [jh] → jar [jh aa r]

o [l] → licorice [l ih k axr ix sh]


o [w] → kiwi [k iy w iy]
o [r] → rice [r ay s]

o [y] → yellow [y eh l ow]

o [h] → honey [h ah n iy]

o Rare consonants:

▪ [q] → uh-oh [q ah q ow] (glottal stop)

▪ [dx] → butter [b ah dx axr] (flap)

▪ [nx] → winner [w ih nx axr] (nasal flap)


▪ [el] → table [t ey b el] (syllabic consonant)

8. ARPAbet Vowels – Examples


(With IPA equivalents)

o [iy] → lily [l ih l iy]

o [ih] → lily [l ih l iy]

o [ey] → daisy [d ey z iy]

o [eh] → pen [p eh n]

o [ae] → aster [ae s t axr]

o [aa] → poppy [p aa p iy]


o [ao] → orchid [ao r k ix d]
o [uh] → wood [w uh d]

o [ow] → lotus [l ow dx ax s]

o [uw] → tulip [t uw l ix p]

o [ah] → buttercup [b ah dx axr k ah p]


o [er] → bird [b er d]

o [ay] → iris [ay r ix s]

o [aw] → sunflower [s ah n f l aw axr]

o [oy] → soil [s oy l]

o Reduced/Uncommon vowels:

▪ [ax] → lotus [l ow dx ax s] (schwa)

▪ [axr] → heather [h eh dh axr]


▪ [ix] → tulip [t uw l ix p] (reduced [ih])
▪ [ux] → dude [d ux d]

9. Orthography vs. Phonetic Symbols

o Many ARPAbet/IPA symbols match Roman letters (e.g., [p] in platypus).

o English spelling is opaque:

▪ Same letter can represent different sounds.

▪ Example: letter c → [k] in cougar [k uw g axr], but [s] in cell [s eh l].

o [k] can appear as:


▪ c (cougar)

▪ k (kangaroo)

▪ x (fox [f aa k s])

▪ ck (jackal [jh ae k el])

▪ cc (raccoon [r ae k uw n])

o Languages like Spanish have more transparent spelling-sound mapping than


English.

ARPAbet Consonants Table

ARPAbet IPA Example Word Example Transcription

p [p] parsley [p aa r s l iy]


ARPAbet IPA Example Word Example Transcription

t [t] tea [t iy]

k [k] cook [k uh k]

b [b] bay [b ey]

d [d] dill [d ih l]

g [g] garlic [g aa r l ix k]

m [m] mint [m ih n t]

n [n] nutmeg [n ah t m eh g]

ng [ŋ] baking [b ey k ix ng]

f [f] flour [f l aw axr]

v [v] clove [k l ow v]

th [θ] thick [th ih k]

dh [ð] those [dh ow z]

s [s] soup [s uw p]

z [z] eggs [eh g z]

sh [ʃ] squash [s k w aa sh]

zh [ʒ] ambrosia [ae m b r ow zh ax]

ch [tʃ] cherry [ch eh r iy]

jh [dʒ] jar [jh aa r]

l [l] licorice [l ih k axr ix sh]

w [w] kiwi [k iy w iy]

r [r] rice [r ay s]

y [j] yellow [y eh l ow]

h [h] honey [h ah n iy]

q [ʔ] uh-oh [q ah q ow]


ARPAbet IPA Example Word Example Transcription

dx [ɾ] butter [b ah dx axr]

nx [ɾ]̃ winner [w ih nx axr]

el [l̩ ] table [t ey b el]

ARPAbet Vowels Table

ARPAbet IPA Example Word Example Transcription

iy [i] lily [l ih l iy]

ih [ɪ] lily [l ih l iy]

ey [eɪ] daisy [d ey z iy]

eh [ɛ] pen [p eh n]

ae [æ] aster [ae s t axr]

aa [ɑ] poppy [p aa p iy]

ao [ɔ] orchid [ao r k ix d]

uh [ʊ] wood [w uh d]

ow [oʊ] lotus [l ow dx ax s]

uw [u] tulip [t uw l ix p]

ah [ʌ] buttercup [b ah dx axr k ah p]

er [ɝ] bird [b er d]

ay [aɪ] iris [ay r ix s]

aw [aʊ] sunflower [s ah n f l aw axr]

oy [ɔɪ] soil [s oy l]

ax [ə] lotus [l ow dx ax s]

axr [ɚ] heather [h eh dh axr]

ix [ɨ] tulip [t uw l ix p]
ARPAbet IPA Example Word Example Transcription

ux [ʉ] dude [d ux d]

ARTICULATORY PHONETICS

Study of how phones are produced as organs in the mouth, throat, and nose modify
airflow from the lungs.

1 The Vocal Organs

Airflow & Sound Production

• Air source: Rapid movement of air, usually expelled from the lungs.

• Air passes through:


1. Lungs →

2. Windpipe (Trachea) →

3. Larynx →

4. Mouth or Nose.

The Larynx & Vocal Folds

• Larynx = "Adam’s apple" / voice box.

• Contains vocal folds (vocal cords) → two small folds of muscle.


• Glottis = space between vocal folds.

Voicing

• Vocal folds close (but not tightly) → vibrate as air passes → Voiced sounds.

• Vocal folds far apart → no vibration → Unvoiced sounds.

Voiced Sounds Unvoiced Sounds

[b], [d], [g], [v], [z], all vowels [p], [t], [k], [f], [s], etc.

The Vocal Tract

• Area above trachea.


• Two parts:
1. Oral tract (mouth)

2. Nasal tract (nose)

Nasal Sounds
• Air passes through nose (and also mouth cavity as resonator).

• English examples: m, n, ng.

Phones Classification

• Consonants

o Made by restricting/blocking airflow.

o Can be voiced or unvoiced.


o Examples: [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l].
• Vowels

o Minimal obstruction to airflow.

o Usually voiced, louder, and longer-lasting.

o Examples: [aa], [ae], [ao], [ih], [aw], [ow], [uw].

• Semivowels

o Share properties of both consonants and vowels.

o Voiced like vowels but short & less syllabic like consonants.
o Examples: [y], [w].

Diagram: The Vocal Organs

Nasal cavity

┌───────┐

│ │

Lips │ │ Teeth
↓ │ │

Tongue ────── Palate


│ │
Pharynx │

↓ │

Larynx (vocal folds / glottis)


Trachea (windpipe)

Lungs

ARTICULATORY PHONETICS

• Definition: Study of how phones (speech sounds) are produced as the mouth,
throat, and nose modify the airflow from the lungs.

• The ARPAbet phone list is meaningless without knowing how each phone is
produced.

7.2.1 The Vocal Organs

Airflow in Speech

• Speech is produced by rapid movement of air.


• Most sounds: Air from lungs → trachea (windpipe) → mouth or nose.

Larynx

• Known as Adam’s apple or voice box.

• Contains vocal folds (vocal cords) → 2 small muscle folds.

• Glottis: Space between vocal folds.

Voicing

• Vocal folds close (not tightly) → vibrate as air passes → Voiced sounds.

• Vocal folds far apart → no vibration → Unvoiced sounds.

Voiced Unvoiced

[b], [d], [g], [v], [z], all vowels [p], [t], [k], [f], [s], etc.

Vocal Tract
• Area above trachea.
• Two parts:

1. Oral tract (mouth)

2. Nasal tract (nose)

Nasal Sounds

• Air passes through nose (also resonates in mouth).

• Examples: m, n, ng.

Phones Classification

1. Consonants

o Restrict/block airflow.
o May be voiced or unvoiced.
o Examples: [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l].

2. Vowels

o Minimal obstruction.

o Usually voiced, louder & longer-lasting.

o Examples: [aa], [ae], [ao], [ih], [aw], [ow], [uw].

3. Semivowels

o Properties of both vowels & consonants.


o Voiced like vowels, short like consonants.

o Examples: [y], [w].

Diagram – Vocal Organs (Side View)


(Shows airflow path and main parts)

scss

CopyEdit

Nasal Cavity


┌─────────┐
│ │

Lips Teeth Alveolar Ridge

↓ ↓ ↓

Tongue ─── Palate ─── Velum (soft palate)


Pharynx

Larynx (vocal folds, glottis)

Trachea


Lungs

7.2.2 Consonants – Place of Articulation

• Definition: The point of maximum restriction of airflow when producing a


consonant.

• Used in automatic speech recognition to group phones.

Major English Places of Articulation

1. Labial

o Bilabial: Both lips come together.


Examples: [p] (possum), [b] (bear), [m] (marmot).

o Labiodental: Bottom lip touches upper teeth.


Examples: [v], [f].

2. Dental

o Tongue against teeth (tip slightly between).


Examples: [th] (thing), [dh] (though).

3. Alveolar

o Tip of tongue against alveolar ridge (behind upper teeth).


Examples: [s], [z], [t], [d].
o Coronal: Term for both dental + alveolar sounds.
4. Palatal

o Palato-alveolar: Tongue blade against rising back of alveolar ridge.


Examples: [sh] (shrimp), [ch] (china), [zh] (Asian), [jh] (jar).

o Palatal proper: Front of tongue close to palate.


Example: [y] (yak).

5. Velar

o Back of tongue against velum (soft palate).


Examples: [k] (cuckoo), [g] (goose), [ng] (kingfisher).

6. Glottal

o Constriction at glottis (vocal folds close).


Example: Glottal stop [q] (IPA [ʔ]).

Diagram – Places of Articulation

scss
CopyEdit

[ Lips ] → Labial (Bilabial, Labiodental)

[ Teeth ] → Dental

[ Alveolar ridge ]→ Alveolar

[ Palate ] → Palatal

[ Velum ] → Velar

[ Glottis ] → Glottal

This has every term, example, and keyword exactly as in your text but in a clear point-
wise format with diagrams so it’s easy to study and reproduce in exams.
Consonants: Manner of Articulation:

Definition:

• Manner of articulation → How the restriction in airflow is made during consonant


production.

• Along with place of articulation, it usually identifies a consonant uniquely.

1. Stop (Plosive)
• Airflow completely blocked for a short time → followed by an explosive release.

• Two phases:

o Closure → period of complete blockage.

o Release → explosion of air.


• Examples:

o Voiced stops: [b], [d], [g]

o Unvoiced stops: [p], [t], [k]

• Other notes:

o Also called plosives.

o Some systems distinguish closure and release separately (e.g., ARPAbet [pcl],
[tcl], [kcl]).

o Unreleased stops → no explosive release (e.g., end of words)

▪ ARPAbet: [pd], [td], [kd], [bd], [dd], [gd]

▪ IPA: [p̚], [t̚], [k̚]


o In this chapter → [p], [t], [k] = full stop with both closure & release.

2. Nasal

• Produced by lowering the velum → allows air to pass into the nasal cavity.

• Examples: [n], [m], [ŋ] (ng).

3. Fricatives
• Airflow constricted but not fully blocked → causes turbulent airflow.

• Produces a “hissing” sound.

• Examples by place:

o Labiodental: [f], [v] → lower lip against upper teeth.

o Dental: [θ] (“th” in thing), [ð] (“th” in though) → air flows around tongue
between teeth.

o Alveolar: [s], [z] → tongue against alveolar ridge, forcing air over teeth.

o Palato-alveolar: [ʃ] (“sh”), [ʒ] (“zh” as in Asian) → tongue at back of alveolar


ridge, air through tongue groove.
• Sibilants → higher-pitched fricatives: [s], [z], [ʃ], [ʒ].

• Affricates → stop immediately followed by fricative: [tʃ] (“ch”), [dʒ] (“jh” as in


giraffe).

4. Approximants

• Articulators close together but not close enough for turbulence.

• Examples:
o [j] (“y” in yellow) → tongue near roof of mouth, no turbulence.

o [w] (“w” in wood) → back of tongue near velum.

o American [r] →

▪ Tongue tip near palate, OR

▪ Whole tongue bunched near palate.

o [l] → tip of tongue against alveolar ridge/teeth, sides lowered so air flows
over them.

▪ Called lateral sound (air passes along tongue sides).

5. Tap / Flap

• Quick motion of tongue against alveolar ridge.

• Example: [ɾ] → middle of lotus ([l ow dx ax s]) in American English.

• In many UK dialects → replaced by [t].

Diagram Needed

You should draw or insert:

1. Airflow blockage & release (for stops).


2. Velum lowered for nasals.

3. Constriction points for fricatives.

4. Positions for approximants & laterals.

5. Tap/flap quick contact motion.

Vowels:
Definition:
• Like consonants, vowels are described by articulator positions during production.

• Three main parameters:

1. Vowel height → height of highest part of tongue.

2. Vowel frontness/backness → position of highest tongue point (toward front


or back).

3. Lip shape → rounded or unrounded.

1. Vowel Height

• High vowels → tongue raised high.

• Mid vowels → tongue in mid position.

• Low vowels → tongue lowered.

• Examples:

o High front: [iy] (heed)

o Low front: [ae] (had)


o High back: [uw] (who’d)

o [ih] higher than [eh] (both front vowels).

2. Vowel Frontness / Backness

• Front vowels → tongue highest point toward front of mouth.

o Examples: [iy], [ih], [eh], [ae]

• Back vowels → tongue highest point toward back of mouth.


o Examples: [uw], [uh], [ao], [aa]

3. Lip Shape

• Rounded vowels → lips rounded (like in whistling).

o Examples: [uw], [ao], [ow]

• Unrounded vowels → lips relaxed/spread.

4. Diphthongs
• Definition: vowel where tongue position changes markedly during production.

• Represented in vowel charts as vectors instead of points.

• English is rich in diphthongs.

• Examples: [ay] (eye), [aw] (cow), [oy] (boy).

5. Important Notes

• Vowel height is schematic → correlates with acoustic patterns more than exact
tongue position.

• Figures to understand:

o Fig. 7.5 → tongue position examples for [iy], [ae], [uw].

o Fig. 7.6 → schematic vowel chart (high, mid, low; front–back positions;
monophthongs vs diphthongs).

If you want, I can make you a single combined diagram showing:

• Vowel chart (front–back, high–low)

• Examples of rounded/unrounded vowels


• Arrows for diphthongs

That way, your consonants + vowels notes will be visually complete.

UNIT 5
Speech synthesis
Speech Synthesis – Overview

Historical Background

• 1769, Vienna – Wolfgang von Kempelen built The Mechanical Turk for Empress
Maria Theresa.

o Appearance: Wooden box with gears + robot mannequin moving chess pieces
via mechanical arm.

o Toured Europe & Americas → defeated Napoleon Bonaparte, played


Charles Babbage.
o Hoax → secretly operated by a hidden human chess player.
• 1769–1790 – Von Kempelen also built first full-sentence speech synthesizer (not a
hoax).

o Components:

▪ Bellows → simulated lungs.

▪ Rubber mouthpiece + nose aperture.

▪ Reed → simulated vocal folds.

▪ Whistles → fricatives.
▪ Auxiliary bellows → puff of air for plosives.

▪ Flexible leather “vocal tract” → adjusted to produce different


consonants & vowels.

o Operation: Controlled by moving levers with both hands, opening/closing


passages.

Modern Speech Synthesis (Text-to-Speech, TTS)

• Definition: Generating speech (acoustic waveforms) from text input.

• Applications:

1. Conversational agents – telephone-based dialogue systems (with speech


recognition).

2. Non-conversational systems – reading aloud for the blind, video games, toys.

3. Assistive communication – e.g., Stephen Hawking (ALS) typed words →


synthesizer speech output.

• Limitations:

o Even best systems may sound wooden.

o Limited voice variety.


• State of the art:

o Can produce remarkably natural speech for wide range of inputs.

Basic TTS Process

• Text Analysis – Convert text into phonemic internal representation.


o Includes: Expanding acronyms (e.g., PG&E → “P G AND E”), converting
numbers (20 → “twentieth”), assigning phone sequences, prosody, and
phrasing.

• Waveform Synthesis – Convert internal representation into speech waveform.

Example (Fig. 8.1)

Sentence: PG&E will file schedules on April 20.

• Converted to:

o Expanded words: P G AND E WILL FILE SCHEDULES ON APRIL


TWENTIETH

o Phones: p iy jh iy ae n d iy w ih l f ay l s k eh jh ax l z aa n ey p r ih l t w eh n
t iy ax th

o Prosodic markers: * * * L-L%

Main Waveform Synthesis Paradigms


1. Concatenative synthesis (focus of chapter) –

o Store recorded speech samples in database.

o Chop and recombine to form new sentences.

o Unit selection synthesis → chooses best matching units from large database.

2. Formant synthesis –

o Model resonance characteristics of vocal tract.

o Fully artificial, not sample-based.

3. Articulatory synthesis –
o Simulate physical processes of speech production.

Architecture Example

• Hourglass metaphor (Taylor, 2008) → Two-step narrowing from text to phones, then
expanding from phones to waveform.

• Modern commercial TTS → Mostly based on concatenative unit selection synthesis.


If you want, I can also make this into a clear flowchart showing:
Historical → Modern TTS → Text Analysis → Waveform Synthesis → Methods
(Concatenative, Formant, Articulatory)

Text Normalization:

Purpose

• Goal: Convert raw text into a phonemic internal representation for speech
synthesis.

• Why needed: Raw text contains abbreviations, numbers, punctuation quirks, and
irregular formats that must be standardized for correct pronunciation.

Main Steps in Text Normalization

1. Sentence Tokenization

• Definition: Segmenting text into separate sentences/utterances for synthesis.

• Challenges:
o Abbreviations with periods – Avoid splitting sentences incorrectly.

▪ Example: B.C. Hydro → do not treat the period after B.C. as an end of
sentence.

o Non-standard sentence boundaries – Detect sentences ending with


punctuation other than a period.
▪ Example: Sentence ending after “collected” even though it uses a
colon instead of a period.

2. Handling Non-Standard Words (NSWs)

• Definition: Words/symbols that require expansion into standard spoken form.

• Types & Examples:

o Dates:

▪ March 31 → “March thirty-first” (not “March three one”).

o Numbers:

▪ $1 billion → “one billion dollars” (insert word dollars after “billion”).


o Acronyms:
▪ Expand into letter-by-letter or full-word form as appropriate (e.g.,
PG&E → “P G and E”).
o Abbreviations:

▪ Must identify correct expansions (e.g., Dr. → “Doctor”, St. → “Street”


or “Saint” depending on context).

Example from Enron Corpus (Klimt & Yang, 2004)

• Raw text contains:

o Abbreviation with period (B.C. Hydro).


o Date (March 31).

o Currency & number ($1 billion).

o Colon as sentence end (“collected:” → signals end of an utterance).

Key Output of Text Normalization

• Clean, sentence-segmented text.

• All non-standard words expanded into their full spoken forms.


• Ready for phonetic analysis and prosodic analysis in the TTS pipeline.

8.1.1 Sentence Tokenization

Definition

• Sentence Tokenization = Process of detecting sentence boundaries in text.

• Goal: Correctly identify where sentences begin and end for speech synthesis.

Challenges in Sentence Tokenization


1. Punctuation Ambiguity

• Periods are not always sentence boundaries.

• Examples:

o Abbreviations – B.C. Hydro (period is part of abbreviation, not sentence end).

o Dual Role – When abbreviation ends a sentence (Dr. J. M. Freeman).

o Non-period boundaries – Colons can signal sentence end:

▪ collected: “We continue…”


o Example references: (8.2), (8.3), (8.4).
2. Period Disambiguation

• Definition: Deciding whether a period marks End-of-Sentence (EOS) or not.

• Approach:
o Early method: Simple Perl scripts (Ch. 3).

o Modern method: Machine Learning–based EOS classifier.

Machine Learning Approach for Sentence Tokenization

Training Process

1. Hand-label a training set with correct sentence boundaries.

2. Tokenize text into tokens separated by whitespace.


3. Select candidate tokens containing potential boundary punctuation:
o ., !, ? (possibly also :).

4. Train a classifier to predict EOS vs not-EOS for each candidate.

Features for EOS Classification

A. Basic Feature Templates

• Prefix: Text before punctuation in the token.

• Suffix: Text after punctuation in the token.


• Abbreviation Check: Whether prefix/suffix is in abbreviation list.

• Previous Word and Next Word.

• Whether previous word is abbreviation.

• Whether next word is abbreviation.

Example (8.5): ANLP Corp. chairman Dr. Smith resigned.

• Candidate = period in "Corp."

o PreviousWord = ANLP
o NextWord = chairman

o Prefix = Corp
o Suffix = NULL
o PreviousWordAbbreviation = 1

o NextWordAbbreviation = 0

B. Lexical Probability Features


• Probability that the candidate token occurs at end of sentence.

• Probability that the word after candidate occurs at beginning of sentence.

C. Language-Specific Features

• Capitalization patterns:

o Case of candidate word: Upper, Lower, AllCap, Numbers.

o Case of following word: Upper, Lower, AllCap, Numbers.


• Special abbreviation classes:
o Honorifics/Titles – Dr., Mr., Gen.

o Corporate designators – Corp., Inc.

o Month abbreviations – Jan., Feb.

Classification Methods

• Common algorithms: Logistic Regression, Decision Trees.

• Logistic Regression → Often higher accuracy.


• Decision Trees → More interpretable (example shown in Fig. 8.3).

• Features used:

o Log likelihood of current word being sentence start (bprob).

o Log likelihood of previous word being sentence end (eprob).

o Capitalization of next word.

o Abbreviation subclass (company, state, unit of measurement).

Non-Standard Words in Text Normalization :

Definition

Non-standard words (NSWs) are tokens like numbers or abbreviations that must be expanded
into full English words before pronunciation.
Ambiguity in NSWs

• Numbers can have multiple pronunciations depending on context:

• 1750 → seventeen fifty (year)


• 1750 → one seven five zero (password)
• 1750 dollars → seventeen hundred and fifty OR one thousand seven hundred and fifty
• Roman numerals (e.g., IV) can mean four, fourth, or letters "I V" (intravenous).
• Fractions like 2/3 → two thirds, February third, or two slash three.

Types of NSWs

• Numbers

• NUM: number (cardinal) – 12, 45, 1/2, 0.6


• NORD: ordinal – 3rd, May 7
• NDIG: number as digits – Room 101
• NIDE: identifier – 747, I5
• NADDR: street address – 386 Main St.
• NZIP: zip code – 91020
• NTIME: time – 3.20, 11:45
• NDATE: date – 2/28/05
• NYER: years – 1998, 80s, 2008
• MONEY: $3.45, Y20,200
• BMONEY: billions/trillions – $3.2 billion
• PRCT: percentage – 75%

• Alphabetic

• EXPN: abbreviations – N.Y., mph, gov’t


• LSEQ: letter sequences – DVD, IBM
• ASWD: acronyms as words – NASA, IKEA

• Realization Patterns

• Paired method (e.g., years) → seventeen fifty for 1750.


• Serial method (e.g., zip codes) → nine four one one zero.
• BMONEY → always read with currency word at the end.

• Processing NSWs

Step 1: Tokenization

• Tokenize by whitespace.

• Identify tokens not in the pronunciation dictionary as NSWs.

Handle:
• Known abbreviations in dictionaries (e.g., st, mr, mrs, mon, tues, nov).
• Single-character tokens.
• Hyphenated words (2-car).
• CamelCase words (RVing).

Step 2: Classification

• Assign NSW type (NUM, LSEQ, EXPN, etc.) using:


• Regular expressions (e.g., NYER: /1[89][0-9][0-9]|20[0-9][0-9]/).
• Machine learning classifiers with features:

i Letter patterns (all caps, two vowels, contains slash, token length).
ii Context words (Chapter, on, king).
iii Neighboring word identity.

Step 3: Expansion

o EXPN: needs abbreviation dictionary + homonym disambiguation.


• LSEQ: expand letter by letter.
• ASWD: keep as is.
• NUM/NORD: expand to cardinal/ordinal words.
• NDIG/NZIP: digit-by-digit expansion.
• NYER: two-digit pairs (nineteen eighty), except:

➢ Ends in "00" → read as cardinal (two thousand).


➢ Hundreds method → eighteen hundred.

• NTEL: digit sequence, or paired digits for last four, or trailing unit method (five
thousand for last digits).

Dialect & Language-Specific Variations

• Australian English → double three for "33".


• French: depends on gender of noun (un garçon, une fille).
• German: morphological case affects pronunciation (Heinrich IV changes form).

Homograph Disambiguation :

Goal:

• To determine correct pronunciation for homographs — words with the same spelling
but different pronunciations.

Definition & Examples

Homographs: Same spelling, different pronunciations.

• English examples:
1. use

• noun: /y uw s/ → It’s no use


• verb: /y uw z/ → to use the telephone

2. live
• verb: /l ih v/ → Do you live near a zoo
• adjective: /l ay v/ → live animals

• bass

• fish: /b ae s/ → bass fishing


• instrument: /b ey s/ → bass guitar

French examples:

• fils: [fis] ‘son’ vs [fil] ‘thread’


• fier: ‘proud’ vs ‘to trust’
• est: ‘is’ vs ‘East’

Part-of-Speech (POS) as a Disambiguation Tool

In English (and similar languages like French, German), different homograph forms often
differ in part-of-speech.

Example:

• use: noun vs verb


• live: verb vs adjective

Fig. 8.5 Patterns:

o Final voicing: noun: /s/ vs verb: /z/ → use, close, house

o Stress shift: noun: initial stress vs verb: final stress → record, insult, object

o -ate final vowel weakening

• noun/adjective: final /ax/ vs verb: final /ey/ → estimate, separate, moderate

Statistical Evidence

Liberman & Church (1992):

• Many frequent homographs in AP newswire corpus can be disambiguated via POS.


• Top 15 most frequent homographs: use, increase, close, record, house, contract, lead,
live, lives, protest, survey, project, separate, present, read

Standard Disambiguation Approach


Store distinct pronunciations for homographs labeled by POS.

• Run a POS tagger to select the correct pronunciation in context.

When POS Alone Is Not Enough

o Same POS, different pronunciations:


• bass: fish (/b ae s/) vs instrument (/b ey s/)
• lead: noun (/l iy d/) leash vs noun (/l eh d/) metal

Abbreviation ambiguity:

• Dr. → doctor vs drive


• St. → Saint vs street

Capitalization differences:

• polish vs Polish (homographs in sentence-initial or all-caps text)

Practical TTS Handling

o Many of these harder homographs are ignored in TTS systems.


• Alternatively, use Word Sense Disambiguation (WSD) methods:
• Example: Decision-list algorithm (Yarowsky, 1997)

Phonetic Analysis:
Purpose:
• Take normalized word strings from text analysis and generate pronunciations for each
word.

Main Component

• Large pronunciation dictionary (core tool for phonetic analysis).

Why Dictionaries Alone Are Not Enough

• Running text often contains words not in the dictionary.

• Example (Black et al., 1998):


o British English dictionary: OALD lexicon tested on first section of Penn Wall
Street Journal Treebank.
o Total words (tokens): 39,923
o Not in dictionary: 1,775 tokens (4.6%)

▪ Unique types: 943

o Distribution of unseen word tokens:

Category Count %

Names 1,360 76.6%

Unknown words 351 19.8%

Typos & others 64 3.6%

• Two main areas needing augmentation:


1. Handling names

2. Handling other unknown words

Process Order

1. Dictionaries

2. Names

3. Grapheme-to-phoneme rules for other unknown words

8.2.1 Dictionary Lookup

Phonetic Dictionaries

• Introduced earlier in Ch. 8.

• Commonly used in TTS:

o CMU Pronouncing Dictionary (CMUdict, 1993)


▪ ~120,000 words

▪ Pronunciations: roughly phonemic from 39-phone ARPAbet-derived


set

▪ Phonemic transcription:
▪ No surface reductions (like [ax], [ix]) directly written

▪ Each vowel has stress tags:

▪ 0 → unstressed
▪ 1 → stressed
▪ 2 → secondary stress

▪ Non-diphthong vowels with stress 0 usually → [ax] or [ix]

▪ Most words have one pronunciation, ~8,000 words have two or three

▪ Some phonetic reductions are reflected


▪ Not syllabified (nucleus implicitly marked by numbered vowel)

▪ Limitations:

▪ Designed for speech recognition, not synthesis

▪ Does not specify which pronunciation to choose for synthesis

▪ No syllable boundaries

▪ Capitalization of headwords → can’t distinguish US vs us

▪ US: [AH1 S] or [Y UW1 EH1 S]


Sample CMUdict Entries (Fig. 8.6)
ANTECEDENTS AE2 N T IH0 S IY1 D AH0 N T S

PAKISTANI P AE2 K IH0 S T AE1 N IY0

CHANG CH AE1 NG

TABLE T EY1 B AH0 L

DICTIONARY D IH1 K SH AH0 N EH2 R IY0

TROTSKY T R AA1 T S K IY2

DINNER D IH1 N ER0


WALTER W AO1 L T ER0

LUNCH L AH1 N CH

WALTZING W AO1 L T S IH0 NG

MCFARLAND M AH0 K F AA1 R L AH0 N D

WALTZING(2) W AO1 L S IH0 NG

UNISYN Dictionary

• ~110,000 words
• Freely available for research

• Designed for speech synthesis (Fitt, 2002)


• Features:
o Syllabifications

o Stress markings

o Some morphological boundaries

o Can produce pronunciations in multiple English dialects:


▪ General American

▪ RP British

▪ Australian English

▪ …and dozens more

o Slightly different phone set than CMUdict

UNISYN Examples:
going: { g * ou }.> i ng >
antecedents: { * a n . tˆ i . s ˜ ii . d n! t }> s >

dictionary: { d * i k . sh @ . n ˜ e . r ii }

8.2.2 Names

Importance of Names in Speech Synthesis

• Names are a major source of pronunciation errors in TTS.

• Categories of names:

1. Personal names – first names and surnames


2. Geographical names – cities, streets, and other place names

3. Commercial names – company and product names

Scale of the Problem

• Personal names in the U.S. (Spiegel, 2003, based on Donnelly and household lists):

o ~2 million different surnames

o ~100,000 first names


• Comparison:

o 2 million surnames is an order of magnitude larger than the entire CMU


Pronouncing Dictionary (~120,000 words).
Large-Scale TTS Solutions

• Most large-scale TTS systems include a large name pronunciation dictionary.

• CMU Pronouncing Dictionary:


o Contains many names

o Includes:

▪ The most frequent 50,000 surnames (from Bell Labs estimate of U.S.
personal name frequency)

▪ ~6,000 first names

Coverage Studies

• Liberman & Church (1992):

o A dictionary of 50,000 names covered 70% of name tokens in a 44-million-


word AP newswire corpus.

o Many remaining names could be derived from these 50,000:

▪ Up to 97.43% coverage possible by applying simple modifications to


the known names.

Name Formation Techniques


1. Adding stress-neutral suffixes to known names

o Examples:

▪ walters = walter + s

▪ lucasville = lucas + ville

▪ abelson = abel + son

2. Rhyme analogy

o Example:
▪ Known: Trotsky → /tr/ replaced with /pl/ → Plotsky

3. Morphological decomposition

4. Analogical formation
5. Mapping unseen names to spelling variants already in the dictionary
o (Fackrell & Skut, 2004)

Current Challenges

• Name pronunciation remains difficult despite these techniques.


• Many modern systems handle unknown names via grapheme-to-phoneme (G2P)
methods:

o Often build two separate predictive systems:


▪ One for names

▪ One for non-names

Additional Resources

• Spiegel (2003, 2002) provides further details on issues in proper name


pronunciation.

8.2.3 Grapheme-to-Phoneme (G2P):

Definition

• Grapheme-to-Phoneme Conversion (G2P):


Converting a sequence of letters (graphemes) into a sequence of phones (phonemes).

o Example: "cake" → [K EY K]

• Also called Letter-to-Sound (LTS) conversion.

Historical Approach – Hand-Written Rules

• Based on Chomsky–Halle phonological rewrite rules (Chapter 7 format).

• Rules are applied in order:

o Earlier rules = context-specific

o Later rules = default (apply only if earlier ones don’t)

• Example simple rules for letter c:


1. c → [k] / {a, o} V (context-dependent)

2. c → [s] (default, context-independent)

• Actual rules are more complex:


o c → [ch] in cello, concerto
• Stress rules for English are especially complex:

o Example rule from Allen et al. (1987):


Assign primary stress in specific contexts involving weak syllables and
morpheme-final positions:

1. Before a weak syllable + morpheme-final short vowel + consonants


(e.g., difficult)

2. Before a weak syllable + morpheme-final vowel (e.g., oregano)

Modern Approach – Probabilistic Methods

• First formalized by Lucassen & Mercer (1984):

o Goal: Find the most probable phone sequence P for a letter sequence L:

P^=arg⁡max⁡PP(P∣L)\hat{P} = \arg\max_P P(P|L)P^=argPmaxP(P∣L)

• Requires:

o Training set: Words with spelling + pronunciation

o Test set: New words to predict

Step 1 – Letter-to-Phone Alignment (Training Set)

• Needed: Which phones align with which letters.

• Some letters align to:

o Multiple phones: x → [k s]

o No phones: final e in cake → ε (null)

• Example alignment:

makefile
CopyEdit

L: c a k e

| | | |

P: K EY K ε

Semi-Automatic Alignment Method (Black et al., 1998)

• Relies on allowable phone lists for each letter.


• Example:
o c: k, ch, s, sh, t-s, ε

o e: ih, iy, er, ax, ah, eh, ey, uw, ay, ow, y-uw, oy, aa, ε

• Process:

1. For each word, find all alignments matching allowable lists.


2. Count (letter, phone) pairs across all alignments.

3. Compute probability:

P(pi∣lj)=count(pi,lj)count(lj)P(p_i | l_j) = \frac{\text{count}(p_i, l_j)}{\text{count}(l_j)}P(pi


∣lj)=count(lj)count(pi,lj)

4. Use Viterbi algorithm to find best alignment A for each (P, L) pair.

Step 2 – Predicting Phones for New Words (Test Set)

• Train machine learning classifier (e.g., decision tree) on aligned training set.

• Predicts most probable phone for each letter.

Features for Decision Tree

1. Window of surrounding letters:


o k previous + k following letters

o Example:

▪ "cat": a → [AE]

▪ "cake": a → [EY] (influenced by final e)

2. Previous predicted phone:

o Gives phonotactic context

o Must process left-to-right to use previous output.

3. Stress prediction:
o Augment vowels with stress info:

▪ 2 levels: AE / AE1

▪ 3 levels: AE0, AE1, AE2 (as in CMU lexicon)

4. Part-of-speech tag of the word:

o Even for unknown words (estimated by POS tagger)


5. Previous vowel stress status
6. Letter classes:

o Consonants, vowels, liquids, etc.

7. Following word features (language-specific phenomena):

o Example: French liaison:


▪ six → [sis] (j’en veux six)

▪ [siz] (six enfants)

▪ [si] (six filles)

Special Handling for Names

• Many systems build two separate G2P models:

1. For unknown personal names


2. For other unknown words
• Additional features for names:

o Foreign language origin (predicted via letter sequence n-gram models)

Model Type

• Decision tree = conditional classifier:

o Finds phone sequence with highest conditional probability given grapheme


sequence.

• More recent models:

o Joint classifier:
▪ Hidden state = graphone (grapheme + phoneme pair)

This is now a full extraction—nothing skipped, all examples, rules, and equations preserved,
plus structured for clarity.

You might also like