Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views38 pages

Introduction To Speech Processing

The document outlines the course EE432 Speech Processing, which covers digital speech processing principles, including speech production, perception, and various processing techniques. It includes course information such as prerequisites, grading, and required textbooks, as well as a term project involving MATLAB. Additionally, it introduces phonetics and the mechanisms of speech articulation, focusing on the production of speech sounds and their characteristics.

Uploaded by

m.dimen2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

Introduction To Speech Processing

The document outlines the course EE432 Speech Processing, which covers digital speech processing principles, including speech production, perception, and various processing techniques. It includes course information such as prerequisites, grading, and required textbooks, as well as a term project involving MATLAB. Additionally, it introduces phonetics and the mechanisms of speech articulation, focusing on the production of speech sounds and their characteristics.

Uploaded by

m.dimen2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

EE432 Speech Processing Lecture # 1

EE432
Speech Processing
Lecture # 1

Course webpage at:


MS Teams Instructor: M. Zübeyir Ünlü, PhD
(No Enrollment Key!) [email protected]

17.02.2025 Lecture # 1 1
EE432 Speech Processing Lecture # 1 Introduction

Lecture 01:
Introduction

17.02.2025 Lecture # 1 2
EE432 Speech Processing Lecture # 1 Course Description

This course covers the basic principles of digital speech processing:

• Review of digital signal processing

• MATLAB functionality for speech processing

• Fundamentals of speech production and perception

• Basic techniques for digital speech processing:


• short - time energy, magnitude, autocorrelation

• short - time Fourier analysis

• homomorphic (convolutional) methods

• linear predictive methods


17.02.2025 Lecture # 1 3
EE432 Speech Processing Lecture # 1 Course Description

• Speech estimation methods


• speech/non-speech detection

• voiced/unvoiced/non-speech segmentation/classification

• pitch detection

• formant estimation

• Speech Processing Applications


• Speech coding

• Speech synthesis

• Speech recognition/natural language processing

• A MATLAB-based term project will be required for all students taking this course for credit.

17.02.2025 Lecture # 1 4
EE432 Speech Processing Lecture # 1 Course Information
• Textbook: L. R. Rabiner and R. W. • Prerequisites: Basic Digital Signal

Schafer, Theory and Applications of Processing, good knowledge of MATLAB

Digital Speech Processing, Prentice-Hall • Time and Location: Monday 13:30 to

Inc., 2011 16:15, Room: Z43

• Grading: • Course Website: MS Teams


• Midterm Exams 35%
• Enrollment Key: No Enrollment Key
• Homeworks 15%

• Term Project 15%

• Final Exam 35%


17.02.2025 Lecture # 1 5
EE432 Speech Processing Lecture # 1 Course Textbook & Other Sources

Course Textbook:
• L. R. Rabiner and R. W. Schafer, Theory and Applications of Digital Speech Processing, Prentice-
Hall Inc., 2011
Recommended Supplementary Textbook:
• T. F. Quatieri, Principles of Discrete - Time Speech Processing, Prentice Hall Inc, 2002
Matlab Exercises:
• C. S. Burrus et al, Computer-Based Exercises for Signal Processing using Matlab, Prentice Hall Inc,
1994
• J. R. Buck, M. M. Daniel, and A. C. Singer, Computer Explorations in Signals and Systems using
Matlab, Prentice Hall Inc, 2002

17.02.2025 Lecture # 1 6
EE432 Speech Processing Lecture # 1 Course Textbook & Other Sources

J. L. Flanagan, Speech Analysis, Synthesis, and Perception, Springer -Verlag, 2nd Edition, Berlin, 1972

J. D. Markel and A. H. Gray, Jr., Linear Prediction of Speech, Springer-Verlag, Berlin, 1976

B. Gold and N. Morgan, Speech and Audio Signal Processing, J. Wiley and Sons, 2000

J. Deller, Jr., J. G. Proakis, and J. Hansen, Discrete – Time Processing of Speech Signals, Macmillan Publishing, 1993

D. O’Shaughnessy, Speech Communication, Human and Machine, Addison-Wesley, 1987

S. Furui and M. Sondhi, Advances in Speech Signal Processing, Marcel Dekker Inc, NY, 1991

R. W. Schafer and J. D. Markel, Editors, Speech Analysis, IEEE Press Selected Reprint Series, 1979

D. G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley and Sons, 1999

K. Stevens, Acoustic Phonetics, MIT Press, 1998

J. Benesty, M. M. Sondhi and Y. Huang, Editors, Springer Handbook of Speech Processing and Speech Communication,
Springer, 2008.

17.02.2025 Lecture # 1 7
EE432 Speech Processing Lecture # 1 References in Selected Areas of Speech Processing

Speech Coding:

• A. M. Kondoz, Digital Speech: Coding for Low Bit Rate Communication Systems-
2nd Edition, John Wiley and Sons, 2004

• W. B. Kleijn and K. K. Paliwal, Editors, Speech Coding and Synthesis, Elsevier, 1995

• P. E. Papamichalis, Practical Approaches to Speech Coding, Prentice Hall Inc, 1987

• N. S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice Hall Inc, 1984

17.02.2025 Lecture # 1 8
EE432 Speech Processing Lecture # 1 References in Selected Areas of Speech Processing

Speech Synthesis:

• T. Dutoit, An Introduction to Text - To-Speech Synthesis, Kluwer Academic Publishers, 1997

• P. Taylor, Text-to-Speech Synthesis, Cambridge University Press, 2008

• J. Allen, S. Hunnicutt, and D. Klatt, From Text to Speech, Cambridge University Press, 1987

• Y. Sagisaka, N. Campbell, and N. Higuchi, Computing Prosody, Springer Verlag, 1996

• J. VanSanten, R. W. Sproat, J. P. Olive and J. Hirschberg, Editors, Progress in Speech Synthesis,


Springer Verlag, 1996

• J. P. Olive, A. Greenwood, and J. Coleman, Acoustics of American English, Springer Verlag, 1993

17.02.2025 Lecture # 1 9
EE432 Speech Processing Lecture # 1 References in Selected Areas of Speech Processing

Speech Recognition:

• L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall Inc,


1993

• X. Huang, A. Acero and H-W Hon, Spoken Language Processing, Prentice Hall Inc, 2000

• F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1998

• H. A. Bourlard and N. Morgan, Connectionist Speech Recognition-A Hybrid Approach,


Kluwer Academic Publishers, 1994

• C. H. Lee, F. K. Soong, and K. K. Paliwal, Editors, Automatic Speech and Speaker


Recognition, Kluwer Academic Publisher, 1996

17.02.2025 Lecture # 1 10
EE432 Speech Processing Lecture # 1 References in Digital Signal Processing

• A. V. Oppenheim and R. W. Schafer, Discrete - Time Signal Processing,


3rd Ed., Prentice-Hall Inc, 2010

• L. R. Rabiner and B. Gold, Theory and Application of Digital Signal


Processing, Prentice Hall Inc, 1975

• S. K. Mitra, Digital Signal Processing-A Computer-Based Approach, Third


Edition, McGraw Hill, 2006

• S. K. Mitra, Digital Signal Processing Laboratory Using Matlab, McGraw


Hill, 1999

17.02.2025 Lecture # 1 11
EE432 Speech Processing Lecture # 1 Term Project
• All registered students are required to do a term project.
This term project, implemented using Matlab®, must be a speech or audio processing system
that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced
detection, speech/silence classification, speech synthesis, speech recognition, speaker
recognition, helium speech restoration, speech coding, MP3 audio coding, etc.

• Every student is also required to make a 10-minute Power Point presentation of


their term project to the entire class.
• The presentation must include:
• A short description of the project and its objectives
• An explanation of the implemented algorithm and relevant theory
• A demonstration of the working program – i.e., results obtained when running the program
17.02.2025 Lecture # 1 12
EE432 Speech Processing Lecture # 1 Suggestions for Term Project
1. Pitch detector – time domain, 8. Speech synthesizer – serial, parallel, direct,
autocorrelation, cepstrum, LPC, etc. lattice
2. Voiced/Unvoiced/Silence detector 9. Helium speech restoration system
3. Formant analyzer/tracker 10. Audio/music coder
4. Speech coders including ADPCM, LDM, CELP, 11. System to speed up and slow down speech by
Multipulse, etc. arbitrary factors
5. N-channel spectral analyzer and synthesizer – 12. Speaker verification system
phase vocoder, channel vocoder,
13. Sinusoidal speech coder
homomorphic vocoder
14. Speaker recognition system
6. Speech endpoint detector
15. Speech understanding system
7. Simple speech recognizer – e.g. isolated
digits, speaker trained 16. Speech enhancement system (noise
reduction, post filtering, spectral flattening)
17.02.2025 Lecture # 1 13
EE432 Speech Processing Lecture # 1 Phonetics & Representations of Speech

Phonetics &
Representations of Speech
Ref: https://speech.zone/courses/speech-processing/
Thanks to Dr. Catherine Lai and Prof Simon King

17.02.2025 Lecture # 1 14
EE432 Speech Processing Lecture # 1 Phonetics & Representations of Speech

• Phonetics is the study of speech sounds. There are three main branches of study:

• Articulatory phonetics: how people physically produce speech

• Auditory phonetics: how people hear and perceive speech

• Acoustic phonetics: what are the physical properties of speech sounds

• To build speech technologies we want to know what speech actually is and how
people use it to communicate!

• Today we’ll focus on Articulatory Phonetics to start to build our mapping between
speech and text.

17.02.2025 Lecture # 1 15
EE432 Speech Processing Lecture # 1 Speech Articulation
• We’ll start by looking at how people produce speech by actually doing it
ourselves!

• Let’s make a simple speech sound: [s] as in the beginning of the word “sag”.
What’s involved here?

• You take a breath

• You blow out that breath from your lungs out your mouth

• Your mouth is configured in a specific way - what is it?

17.02.2025 Lecture # 1 16
EE432 Speech Processing Lecture # 1 Speech Articulation: [s]
• What are you doing when you make an [s] sound?

• Tongue tip is placed behind the ridge near your teeth so that there’s a small
gap between the tongue and the top of your mouth

• Air is forced through the gap

17.02.2025 Lecture # 1 17
EE432 Speech Processing Lecture # 1 Speech Production Mechanisms
Four main processes

• Airstream: how you move air around


(usually out!) to create sounds

• Phonation: what are your vocal folds doing

• Oro-nasal: where is air going? Your


mouth? Nose? Both?

• Articulatory: what your doing with your


lips and tongue relative to the roof of your
mouth and pharynx
17.02.2025 Lecture # 1 18
EE432 Speech Processing Lecture # 1 Articulation: Speech Organs

17.02.2025 Lecture # 1 19
EE432 Speech Processing Lecture # 1 Places of Articulation

We describe place of articulation in terms of the point of constriction between


tongue and the roof of the mouth
17.02.2025 Lecture # 1 20
EE432 Speech Processing Lecture # 1 Active and Passive Articulators
We can characterize articulators as
active (move) or passive (don’t move
much)

• Passive: upper lip, teeth, alveolar


ridge, hard palate, soft palate, uvula

• Active: lower lip, tongue

17.02.2025 Lecture # 1 21
EE432 Speech Processing Lecture # 1 Speech Articulation: [s]
What are you doing when you make a [s] sound?

• Place: tongue (active) constriction at the alveolar ridge (passive)

Compare to

• [f] where the constriction is at the lips (labial)

• or “sh” [ʃ] which is a bit behind the ridge (post-alveolar)

17.02.2025 Lecture # 1 22
EE432 Speech Processing Lecture # 1 Voicing: [s] vs [z]
Now make a [z] sound (like “buzz”). How does this differ from [s]?

• Put your hand on your throat

• You should feel a buzz with [z] but not [s]

• This is a difference in voicing

17.02.2025 Lecture # 1 23
EE432 Speech Processing Lecture # 1 Voicing: the larynx and vocal folds

• Voiced sounds are produced with the vocal folds vibrating.


• Voiceless sounds happen when the vocal folds are held apart (like when we
breathe).
17.02.2025 Lecture # 1 24
EE432 Speech Processing Lecture # 1 Speech Articulation: [s]
What are you doing when you make a [s] sound?

• Place: tongue constriction at the alveolar ridge

• Voicing: unvoiced

Compare to [z] which is voiced!

17.02.2025 Lecture # 1 25
EE432 Speech Processing Lecture # 1 Manner of Articulation: [s] vs [t] vs [n]

Now make a [t], [s] and [n] sounds, e.g. say “tag”, “sag”, “nag”

• [s]: steady stream of air (turbulent flow) through small gap between the
tongue and roof of mouth → fricative

• [t]: touch the tongue tip to the ridge (closure), and then push air out in a burst
(release), opening the closure → oral stop/plosive

• [n]: touch the tongue tip to the ridge (closure), push air through the nasal
cavity (lower velum) → nasal stop

17.02.2025 Lecture # 1 26
EE432 Speech Processing Lecture # 1 Speech Articulation: [s]
What are you doing when you make a [s] sound?

• Place: tongue constriction at the alveolar ridge

• Voicing: unvoiced

• Manner: fricative

17.02.2025 Lecture # 1 27
EE432 Speech Processing Lecture # 1 Manner, Place, Voicing

place
manner

The International Phonetic Alphabet (IPA) gives a systematic way of characterizing consonants
in terms of manner, place, voicing.
17.02.2025 Lecture # 1 28
EE432 Speech Processing Lecture # 1 Manners of Articulation
• Stops: complete closure of the airstream in oral cavity (air can’t pass)

• Oral stops/plosives, e.g. [t] “tag”, [d] “dag”

• Nasal stops, e.g. [n] “nag”

• Fricatives: close approximation of articulators, airstream partially obstructed,


turbulent flow, e.g. [s] “sag”, [z] “zag”

• Approximants: Articulators are close together, but not enough to produce


turbulent flow, e.g. [ɹ] “rag”

• Lateral approximants: Obstruction along the center of vocal tract, air flowing over
the side of the tongue, e.g. [l] “lag”
17.02.2025 Lecture # 1 29
EE432 Speech Processing Lecture # 1 Manners of Articulation
• Affricates: oral stop + fricative, e.g. [tʃ] “church”

• Trills: articulator vibrating in place, e.g. [r] “perro” (Spanish)

• Taps/flaps: e.g., [ɾ] “pity” (US English)

• Lateral fricative:, e.g., [ɬ] “cyllell” (Welsh)

17.02.2025 Lecture # 1 30
EE432 Speech Processing Lecture # 1 Places of Articulation (again)
A lot of
distinctions
potentially
happening
around the
alveolar
ridge!

From Ladefoged & Johnson (Fig 1.8): Sagittal section of vocal tract (extra detail of
the “coronal” area on the right)
17.02.2025 Lecture # 1 31
EE432 Speech Processing Lecture # 1 Non-pulmonic Consonants
Consonants made with a glottalic airstream

• Close the glottis (make a glottal stop)

• Ejectives: Move air out of mouth by raising


the glottis

• Implosives: Move air in through the mouth Clicks, implosives and ejectives are common
in a wide variety of languages!
but lowering the glottis

17.02.2025 Lecture # 1 32
EE432 Speech Processing Lecture # 1 What about Vowels?

Vowels are characterized by tongue position (front-back, high-low) and lip rounding
17.02.2025 Lecture # 1 33
EE432 Speech Processing Lecture # 1 Transcription of Consonants
Depending on our goals we may or may
not want to capture all the phonetic
detail possible in transcription

e.g. We don’t necessarily need to


distinguish [k] and [k’], i.e., the
ejective, in English if our end goal is
automatic transcription. But we do
need to distinguish [k] from [t].

From Ladefoged & Johnson, Chapter 2.

17.02.2025 Lecture # 1 34
EE432 Speech Processing Lecture # 1 Transcription: Contrasts
We can use minimal pairs of words to
figure out which sounds are
contrastive.

When two sounds distinguish different


words, we call the difference
phonemic.

Thus we distinguish phonemes (speech


sounds that distinguish words) from
phones (distinctive speech sounds in
From Ladefoged & Johnson, Chapter 2.
general).
17.02.2025 Lecture # 1 35
EE432 Speech Processing Lecture # 1 Transcription of Vowels
Vowel transcription can be even trickier
(English) accents vary more in vowels
than consonants.

We can still look for minimal pairs to


determine if differences in height,
backness, rounding are contrastive.

From Ladefoged & Johnson, Chapter 2.

17.02.2025 Lecture # 1 36
EE432 Speech Processing Lecture # 1 Transcription of Vowels
Pronunciation differences between
accents are often systematic.

If we can learn rules that govern these


differences, we can reuse our
knowledge of one accent to infer
pronunciation for another

We’ll see the idea of a compact


pronunciation lexicon is very important
for speech technology later in the
From Ladefoged & Johnson, Chapter 2.
course!
17.02.2025 Lecture # 1 37
EE432 Speech Processing Lecture # 1

• We’re looking at human speech production (and later perception) so we can understand if
our computational models are doing what they need to!

• We can describe human speech in terms of speech production mechanisms (airstream,


phonation, oro-nasal, articulation)

• We use voicing, manner, and place of articulation to describe differences in consonants

• We use height, backness, rounding to describe vowels

• We use the International Phonetic alphabet to represent differences in speech sounds

• These articulatory differences are systematized in the IPA chart

17.02.2025 Lecture # 1 38

You might also like