0% found this document useful (0 votes)

14 views54 pages

Speech Processing Course Guide

Uploaded by

fmlomat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views54 pages

Speech Processing Course Guide

Uploaded by

fmlomat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Speech Processing

Course Code : CS300

Course Overview

Course Specification

Course Plan
Chapter 1
Introduction
Simple Period Waves (sine waves)
• Characterized by: 0.99

• period: T
• amplitude A
• phase  0

• Fundamental frequency in cycles

per second, or Hz
• F0=1/T
–0.99
0 0.02
Time (s)

1 cycle
Simple periodic waves

 Computing the frequency of a wave:

• 5 cycles in .5 seconds = 10 cycles/second = 10 Hz
 Amplitude:
• 1
 Equation:
• Y = A sin(2ft)
Speech sound waves

A little piece from the waveform of the vowel [iy]

Y axis:
•Amplitude = amount of air pressure at that time point
•Positive is compression
•Zero is normal air pressure,
•negative is rarefaction ( ‫(تخلخالت‬
Digitizing Speech
Analog to
Digital
Converter
Digitizing Speech

Analog-to-digital conversion Or A/D conversion.

Three steps
• Sampling
• Quantization
• Coding

Sampler Quantizer Encoder

Mic
Sampling
 Measuring amplitude of signal at time t
 The sampling rate needs to have at least two samples for each
cycle

• Roughly speaking, one for the positive and one for the
negative half of each cycle.
• More than two sample per cycle is ok
• Less than two samples will cause frequencies to be missed
• So the maximum frequency that can be measured is one
that is half the sampling rate.
• The maximum frequency for a given sampling rate called
Nyquist frequency
Sampling
Original signal in red:

If measure at green dots, will

see a lower frequency wave
and miss the correct higher
frequency one!
Sampling
In practice, then, we use the following sample rates.
• 16,000 Hz (samples/sec) Microphone (“Wideband”):
• 8,000 Hz (samples/sec) Telephone
Why?
 Need at least 2 samples per cycle
 max measurable frequency is half sampling rate
 Human speech < 10,000 Hz, so need max 20K
 Telephone filtered at 4K, so 8K is enough
Sampling Theorem:
Sampling Frequency = 2 * maximum frequency of the signal

fs ≥ 2fm
Where fs is the sampling frequency
and fm is the maximum frequency of the signal to be sampled.
Quantization
Definition:
“Representing the real value of each amplitude as an integer”

8-bit (-128 to 127) or 16-bit (-32768 to 32767)

Formats:
16 bits PCM (Pulse Code Modulation)
8 bits log compression
Headers:
Raw (no header) 40 byte
header
Microsoft: filename.wav
Sun: filename.au
WAV format
Fundamental frequency

Waveform of the vowel [iy]

(10 reps in .03875 secs)

Frequency: repetitions/second of a wave

• Above vowel has 10 repetitions in .03875 secs
• So freq is 10/.03875 = 258 Hz
• This is speed that vocal folds move, hence voicing
• Each peak corresponds to an opening of the vocal folds
• The frequency of the complex wave is called the fundamental
frequency of the wave or F0
Amplitude
• We need a way to talk about the amplitude of a
region of a signal (frame) over tune.
• We can’t just average all the values. Why not?
Because the Average ≈ Zero
• So we often talk about the Root Mean Square
(RMS) amplitude
N 2
x[i]
ARMS  
i1
N
“The square Root of the Mean of the Squares of the
samples”
Power and Intensity
Power: related to square of amplitude

1 N
Power   x[i]2
N i1

Intensity in air: power normalized to auditory

threshold, given in dB.

P0 is the auditory threshold pressure = 2x10-5 pa
N
1
Intensity  10 log10 ( power / Po)  10 log10
NP0
 x[
i 1
i ]2
Plot of Intensity
Pitch and Loudness
• Pitch is the mental sensation or perceptual correlated of F0.

• Relationship between pitch and F0 is not linear;

human pitch perception is most accurate between 100Hz and
1000Hz.
Linear in this range
Logarithmic above 1000Hz
Mel scale is one model of this F0-pitch
mapping.
A Mel is a unit of pitch defined so that pairs of
sounds which are perceptually equidistant in
pitch are separated by an equal number of mels

Frequency in mels = 1127 ln (1 + f/700)

Pitch track

•
Pitch

RETONE: manipulate pitch contour.

Record some speech and listen to what happens when you

adjust its pitch contour.
She just had a baby

• Note that vowels all have regular amplitude peaks

• Stop consonant
Closure followed by release
Notice the silence followed by slight bursts of emphasis: very clear for
[b] of “baby”
• Fricative: noisy. [sh] of “she” at beginning
Fricative
Waves have different frequencies
0.99

0
100 Hz

–0.99
0 0.02
Time (s)

0.99

0
1000 Hz

–0.99
0 0.02
Time (s)
Complex waves: Adding a 100 Hz and 1000 Hz
wave together
0.99

–0.9654
0 0.05
Time (s)

The Discrete Fourier Transform (DFT)

 xn    

   xn   xne

j  jn
Xe
n   n  
Notes:
• X(ejω ) is a complex-valued continuous function

• ω = 2π f [rad/sec]

• f is the digital frequency measured in [ C/S]

The Discrete Fourier Transform (DFT)
Spectrum Analysis (Cont.)

   xn   xne

j  jn
Xe
n  

    xne
 
Xe j

n  
 jn
  x(n)cos(n)  j sin(n)
n  
 
  x(n) cos(n)  j  x(n) sin(n)
n   n  

ESynth - Mark Huckvale - University

College London (speechandhearing.net)
Spectrum

Amplitude
Frequency
components (100 and
1000 Hz) on x-axis

100 Frequency in Hz 1000

Fourier analysis:
any wave can be represented as the
(infinite) sum of sine waves of different
frequencies (amplitude, phase)

Spectrum of one instant in an

actual sound wave: many
20

components across frequency

range
0

0 5000
Frequency (Hz)
Part of [ae] waveform from “had”

• Note complex wave repeating nine times in figure

• Plus smaller waves which repeats 4 times for every large
pattern
• Large wave has frequency of 250 Hz (9 times in .036 seconds)
• Small wave roughly 4 times this, or roughly 1000 Hz
• Two little tiny waves on top of peak of 1000 Hz waves
Back to spectrum
Spectrum represents these freq components computed by
Fourier transform, algorithm which separates out each
frequency component of wave.

x-axis shows frequency, y-axis shows magnitude (in decibels, a

log measure of amplitude)
Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
Spectrogram: spectrum + time dimension
f

Note that: The grey level represents the amplitude or energy

Seeing formants: the spectrogram
Third Formant
F3

Second Formant
F2

First Formant
F1

Formants
Vowels largely distinguished by 2 characteristic pitches (F1 and F2).
One of them (the higher of the two) goes downward throughout
the series iy ih eh ae aa ao ou u
The other goes up for the first four vowels and then down for the
next four.
These are called “Formants" of the vowels, lower is 1st formant, higher is 2nd
formant.
Different vowels have different formants
• Vocal tract as "amplifier"; amplifies different frequencies
• Formants are result of different shapes of vocal tract.
• Any body of air will vibrate in a way that depends on its size and shape.
• Air in vocal tract is set in vibration by action of vocal cords.
• Every time the vocal cords open and close, pulse of air from the lungs,
acting like sharp taps on air in vocal tract,
• Setting resonating cavities into vibration so produce a number of
different frequencies.

Again: why is a speech sound wave composed of these peaks?

Articulatory facts:
1. The vocal cord vibrations create harmonics
2. The mouth is an amplifier
3. Depending on shape of mouth, some harmonics are
amplified more than others
How Formants are produced
• Q: Why do vowels have different pitches if the vocal cords are
same rate?

• A: This is a confusion of frequencies of SOURCE and

frequencies of FILTER!

Source Filter Speech

(Vocal Cords)
(Vocal Tract)

Fundamental
frequency Fo Formants F1, F2, F3
Source-filter model of speech production
Input Filter Output

Glottal spectrum Vocal tract frequency

(Source) response function

Glottal :The vocal cords and opening between them

Source and filter are independent, so:

• Different vowels can have same pitch:
When they are produced by the same cavity structure
(Filter responses are identical).
• The same vowel can have different pitch:
e.g.; Different speakers.
Deriving schwa: how shape of mouth (filter function)
creates peaks!

Basic facts about sound waves:

f = c/
c = speed of sound (approx 35,000 cm/sec)
A sound with =10 meters has low frequency f = 35 Hz
(35,000/1000)
A sound with =2 centimeters has high frequency f =
17,500 Hz (35,000/2)
Resonances of the vocal tract
• The human vocal tract as an open tube
Closed end Open end

Length 17.5 cm.

• Air in a tube of a given length will tend to vibrate at resonance
frequency of tube.
Resonances of the vocal tract
The human vocal tract as an open tube

Closed end Open end

Length 17.5 cm.

Air in a tube of a given length will tend

to vibrate at resonance frequency of
tube.
• If vocal tract is cylindrical tube open at one end
• Standing waves form in tubes
• Waves will resonate if their wavelength corresponds to dimensions of tube
• Constraint: Pressure differential should be maximal at (closed)
glottal end and minimal at (open) lip end.
• Next slide shows what kind of length of waves can fit into a tube with this
contsraint
Max Energy at
Closed ends Min Energy at
Open ends
Computing the 3 formants of schwa
Let the length of the tube be L

F1 = c/1 = c/(4L) = 35,000/4*17.5 = 500Hz

F2 = c/2 = c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz
F3 = c/3 = c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz

So we expect a neutral vowel to have 3 resonances at 500,

1500, and 2500 Hz

These vowel resonances are called Formants

Vowel [i] sung at successively higher pitch.

1 2 3

4 5 6

7
Vocal Tract Simulation
Time Total time
ms Segment Duration
JW Jaw Position
TP Tongue Position
TS Tongue Shape
TA Tongue Expansion
LA Lip Aperture ‫بؤرة الشفاه‬
LP Lip Protrusion ‫نتوء‬
LH Larynx Height ‫عرض الحنجرة‬
GA Glottal Aperture ‫بؤرة لسان المزمار‬
FX Fundamental Frequency
NS Velo-pharyngeal port opening ‫فتحة البلعوم‬
Vocal Tract Simulation

VTDEMO: vocal tract synthesizer

How to read spectrograms

bab: closure of lips lowers all formants: so rapid increase in all

formants at beginning of "bab”
dad: first formant increases, but F2 and F3 slight fall
gag: F2 and F3 come together: this is a characteristic of velars.
Formant transitions take longer in velars than in alveolar or labials
‫حلقى‬ ‫الصوت الساكن‬ ‫شفوى‬
She came back and started again

1. lots of high-freq energy

3. closure for k
4. burst of aspiration for k
5. ey vowel;faint 1100 Hz formant is nasalization
6. bilabial nasal
7. short b closure, voicing barely visible.
8. ae; note upward transitions after bilabial stop at beginning
9. note F2 and F3 coming together for "k"
Phonetic Resources
Phonetic dictionaries
CMU dict
CELEX
Phonetically transcribed corpora
TIMIT
Switchboard
TIMIT
Read speech corpus, time aligned

Switchboard
Spontaneous speech corpus
Telephone conversations between strangers
“They’re kind of in between right now” Time alignments
Summary
Acoustic Phonetics
Waves, sound waves, and spectra
Speech waveforms
F0, pitch, intensity
Spectra
Spectrograms
Formants
Reading spectrograms
Deriving schwa: why are formants where they are
PRAAT
Resources: dictionaries and phonetically-labeled corpora.
Examples

pad

bad

spat
Useful Textbooks
Useful Textbooks (Cont.)
Software Resources
• Snack Speech Toolkit
– http://speech.kth.se/snack/
• OGI Speech Toolkit
• University of Colorado SONIC recognizer
– http://cslr.colorado.edu
• Cambridge Hidden Markov Model Toolkit (HTK)
• CMU Sphinx-II Speech Recognizer
• NIST Speech Recognition Scoring Utilities
• SRI Language Model Toolkit
• CMU / Cambridge Language Model Toolkit
Literature Resources
Conference Proceedings
• International Conference on Acoustics, Speech,
and
Signal Processing (ICASSP)
• International Conference on Spoken Language
Processing (ICSLP)
• Eurospeech
Journal Publications
• Speech Communication
• IEEE Transactions on Speech and Audio
Processing
Useful Website

Internet Institute for Speech and Hearing

Encyclopedia of Recreational Diving Chapter 1
100% (4)
Encyclopedia of Recreational Diving Chapter 1
98 pages
How To Compute Planetary Positions
100% (1)
How To Compute Planetary Positions
22 pages
Physics of Sound and Hearing
No ratings yet
Physics of Sound and Hearing
12 pages
MSF HIV-TB Clinical Guide English
100% (2)
MSF HIV-TB Clinical Guide English
365 pages
Acoustics of Speech: Julia Hirschberg CS 4706
No ratings yet
Acoustics of Speech: Julia Hirschberg CS 4706
30 pages
Basic Acoustics + DSP
No ratings yet
Basic Acoustics + DSP
42 pages
Acoustics of Speech: Julia Hirschberg CS 4706
No ratings yet
Acoustics of Speech: Julia Hirschberg CS 4706
29 pages
Acoustic Phonetics Overview
0% (1)
Acoustic Phonetics Overview
52 pages
Speech Processing Basics
No ratings yet
Speech Processing Basics
86 pages
Physics of Sound
No ratings yet
Physics of Sound
33 pages
Acoustic Phonetics in Advanced Phonology
No ratings yet
Acoustic Phonetics in Advanced Phonology
31 pages
Acoustic Phonetics Abidah
No ratings yet
Acoustic Phonetics Abidah
18 pages
Lecture 3
No ratings yet
Lecture 3
7 pages
Introduction To Physics of Sound
No ratings yet
Introduction To Physics of Sound
48 pages
Lec2 Audition
No ratings yet
Lec2 Audition
37 pages
Speech Lab
No ratings yet
Speech Lab
7 pages
Speech Analysis
No ratings yet
Speech Analysis
10 pages
Phonetics and Phonology Explained
No ratings yet
Phonetics and Phonology Explained
21 pages
Acoustic Phonetics PDF
100% (2)
Acoustic Phonetics PDF
82 pages
15 Resonance
No ratings yet
15 Resonance
25 pages
04 Speech Processing Source-Filter-Model
No ratings yet
04 Speech Processing Source-Filter-Model
100 pages
Week06 Acoustics LING2004-2024 Handout
No ratings yet
Week06 Acoustics LING2004-2024 Handout
23 pages
Resonance: November 4, 2011
No ratings yet
Resonance: November 4, 2011
23 pages
2.2 Speech Processing: - Speech Synthesis. - Speech Recognition. - Speech Coding
No ratings yet
2.2 Speech Processing: - Speech Synthesis. - Speech Recognition. - Speech Coding
7 pages
Acoustic Phonetics 2017-18
No ratings yet
Acoustic Phonetics 2017-18
49 pages
Acoustic-Phonetics Simple 2
No ratings yet
Acoustic-Phonetics Simple 2
39 pages
Acoustic and Auditory Phonetics: Jeffrey Heinz Heinz@udel - Edu
No ratings yet
Acoustic and Auditory Phonetics: Jeffrey Heinz Heinz@udel - Edu
19 pages
03 Audio
No ratings yet
03 Audio
32 pages
2012minimodule Lecture1 PDF
No ratings yet
2012minimodule Lecture1 PDF
6 pages
Zsiga - Ch6-Physics of Sound
No ratings yet
Zsiga - Ch6-Physics of Sound
11 pages
Vocal Science for Singers
No ratings yet
Vocal Science for Singers
44 pages
Audproc 2
No ratings yet
Audproc 2
40 pages
Acoustic Theory Speech Production
100% (1)
Acoustic Theory Speech Production
24 pages
15 Resonance
No ratings yet
15 Resonance
25 pages
Acoustics for Sound Enthusiasts
No ratings yet
Acoustics for Sound Enthusiasts
56 pages
Speech Sound Production: Recognition Using Recurrent Neural Networks
No ratings yet
Speech Sound Production: Recognition Using Recurrent Neural Networks
20 pages
Acoustic Phonetics Overview
No ratings yet
Acoustic Phonetics Overview
15 pages
WINSEM2024-25 TPHY207L TH VL2024250506113 2024-12-13 Reference-Material-III
No ratings yet
WINSEM2024-25 TPHY207L TH VL2024250506113 2024-12-13 Reference-Material-III
12 pages
Acoustic Phonetics Overview
No ratings yet
Acoustic Phonetics Overview
19 pages
Introduction To Acoustics
No ratings yet
Introduction To Acoustics
7 pages
General Notes
No ratings yet
General Notes
19 pages
The Reference Frequency That Rule Our Music, 440 HZ
No ratings yet
The Reference Frequency That Rule Our Music, 440 HZ
10 pages
Types of Waveform.
No ratings yet
Types of Waveform.
5 pages
Acoustic Phonetics Overview
No ratings yet
Acoustic Phonetics Overview
30 pages
Acoustics
No ratings yet
Acoustics
18 pages
Acoustic Phonetics
No ratings yet
Acoustic Phonetics
4 pages
List of Figures: Second Unit: Audio and Speech Descriptors
No ratings yet
List of Figures: Second Unit: Audio and Speech Descriptors
22 pages
Fund Acoustics
100% (1)
Fund Acoustics
56 pages
IMT 2 Tue
No ratings yet
IMT 2 Tue
19 pages
3.2 Automatic Speech Recognition
No ratings yet
3.2 Automatic Speech Recognition
151 pages
Audio Frequencies
No ratings yet
Audio Frequencies
6 pages
S H Li Speech Analysis
No ratings yet
S H Li Speech Analysis
32 pages
Steve Harris+Joern Nettingsmeier-Audio Engineering
No ratings yet
Steve Harris+Joern Nettingsmeier-Audio Engineering
57 pages
V I I X 10 Log (I/I I P/4 R V F V F ML 2L/n, 2L/n, F nv/2L N 1,2,3,... For A Tube Open at Both Ends. 4L/n, F nv/4L N 1,3,5,... For A Tube Open at Only One End
No ratings yet
V I I X 10 Log (I/I I P/4 R V F V F ML 2L/n, 2L/n, F nv/2L N 1,2,3,... For A Tube Open at Both Ends. 4L/n, F nv/4L N 1,3,5,... For A Tube Open at Only One End
8 pages
Introducing Phonetic Science Ashby Meidmen Páginas 2
No ratings yet
Introducing Phonetic Science Ashby Meidmen Páginas 2
15 pages
Physical Sound Parameters and Subjective Audition Phenomenon
No ratings yet
Physical Sound Parameters and Subjective Audition Phenomenon
37 pages
Basics of Architectural Acoustics: Praveen Suthar
No ratings yet
Basics of Architectural Acoustics: Praveen Suthar
46 pages
EEC367 - Lecture 1 - 2023
No ratings yet
EEC367 - Lecture 1 - 2023
48 pages
Acoustics and Illumination
100% (1)
Acoustics and Illumination
109 pages
FA4 10th Science (2024-25)
No ratings yet
FA4 10th Science (2024-25)
3 pages
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
No ratings yet
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
6 pages
Persuasive Essay On School Uniforms
100% (2)
Persuasive Essay On School Uniforms
7 pages
Cifras Internacionais
No ratings yet
Cifras Internacionais
17 pages
First Solar Filing To AZCC
No ratings yet
First Solar Filing To AZCC
5 pages
LKG GK Syllabus Whole Session
No ratings yet
LKG GK Syllabus Whole Session
6 pages
Shop Christian Louboutin Loubi Girl 100 Leather Sandals Saks Fifth Avenue
No ratings yet
Shop Christian Louboutin Loubi Girl 100 Leather Sandals Saks Fifth Avenue
1 page
UPS Power Monitor Users Manual Ver 1.17 - C
No ratings yet
UPS Power Monitor Users Manual Ver 1.17 - C
32 pages
6744-00-16-46-SP-09 Ra
No ratings yet
6744-00-16-46-SP-09 Ra
4 pages
The Faerie Prince BONUS SCENES
100% (2)
The Faerie Prince BONUS SCENES
21 pages
Map of The GD&T World
No ratings yet
Map of The GD&T World
2 pages
Chem Workshop #1
No ratings yet
Chem Workshop #1
2 pages
CL - 2 - UIMO - Model Paper For Online Registered Users
No ratings yet
CL - 2 - UIMO - Model Paper For Online Registered Users
21 pages
Pipe Support Span Chart
No ratings yet
Pipe Support Span Chart
1 page
Mymms d2b Catalog
No ratings yet
Mymms d2b Catalog
12 pages
Chapter - 01
No ratings yet
Chapter - 01
72 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
3 pages
Geography and Landforms of Asia
No ratings yet
Geography and Landforms of Asia
31 pages
iPhone 14 Setup Guide for New Users
No ratings yet
iPhone 14 Setup Guide for New Users
22 pages
Factors Affecting Solubility
No ratings yet
Factors Affecting Solubility
10 pages
AR Parts AR-6
No ratings yet
AR Parts AR-6
3 pages
Old Question Plus 2
No ratings yet
Old Question Plus 2
18 pages
Medieval English Architecture Guide
No ratings yet
Medieval English Architecture Guide
4 pages
MS Angles PDF
No ratings yet
MS Angles PDF
1 page
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
No ratings yet
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
1 page
Class-12-Maths-Sep Test-Final QN Paper
No ratings yet
Class-12-Maths-Sep Test-Final QN Paper
5 pages
Z-Transforms and Their Applications For Solving Difference Equations
No ratings yet
Z-Transforms and Their Applications For Solving Difference Equations
3 pages

Speech Processing Course Guide

Uploaded by

Speech Processing Course Guide

Uploaded by

Speech Processing

Course Code : CS300

• Fundamental frequency in cycles

 Computing the frequency of a wave:

A little piece from the waveform of the vowel [iy]

Analog-to-digital conversion Or A/D conversion.

Sampler Quantizer Encoder

If measure at green dots, will

8-bit (-128 to 127) or 16-bit (-32768 to 32767)

Waveform of the vowel [iy]

Frequency: repetitions/second of a wave

Intensity in air: power normalized to auditory

• Relationship between pitch and F0 is not linear;

Frequency in mels = 1127 ln (1 + f/700)

RETONE: manipulate pitch contour.

Record some speech and listen to what happens when you

• Note that vowels all have regular amplitude peaks

The Discrete Fourier Transform (DFT)

• f is the digital frequency measured in [ C/S]

ESynth - Mark Huckvale - University

100 Frequency in Hz 1000

Spectrum of one instant in an

components across frequency

• Note complex wave repeating nine times in figure

x-axis shows frequency, y-axis shows magnitude (in decibels, a

Note that: The grey level represents the amplitude or energy

Again: why is a speech sound wave composed of these peaks?

• A: This is a confusion of frequencies of SOURCE and

Source Filter Speech

Glottal spectrum Vocal tract frequency

Glottal :The vocal cords and opening between them

Source and filter are independent, so:

Basic facts about sound waves:

Length 17.5 cm.

Closed end Open end

Length 17.5 cm.

Air in a tube of a given length will tend

F1 = c/1 = c/(4L) = 35,000/4*17.5 = 500Hz

So we expect a neutral vowel to have 3 resonances at 500,

These vowel resonances are called Formants

VTDEMO: vocal tract synthesizer

bab: closure of lips lowers all formants: so rapid increase in all

1. lots of high-freq energy

Internet Institute for Speech and Hearing

You might also like