Speech Processing
Course Code : CS300
Course Overview
Course Specification
Course Plan
Chapter 1
Introduction
Simple Period Waves (sine waves)
• Characterized by: 0.99
• period: T
• amplitude A
• phase 0
• Fundamental frequency in cycles
per second, or Hz
• F0=1/T
–0.99
0 0.02
Time (s)
1 cycle
Simple periodic waves
Computing the frequency of a wave:
• 5 cycles in .5 seconds = 10 cycles/second = 10 Hz
Amplitude:
• 1
Equation:
• Y = A sin(2ft)
Speech sound waves
A little piece from the waveform of the vowel [iy]
Y axis:
•Amplitude = amount of air pressure at that time point
•Positive is compression
•Zero is normal air pressure,
•negative is rarefaction ( (تخلخالت
Digitizing Speech
Analog to
Digital
Converter
Digitizing Speech
Analog-to-digital conversion Or A/D conversion.
Three steps
• Sampling
• Quantization
• Coding
Sampler Quantizer Encoder
Mic
Sampling
Measuring amplitude of signal at time t
The sampling rate needs to have at least two samples for each
cycle
• Roughly speaking, one for the positive and one for the
negative half of each cycle.
• More than two sample per cycle is ok
• Less than two samples will cause frequencies to be missed
• So the maximum frequency that can be measured is one
that is half the sampling rate.
• The maximum frequency for a given sampling rate called
Nyquist frequency
Sampling
Original signal in red:
If measure at green dots, will
see a lower frequency wave
and miss the correct higher
frequency one!
Sampling
In practice, then, we use the following sample rates.
• 16,000 Hz (samples/sec) Microphone (“Wideband”):
• 8,000 Hz (samples/sec) Telephone
Why?
Need at least 2 samples per cycle
max measurable frequency is half sampling rate
Human speech < 10,000 Hz, so need max 20K
Telephone filtered at 4K, so 8K is enough
Sampling Theorem:
Sampling Frequency = 2 * maximum frequency of the signal
fs ≥ 2fm
Where fs is the sampling frequency
and fm is the maximum frequency of the signal to be sampled.
Quantization
Definition:
“Representing the real value of each amplitude as an integer”
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats:
16 bits PCM (Pulse Code Modulation)
8 bits log compression
Headers:
Raw (no header) 40 byte
header
Microsoft: filename.wav
Sun: filename.au
WAV format
Fundamental frequency
Waveform of the vowel [iy]
(10 reps in .03875 secs)
Frequency: repetitions/second of a wave
• Above vowel has 10 repetitions in .03875 secs
• So freq is 10/.03875 = 258 Hz
• This is speed that vocal folds move, hence voicing
• Each peak corresponds to an opening of the vocal folds
• The frequency of the complex wave is called the fundamental
frequency of the wave or F0
Amplitude
• We need a way to talk about the amplitude of a
region of a signal (frame) over tune.
• We can’t just average all the values. Why not?
Because the Average ≈ Zero
• So we often talk about the Root Mean Square
(RMS) amplitude
N 2
x[i]
ARMS
i1
N
“The square Root of the Mean of the Squares of the
samples”
Power and Intensity
Power: related to square of amplitude
1 N
Power x[i]2
N i1
Intensity in air: power normalized to auditory
threshold, given in dB.
P0 is the auditory threshold pressure = 2x10-5 pa
N
1
Intensity 10 log10 ( power / Po) 10 log10
NP0
x[
i 1
i ]2
Plot of Intensity
Pitch and Loudness
• Pitch is the mental sensation or perceptual correlated of F0.
• Relationship between pitch and F0 is not linear;
human pitch perception is most accurate between 100Hz and
1000Hz.
Linear in this range
Logarithmic above 1000Hz
Mel scale is one model of this F0-pitch
mapping.
A Mel is a unit of pitch defined so that pairs of
sounds which are perceptually equidistant in
pitch are separated by an equal number of mels
Frequency in mels = 1127 ln (1 + f/700)
Pitch track
•
Pitch
RETONE: manipulate pitch contour.
Record some speech and listen to what happens when you
adjust its pitch contour.
She just had a baby
• Note that vowels all have regular amplitude peaks
• Stop consonant
Closure followed by release
Notice the silence followed by slight bursts of emphasis: very clear for
[b] of “baby”
• Fricative: noisy. [sh] of “she” at beginning
Fricative
Waves have different frequencies
0.99
0
100 Hz
–0.99
0 0.02
Time (s)
0.99
0
1000 Hz
–0.99
0 0.02
Time (s)
Complex waves: Adding a 100 Hz and 1000 Hz
wave together
0.99
–0.9654
0 0.05
Time (s)
The Discrete Fourier Transform (DFT)
xn
xn xne
j jn
Xe
n n
Notes:
• X(ejω ) is a complex-valued continuous function
• ω = 2π f [rad/sec]
• f is the digital frequency measured in [ C/S]
The Discrete Fourier Transform (DFT)
Spectrum Analysis (Cont.)
xn xne
j jn
Xe
n
xne
Xe j
n
jn
x(n)cos(n) j sin(n)
n
x(n) cos(n) j x(n) sin(n)
n n
ESynth - Mark Huckvale - University
College London (speechandhearing.net)
Spectrum
Amplitude
Frequency
components (100 and
1000 Hz) on x-axis
100 Frequency in Hz 1000
Fourier analysis:
any wave can be represented as the
(infinite) sum of sine waves of different
frequencies (amplitude, phase)
40
Spectrum of one instant in an
actual sound wave: many
20
components across frequency
range
0
0 5000
Frequency (Hz)
Part of [ae] waveform from “had”
• Note complex wave repeating nine times in figure
• Plus smaller waves which repeats 4 times for every large
pattern
• Large wave has frequency of 250 Hz (9 times in .036 seconds)
• Small wave roughly 4 times this, or roughly 1000 Hz
• Two little tiny waves on top of peak of 1000 Hz waves
Back to spectrum
Spectrum represents these freq components computed by
Fourier transform, algorithm which separates out each
frequency component of wave.
x-axis shows frequency, y-axis shows magnitude (in decibels, a
log measure of amplitude)
Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
Spectrogram: spectrum + time dimension
f
Note that: The grey level represents the amplitude or energy
Seeing formants: the spectrogram
Third Formant
F3
Second Formant
F2
First Formant
F1
Formants
Vowels largely distinguished by 2 characteristic pitches (F1 and F2).
One of them (the higher of the two) goes downward throughout
the series iy ih eh ae aa ao ou u
The other goes up for the first four vowels and then down for the
next four.
These are called “Formants" of the vowels, lower is 1st formant, higher is 2nd
formant.
Different vowels have different formants
• Vocal tract as "amplifier"; amplifies different frequencies
• Formants are result of different shapes of vocal tract.
• Any body of air will vibrate in a way that depends on its size and shape.
• Air in vocal tract is set in vibration by action of vocal cords.
• Every time the vocal cords open and close, pulse of air from the lungs,
acting like sharp taps on air in vocal tract,
• Setting resonating cavities into vibration so produce a number of
different frequencies.
Again: why is a speech sound wave composed of these peaks?
Articulatory facts:
1. The vocal cord vibrations create harmonics
2. The mouth is an amplifier
3. Depending on shape of mouth, some harmonics are
amplified more than others
How Formants are produced
• Q: Why do vowels have different pitches if the vocal cords are
same rate?
• A: This is a confusion of frequencies of SOURCE and
frequencies of FILTER!
Source Filter Speech
(Vocal Cords)
(Vocal Tract)
Fundamental
frequency Fo Formants F1, F2, F3
Source-filter model of speech production
Input Filter Output
Glottal spectrum Vocal tract frequency
(Source) response function
Glottal :The vocal cords and opening between them
Source and filter are independent, so:
• Different vowels can have same pitch:
When they are produced by the same cavity structure
(Filter responses are identical).
• The same vowel can have different pitch:
e.g.; Different speakers.
Deriving schwa: how shape of mouth (filter function)
creates peaks!
Basic facts about sound waves:
f = c/
c = speed of sound (approx 35,000 cm/sec)
A sound with =10 meters has low frequency f = 35 Hz
(35,000/1000)
A sound with =2 centimeters has high frequency f =
17,500 Hz (35,000/2)
Resonances of the vocal tract
• The human vocal tract as an open tube
Closed end Open end
Length 17.5 cm.
• Air in a tube of a given length will tend to vibrate at resonance
frequency of tube.
Resonances of the vocal tract
The human vocal tract as an open tube
Closed end Open end
Length 17.5 cm.
Air in a tube of a given length will tend
to vibrate at resonance frequency of
tube.
• If vocal tract is cylindrical tube open at one end
• Standing waves form in tubes
• Waves will resonate if their wavelength corresponds to dimensions of tube
• Constraint: Pressure differential should be maximal at (closed)
glottal end and minimal at (open) lip end.
• Next slide shows what kind of length of waves can fit into a tube with this
contsraint
Max Energy at
Closed ends Min Energy at
Open ends
Computing the 3 formants of schwa
Let the length of the tube be L
F1 = c/1 = c/(4L) = 35,000/4*17.5 = 500Hz
F2 = c/2 = c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz
F3 = c/3 = c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz
So we expect a neutral vowel to have 3 resonances at 500,
1500, and 2500 Hz
These vowel resonances are called Formants
Vowel [i] sung at successively higher pitch.
1 2 3
4 5 6
7
Vocal Tract Simulation
Time Total time
ms Segment Duration
JW Jaw Position
TP Tongue Position
TS Tongue Shape
TA Tongue Expansion
LA Lip Aperture بؤرة الشفاه
LP Lip Protrusion نتوء
LH Larynx Height عرض الحنجرة
GA Glottal Aperture بؤرة لسان المزمار
FX Fundamental Frequency
NS Velo-pharyngeal port opening فتحة البلعوم
Vocal Tract Simulation
VTDEMO: vocal tract synthesizer
How to read spectrograms
bab: closure of lips lowers all formants: so rapid increase in all
formants at beginning of "bab”
dad: first formant increases, but F2 and F3 slight fall
gag: F2 and F3 come together: this is a characteristic of velars.
Formant transitions take longer in velars than in alveolar or labials
حلقى الصوت الساكن شفوى
She came back and started again
1. lots of high-freq energy
3. closure for k
4. burst of aspiration for k
5. ey vowel;faint 1100 Hz formant is nasalization
6. bilabial nasal
7. short b closure, voicing barely visible.
8. ae; note upward transitions after bilabial stop at beginning
9. note F2 and F3 coming together for "k"
Phonetic Resources
Phonetic dictionaries
CMU dict
CELEX
Phonetically transcribed corpora
TIMIT
Switchboard
TIMIT
Read speech corpus, time aligned
Switchboard
Spontaneous speech corpus
Telephone conversations between strangers
“They’re kind of in between right now” Time alignments
Summary
Acoustic Phonetics
Waves, sound waves, and spectra
Speech waveforms
F0, pitch, intensity
Spectra
Spectrograms
Formants
Reading spectrograms
Deriving schwa: why are formants where they are
PRAAT
Resources: dictionaries and phonetically-labeled corpora.
Examples
pad
bad
spat
Useful Textbooks
Useful Textbooks (Cont.)
Software Resources
• Snack Speech Toolkit
– http://speech.kth.se/snack/
• OGI Speech Toolkit
• University of Colorado SONIC recognizer
– http://cslr.colorado.edu
• Cambridge Hidden Markov Model Toolkit (HTK)
• CMU Sphinx-II Speech Recognizer
• NIST Speech Recognition Scoring Utilities
• SRI Language Model Toolkit
• CMU / Cambridge Language Model Toolkit
Literature Resources
Conference Proceedings
• International Conference on Acoustics, Speech,
and
Signal Processing (ICASSP)
• International Conference on Spoken Language
Processing (ICSLP)
• Eurospeech
Journal Publications
• Speech Communication
• IEEE Transactions on Speech and Audio
Processing
Useful Website
Internet Institute for Speech and Hearing