Automatic Speech Recognition: Introduction
Peter Bell
Automatic Speech Recognition— ASR Lecture 1
11 January 2020
ASR Lecture 1 Automatic Speech Recognition: Introduction 1
Automatic Speech Recognition — ASR
Course details
Lectures: About 18 lectures, delivered live on Teams for now
Labs: Weekly lab sessions – using Python, OpenFst
(openfst.org) and later Kaldi (kaldi-asr.org)
Lab sessions will start in Week 3 – exact format TBA.
Assessment:
First five lab sessions worth 10%
Coursework, building on the lab sessions, worth 40%
Open book exam in April or May worth 50%
People:
Course organiser: Peter Bell
Guest lecturers: Hiroshi Shimodaira and Yumnah Mohammied
TA: Andrea Carmantini
Demonstrators: Chau Luu and Electra Wallington
http://www.inf.ed.ac.uk/teaching/courses/asr/
ASR Lecture 1 Automatic Speech Recognition: Introduction 2
Your background
If you have taken:
Speech Processing and either of (MLPR or MLP)
Perfect!
either of (MLPR or MLP) but not Speech Processing
(probably you are from Informatics)
You’ll require some speech background:
A couple of the lectures will cover material that was in Speech
Processing
Some additional background study (including material from
Speech Processing)
Speech Processing but neither of (MLPR or MLP)
(probably you are from SLP)
You’ll require some machine learning background (especially
neural networks)
A couple of introductory lectures on neural networks provided
for SLP students
Some additional background study
ASR Lecture 1 Automatic Speech Recognition: Introduction 3
Labs
Series of weekly labs using Python, OpenFst and Kaldi
They count towards 10% of the course credit
Labs start week 3 – exact arrangements TBA
You will need to work in pairs
Labs 1-5 will give you hands-on experience of using HMM
algorithms to build your own ASR system
These labs are an important pre-requisite for the coursework –
take advantage of the demonstrator support!
Later optional labs will introduce you to Kaldi recipes for
training acoustic models – useful if you will be doing an
ASR-related research project
ASR Lecture 1 Automatic Speech Recognition: Introduction 4
What is speech recognition?
ASR Lecture 1 Automatic Speech Recognition: Introduction 5
What is speech recognition?
ASR Lecture 1 Automatic Speech Recognition: Introduction 5
What is speech recognition?
Speech-to-text transcription
Transform recorded audio into a sequence of words
Just the words, no meaning.... But do need to deal with
acoustic ambiguity: “Recognise speech?” or “Wreck a nice
beach?”
Speaker diarization: Who spoke when?
Speech recognition: what did they say?
Paralinguistic aspects: how did they say it? (timing,
intonation, voice quality)
Speech understanding: what does it mean?
ASR Lecture 1 Automatic Speech Recognition: Introduction 6
Why is
speech recognition
difficult?
ASR Lecture 1 Automatic Speech Recognition: Introduction 7
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language
Other paralinguistics Emotional state, social class, . . .
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a linguistic perspective
Many sources of variation
Speaker Tuned for a particular speaker, or
speaker-independent? Adaptation to speaker
characteristics
Environment Noise, competing speakers, channel conditions
(microphone, phone line, room acoustics)
Style Continuously spoken or isolated? Planned monologue
or spontaneous conversation?
Vocabulary Machine-directed commands, scientific language,
colloquial expressions
Accent/dialect Recognise the speech of all speakers who speak a
particular language
Other paralinguistics Emotional state, social class, . . .
Language spoken Estimated 7,000 languages, most with limited
training resources; code-switching; language change
ASR Lecture 1 Automatic Speech Recognition: Introduction 8
From a machine learning perspective
As a classification problem: very high dimensional output
space
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
From a machine learning perspective
As a classification problem: very high dimensional output
space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
From a machine learning perspective
As a classification problem: very high dimensional output
space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
From a machine learning perspective
As a classification problem: very high dimensional output
space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
From a machine learning perspective
As a classification problem: very high dimensional output
space
As a sequence-to-sequence problem: very long input sequence
(although limited re-ordering between acoustic and word
sequences)
Data is often noisy, with many “nuisance” factors of variation
in the data
Very limited quantities of training data available (in terms of
words) compared to text-based NLP
Manual speech transcription is very expensive (10x real time)
Hierachical and compositional nature of speech production
and comprehension makes it difficult to handle with a single
model
ASR Lecture 1 Automatic Speech Recognition: Introduction 9
The speech recognition problem
We generally represent recorded speech as a sequence of
acoustic feature vectors (observations), X and the output
word sequence as W
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
The speech recognition problem
We generally represent recorded speech as a sequence of
acoustic feature vectors (observations), X and the output
word sequence as W
At recognition time, our aim is to find the most likely W,
given X
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
The speech recognition problem
We generally represent recorded speech as a sequence of
acoustic feature vectors (observations), X and the output
word sequence as W
At recognition time, our aim is to find the most likely W,
given X
To achieve this, statistical models are trained using a corpus
of labelled training utterances (Xn , Wn )
ASR Lecture 1 Automatic Speech Recognition: Introduction 10
Representing recorded speech (X)
Represent a recorded utterance as a sequence of feature vectors
Reading: Jurafsky & Martin section 9.3
ASR Lecture 1 Automatic Speech Recognition: Introduction 11
Labelling speech (W)
Labels may be at different levels: words, phones, etc.
Labels may be time-aligned – i.e. the start and end times of an
acoustic segment corresponding to a label are known
Reading: Jurafsky & Martin chapter 7 (especially sections 7.4, 7.5)
ASR Lecture 1 Automatic Speech Recognition: Introduction 12
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
w1 w2
NO RIGHT
x1 x2 x3 x4 ...
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
w1 w2
NO RIGHT
x1 x2 x3 x4 ...
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
p1 p2 p3 p4 p5
n oh r ai t
x1 x2 x3 x4 ...
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
Two key challenges
In training the model:
Aligning the sequences Xn and Wn for each training utterance
g1 g2 g3 g4 g5 g6 g6
n o r i g h t
x1 x2 x3 x4 ...
In performing recognition:
Searching over all possible output sequences W
to find the most likely one
The hidden Markov model (HMM) provides a good solution to
both problems
ASR Lecture 1 Automatic Speech Recognition: Introduction 13
The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence of
continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)
ASR Lecture 1 Automatic Speech Recognition: Introduction 14
The Hidden Markov Model
x1 x2 x3 x4 ...
A simple but powerful model for mapping a sequence of
continuous observations to a sequence of discrete outputs
It is a generative model for the observation sequence
Algorithms for training (forward-backward) and
recognition-time decoding (Viterbi)
Later in the course we will also look at newer all-neural,
fully-differentiable “end-to-end” models
ASR Lecture 1 Automatic Speech Recognition: Introduction 14
Hierarchical modelling of speech
Generative "No right" Utterance W
Model
NO RIGHT Word
n oh r ai t Subword
HMM
Acoustics X
ASR Lecture 1 Automatic Speech Recognition: Introduction 15
“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and
W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W
ASR Lecture 1 Automatic Speech Recognition: Introduction 16
“Fundamental Equation of Statistical Speech Recognition”
If X is the sequence of acoustic feature vectors (observations) and
W denotes a word sequence, the most likely word sequence W∗ is
given by
W∗ = arg max P(W | X)
W
Applying Bayes’ Theorem:
p(X | W)P(W)
P(W | X) =
p(X)
∝ p(X | W)P(W)
W∗ = arg max p(X | W) P(W)
W | {z } | {z }
Acoustic Language
model model
ASR Lecture 1 Automatic Speech Recognition: Introduction 16
Speech Recognition Components
W∗ = arg max p(X | W)P(W)
W
Use an acoustic model, language model, and lexicon to obtain the
most probable word sequence W∗ given the observed acoustics X
Recorded Speech X Decoded Text W*
(Transcription)
Signal
Analysis p(X | W)
Acoustic
Model
Search
Space
Training P(W)
Language W
Data
Model
ASR Lecture 1 Automatic Speech Recognition: Introduction 17
Phones and Phonemes
Phonemes
abstract unit defined by linguists based on contrastive role in
word meanings (eg “cat” vs “bat”)
40–50 phonemes in English
Phones
speech sounds defined by the acoustics
many allophones of the same phoneme (eg /p/ in “pit” and
“spit”)
limitless in number
Phones are usually used in speech recognition – but no
conclusive evidence that they are the basic units in speech
recognition
Possible alternatives: syllables, automatically derived units, ...
(Slide taken from Martin Cooke from long ago)
ASR Lecture 1 Automatic Speech Recognition: Introduction 18
Evaluation
How accurate is a speech recognizer?
String edit distance
Use dynamic programming to align the ASR output with a
reference transcription
Three type of error: insertion, deletion, substitutions
Word error rate (WER) sums the three types of error. If there
are N words in the reference transcript, and the ASR output
has S substitutions, D deletions and I insertions, then:
S +D +I
WER = 100 · % Accuracy = 100 − WER%
N
Speech recognition evaluations: common training and
development data, release of new test sets on which different
systems may be evaluated using word error rate
ASR Lecture 1 Automatic Speech Recognition: Introduction 19
Next Lecture
Recorded Speech Decoded Text
(Transcription)
Signal
Analysis
Acoustic
Model
Search
Space
Training Language
Data Model
ASR Lecture 1 Automatic Speech Recognition: Introduction 20
Example: recognising TV broadcasts
ASR Lecture 1 Automatic Speech Recognition: Introduction 21
Reading
Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.): Chapter 7 (esp 7.4, 7.5) and Section 9.3.
General interest:
The Economist Technology Quarterly, “Language: Finding a
Voice”, Jan 2017.
http://www.economist.com/technology-quarterly/2017-05-
01/language
The State of Automatic Speech Recognition: Q&A with
Kaldi’s Dan Povey, Jul 2018.
https://medium.com/descript/the-state-of-automatic-
speech-recognition-q-a-with-kaldis-dan-povey-
c860aada9b85
ASR Lecture 1 Automatic Speech Recognition: Introduction 22