Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views70 pages

Speech Processing

Uploaded by

dasaditi2312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views70 pages

Speech Processing

Uploaded by

dasaditi2312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

ML in Speech Processing Applications

Process of speech production


• Speech production is the process of uttering articulated sounds or words, i.e., how
humans generate meaningful speech.

• It is a complex feedback process in which hearing, perception, and information


processing in the nervous system and the brain are also involved.
• When speaker produces a speech signal it travels in the form of pressure
waves from the speaker’s head to the listener’s ears.

• This signal consists of variations in pressure as a function of time and is


usually measured directly in front of the mouth, the primary sound source.

• The amplitude variations correspond to deviations from atmospheric


pressure caused by traveling waves.
Block diagram of human speech
production system
Modeling the Human Speech Production
System
• The human speech production can be
illustrated by a simple source-filter
model
• Here, the lungs are replaced by a DC
source, the vocal cords by an impulse
generator, and the articulation tract by
a linear filter system. A noise
generator produces the unvoiced
excitation.
A general model for speech production
A simplified discrete-time model for
speech production
Applications of speech production model
• speech synthesis
• speech analysis
• speech and speaker recognition,
• speech coding etc.
Speech recognition
• Speech recognition is a capability which enables a program to process human
speech into a written format.

• The main types of speech recognition are “automatic speech recognition” (ASR),
“computer speech recognition” or “speech to text” (STT).

• Speech recognition focuses on the translation of speech from a verbal format to a


text one whereas voice recognition just seeks to identify an individual user’s
voice.
• Speech recognition software must adapt to the highly variable and
context-specific nature of human speech.

• To meet this requirement, speech recognition systems use two types of


models:

• Acoustic models. These represent the relationship between linguistic units


of speech and audio signals.

• Language models. Here, sounds are matched with word sequences to


distinguish between words that sound similar.
• There are three essential features in a speech:

• Lexical features (the vocabulary used): it would require a transcript of the speech
based on the text extraction from the speech

• Visual features (the expressions the speaker makes): it would require access to
the video of the conversation

• Acoustic features (sound properties like pitch, tone, jitter etc.)


How does speech recognition work?

1. analyze the audio;

2. break it into parts;

3. digitize it into a computer-readable format; and

4. use an algorithm to match it to the most suitable text representation.


Speech recognition algorithms

• Natural language processing (NLP):

• While NLP isn’t necessarily a specific algorithm used in speech recognition, it is


the area of artificial intelligence which focuses on the interaction between humans
and machines through language through speech and text.
• Many mobile devices incorporate speech recognition into their systems to conduct
voice search—e.g. Siri—or provide more accessibility around texting.
• Hidden markov models (HMM):
• Hidden Markov Models build on the Markov chain model, which stipulates that
the probability of a given state hinges on the current state, not its prior states.
• While a Markov chain model is useful for observable events, such as text inputs,
hidden markov models allow us to incorporate hidden events, such as part-of-
speech tags, into a probabilistic model.
• They are utilized as sequence models within speech recognition, assigning labels
to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels
create a mapping with the provided input, allowing it to determine the most
appropriate label sequence.
• Artificial intelligence.

• AI and machine learning methods like deep learning and neural networks are
common in advanced speech recognition software.
• These systems use grammar, structure, syntax and composition of audio and voice
signals to process speech. Machine learning systems gain knowledge with each
use, making them well suited for nuances like accents.
• Neural networks:
• Primarily leveraged for deep learning algorithms, neural networks process training
data by mimicking the interconnectivity of the human brain through layers of
nodes.
• Each node is made up of inputs, weights, a bias (or threshold) and an output. If
that output value exceeds a given threshold, it “fires” or activates the node,
passing data to the next layer in the network.
• Neural networks learn this mapping function through supervised learning,
adjusting based on the loss function through the process of gradient
descent. While neural networks tend to be more accurate and can accept more
data, this comes at a performance efficiency cost as they tend to be slower to train
compared to traditional language models.
• Speaker Diarization (SD): Speaker diarization algorithms identify and segment
speech by speaker identity. This helps programs better distinguish individuals in a
conversation and is frequently applied at call centers distinguishing customers and
sales agents.

• N-grams: This is the simplest type of language model (LM), which assigns
probabilities to sentences or phrases. An N-gram is sequence of N-words.
• For example, “order the pizza” is a trigram or 3-gram and “please order the pizza”
is a 4-gram. Grammar and the probability of certain word sequences are used to
improve recognition and accuracy.
Time Domain Methods in Speech Processing
Key features of effective speech
recognition
• Language weighting: Improve precision by weighting specific
words that are spoken frequently (such as product names or
industry jargon), beyond terms already in the base vocabulary.
• Speaker labeling: Output a transcription that cites or tags each
speaker’s contributions to a multi-participant conversation.
• Acoustics training: Attend to the acoustical side of the
business. Train the system to adapt to an acoustic environment
(like the ambient noise in a call center) and speaker styles (like
voice pitch, volume and pace).
• Profanity filtering: Use filters to identify certain words or
phrases and sanitize speech output.
Application fields
What applications is speech recognition
used for?
• Mobile devices. Smartphones use voice commands for call routing,
speech-to-text processing, voice dialing and voice search.
• Education Speech recognition software is used in language instruction.
The software hears the user's speech and offers help with pronunciation.
• Customer service. Automated voice assistants listen to customer queries
and provides helpful resources.
• Healthcare applications. Doctors can use speech recognition software to
transcribe notes in real time into healthcare records.
What applications is speech recognition
used for?
• Disability assistance. Speech recognition software can translate spoken
words into text using closed captions to enable a person with hearing loss
to understand what others are saying.
• Court reporting. Software can be used to transcribe courtroom
proceedings, precluding the need for human transcribers.
• Emotion recognition. This technology can analyze certain vocal
characteristics to determine what emotion the speaker is feeling. Paired
with sentiment analysis, this can reveal how someone feels about a product
or service.
• Hands-free communication. Drivers use voice control for hands-free
communication, controlling phones, radios and global positioning system,
for instance.
What are the advantages of speech
recognition?
• Machine-to-human communication. The technology enables electronic
devices to communicate with humans in natural language or conversational
speech.
• Readily accessible. This software is frequently installed in computers and
mobile devices, making it accessible.
• Easy to use. Well-designed software is straightforward to operate and
often runs in the background.
• Continuous, automatic improvement. Speech recognition systems that
incorporate AI become more effective and easier to use over time. As
systems complete speech recognition tasks, they generate more data about
human speech and get better at what they do.
What are the disadvantages of speech
recognition?
• Inconsistent performance. The systems may be unable to capture words
accurately because of variations in pronunciation, lack of support for some
languages and inability to sort through background noise. Ambient noise
can be especially challenging. Acoustic training can help filter it out, but
these programs aren't perfect. Sometimes it's impossible to isolate the
human voice.
• Speed. Some speech recognition programs take time to deploy and master.
The speech processing may feel relatively slow.
• Source file issues. Speech recognition success depends on the recording
equipment used, not just the software.
Thank you

You might also like