Interactive Telecommunication
Systems and Services
Ing. Stanislav Ondáš, PhD.
Sound, speech and their computer processing
“Life depends on the ability to communicate, to share information”
(Uhlíř J. a kol.: Technológie hlasových komunikací)
Sound
• The sound is an mechanical movement, which
– Arises in a flexible environment turning the particle of the environment from
its equilibrium position
– It spreads in the environment in the form of sound waves (the energy is
passed by oscillation between neighboring particles)
– is perceptible to humans, animals and other living creatures, or detectable by
special instruments
• It is often called sound (acoustic) waves.
Speech
• One of the most natural communication medium
• Speech is the spoken form of language
– Language relates to thinking; it enables to share ideas, opinions,
emotions, desires, intentions, …
• Speech can substitute other senses in information
transmission process.
• Speech contains some verbal content and also non-verbal
signals (about emotions, mental state and so on.)
• Speech enables to identify speaker (speaker verification)
• Computer models are often inspirated by human speech
production and hearing.
Models of speech production and perception
Digital speech processing steps
Digital speech processing applications
Speech technologies
• Enables to use speech in human-computer interaction
• Human-human communication use following human skills:
– Hearing ability
– Language understanding ability
– Ability to be participant in dialogue interaction
– Ability to generate an appropriate reaction – answer (it requere thinking
…)
– Speech production ability
– Ability to use other senses in communication - Schopnosť použiť aj
ostatné zmysly – sight, smell, taste, touch
• Scientists want to give such abilities also machines to make communication with
machines more intuitive and natural for humans.
Human – machine speech communication chain
Systems classification
• Spoken Dialogue Systems (SDS)
– Automatic interactive systems, which enables human-computer
interaction through the spoken dialogue.
– It can be seen as the Voice User Interface (VUI)
– Task-oriented systems.
• Multimodal dialogue systems (MDS)
– Multimodal interactive system is a system that enables to use at least
two different modalities in the human-computer interaction, e.g.
combination of speech, gestures and touches on touchscreen…”
• Embodied Conversational Agents
– Animated virtual avatars with „human-like“ behaviour as
an interface to computer systems
• Robots and humanoid robots
Text-to-speech synthesis (TTS)
Automatic Speech Recognition (ASR)
Natural language processing (NLP)
Dialogue/interaction management
Speech technologies for HMI
Information obtaining, processing and representation
Modalities interpretation, fusion and fission
Automatic speech recognition
Introduction
Automatic Speech Recognition
• Transforms speech signal to the words sequence
• Makes computers able to hear
• The main principle: acoustic matching of incoming speech with
patterns in memory..
• ASR technology is language-dependent.. For each language a
unique set of resources (databases, models) is required.
• Applications: automatic telephony services, voice search, smart
home apps, robotics, speaker identification, security apps,
healtcare, …)
ASR classification
• According complexity
– Isolated words recognition
– Connected words recognition
– Dictated speech recognition
– Spontaneous natural speech recognition (Large Vocabulary Continuous
Speech Recognition systems – LVCSR)
• According speaker dependency
– Speaker-dependent – rozumejú iba jednému človeku, resp. úzkej skupine ľudí
– Speaker-independent – natrénované tak, aby rozumeli hocikomu
• According vocabulary size
– ASR with small vocabulary: 1 – 10tky slov
– ASR with medium vocabulary: 100 – 1000ky slov
– ASR with large vocabulary: 10 tis. a viac slov
• According principle
– Based on Dynamic Time Warping method (DTW)
– Based on Hidden Markov Models (HMM)
– Based od DNN (Deep Neural Networks)
– Hybrid systems
Principal scheme of ASR system
Natural language
UNDERSTANDING
Intro
Natural Language Understanding
• Transforms incoming word sequence to some meaning representation.
• Spoken sentences bear an surface representation of meaning, but meaning is
more structured (hierarchic)
• The simpler approach: keyword spotting
• Key words, which relates with key information, are identified in sentences
• More complex approach: linguistic analysis
• Preprocessing: tokenization, lemmatization
• Morphological analysis
• Syntactic analysis
• Semantic analysis
• Pragmatic analysis
• NLU requires database with information about entities and relations in real
world.
• Task-oriented systems enable simplifications in a form of “domain models”.
• In case of robotic interface, the task is to find mapping between spoken
commands and robot’s functions
Dialogue management
Spoken dialogue
• Dialog – natural medium for information exchange between two or more
participants
• Enables to cooperate on task solving in spoken interaction
• In SDSs we suppose the “task-oriented” dialog. In comparison with
“conversation”, it is more simpler to manage such dialogues.
• V SDS je za dialógové funkcie zodpovedný blok riadenia dialógu, teda
dialógový manažér.
• Terms:
– Dialog turn
– Participant
– System prompt/utterance
– User’s utterance/answer
Dialogue manager (DM)
• The main task of DM is to find an appropriate reaction on users input,
which reflects actual state of the system, user inputs and interaction
history.
Dialogue management
• Two important problems:
– Dialogue management methods. How to control the flow of the dialogue,
turns and logic.
– Dialogue models. They represent and model:
• information that DM uses for user’s input interpretation and flow
management.
• Model of the user
• Model of the system
• Interaction history
DM systems classification
• Finite state machines
– Models dialogue as a network of states and transitions between them. Each state
represent particular step in the dialog. Transitions represent possible (conditional)
moves to another step.
• Frame-based systems
– The main idea is that the dialogue is like a form filling task.
• Agent systems
– Dialog is managed by a set of specialized agents (domain agent, dialogue flow
agent, history agent, …)
• DDL (Dialogue Description Language)-based systems
– The dialogue and the control algorithm are encoded in one of the scripting
language. The most popular is the VoiceXML language
• Systems based on statistical modeling
– Control mechanisms are trained on large corpuses of annotated dialog
– The result is the statistical model of the dialog
Metacommunication in dialog
Clarification
• The part of dialog, where obtained information is clarified by
dialogue participants
• Example:
System: What time do you want to leave from Prague?
User: At five?
S: In the afternoon? (clarification prompt)
Confirmation
• Implicit - Confirmation is included into the prompt, which has also another
communication function.
S: From where you want to travel?
U: From Žiliny.
S: What time do you want to leave from Žilina?
U: scenario a) At five. (Implicit confirmation)
U: scenario b) Not from Žiliny, from Žipov. (Missunderstanding repair)
• Explicit – Confirmation is performed by the special prompt, which has only one
communication function – to confirm previously obtained information.
S: From where you want to travel?
U: From Žiliny.
S: What time do you want to leave from?
U: At five
S: You selected Žilina station and departure time about 5:00AM. Is it correct? (explicit
confirmation)
Errors and misunderstanding repairs
• Type of errors
– Nomatch event means the situation, when the input
provided by the user does not match any acceptable
value.
– Noinput event means the situation, when the user does
not answer the system prompt in the specified interval
– Misunderstanding event. The system was not able to
recognize the user input correctly.
– Errors on the side of the user (hesitations, ...)
– Errors on the side of the system (system crash, ...)
Text-to-Speech synthesis
Text-to-Speech synthesis (TTS)
• TTS converts text to speech.
• Applications: apps for the blind people, telephony apps – call centers,
voice banking, ... , robotics, in-car navigation, smart home apps,
intelligent glasses, ...
• Interdisciplinarity of the TTS problematics: signal procesing, natural
lajnguage processing, phonetics, databases
• TTS has to able: to model prosody (melody, tempo, rythm, emphasis) ,
lexical analysis, ....
• Classification:
– Concatenative approaches – diphone-based/corpus based..
– Statistical approaches based on Vocoders and HMM models.
TTS system structure
Multimodality in HMI
Multimodality in HMI
• Human-human communication involves all senses
• Such capability we want to give also to machines (computers, robots, .. )
• Required capabilities:
• Recognition of inputs delivered from various modalities (visual, speech, touches, ..)
• Input signals interpretation and meaning extraction
• Multimodal fusion to one complex representation of meaning
• Multimodal fission, which enables machines to present information through several
output modalities
• The ability to model user behaviour (desires, intentions, goals, emotions)
• The ability to model their own behavior and internal state (desires, intentions, goals,
emotions)
• The ability to work with database of real-world data
• Typical input modalities • Typical output modalities
• Speech • Speech
• Gestures (hand and head • Graphic (text, maps)
movements, face gestures) • Gestures (hand and head
• Touches on touchscreen movements, face gestures,
• Writing articulation)
• Using keys on keyboard • System actions
• Joystick
• Virtual avatars
• Human –like behavior
• Aplications:
• Information kiosks
• Recepcions
• Education
• Application for elderly and disabled people for their
„independent living“
Virtual conversational agents and
humanoid robots
http://www.youtube.com/watch?v=zruOPSSWVXw
http://www.youtube.com/watch?v=munqOlj3mNw
http://www.youtube.com/watch?v=xRR33WDFi_k
http://www.youtube.com/watch?v=cy7xGwYdRk0
http://www.youtube.com/watch?v=nFZ9sUbbfe8
http://www.youtube.com/watch?v=wOzw71j4b78
Applications and demos
designed in our Lab
Automatický prepis videomateriálu
Diktovací systém pre slovenské súdy
Rečové rozhranie pre servisného robota SCORPIO
Inteligentné rečové komunikačné rozhranie
Virtuálny agent SIMONA
HMM syntéza reči v slovenskom jazyku