Affective Comping Book
Affective Comping Book
1
19 WK_6_Lecture_ 2 352
20 Week 6 Tutorial 375
Week 7
21 Lecture 1 Emotions in Physiological Signals 384
22 WK 7 Tutorial 406
Week 8
23 Lecture 2 Emotions via Skin Conductance 419
24 Lecture 3 Emotions Via EEG 434
25 Multimodal Affect Recognition 454
26 Multimodal Analysis 471
27 Week 8 MM Tutorial 501
Week 9
28 Tutorial 526
29 wk 9 Lecture 1 537
30 WK 9 lecture 2 549
31 Week 9 Lecture 3 580
Week 10
32 Emotionally Intelligent Machines Part 1 613
33 Emotionally Intelligent Machines Part 2 638
34 Case Study 654
Week 11
35 wk 11 Lecture 1 671
36 wk 11 Lecture 2 701
Week 12
37 Ethics in Affective Computing 1 733
38 Ethics in Affective Computing 2 748
39 Course Finale 766
2
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 01
Lecture - 01
Fundamentals of Affective Computing
Hello and welcome to the first lecture in the Affective Computing course, Friends I am
Abhinav Dhall from the Indian Institute of Technology, Ropar and today we will be
discussing about the Fundamentals of Affective Computing. So, the agenda for today is as
follows.
We will first discuss about the topics which are going to be covered in this course, we will
talk about some resources which you can refer to and then we will introduce what is affective
computing? And dwell a bit deeper into one of the major components of affective computing
which is referred to as Affect Sensing, from affect sensing then we will come on to the
various components through which we can understand the affect and that is through the
different modalities which is based on different sensors.
3
(Refer Slide Time: 01:29)
So, we will start with the fundamentals then we will move on to emotional theory and
emotional design. So, some of these topics will be covered by myself and others will be
covered by my colleague Doctor Jainendra Shukla. So, Doctor Shukla will discuss about how
the different emotion are represented computationally, what is the theory of emotions and
what are the concepts of emotion enabled design.
Then the discussion will move on to how do we elicit emotions in a person. And once we
have elicited emotions what are the different research and development tools which are
available in the community, which you as a practitioner of affective computing can use in
your respective projects. From this we will dwell into how we can understand emotions from
facial expressions.
This would be primarily focusing on the camera based image and video based analysis of
affect, from this we move on to the other modality which is voice you can hear me right now
you can tell what is the emotion in my speech right you can hear that in my voice. So, we will
discuss the different methodologies for voice based emotion recognition from this we move
on to text and then we will move on to physiological signals your heart rate sensor the EEG
and so forth.
Later on friends, we will combine this into multimodal emotion recognition, that is we have
data coming in from let us say voice and text or there is data coming in from camera and
4
physiological sensor. Now once we have discussed the sensing part will also come to how
emotion is reflected by a virtual agent. How can be it communicated by a machine?
And if you have a Robo when you are talking about human robo interaction, how do the robo
after sensing the emotional state of the user react to it. Then there would be discussions on
the challenges and opportunities in this area. So, this is a very new area it was introduced
only 3 decades ago.
So, if you compare it with other disciplines in computer science and signal processing this is
fairly young, which means that there are lots number of opportunities and also there are a fair
number of challenges which are involved. Then we will move on to a case study of a real
world deployment of an affective computing system and then last, but by no means the least
the very important aspect of the ethical, legal and social implications of affective computing.
Now, if we want to move these affective computing systems from let us say labs and to real
world settings, what are those aspects which we need to be aware of from the perspective of
ethics legal and so forth. Now this course is fairly self sustained the material which we will
be sharing with you through our lectures should be giving you a healthy amount of
information about affective computing.
And with this also I would like to mention to you some of the textbooks for those of you who
would like to let us say go deeper into the concepts. Now, from Professor Rosalind Picard is
5
the first book in this area which is aptly named Affective Computing, then a book recent
book called The Oxford Handbook of Affective Computing that was also released it is a very
rich set of resources for concepts in affective computing.
And then a very new book by Leimin Tian and others which is called the Applied Affective
Computing that is also a very good resource. Now since affective computing is a very
growing area rapidly growing area much of the newer work which is being done in
universities and in industrial labs that is first communicated to disseminated in journals and
conferences.
So, the IEEE transactions on Affective Computing that is the standard venue, where the high
quality affective computing work that is shared by researchers. The community also has a
conference called Affective Computing and Intelligent Interaction and this conference is also
a go to resource for the very recent works.
Now, since affective computing is fusion of concepts from computer science, signal
processing, machine learning and concepts in how emotions are represented in psychology
and social sciences literature. So therefore, this is an ensemble a fusion of different areas.
Now, what that means is there are some other venues some other resources as well wherein in
part affective computing work is presented.
So for example, there is a conference called ACM Multimedia, where you will find recent
affective computing works published as well and then there is a conference on multimodal
interaction. So, there are different sensors multi modal and then there is interaction part also
we see affect related work in computer vision conferences such as the Computer Vision and
Pattern Recognition and the International Conference on Computer Vision.
Now, these resources are fairly advanced resources. So, I would suggest that if anyone is
interested in exploring the recent works you know you could start that towards the later part
of the course. It is not exactly within the scope of this course because our course is essentially
an introduction to affective computing.
But these are extremely high quality resources for going into advanced affective computing.
Now throughout the discussion in this course doctor Janendra Shukla and I would assume
that you would have learned maybe credited concepts in machine learning and would have
been introduced to at least 1 type of signal processing it could be through camera image
6
processing or computer vision or it could be speech processing or NLP Natural Language
Processing.
Because as I have already mentioned to you affective computing is, you get some data from a
type of sensor you do signal processing on it you do machine learning and then there is the
human computer interaction part.
So, now friends let us formally introduce affective computing. So, the field of affective
computing it encompasses both the creation of an interaction with machine systems which
sense, recognize, respond to, and influence emotions. Now, this term affective computing was
first term coined by Professor Rosalind Picard at the MIT media labs and it is essentially
about 2 components.
First we want the machine to understand the emotional state of the user and once machine has
this information how should the machine react to it? In other words how should the
information be now presented through the interface or through voice or images to the user,
which is an appropriate response to the user based on the user's emotional state. So, you
recognize the state and you react accordingly.
7
(Refer Slide Time: 10:53)
Now if you look at the affective computing systems right. So, there are a large number of
components to it. So, on one side is let us say a machine could be your mobile device which
has some AI capability in the camera module or in the speech modules and then on the other
hand are users. So, your affective computing system can have different components it can
recognize emotions it can have behavior analytics, there would be a dialogue system for
interacting with the user there would be an interaction system.
Now for example, interaction system is simply let us say you are browsing through an app on
your mobile phone and are using the touch sensor. So, that is the way one of the ways for the
user to give feedback. So, this affective computing system will have observation of a user’s
behavior and mental state.
It will understand it compute it and then it would synthesize and adapt to the user and that
would be actions to influence a user’s behavior and mental state. An example is let us say
operator is using a complex machinery the machine senses that the user is fatigued.
So, this is this part and it would then let us say share a recommendation to the operator that
you can now please have a break maybe you can relax a bit, right. So, that is one of the
appropriate response from the machine to the user. Now if you notice from this particular part
we have signal processing machine learning again signal processing machine learning the
HCI part the AI part and this is the recent works in synthesis for example, using deep learning
generating images, voices, videos and so forth.
8
(Refer Slide Time: 13:34)
We already see that in the artificial intelligence domain there are several well established
directions and in these, affective computing finds high relevance. So, to start with, in speech
there are 2 parts to speech from a computer's perspective. One is understanding the speech of
the user which could mean for example, automatic speaker recognition automatic speech
recognition, wherein the system tries to understand the spoken words.
What is being spoken? Now in this pursue of what is spoken by the user, what can be
computed is that when the user spoke whatever the emotion, which you could sense while the
user was speaking. So, that will give you the emotional state. Once you know the emotional
state from the speech perspective you can generate speech as well right.
So, it could be let us say a text-to-speech system, which is generating the speech based on
what was sensed from the user, let us say the user sounded sad; so how about if the text to
speech system generates happy or at least a neutral toned voice right. So, that would perhaps
maybe help the user.
Then we have the face analysis domain. In the case of face analysis, there are large number of
directions. An example is we are already doing things such as face detection, face re
identification, verification in artificial intelligence; so the sub domain of computer vision and
machine learning.
9
And within this we can use the same pipelines adapted to understand the facial expressions a
person smiles, if you notice me right now or the person is sad right. So, we find a large
number of applications in the face analysis domain now, that is again detection right, but you
can synthesize as well let us say a user is interacting with a virtual agent.
The virtual agent senses that the user let us say is happy and the facial expressions of the
virtual agent along with the tone are also cheerful. Then with lip reading as well you know lip
reading friends is let us say we have a input video and we want to analyze the lip movement.
So, that we can learn a system, which can predict what is it that the user is speaking.
Now, in this pursue lip movement also tells us about the facial structure corresponding to a
particular facial expression. You smile, let us say, when you are speaking or you are showing
a surprise expression as you are speaking, then in monitoring as well let us say a system
monitoring or even a scenario.
For example, there is a product you are showing it to bunch of users and you are analyzing
the emotional state the behavior which is elicited in them in response to the product which is
being shown. In the later part of this week’s lecture I will give you an example of this how
affective computing is useful. Then in game play as well, so along with the keyboard, mouse
we can use the facial expressions the head pose information of where the user is looking.
How is the user reacting to the game play and for much serious applications perhaps let us
say for training someone with facial expression issues one could have a game where you are
supposed to show facial expressions and you reach only to the next round if the facial
expression is as shown to the user on the screen. Then from social skill analysis in AI right.
Now, let us say you have a group of people there is a conversation happening how do we use
things such as back channeling understanding of cohesion unity in a group. So, there as well
we can use affective computing essentially the perceived emotion, emotion of a group and we
can then add it to the social skills understanding. Now a thing to notice here is throughout this
course we will be using the term emotion and perceived emotion interchangeably.
As you can see this is based on the perception. When you meet someone what do you think is
their emotional state? Well, that is based on what they are speaking, what you see on their
face on their body gestures, that could be in some circumstances highly correlated, but in
10
others it could be having a low correlation right. So, we will use perceived emotion and
emotional interchangeably.
Now, coming to the aspect of the first component of Affective computing, which is
recognition right, we refer to it as affect sensing you want to sense the affect of the user what
is the affective state? So, formally affect sensing refers to a system that can recognize
emotion by receiving some data through signals and pattern. So, this is the formal definition
from Professor Rosalind Picard.
Now, for achieving this task, a computer would need certain hardware and certain software,
the hardware would be essentially the sensors which are going to capture the information and
also as we are going to do pattern recognition in this case. We want to identify certain
patterns from the data which we have captured from the sensors, we would need a fast high
quality compute.
Perhaps in some cases let us say a graphic processing unit, if it is a machine learning deep
learning kind of system which is being built. From the software perspective platform for
interfacing with the hardware sensors and software libraries, which can process the data feed
coming in from the sensors and also then learn pattern recognition machine learning model.
Now if you look at the hardware we can sense from this perspective, the affective computing
systems on the basis of the modalities which are used to get the data ok. So, affect sensing
11
can be identified based on what type of data are you analyzing, what is the modality of the
data which will then translate into which is the sensor which you are using to fetch that data
about the user.
Now a very simple thing for that would be let us say you are sitting in front of your computer
and you are interacting with a software using your mouse and keyboard. So now, the
modality is the data input which is coming from the mouse and keyboard right. Now let us
look at the primary modalities for affect sensing in affective computing systems.
So, friends the first is using a camera sensor we analyze the facial activity. Now the facial
expressions which come on your face are a window to the emotional state. So, cameras are
very commonly used as one of the popular modality for affect sensing. Now what you are
saying essentially here is here let us say you have a camera and on the other side you have a
subject you capture the image.
And you would like to understand that for example, for the subject in this image you know
the person is happy right. So, capturing is done and after pattern recognition this is the affect
which has been sensed that this person in the image looks happy.
12
(Refer Slide Time: 24:54)
Now in the past few decades we have observed that the cameras are getting better in terms of
the image capture quality resolution and so forth and in parallel due to the progress in
hardware the cost of cameras is also going down. So, typically now you will see that your
webcams in your laptops and the cameras which you have in your mobile phone are capable
of capturing high quality clear images right.
So, it means you know if you are creating an affective computing system you can easily find
a camera and if you can find a camera given that they are very well already fixed and have an
easy to use interface with the machine. For example in your mobile phone or on your laptop
you can fetch data from the camera.
13
(Refer Slide Time: 25:53)
The issue which comes friends is that yes you can capture an image of a person from the
camera on your laptop or mobile phone. But why were we doing this? Well the aim was
affect sensing we wanted to understand the emotional state of the user. But in this pursue we
are capturing the face of the person and this face of so gives us things such as identity, age,
gender which may or may not be required by your affect sensing system. But it can tell very
important information private information about the user.
So, that is where some privacy issues come into when we are using a camera. One solution is
well, most of the cameras which we see around us in our day-to-day lives they are RGB color
based cameras, right; you get a color image as you can see on this side of the right side of the
visualization on the slide.
But then there are other cameras as well for example thermal cameras. So, here you see on
the left side friends the feed of the same persons face from a thermal camera and what is
happening is this is telling you the temperature distribution. Now, how does it tells us about
the affect, an example is let us say a person is feeling fear.
So, that is the emotion and as they are feeling fear they are maybe rapidly breathing. So,
when someone rapidly breathes through the nostril due to the movement and action of the
muscles the region around the nostril it starts warming up. You can actually sense the
temperature difference in that part of the face.
14
So, one can then use this difference which you observe in the feed coming from a thermal
camera and predict the emotion. The advantage of course, is that it is easier to preserve the
identity as compared to in the case of color camera, the disadvantages that thermal cameras
are expensive and relatively have lower resolution as compared to color cameras. Now, this
was for when we are looking at a face right.
Let me show you an example friends, so let us say there is a camera. So, this is a camera
which I have created here is a person you know and you got the image or the videos this is
just showing you some frames, you do feature analysis some statistics well we will be
studying this in far in detail in the forthcoming lectures. And then you do some machine
learning ok, to predict if the emotional class.
So, this is an example which I am going to play for you guys. So, this is well, me and what
we have here is the face being detected using a computer vision library, these dots are telling
about the facial points and the corresponding location of the facial parts. So, using the facial
point information you can tell the structure of the face which can tell let us say if the subject
is feeling happy or not, right we are taking a very trivial example here.
15
(Refer Slide Time: 30:34)
So, this is how it should look like. So, you can see in this case that, let me play the video let
us go back, play the video. So, you see this is happy expression and also that there is a
particular direction. So, this is friends this is called Eye gaze. Now gaze also gives extremely
important information about the affect. So, it tells you where is a person looking. Now
imagine a scenario someone is watching a horror movie and they are anticipating horror
scene coming up.
Maybe in that anticipation the person may not directly look at the screen and may look on
their sides, that is a very important cue. So, you knew that the task was that the person is
watching a movie, but the person is not watching the at the movie and you know their gaze is
moving left or right; could be that the person is not really immersed in it in not engaged.
Or it is simply an indication that the user since is watching a horror movie is anticipating a
very you know high horror quotient scene right. So, you can use the gaze now again your eye
gaze is captured through the camera and we have seen in various amount of works that this
camera modality based eye gaze that gives us extremely important cues about let us say if a
person is making eye contact or not during a conversation.
Making of eye contact can be used to assess large number of attributes you know confidence
are you let us say speaking entirely truth or not. So, the camera enables gaze along with face
analysis.
16
(Refer Slide Time: 33:10)
Now, through the camera modality, we can also have the pose of the person ok now what
does that mean; that means, if you notice my video right now and let me try to act ok. What!!
did you see a lion at the zoo? Now what has happened here guys in this small act, I had an a
facial expression surprise and even the hand gestures right, they moved up what did you see
wow.
Typically you will observe this when let us say people are watching a game right a football or
a cricket game you can tell them from the gestures that what is the affect. So, through the
camera we can easily tell the pose and the gesture and this pose and gesture can give us
important information about the emotional state, you can tell very easily from the image
illustration here.
So, here you have 2 people you can see that person one is keeping his hand on the shoulder of
person 2 person 2’s left hand is on his head and it seems as if they are you know in a situation
where person 1 is consoling person 2 and person 2 visibly looks upset right, you can tell that
from the gesture. So, this gives us very important information about the affect, right, the
affective state of the user.
17
(Refer Slide Time: 35:32)
Now let us change gears and move from the camera modality. What we have here is the
microphone, voice-based affect sensing. So, in this case you will have a microphone you will
be capturing what the user is speaking and then analyzing this information to predict the
affect.
Now in certain scenarios along with prediction of the emotional state of the user through his
or her speech one can also make sense of the overall scene. So, through the microphone you
will be capturing let us say the background noise, maybe there is some music playing in the
background which can tell your affect sensing system that the user let us say is in a place
where music is playing.
So, maybe no, the user is not in his or her office. Now if the user is not in her office then
could maybe the chances are high that the user is in a casual setting. Now based on the
perceived emotion of this music which is playing in the background I can have auxiliary
information extra information about where the user is and that can tell me a more accurate
understanding of the emotional state of the user.
So, in this case you would capture the data about the user from the microphone let us say this
is the waveform which is captured, we will extract some information from this waveform
some features some statistics and then there would be a machine learning model ok. Similar
to how we do for faces?
18
You have a camera you capture the face you analyze some statistics around the face and then
you have a machine learning model which is going to predict the emotional state. So, friends
now we have seen camera based modality and microphone based modality. So, these are the 2
types of data feeds which let us say if you were interacting with someone you would be
analyzing them in real time, you know 2 people are talking one is able to understand the
emotional state based on the facial expressions, pose that is what the person sees and what the
person hears right from the voice tone and so forth.
Now another modality in affect sensing which is very prevalent is coming from Text. Now
we have documents, Email conversations, Messages which we send over platforms and then
social media post. So, all of this gives us huge amount of data about a user, about a scene, but
in the form of text.
Now, this text could let us say be a article which someone is writing or it could be a chat
which is happening between a user and an agent, because this could be a virtual agent. So,
this is like a conversational AI based use case. So, in this case the human is communicating
with the machine through text chatting and the response to the human is also in the form of
chat text. So, the user now would be sending a message and we can try to understand the
affect, the tone from that text right.
So, that is coming to the natural language processing right we would like to understand the
affective state based on what is written by the user. Now of course, this would span across
19
different languages and what it means is in different languages users will convey the same
emotion, but in different forms, different ways of phrasing, different words. Therefore, for
this human machine interaction using text we will have different models through which we
can analyze words sentences and documents.
And we can also analyze human-human interaction. So, let us say there is a job interview
scenario there is an interviewer there is an interviewee, later on we take the speech data
transcribe the speech data into text and now what we will have is the conversation let us say
between the interviewer and the interviewee in the form of text.
Now, we can analyze this conversation between the interviewer interviewee and extract
statistics around that text, which can help us in predicting things such as how was the affect
state of the interviewer interviewee varying as the conversation proceeded. So, that would
mean an example like this here you have document, you could analyze the words, the
sentences, paragraphs, the links between sentences how the neighborhood words are or let us
say a particular word extract these statistics about the document.
Have your PRML the pattern recognition machine learning model and that can tell you let us
say the emotion. Now, there are so many applications of text based sensing, you see I told
you about this human-machine, human-human interaction chat email and so forth. But it is
also about let us say someone was writing a poem someone had you know a document which
is coming from a book 100 of years old we are digitizing it.
Now, we want to understand what was the emotion which is conveyed in the document or the
poem by the writer? So, we can actually have that objectively measured right and that is the
affective sensing part coming through text. Now in the case of affect sensing through text you
would have noticed by now friends that the identity information you can hide that in this
case. So, you know you can actually do parsing of the text, you can hide private data and then
try to make the affect related sense of the data.
Now if you compare that with the camera color based camera or a speech based pattern,
because in the case of speech as well one can analyze the speech and extract things such as
the probable age, range, gender and even identity of the person. So, one could better preserve
privacy in the case of text as compared to speech and camera based facial analysis or body
pose.
20
(Refer Slide Time: 44:34)
Now, moving in this direction forward the other type of modality which is commonly used in
many affective computing applications for sensing the affect is through physiological signals.
You could think of it this way when you hear someone and you see someone expressing
topic, hearing and watching viewing is based on the facial expression and voice which are
outside of the person right.
You see what how their facial expression is you hear what they are saying. Now how are the
different parameters attributes of the person inside implicit information. Now what could that
implicit information can be? A very simple example is heart rate. So, if a person’s heart rate
is relaxed with reference to their normal then that is a very strong information for a classifier
to judge if a person is let us say neutral, relaxed or is aggravated, right.
Now, a commonly used physiological signal sensor along with heart rate sensor is your
electro thermal activity or Electrodermal activity sensor called EDA. What it does is So, it
will sense the change in conductivity in the skin due to the increase or decrease of the activity
in the sweat glands which we have on our skin, you know to mind the very small microscope
observable sweat glands. Now what that means?
It has been established that as your affective state changes, you know you could feel more
stressed or not stressed, angry fear or happy, we see changes happening on the skin as well
ok. Now you could measure that for example, here you see one of the sensors from biopac
21
where, you are actually measuring this ok. So, there are 2 electrodes here and you are
measuring the change in EDA activity ok Electrodermal activity.
Now, similar to how we did for text you have this modality physiological sensor EDA
continuous, let us say data is coming in and we are extracting statistics again about the EDA
data there is a ML model and then you predict. Now in the case of your EDA data what is the
positive side, you know it is privacy preserving we are not recording any information about
the identity of the subject. What is the disadvantage? The user has to wear it so it is intrusive
right, this is not natural setting, we do not wear these sensors.
Of course, now if you see friends the smart watches which are getting quite ubiquitous very
common they will have these sensors. So, as the form factor the user experience is such that
these sensors are gelling well on the human body in their human uses natural environment
they will not notice. However, for these kind of sensors you know high quality high fidelity
sensors the user notices that there is something on the finger or you know on the skin.
So, they are intrusive ok, so in a few cases this may actually add to it is own bias to the data
that the user is aware of something; but then that is also possible in the case of a camera,
right. Now I am looking into a camera and if I was asked to smile or you know show fear
expression, I may become camera cautious right. So, that may not be the very natural
expression which I otherwise would show in the world around me.
22
If I was not aware that there is a camera which is noticing me, same goes for this
physiological sensor if it is on my fingers I may be aware that, you know if I move too much
my hand I am actually you know maybe break away or you know I will add some damage or
noise to the data. However, strong you know advantage is that you are preserving the privacy.
Typically we will use these kind of sensors as an ensemble you know they will have multiple
sensors in your physiological signal based affect sensing. So, another very commonly used
sensor along with the EDA is your EEG ok
So, that is your Electroencephalography EEG sensor. What is your EEG sensor? Friends you
can see here this is a EEG cap ok. What you have here is these electrodes these electrodes
and what these electrodes are doing is they are recording the electrical activity in the brain in
the neurons.
So, why are we using this for affect sensing? When you see a stimuli a video or you are
talking to a person we are sensing what they are saying we are observing the world and that
whole perception is happening inside our brain which triggers different neural pathways and
that is measured by EEG.
23
(Refer Slide Time: 51:22)
Now you can see here in the case of the EEG the person has to wear you know slowly EEG
caps are also getting very user friendly easy to wear. Similar to how we had for EDA strong
positive point you are actually trying to get data right from the brain, what does that mean
how does it make it different from a camera.
Let us say I was a very good actor maybe I was feeling very happy prior to I came to an
environment where the environment is supposed to make me show in the social settings a bit
of a sad reflection right. So, what I am trying to convey here is what you are feeling and what
you show on your face sometimes that may not have the direct correlation ok. So, what you
are capturing from the camera or hearing from the microphone in the terms of your voice of a
person that could be more towards a perceived emotion.
But not maybe the real emotion of what the person is feeling right. classic example is the
actors you know they some actors will get so much well into the role that they will show
different emotions, but are they truly feeling those emotions you know there could be a gap
here but, when you come to these physiological sensors for example, through the EEG.
There you are measuring the different signals from the electrode. So, let us say these are
different electrodes right, the user can try to control their thought, but that again you know
this whole pursue would be captured. So, it may sound like now that you know the EEG is
the most perfect type of sensing modality for affective computing.
24
But it has it is own challenges well one is it is not natural to wear these EEG caps you know
the user is aware that there is a you know something on their head, then you have a bunch of
electrodes which are connected to these wires. So, if the user moves you know a bit then that
noise is also added to the signal which has been captured, the placements of the cap.
And also that if let us say there is a particular sensor you know one of the electrode that is not
functioning correctly, you may not actually realize that till you analyze the data. So, the
whole process of data capturing has been done, but one may not realize ok. The other
challenge which comes with EEG is you cannot trivially easily take it outside right outside of
the laboratory setting.
Now, if you compare that with the camera or microphone you can use the microphone or the
camera sensor in a mobile phone, take it anywhere with the user and you are able to sense the
affect. But with EEG you that is a challenge, but what we get is a ground truth data, data
which is far more closer to the user.
So, what is going to happen guys in the pipeline you have the EEG sensor, you sense this data
again similar to how we have been doing is extract statistics features and then you have your
machine learning model and you give the affect. So, there are a large number of these sensors
which are available to us to predict the affect.
25
So, in nutshell we see that we can divide based on our discussion the modalities based on
camera driven modalities, which give you the perception of sight, the faces, the body pose
gesture simple things such as you know the hand tension. Let us say you know me right now
strongly clutching my hands and then rotating this give this can give you a very vital queue if
in a real world setting I was doing this you know it could tell you a lot of information about
how I feel.
Then we have friends your microphone which gives you the voice the background knowledge
and you can use 1 microphone when you have multiple people right 2 3 people who are let us
say in a closer proxemics closer to each other in space and you can have one microphone in
between.
Now the challenge with camera would be you may require multiple cameras to focus on
different people of a group right. So, there are the settings based challenges which are going
to come. Then friends we have your language processing natural language processing through
text data documents emails and so forth that is another modality.
And then we took examples of these 2 physiological signals based on EDA your skin and
then based response and EEG the brain neuron based response. Typically we will have
combinations you know you could have camera plus this or maybe you know a text which is
being read by a user. So, you have microphone you have the text.
So, we can use these different modalities in tandem together to get complementary
information. Now all of this is dependent on the type of project on which we are working
right, what is the type of data which is available, what is the scenario in which the user will
be using the system?
So, that will decide you know if a camera is possible is privacy super concerned and if it is a
static position let us say a person can sit and then you can use a EEG kind of sensor or not.
So, with this friends, we reach to the end of the first part of the Fundamentals of Affective
Computing. In the second part I will be introducing you to some of the applications in
different domains of affective computing.
Thank you.
26
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 01
Lecture - 02
Fundamentals of Affective Computing Applications
Hello and welcome back. I am Dr. Abhinav Dhall from the Indian Institute of Technology,
Ropar and I will be discussing today the Applications of Affective Computing. Friends, this
is in the series of the Affective Computing course which we are conducting on NPTEL.
So, in the last lecture, we defined affective computing and then we looked at the various
modalities which enable the content to be captured can be from a camera, from a microphone,
from a text perspective. And then, how we can have a computational model which can predict
the affect which tells us about the emotional state of a user or a group of users.
So, the agenda for today is discussing about the applications and later on we will be talking
about the different areas in which emotion, recognition, affective computing is applied.
27
(Refer Slide Time: 01:27)
So, with respective of applications what we can do on principle is, we can have a system
which can detect the emotion of the user. And then in this pursuit, the system would also
express the appropriate emotion which is in response to what has been detected. So, the first
step as we discussed in the last lecture is affect sensing, the machine understands the
emotional state of the user. And the second is to express appropriately, which is a reply to the
understanding which the machine had about the user.
Now, when we are expressing we are talking about let us say an avatar or a robot or an
animated conversational agent which would try to show facial expressions or would try to
have the variation in the speech which would then give a perception of an emotion which is
there in the speech.
An example of that could be, let us say if I was a digital avatar, then if I was to greet a user,
option 1, could be I could say hello and welcome. Now, this is very neutral the other could
be, let us say I want to give a cheerful greeting, so I will say hello and welcome. So, this now
has a emotion which is more towards the positive side, more towards the happy side which is
indicated by my speech, right. So, this is on the generation part, the synthesis part of the
emotion.
Now, we would like to actually create these systems which can perceive the emotion, right.
So, this again is part of the affect sensing who step in the affective computing.
28
(Refer Slide Time: 03:25)
So, now, let us look at first the perception you know the detection aspect. So, if you talk
about the healthcare and well-being domain, there are a large number of applications where
affective computing is highly relevant. An example is this project being done at IIIT, Delhi by
Dr. Jainendra Shukla’s lab which is called Visharanti.
Now, the aim of the project is using the sensors in a watch in a smart watch, we would like to
predict if the user is stressed or not. So, essentially with the sensors which are there in my
watch, we would be capturing the user data, then we would be using a classifier. Now, this
classifier would either tell us if the user is not stressed or if the user is stressed.
Now, if the user is stressed, then there is a recommender system which would recommend an
activity and then there is a loop forward and backward which runs between the user, the
recommendation, which has happened and it can be added to the classifier as well.
So, what it means is we are learning online the user behavior. And then, we are trying to
personalize the model, so that it can predict not just the stressed or non-stress state of the
user, but also give appropriate recommendations when the classifier predicts that the user is
stressed.
29
(Refer Slide Time: 05:09)
Now, in the same direction, there is another very important application. So, typically patients
who are suffering from Asperger syndrome or are high functioning autism. So, you know
they are on the ASD spectrum. So, there would be record keeping which is done let us say by
a parent for their child who is on ASD, could be prescribed by the doctor.
Now, in that there are several questions for example, how was the child behaving at a
particular instance in time or was the child agitated, was the child relaxed. Now, the answer
to all these questions is based on the subjective interpretation of the child state from the
perspective of the parent. So, the parent would input into the app if let us say the child seems
to be relaxed or not.
Now, this subjectivity and the continuous record keeping which is required is actually time
consuming, can be noisy because you know there is possibility of confirmation bias coming
into the picture. To this end, we can use affective computing and behavior analysis to
automatically detect the state of the patient for example, if the patient is looking relaxed or
stressed. Now, this can be based on the visual cues, the expressions and all or it can be on the
basis of speech.
And we can then use these apps which can automatically do the record keeping for us. And
over the time this longitudinal data would be a very rich analytics which can be then
presented to an expert a doctor and that can help in you know let us say better diagnosis or
treatment.
30
(Refer Slide Time: 07:30)
Now, another important aspect of role of affective computing in healthcare is in the case of
PTSD which is the post-traumatic stress disorder. So, a typical example of this is soldiers
who have been to war or similar you know difficult circumstances. So, when they come back
the effect of the happenings in that war or similar circumstances that may lead in a few cases
to PTSD, right.
So, using affective computing, we can detect the changes which have happened in a person
which is reflected in the way in which they are communicating based on the body gestures,
speech and facial expressions. And this information can then be used by an expert to give
suggestions treatment to a particular patient. For example, here what we see is an app called
StartleMart.
It is a platform which has these virtual environments, right. So, these kind of treatments can
be used wherein both the stimuli can be used for generating illicit an effect in the person and
also remedies can be given. Another example from the tester HCI group is these interviewers
who are interviewing this refugee. If you notice there is an EEG headset, a physiological
sensor.
So, as the interviewers ask question to the interviewee, the response which is given by the
interviewee and their perception of the questions, the discussion which is going on it does
affect how they are feeling, and then that is recorded as part of the EEG data which is
captured by the EEG sensor.
31
So, later post analysis can be done and we can find out different attributes about the affective
state which tells us about how that person is feeling and is extremely helpful in looking into
how their mental and physical health is. Which of course, for example, in this case of a
refugee would be an important information and let us say their rehabilitation and how they
are actually getting absorbed into the new society.
Now, changing the gears friends, there are other of course, you know health and well being
applications of affective computing. Now, let us move to the education applications. So, in
2014, Darnell and others proposed EngageMe. Now, what EngageMe is this is a system
which is using a physiological sensor which is measuring the skin conductance.
And along with the video feed it is capturing how the student are in a classroom such that we
can collate the overall engagement and affect for the teacher to later reflect on. Now, if you
see when you are looking at the skin conductance data, of course, there is a sensor which is
you know on the let us say the arm of the student, and then there is a camera in the classroom
as well. So, together the data from this can be combined to do rich analytics.
Of course, you know when you talking about video feeds, you have to be super careful about
the implications of the identity of the person and also you know if the data which we are
capturing about the user it is not in any way releasing or leaking the identity information
about the user. Now, on the same lines as EngageMe, there is another work called subtle
32
stone. So, it was proposed before EngageMe by Balaam and others in 2009. Now, this is
fairly interesting.
Now, what Balaam and others proposed was a wireless hand held squeezable ball which
allows students to communicate their affective and motivational experiences to the teacher in
real time. So, now this is you can say a wearable object and you know based on the how the
feedback to this squeezable ball is, you know that can be uses an indicator a way of
communication and we can use that as rich analytics based on the affective state.
And the student is telling about the complexity of some of the questions and while they solve
their facial cues are being analyzed for different expression categories and that is used to map
to the self reported complexity of a question which the student felt. Now, let us look at the
video.
To face the math this is the research project at the machine perception laboratory of UC, San
Diego on using automatic facial expression recognition to improve interactions between
33
students and teachers. In one particular project, we are measuring the facial muscle
movements of the student in real time using a face detector and an expression recognition.
Now, here is the rich analytics which is essentially you know given from a smile detector and
then the movement of the head which of course, you know then shows you engagement and
so forth.
34
Now, when this information is collated, you see the difficulty versus time, and you know how
the self reported difficulty and predicted difficulty, they are actually having a correlation.
Now, what it means is you can use the facial expressions, head pose cues and then apply it to
better learning environments to assess if we have effective learning in a classroom
environment and in today’s context in online environment as well.
Now, let us say if it was an online learning scenario then one could predict the engagement in
the form of the facial expressions, eye gaze. Now, friends, eye gaze is the location on to
which your pupil are fixating, right, so where are you looking at this point, ok. And using
head pose gaze you can actually learn a classifier which can predict the engagement intensity,
ok.
Now, as you can see in this example here, you know here is the timeline of this educational
video which is being studied by the students. This is an online material. And what we see is
for example, and at this points the engagement is fairly high. Here you see that there is a dip.
And you know in this case the person actually is looking downwards and not at the screen,
right. So, we can do this fine grained analytics and then see what is the engagement level of a
person when they are consuming, when they are viewing this material.
35
(Refer Slide Time: 15:27)
Now, one can go a step ahead, right. So, you can quantify engagement and you can say for
example, you know you could have low level where the person is not paying attention you
can tell them from their head pose and their gaze and you know let us say the facial
expression as well.
And the other is high where the person is really fixating on the task. You can imagine this
also from the perspective of the car driving monitoring as well if the driver is paying
attention. Now, here is an example I am going to show you from human-robo interaction
perspective. There are two paths, the first of course, is the sensing part. Now, what is going to
happen in the video is that there is a robo and that robo is speaking about a life experience.
Now, this life experience is borrowed from a popular ted talk. In the field of view of the robo
there are users, let us say these are the users. Now, based on the head pose that is the location
and the angle of the frontal part of the face with respect to the view of the robo. The robo
judges the engagement of the group.
Now, if the engagement goes up the robo maintains its pitch and volume. If it goes down,
then the robo adds variation. Now, this is the sensing part and the response of modulating that
is the reflection on the basis of what we sensed. Now, what was also going to happen is that
the gesture of the robo are based on the text which it converts to speech. So, what the robo is
speaking the gestures are also generated accordingly. So, let us play the video here.
36
(Refer Slide Time: 17:44)
She refused to sign the final adoption paper; relented a few months later when my parents
promised that I would go to college. This was the start in my life. Now, there are some
interesting points which come into the picture. One is let us say there is a communication
happening between an agent or a robo to a user, right. So, it could be either an agent or robo
talking to a user.
Now, how does the agent or the robo realize that the conversation is ending? The other
question is what is the appropriate behavior of the robo or the agent once they have predicted
the affective state of the user. A simple example can be, let us say there is a robo, a social
robo in a house, the user walks into the living room and is visibly distressed.
The robo using the camera locates the face of the user, analyzes the facial expression, and if
the robo is able to sense that the person is distressed, then they can for example, maybe crack
a joke or can ask questions or give relevant information which can help the user to feel better.
Now, on similar lines one could also use affective computing and the analysis of the face and
speech in training, ok.
37
(Refer Slide Time: 19:38)
So, here is an example of the My Automated Conversation Coach, the MACH system which
was developed by Ehsan Hoque and his team at the MIT Media Lab. Now, in this case there
is a virtual agent and there is this is the user. So, they are they have a conversation. And in
this lieu, during this there is a camera and the camera captures the face of the user, and in this
it will look at things such as presence of smile, the head pose, gaze and so forth.
And then the user at the end of the conversation can get very rich analytics about how they
were perceived based on their expression or head pose. And they can use this to train
themselves let us say to speak better. An example could be during an interview process, so
you can prepare yourself to speak better in an interview process. So, let us see the video.
Now, this is a video of which is available on YouTube.
38
(Refer Slide Time: 21:50)
Now, here is another example. Now, this is the robo named KISMET which was developed in
the late 1990s at the MIT Media Lab. Now, what the robo does is you know it senses the
person's facial expression and then it will try to mimic it you know and it will as a try to give
a big smile, if you showed a big smile in front of the robo. So, the aim is quite simple, you
will sense the affect and then you react accordingly.
39
(Refer Slide Time: 22:18)
In 1997, professor Brooks and his team built KISMET, a small robot with eyes, ears and a
mouth, so it could see, hear and experience the world around it.
40
(Refer Slide Time: 22:46)
Now, moving to an a smart interface. So, this is an example from Monash University wherein
they proposed a mirror ritual system. What it does is it looks at the emotional state of the user
through a camera and then it would generate this text could be a poem which is better
mapped to the emotional state of the person, right.
So, they propose this as emotional self reflection tool. So, there is a mirror, there is a camera
you come in front of it, it senses your emotion and then it will generate this poem which is
based on the emotional content which is you know a reflection of how the user is feeling
based on their facial expressions and so forth.
So, again in this we see the sensing part and the response part. I mean the response here is the
text which is generated by the mirror. So, here is the video.
41
(Refer Slide Time: 23:40)
Mirror ritual is an interactive artwork that challenges the existing paradigms in our
understanding of human emotion and machine perception.
42
(Refer Slide Time: 23:55)
Each poem unique and tailored to the machine perceived emotional state.
Mirror ritual augments the human experience of emotion, expanding it beyond the internal un
reflected experience into the realm of the tangible and expressible. Now, friends, we see all
these applications, right in healthcare it is there, it in education domain it is there, we have it
in driving as well. In fact, now there are games coming up. Now, in these games you know
one of the expectation from the user is to show let us say a particular facial expression.
43
So, let us say we have this game which is designed, created to help a patient learn to express
better, to show facial expressions better. So, in that case the task would be to generate to
show facial expressions on the user's face. And then there is system which is trying to predict
that if let us say the smile was you know a high intensity smile, right only then the person
moves to the next level. So, you know there is a gamification as well. So, in this as well the
facial expressions and the affective cues are being used.
Now, for such interfaces you know where you go to the next level in a game when you show
a certain facial expression, now these are now emotionally aware or you know emotion aware
interfaces. So, the pursuit of again is now on the lines of creating these smart interfaces which
are aware of the emotional state of the user. So, that the content which is shared to the user is
in-sync with how the user is feeling. And let us say not to overwhelm the user with
information which can have any adverse effect.
Now, in the pursuit of creating these interfaces you know there is the whole domain of UX
you know. This the user experience how the user will be experiencing when they are
interacting with the you know interface and there is a whole line of study in this user
experience, how we can get the requirements, how we can then test the requirements, right.
I will briefly touch upon this and then I will discuss on you know how we can use for
example, you know an affective computing based technique to expedite and to know the user
44
requirements and the user response better, when you are talking about the user experience,
ok.
So, let us formally introduce user experience. So, UX now it is a very broad area it
encompasses many aspects of a user’s interaction with let us say a product. You could be
designing a website and the organization, ok. So, it is typically saying you create a website
and now you would like to understand before releasing it to the masses how do the users in a
control group feel about that website.
Now, in this context usability is more the practical side of the things. So, the question is you
know how much easy it is let us say for a user to achieve a certain goal through this website
or this product which you are creating and in that pursue of achieving that goal how easy or
how easy interpretable the information which was presented to the user was.
Now, the combination of these two would lead to a better understanding of the emotional
experience and interaction with a brand and you know hopefully add to a memorable positive
experience. An example of that is let us say we are creating a website for old age users, ok.
Now, when this old age user goes on online, goes to this website, and let us say they are
supposed to find certain information, ok. Now, how easy is it for them to fetch that
information, and how can the system recognize that if a person is now struggling to find that
information.
Once it recognizes this let us say confusion state then how can the system react, right how
can the information be better presented. For example, if you do not find any information may
be trying to increase the font of the certain part of the page on your website.
Now, in this whole process, before let us say you make your website live how are you going
to test it, right, how would you actually know that the experience is good, it is easy to use and
the purpose is also achieved, right. The purpose of let us say while you are creating a website
in the first place.
45
(Refer Slide Time: 29:00)
So, but additional approach would be to do you know this participation design. So, a typical
approach here would be to do participated design acceptance test. You know where you have
these control group stakeholders who would be involved in the design, right. So, from the
website for elderly example you know you could actually have a few elderly users and you
could fetch the information from them, that how they anticipate you know how they expect
the website to be, right.
Now, this will give you idea about the aspects of the user centered design. This is what the
user wants, right. And then once you know what the user wants you can actually have the
website design accordingly. Now, there are different approaches from 91 from Muller, this
PICTIVE approach. And then once you have done this you have the acceptance testing as
well, right.
Now, you would like to test the product and you want to understand the functionality,
usability, reliability and compatibility. Now, from our example of the website for the elderly
you know how correctly the website is functioning, how is the user reacting and what is the
feedback from the user.
Now, this feedback from the user is typically done in the form of self reporting, you know
you could actually let us say give a form to the user. And then have a series of questions
which would be validating, you know let us say your design hypothesis and your design
46
philosophy and are the goals of the product let us say the website example which I shared
with you are those goals you know served or not.
Now, there are some issues when you are talking about these self labeled or you know these
surveys which are performed. The first is you know it is non-trivial sometimes for the users
to articulate very clearly you know how they are feeling about or what is their experience
with the product. And typically, what is observed is that these are you know affected by
things such as the participation bias, you know n number of people already saying this is the
good product.
I am not so sure, but well you know because n people are saying, I will also say yes this is the
good product, right. And then also based on our prior experiences you know history how the
prior memory, you know memory is going to affect our judgment of the current product in
during this testing phase.
Now, let us say you know you we take an example of this website and this is about you know
marketing, ok. So, and videos, so these videos which you want to show on to this website
which is for elderly users. So, what we can do is you know we can use affective computing
now here. Now, you see how affective computing is going to come into the picture. We
understood the user requirement, we created the website and now we want to understand how
the user feedback is, ok.
47
Are our objectives met or not? If they are not met, then we would like to go back you know
rehash and improve the site improve the product, and before releasing it to a wider audience.
So, what we can do when the people in our control group the participants they are exploring
the product the website in this case.
You know we can then have the them being analyzed using a system which is going to look at
the different aspects such as expressions and head pose and so forth, ok if you are using a
camera. And once you have this information, you can then you know make a decision about
how you know the changes should be or how the product is actually turning out.
So, let us see you know what we can do in this case. So, you can you know we are taken this
example of a web based interface for the you know website and some main considerations
when you are going to create this kind of website or user interface usability with, usability
testing with affective computing is you know how are you going to have this control group?
What are the type of sensors which you are going to use? Are you going to have you know
large number of users, who are going to be evaluated? And how comfortable these users are
let us say to be analyzed through a camera? Or how comfortable are these users when you put
a physiological sensor on them? Right.
48
(Refer Slide Time: 33:59)
So, here is an example. So, this user is participating in a study. Let us say there is the website
which the user is browsing. Now, here we have the EEG headset. We also have a camera. So,
you can now analyze the let us say the action units, the facial expressions, and along with the
EEG signals you can use it to understand how was the user behavior when they were doing a
certain task on the website, ok.
So, this can give you very valuable information. And when you are going to collate it over
large number of subjects you can then find out similar to that engagement example, ok. So,
they when the user let us say is trying to search for a information, that is where you know
they let us say lose interest because it is non-trivial to find information on the website. You
know, this is an example.
49
(Refer Slide Time: 35:00)
Now, while we are discussing all these applications it is important to explore the ethical
considerations, right. You talked about health related applications, education related
applications, right. Now, particularly for health and well being we might be dealing with the
patient. So, it is extremely important to be observant of the following.
When the user is prompted that ok, you may take a break because the system sensed that you
are a bit stressed. The system sensed that the user is stressed. Now, there are these ethical
boundaries here, ok. Is the user aware that the system is analyzing the user? Is let us say
recording their physiological sensors or is analyzing the speech or vision?
So, has the user approved for this? And once the user let us say has approved the system
analyzes for stress or let us say you know the facial expression which are going to link to the
emotion and gives the recommendation, then is the system trying to manipulate the user? So,
the what is that fine line, what is that boundary which we can allow for the system to
recommend the user about something? Ok.
So, what this means is it is not about just computer science scientists who are going to come
and create the system, but it we also need experts from social sciences about what is going to
be the user behavior, how the user ideally should react and how the system should present the
information.
50
The other super important aspect is privacy. You know quoting from Professor Rosalind
Picard emotions perhaps more, so then thoughts are ultimately personal and private, right ah.
How you feel, it is a very private thing. And to analyze the emotion of a person or user
through facial expressions or speech or a through other modalities could be invading into
their privacy.
So, what it means is we have to be mindful of the user identity and personal information
being hidden. An example of that is, let us say a system was being created for analyzing the
effect of adding certain new feature in app which is used by children. And for analyzing the
effect of adding that feature the developers use facial expressions.
Now, when you analyze the facial expression of this subject, you have to make sure that you
only share or record the facial expression, but not the identity or the personal information.
And that facial expression can simply be saying for example, at time stamp t 1, subject 1 was
showing a neutral expression.
What could be invading into privacy is you could say at times mt 1, subject 1 name showed
neutral expression, right. You do not want to diverge the identity and right. And also what are
the scenarios where you are going to use this information. So, let us say if we have a system
which is used to online assess an interview candidate and the candidate is shown to be visibly
stressed, the system detects that the person is stressed.
Now, maybe that person is stressed due to something which is outside the scope of that
interview, ok. Should that be communicated to the interviewer or anyone who is going to
analyze the video later on? So, we have to be very mindful in Affective computing about the
privacy aspects of the user.
Now, the third, ethical consideration is emotional dependency. Could having users routinely
use these moral agents create codependence? Checking your expression or you know getting
and getting analyzed your speech pattern and then you know getting this feedback from the
system, that ok please let us say take a break or you know you are doing very well. So that
means, the user gets too used to this, you know is too dependent on this kind of information.
So, this means when designing such application and system which use affective computing,
we need to bring into experts from interaction perspective as well, that what is that point after
which the user could get too dependent, right; could like to get the feedback from the system,
51
so as to do simpler tasks.Now, the designer could have started you know the system could
have started with a noble, you know aim that we would may a let us say the expression or the
emotion and then we will recommend something or you know we will show a some analytics.
But on the other side we see you know that the user gets so dependent on it. So, what is that
fine line?
And of course, then; that means, this is actually based on how the system was designed, what
was the intention of the system, and how well it was tested before that affective computing
enabled system is released to a mass audience.
So, friends, with this we reach towards the end of the application part of the fundamentals of
affective computing lecture. We discussed about the different applications of affective
computing focusing on healthcare and well-being. Moving on to the applications in the
education sector, online tutoring, and then we saw examples of human-robo interaction and
interfaces as well.
Later on, we moved on to how we can have objective feedback in the form of the emotions
which can be used in let us say an example like neuro marketing or creation of a product,
where you would like to understand the user feedback, right from the perspective of user
experience and so forth.
Thank you.
52
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Lecture - 01
Emotion Psychology
Hi friends, welcome to this week’s class. So, in this week we are going to talk about the
theory of emotion and here is our agenda for this week’s class.
So, we will be covering first the psychology of emotion. Emotion models and we will try to
understand the limitations of the traditional affective computing approach and then we will
further move on to understand the specificity of the emotions followed by the emotion and
the brain asymmetry.
53
(Refer Slide Time: 00:53)
So, let us start with the psychology of the emotion. So, when it comes to the psychology of
the emotions there are n number of different questions then we do that we need to answer.
But we will start with two most prominent questions that are there with respect to the
affective computing primarily.
So, first question that we are very much concerned about is the what emotion themselves are?
So, we really want to understand before even trying to understand about the emotional
intelligence and all that; first we want to understand that what emotion exactly are and the
second thing that we want to understand how emotions are generated or elicited.
Because this will enable us to understand the emotions in a better way. So, that the
monitoring can be done in a better way and accordingly the agent adaptiveness can be done in
a significant manner perfect.
54
(Refer Slide Time: 01:45)
So, let us start with the trying to understand the emotions. So, what exactly are the emotions?
So, emotion as such has been a very very difficult topic for the psychologists to define and
they have been struggling since so many decades. While there are different variations and the
versions on the understanding of the emotions, one common and popular definition that is
used has the following components.
Emotion is understood as a complex set of instructions among two important things, One is
the subjective and the another is the objective factors. So, and it is mediated by the neural and
the hormonal systems which can number one this is very important to understand give rise to
affective experiences such as failings of arousal, pleasure or displeasure.
What it means that the very first component with respect to the emotion that we want to
understand that it is a mix of it arises with a mix set of interactions between the subjective
and the objective factors which can be many. And the very first thing that the individual
experiences is an affective experiences such as a feeling of the emotion itself.
Second thing that we have is the it the emotion also generates cognitive processes such as
emotionally relevant perceptual effects. So, how exactly we are perceiving those emotions
and appraisals basically how exactly are we evaluating those emotions, those experiences that
have arisen and then we try to label them with a particular in a particular fashion.
55
So, there are these two important things. The very first thing is that there is an affective
experience involved and then along with the affective experience we have a cognitive process
and the third thing that we have is very important again it activates a widespread
physiological adjustments to the arousing conditions.
What it means? That once we have felt a particular emotion then there are certain
physiological adjustments that happen in the body with respect to the emotion that we are
feeling and this kind of activation it is followed by the feeling of the emotions. So, the third
thing is the activation of the widespread physiological adjustments.
And last, but not the least that once we have felt an emotion, once we have evaluated what
the emotion is all about, once we have our body has done some physiological adjustments
such as for example, change in the heart rate even and so on so forth,then the last, but not the
least which is a very important step that it leads to an action it leads to a behavior that is
followed that is dependent or that is in accordance with the emotion that we have felt and this
behavior or this action can be expressive, but may not be expressive always, but nevertheless
it has it could be very very much goal oriented and adaptive.
So, what it means that once we have felt a particular emotion, we may want to act in a way
that is going to help us in countering or in responding to that particular emotion. So, now, let
us take a very very simple example. Suppose that we encountered an animal in the forest
right. The moment we encountered an animal in the forest; now there are lots of different
things that are happening right.
So, there are lots of subjective and objective factors that are here. Of course, there is this
objective fact that you are there in the forest. There is an animal and subjectivity could be
subjective factors could be that you may know already that what type of animal that is
imagine that is a lion and it's an dangerous animals. So, you already know all these kind of
factors. So, there are the subjective and the objective factors that are there.
Now, the moment you see a lion, now there is a particular experience that you will have and
that particular experience could be let say for example, feeling scared right. So, you may feel
scared by seeing the lion. So, that is what is the first check that we have? So, we have a
feeling of the affective experience which is in this case getting scared by seeing the lion.
56
Number thing number two could be now generation of the cognitive processes. Now, what
could be the generation of the cognitive processes in this case? So, once you have seen a lion
what can be done what will happen? That your brain is going to evaluate the situation right.
So, basically it is going to evaluate the situation of course, the perceptual effects are that you
have seen the lion through your eyes and now it is going to evaluate the situation that ok
where how much at what distance the lion is whether the lion is in a cage or well if it is in a
forest, it could be not in a cage, it could be free and whether you are in a safe position, do you
have an options to run and then so on and so forth. And so, you will start evaluating your
options.
So, that is basically the these are the cognitive processes, cognitive calculations that are going
to happening. So, number two is also checked. Now, the third thing could be once you have
encountered this feeling of being scared by seeing the lion and you have already evaluated
your options. So, imagine that your options could be that you find yourself in a position that
you are very very near to the lion and you cannot run and if you decide to run then maybe the
lion can see you or it can even catch you right.
So, then there are certain physiological adjustments that are going to happen to all this
scenario. So, for example, what is going to happen that maybe your body will suddenly
realize that ok you really need to run fast in this situation and in order to run fast for example,
you may need a lot more supply of the oxygen or the blood towards your limbs for example,
right because you need to take some actions now.
So, for example, your heart rate may will increase because now your heart will start pumping
at a faster rate. So, that it can start producing the supply that your body demands now. So, this
is and these are and at the same time you may start experiencing some kind of sweating you
know and so on and so forth. So, basically this is the third that is the physiological
adjustments that are happening with respect to the emotion that you have just felt right
perfect.
Now, the third thing would be of course, it should lead to the behavior a particular behavior
that is in this case should be expressive or may not be expressive if for example, you decided
to do nothing may not be expressive, but usually if you have seen the lion if you have
evaluated your options and you have already understood that ok the only option for you is to
run as fast as you can.
57
Then maybe what you want to do it is definitely going to be expressive because you will run.
It is going to be you will have a goal of running towards a safer zone putting yourself in a
safer place. And then it will be directed towards that goal and then of course, it is going to be
adaptive because you are taking this action of running, you are doing this action of running
only because you have you felt scared.
And you have evaluated your options and accordingly your body has brain and body together
they have decided ok this is our only option that we need to run and accordingly we are going
to make these adjustments right. So, this is in this case the fourth box is also checked.
So, I hope that now you understand that while the emotion is a very very difficult topic to
design and there are lots of different components here, but in general as per this popular
definition given by Kleinginna and in 1981, what we have? We have four very very important
components. One we have there is an affective experience. This is you can say it as the
number one.
We have some cognitive processes involved with that particular emotion. We have certain
physiological adjustments in the body and then there is certain action motivated by or
adopted that is going to be in adaptation to the particular emotion right perfect. So, this is
what in brief we try to understand what exactly a motion is right.
58
Now, let us try to understand the different second question that we raised in the beginning
which is how exactly the emotions are generated? Now, this is a very interesting topic and
before even able to answer this question let us take another question which is derived from
the similar example that we just saw that.
For example, imagine we encountered an animal, which could be a lion, which could be a
bear or anything. Imagine that if we encountered a bear, do we run from a bear because we
are afraid or are we afraid and hence we run. So, it is a very very interesting question that was
raised by the psychologists in 19 in 1800’s. In fact, and there are different opinions, there are
different ways in which the psychologist and the emotion theorist they have tried to answer it.
For example, James he came up with the very very simple proposition which is also known as
a common sense theory here, which he says that we are afraid because we run. So, what
James said in his very famous proposition in 1894 that we are afraid because we run and that
is how it means our emotions that have been generated they have been generated because of
the bodily responses.
So, for example, we our heart is started our heart is started racing, we started feeling tight
stomach, we had the sweaty palms, our muscles got very very tense and so on so forth and
because we started experiencing these bodily responses what happened? That we started
feeling get we started experiencing the feeling of this scaredness or the fear right. So, this is
the theory that was populated by James in in 1894.
Nevertheless there are lots of criticism about this particular theory and one of the very
common criticism for example, one of the common criticism that was given by more or less
around the same time by Worcester was that the fear for example, in this case is not directly
caused by the sense perceptions which was the argument of the James, but by certain
thoughts please pay attention to this thing by certain thoughts to which these perceptions may
give rise.
So, basically what Worcester is trying to say that in the same situation we are feeling scared,
we are feeling fear not because our body is experiencing some responses. It is because that
there are certain thought process, there are certain cognitive processes that happens first and
because of that cognitive process what happens? That certain evaluations or perceptions can
be understood and hence this feeling of the fear or the scaredness is arising now.
59
And to certain senses actually this is very very true, but at the same time we need to
understand this. That the physiological responses that our body is experiencing, it is of
course, sent back to the brain in order to make the brain understand that ok what exactly the
body is demanding or what exactly is the body needing right and this unique pattern of the
sensory feedback that is the feedback that you just gave to your brain, it gives each emotion
its unique quality, right.
So, now in this case its really hard to say that ok whether exactly we first felt the bodily
reactions and according to that only the emotions were generated or first we felt certain
perceptions because there was certain thought process in our brain and that give rise to that
give rise to the emotions that we are experiencing.
So, for example, let us take this example of the in the same situation, when we encountered a
bear and someone is feeling scared right, but in the same situation imagine that there is a
well-armed hunter and if the well-armed hunter is going to encounter a bear. It is going to feel
rather than feeling scared it is going to feel happy.
Why it is going to feel happy? Because of course, it had a thought process behind first, it had
a thought process that ok; today, I am going to go to the forest and maybe I am going to make
a kill of a big animal maybe a bear right. So, rather than saying that ok you saw the bear first
and then certainly and bodily reactions happened and then give that that give rise to the
emotions what is happening here?
That for example, you had certain thought process and once you encountered a stimuli, a
situation combined with that thought process certain perceptions arise and because of those
certain perceptions, now you felt a particular emotion right. Similarly, we can take the
example another example where an ordinary man for example, who may not have seen bear
before may feel curiosity by seeing a bear in the lion.
A bear or the lion in the forest and so he may not he or she may not feel scared or he or she
may not feel very very happy as well. So, they may feel curious. So, now you have to
understand. So, ultimately now we let us discuss again our the question the question was
whether we feel our emotions because of the bodily reactions or the bodily reactions happens
because of the emotions.
60
Of course, we already saw that there are two very popular theories, one is given by the James,
one given by the Worcester and they both have their own set of arguments. But primarily
what happens that it may not be entirely true that we feel emotions because of the bodily
reactions, certainly we saw some examples here and at the same time it may also not be
entirely true that we have certain thought process and we have there are certain cognitive
processes that happens first and then only the bodily reactions follow.
So, basically it has to be that a mix of both and certain things are happening at the same time
simultaneously, but the answer to this particular question is really not that simple.
Nevertheless, this is what the psychology says. So, far that it could be one of these or most of
the time the psychologists they tend to lean towards the second approach second method
where they tend to see that ok, there is certain thoughts to which the perceptions may give
rise to, right perfect.
So, we already now so I hope that now we have already understood that what exactly are the
emotions? What are the four components of the emotions? And we also try to understand that
how exactly the emotions are generated right. So, basically first some because of the certain
thought processes some certain perceptions are arising and because of that then followed they
are followed by the bodily reactions.
And those bodily reactions again go back to the your send some signal back to the brain to
which the brain again responds or adapts. Right. So, basically this is how the emotions are
generated in general. Perfect.
61
(Refer Slide Time: 16:42)
So, having understood that how the emotions are generated and what the what exactly the
emotions are? Now, there is another very interesting thing that we must look here while we
are discussing the generation of the emotion is the bidirectional projections. So, basically
what happens that now we have already understood that not only our brain impacts the body
because when brain feels something where a brain encounters a particular type of thought
process then certain bodily reactions follow.
In the same way, body also impacts the brain right and then there are different pathways
neurons actually. So, there are for example, afferent neurons and then there are there are
efferent neurons. So, basically what they do they create different types of paths through
which these signals. So, for example, through the with the help of the afferent neurons what
happens?
That the all the bodily reactions they are communicated back to the central nervous system
which includes your brain, which includes your spinal cord for example. And similarly, with
the help of the efferent neurons the all the signals from the brain, from the central nervous
system from the spinal cord are communicated to the body right.
So, there are these two different types of communication that happens with the help of
different pathways with the help of different types of neurons that are there in the body. Now,
one very beautiful example of how body impacts the brain is of the laughter yoga right. So,
basically in the case of the laughter yoga you may know already about the laughter yoga.
62
So, in the case of the laughter yoga what happens, that a group of people they come together
they try to laugh about they try to laugh together and maybe without having even a context
without having a joke right. So, basically they try to laugh consciously without even having a
let say very joyful situation there right.
So, basically in this case in the case of the laughter yoga what happens, that when you start
doing this practice of the laughter yoga then what happens that you may not be able to feel
joy, you not you may not be able to even laugh with the group right and it could be
embarrassing as well in the beginning.
But then what happens that as the time progresses and as you start getting accustomed to this
yogic meditation methods and ways then what happens that slowly you also make you can
start laughing and by start laughing what happens that your body starts feeling joy and in in
turn that laughter becomes very very natural right.
So, basically that is what happens in the case of the laughter yoga that laughter you
experience laughter, you experience joy without the humor or the cognitive thought of the
same and that is one beautiful example of how the body impacts the brain. And in this context
there have been several experiments that the researchers have done in the past trying to
understand that what exactly are the implications of this and how exactly the body is
impacting the brain?
So, for example, one of the very first and very interesting experiments was done by Strack et
al and his team in 1988 in which what they tried to understand that how the voluntary
contraction of facial muscles contribute to the emotional experience. So, what they did? They
did a very interesting experiment they ask the participants to hold a pencil either with their
lips or with their mouth right.
So, basically when you put the pencil in your mouth like this then you are inhibiting their
smile right you are not allowing the participants to smile and in the same way when you are
putting when you are putting the pencil in your mouth inside then sort of you are like doing
like this and then you are mimicking a kind of a smile or laughing smiling behaviour.
So, basically what they found Strack team Strack and the colleagues and then the similar
experiment was also followed by the Levenson was carried by the Levenson as well in 1990.
63
So, what together they found? That the participants who had the pencil who had who hold the
pencil with their lips they rated the cartoons as less amusing.
So, please note here that they had the pencil in their lips, what it means? That they were
trying to prevent or inhibit the smiles and then while doing so, they were asked to rate the
cartoons. So, they rated the cartoons as less assuming as amusing then in comparison to the
participants who hold the pencil in their teeth.
So, basically the participants who were holding the pencil in their teeth. They were
mimicking a smile and then at the same time what they found at the end that they rated the
cartoons as more amusing. So, what exactly this interesting experiment tells us? This
interesting experiment tells us that by applying certain bodily reactions this bodily reactions
are also impacting the brain right.
So, for example, when you are going to inhibit the smile, it means you are preventing
yourself from doing this smile and at the same time maybe the signal that your brain body is
trying to give your brain that ok the situation is not very amusing the situation is not very
humorous and for the same reason maybe the cartoons they got a less amusing ratings.
On the other hand then you have the participants who hold the pencil in their teeth and by
doing so maybe the signal that you are trying to that your body is trying to give to the brain
that it is a very amusing scenario, it is a very amusing situation and even without in first
looking at the cartoons even without encountering any humor or the cognitive thought right.
So, basically by putting yourself in this particular situation you are giving a signal to the
brain that the situation is quite amusing and it could be because of the same reason that they
rated the these participants who hold the pencil in their teeth, they rated the cartoons as more
amusing right.
So, this is quite this is how the bidirectional projections also work. So, we this is very
important to understand that how the body impacts the brain and how the brain impacts the
body and while of course, it is very common to understand that brain can easily impact the
body, but at the same time there are n number of cases where we can see that how the body
can impact the brain also when it comes to the experience of an emotion. Right, perfect.
64
(Refer Slide Time: 23:09)
So, now having understood what emotions are? And how emotions are generated? What are
the bidirectional positions of the emotions? Let us try to understand that what are the
different models of the emotion? Or how the emotions are perceived? What are how the
emotions are perceived right?
So, basically when we talk about the perception of the emotion then Gabrielsson made a
distinction and he proposed a distinction between the perceived and the felt emotions right.
So, basically what Gabrielsson proposed that perceived emotion is the emotion that is
recognized in the stimuli.
What it means? It simply means that for example, you are experiencing a particular music.
And that the music may have its own set of emotion, right, that it wants you to perceive that
can be perceived. So, for example, there could be happy music, there could be sad music,
there could be dancing music, there could be funny music, there could be all different types
of music. So, that is what is the perceived emotion.
So, basically this is very very important. So, basically perceived emotion is the emotion that
is already recognized or embedded in the stimuli right. Now, on the other hand we also have
the induced emotion. Now, what is induced emotion? Induced emotion on the contrary of the
perceived emotion is the emotion experienced actually experienced by the listener. So,
basically this is what this is the emotion that is the emotion that is actually perceived by the
listener.
65
So, please pay attention to these two things now the induced emotion could be similar to the
perceived emotion as well. So, basically what happened? That maybe you wanted you have a
music which is a happy music in general right and the person who listened to the music also
felt happiness right. So, in this case what may happen?
That the perceived emotion and the induced emotion they both become similar and that is
what as a designer your thought process should be, your aim should be, your objective should
be right. So, whatever emotion you want to induce the same emotion whatever emotion that
you want the user to perceive. The same is being induced among the your listeners or among
the users.
But what happens for example, even though there is a happy music, but then a particular kind
of incident has been associated with that particular happy music and while listening to that
music rather than experiencing happiness, rather than experiencing joy you may experience
sadness right. So, and in this case these two set of different things can be very very different.
So, this and that is why the Gabrielsson he made this nice distinction between the perceived
emotions and the induced emotion right. So, as an affective computing researcher or as a
student practitioner of the affective computing theory you also need to take this into account.
That, when you are talking about the emotions what exactly is the type of the emotion that
you are talking about are you talking about the perceived emotion? Or are you talking about
the induced emotion?
Nevertheless, most of the time what happens? That this induced emotion is very very
dependent on the individuals own thought process desires, beliefs and intentions right. So,
there is a lot of individual variability when it comes to the induced emotion and hence most
of the time we discussed when we discussed the emotions, we discussed the perceived
emotion of a stimuli or the perceived emotion of a system right.
And then there are different ways in this perceived emotion or along the same lines the
induced emotion as well can be modelled or can be represented. And among different models
that exist these two models one the discrete or the categorical model and the dimensional
models they are very very popular models and are most commonly used by the affective
computing community. Right.
66
(Refer Slide Time: 27:18)
So, we will now discuss about what exactly is the categorical model? And what exactly is the
dimensional model? Ok. Now, we have already understood that how can we represent the
emotions? What are the different ways and in detail, We will see in the next session. So, see
you.
67
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 02
Lecture - 02
Emotion Theory and Emotional Design
Hi friends. So, welcome back and now let us start with the session 2 of this week. So, in the
session 1 we already understood that there are different distinction, there are distinctions
among the perceived and the induced emotions and both the type of the emotions can be
represented or modelled by different categories, different ways. And, two most popular
models are the categorical models and the disc- and the dimensional models right. So, let us
start with the categorical models.
Now, in the categorical model what happens that discrete emotion models are discrete
emotion emotional states are used and this categorical model is the most popular model or
most commonly used by the affecting computing community as of now. And, for the reasons
that will become very very obvious now. So, first thing that the categorical model of
emotions they are very very easy to understand and hence they are very very easy to represent
as well.
68
So, basically what this categorical model of emotion is that it describes exactly what we use it
describes the emotion in exactly the same way in which we use it in our everyday lives that is
using only a single word. So, what it does? It simply uses a single word to describe the
emotions that are there.
So, that describes our affective states and this particular view of describing the emotions
using the single word or using the words that are commonly used by in our everyday lives is
has been used by this right starting from the Darwin’s age which is when he gave this paper
on the Darwin’s evolutionary view of emotions, where he described this his evolutionary
view of emotions.
Where he said that emotions are basically what can be used what can be described in by in
the single word or what can be described by what the people are using in their day to day
lives. Now, among this categorical model of emotions also then there are different ways in
which the categorical models of emotion is also implemented.
And, the one that is again very very popular among this categorical models of emotion is the
one that is given by the Paul Ekman and Friesen in 1971, where what they said that they
recognized there are six basic emotions. And, these six basic or discrete emotions together
constitute can be treated as a one categorical model, Right.
So, basically of course, Paul Ekman made it popular or popularized with respect to the facial
expressions. But, nevertheless the idea was that these are the basic six emotions using which
any more complex set of emotions can be represented. So, the basic six emotions that Paul
Ekman popularized or proposed were happiness, anger, disgust, sadness, anxiety and surprise.
Right.
And, in his later papers he also recognized contempt as a seventh basic emotion. And,
altogether the idea is that these six or seven basic emotions, they represent most of the
emotions that we experience in our day to day lives. But, more importantly even if there are
complex emotions they can be constituted with the help of these basic emotions right.
69
(Refer Slide Time: 04:13)
Now, the biggest advantage. The biggest advantage of these models or this representation of
emotions is that from a computational perspective, these models are very very easy to
implement. This becomes discrete classes from a machine learning perspective and hence
there are different; if there are we are talking about six basic emotions, then we only will
have to deal with the six classes of classification.
And, hence it becomes very very easy for a classifier or for a machine learning model to deal
with these type of modelling. And, similarly not only for the machine learning models also
even for when you want to collect the ground truth is it is very easy for the end users or the
users to provide the ground truth in terms of these basic models.
Because, it is very easy for them to describe whether they are feeling happy, whether they are
feeling sad, whether they are feeling scared or so on so forth right. So, hence it becomes
computationally very very easy. But, one problem that is there in the model is that even
though it talks about some basic emotions and even though Paul Ekman and other researchers
they hypothesize, that the complex emotions such as shame, pride, guilt and all that they can
be constituted with the help of these basic emotions. But, it remains unclear how the relations
between these discrete states can be modelled right.
So, basically for example, it is not it is not very clear therefore, example in the case of the
pride should there be what exactly will be the elements of the basic emotions; maybe there is
70
a bit of happiness, maybe there is a bit of maybe there is a bit of happiness, maybe there is a
bit of surprise, anxiety or whatever element you can think of right.
But, the problem is most importantly whatever you can think of, first it will be a bit harder to
justify as you can as you rightly saw. And, the other thing in what quantity they will have to
mixed up so, that you come up with the emotion of the pride right. So, it becomes really hard
and difficult to model the relations between the discrete states.
Other problem that is there with the categorical models that there is a lot of inconsistency.
And, when I say the inconsistency what it means that the basic emotions themselves are not
agreed upon among the researchers. So, for example, while the Paul Ekman and the
researchers they have hypothesized the six and later the seven basic emotions, in another
paper Paul Ekman and his team also hypothesized about twelve and the fourteen basic
emotions.
And, then there are different researchers those who talk about eight emotions, ten emotions
and so on so forth. So, this is a very very big source of criticism that the basic set of emotions
are not agreed upon. So, first thing that we do not agree, the researchers do not agree what
exactly are the basic set of emotions. And, the other problem is that how these basic set of
emotions are related to each other and can be used to model more complex emotions. So, I
hope that the categorical model of the emotion is a bit clear.
71
Now, the problem with the categorical model, the psychologists they have a lot of problem
with the categorical model of the emotions. So, basically psychologist what they believe
mostly that this using a single word to describe an affective experience is not correct. And,
because the emotions themselves are very very complex and hence they have to be
represented in terms of some dimensions underlying dimensions that are enabling those
particular emotions.
And, for the same reason based on the same hypothesis, this dimensional models was
proposed. And, this in this dimensional model as you can see in the figure on the right that
there is this there is a 3D numerical vector which denotes the location of any space which
could be for example, you know there is an emotion here. So, then you can see that there is a
dimension as valence, there is a dimension as arousal and then there is another dimension
which is known as the dominance.
So, basically Russell and basically the whole idea of this dimensional model was that the a
particular emotion cannot be described by using a single word, but it is described by the
underlying dimensions which enables this particular emotions. So, for example, we have
three-dimensions here in this three-dimensional model. So, first dimension is the pleasure or
the displeasure scale.
So, basically this is what this is your the pleasure or the displeasure scale. Of course, on the
right side you have the pleasure and on the left-hand side you have the displeasure. So,
basically what it measures exactly on this pleasure or the displeasure pleasure and the
displeasure scale?
It measures the pleasantness or the positivity of an emotion rather than saying that what
exactly is the emotion, what you want to say, what is the positivity that is associated with the
particular emotion. So, for example, if you talk about a emotion of happiness right and then
there is for example, an emotion of anger.
72
Of course, it is very easy to understand that more positivity is associated with the happiness
unless positivity or maybe no positivity is associated with anger. And hence for example,
when we talk about the pleasure or the displeasure scale, the happiness could be lying on the
right hand side, while the anger could be lying on the left hand side on this particular pleasure
valence scale right. So, this is with the first dimension of this model.
The second dimension is the arousal scale that you can see on the y axis. So, basically on the
y axis, you can see that this is the arousal scale. Now, in the arousal scale you have to be
careful and pay attention here. The arousal dimension, commonly, it represents the intensity
or the energy that is associated with a particular emotion right. So, for example, the more the
intensity is there, the more the arousal is there in the more the energy is there in a particular
emotion, the more is the arousal of that particular emotion right.
So, for example, if I take two examples of the there is one emotion let us say we call it there
is one anger emotion again we can take and then again there is a emotion that we can take as
a let us say we can take as the happy happiness itself right. So, basically if we talk about the
anger and the happiness also. So, again you can have angerness with low intensity as well we
can you can have the angerness with the high intensity as well, right.
So, basically let us say you know while on the lower side bottom side let us say of this scale
you will have the low anger. On the higher side, you will have the high anger right because
there is going to be higher. Similarly, for the happiness also you can say that lower happiness
73
can be there on the lower scale a lower side of this arousal dimension and then higher
happiness can be on the higher side of the arousal dimension right. So, this is what is the
arousal scale.
There is the oftenly used third-dimension which is known as the dominance. So, this
particular dimension that we are seeing here, basically this is known as the dominance scale.
And, this is also known as the dominance or the submissiveness scale. So, basically this
dominance scale what it represents? It represents the nature of the emotion whether the nature
of the emotion is controlling or the nature whether the nature of the emotion is the dominant
right.
So, basically or is submissive for example, whether the nature of the emotion is dominant or
whether the nature of the emotion is submissive, contrary to each other. So, for example,
again if you take the example of two emotions, we can take the example of let us say; let me
just clean this a bit.
So, for example, if you take the example of fear right, an another example you can take of is
of the anger. Now, what do you think that as an user you feel more dominant when you are
scared or you feel more dominant when you are angry right? So, the answer is of course, you
feel more dominant when you are angry; because when you are angry you have more
intention to take an action. You do not feel the submissiveness, you do not experience the
submissiveness in that particular emotion.
74
For the same reason fear when it comes to the dominant, emotion it will lie on the left side of
the scale; while the anger will lie on the right side of the scale. So, the more dominance is
there in the emotion, the more the emotion can be represented on the right side of the scale
right.
So, I hope that these three are clear. So, basically this is what is becomes your P or V, this
becomes your A, this becomes your D. So, this is it how it becomes the pleasure arousal
dominance or the valence arousal dominance. So, PAD or the VAD, these two are equally
commonly used in the literature. So, do not get confused with PAD or the VAD model.
Now, having understood the PAD and the VAD model, one particular reason that it has gained
a lot of attention is that it can very easily be used in the regression methods as well as it can
used to discretize dimensions in the few areas right. So, when I say that regression methods,
what it means?
That imagine rather than saying whether an individual is feeling happy or sad, we may want
to say that for example, the emotion intensity was let us say 7, a 0.7 and the positivity that
was there in the emotion was 0.4 right. So, you already know that when there are continuous
values associated with a particular scale, then that particular problem can be easily modelled
as a regression problem.
75
Then, you have a number associated with that rather than having a category right. So, that is
one reason that it has become very very popular. Other interesting thing is of course, you can
always discretize the emotions as well the dimensions as well. So, for example, you can
always say that when the positivity is greater than 0.3, let us say if you are talking about a
scale of 0 to 9, then when the positivity is let us say greater than 7, greater than 6, a 7, 8, 9
then that can be a let us say highly positive emotion.
Similarly, you can talk about low positive and you can talk about medium positive right. So,
you can always discretize this continuous scale as well and hence once you discretized this
continuous scales, this also becomes a what? This also becomes your classification problem.
So, you can treat the same you can use the same scale, same representation of the emotions to
deal with the regression methods as well as the classification methods. So, that is one biggest
advantage of this particular thing is, other thing other advantage of this particular model is
that it really allows you to understand to have an interpretability between of the relations that
is there between the different emotional states. So, for example, you can always say that let
us take this particular example here.
So, for example, you can always say that there is one particular emotion that is lying here and
then there is one particular emotion that is lying here. So, you can always calculate a distance
between these two emotions. Let us say emotion A and emotion B and that can give you the
76
interpretability that is there between these two emotions and the relations that is there
between these two emotions right.
So, you can always understand that on the scale of the positivity how they are related, on the
scale of the intensity how are they related, on the scale of the dominance how are they
related. So, basically this becomes computationally really really interpretable and really easy
for us to do the modelling of the different emotions ok.
So, these are the two advantages of this particular thing. Of course, one disadvantage that can
be seen that since there are three-dimensions. So, of course, earlier you were having only one
value to represent one particular emotion, right now you are having need you need to have
three values to represent one particular emotional state.
Now, an often may be for this reason only that often what happens the third dimension is
omitted. So, once the when the third dimension is omitted, then it results in the VA space only
which is also known as the circumplex model.
77
(Refer Slide Time: 17:47)
So, for example, if you look at the circumplex model this is what the circumplex model, a
2-dimensional circumplex model basically is the is a 2-dimensional model where you just
have the valence and the arousal scale and the dominance scale is completely omitted. And
basically so, you can see on this beautiful diagram that there are when you talk about the two
emotions, two dimensions then the entire quadrant can be divided into four quadrants.
And, then you have the 1st quadrant which is of course, being represented by high arousal.
So, basically you have the high arousal here and then the positive valence may be on this side
right sorry. And, then in the 2nd quadrant you have the higher arousal and the negative
valence.
Similarly, you have the low arousal and the negative valence in the 3rd quadrant and in the
4th quadrant for example, here you have the low arousal and you have the positive valence.
And then of course, you can see for some of the representative emotions that are there.
78
(Refer Slide Time: 18:52)
So, for example so, for this reason when we omit one particular dimension, it results into the
2-dimensional model and which is known as the circumplex model. So, circumplex model is
only having 2-dimensions. And, these 2-dimensions are what? These are the arousal and
valence dimensions. So, we are omitting the dominance dimension here right. And, as you
can see in this beautiful diagram, we have different emotions that are represented in the four
quadrants of these 2-dimensions.
So, for example, in the very first quadrant we have what we have the happy, delighted and
excited emotions. Now, these emotions you can very easily understand that the positivity of
these emotions is high. So, they cannot be termed as the negative valence emotions right. So,
they are what? They are positive valence emotions, for the same reason they are lying in this
right side of this quadrant right side of the valence scale.
And, now if you look at the arousal so, usually when the when you talk about the happiness,
when you talk about the delightness, when you talk about the excitement; all these emotions
usually you have a there is a lot of energy, there is a lot of intensity right. So, for the same
reason usually they are termed as the high arousal and hence they can occupy this entire
space here right.
So, this is how you got we arrive to the high arousal and the positive valence in the first
quadrant which is being represented by the happy, delighted and the excited emotions.
Similarly, if we talk about the 2nd quadrant here which is this particular case here, if we take
79
one representative emotions such as the angry emotion. Now, if you talk about the angry
emotion of course, what would be the positivity of the angry emotion?
Of course, the positivity of this would be very very low or in other words that this particular
emotion will have a negative valence. And, hence it is lying in the second quadrant that is on
the left side of the valence scale. Similarly, if you talk about the arousal state of the angry
angerness of course, usually the arousal is that what is the type of the energy or the intensity
that is there. So, usually when you have the angerness, the energy can be high, energy can be
low as well.
Nevertheless so, but when you put it somewhere here, it means what you are representing that
there is a high arousal that is associated with the angerness. Of course, there is no less arousal
here, of course, in the when we are representing anger here, it means what we are saying that
anger is having high arousal.
Similar thing goes for the tense, maybe when tense is represented here for example, maybe
what we are trying to say that the positivity of the tense emotion is a bit higher than the
positive positivity of the anger emotion right. Similarly, when you talk about the frustration
maybe tense is more positive than the frustration right. But of course, nevertheless you can
put it on the other side as well and accordingly the values may change.
These are just the representative emotion. So, we covered already the 1st quadrant, we
covered the 2nd quadrant. Now, similarly let us look at the 3rd quadrant. So, let us just take
one example of one representative emotion. So, for example, if we talk about the depressed
emotion, now in the depressed emotion of course, it is not a very positive emotion. So,
definitely it is going to have a negative valence right.
So, we are right in saying that this is having a negative valence as being represented by
putting it in the 3rd quadrant. Similarly, if you talk about the arousal, when you have the
depression of course, the intensity or the energy that is associated with the depression can be
high as well. But, nevertheless usually what happens when you are depressed, there is low
energy, there is low intensity that is associated with it right.
So, for the same reason, this arousal is low and hence it is being represented in the 3rd
quadrant here right. Similarly, that is how we covered the third. Now, let us look at the 4th
quadrant here. In the 4th quadrant, let us just take the example of the calm. Calm is a very
80
beautiful emotion. Now, in the case of the calmness, if you look at the arousal of the
calmness, the intensity is going to be low right. There is not going to be much energy in the
calmness right.
So, basically what you will have? You will be in a very relaxed state of mind. So, there is not
going to be much energy. But, definitely if you look at the valence or the positivity associated
with this particular emotion, it is going to be it is definitely going to lie on the positive side
and for the same reason it can very well occupy a space in the 4th quadrant here right.
So, that is how we have calm, relaxed and content, these kind of emotions in the 4th
quadrant. So, that is how this circumplex model is used which is a two-dimensional model.
And, it is very very fairly popular and commonly used model where we have omitted the
dominance.
Now, having used only these seen this circumplex place model, now the question that may
arise ok, this is fairly good model and for the same reason actually it is commonly used also.
Now, but is there any advantage of using the D dimension, that is the Dominance dimension
in this PAD or the VAD model. So, to answer that let us look at two different emotions.
Let us see how the fear is characterized for example. So, for example, the fear can be
characterized definitely by having a negative valence. And, the arousal depending upon what
is the amount of the energy that is associated in the fear; it can be low or high right. So,
81
basically this is how the fear can be easily characterized. Now, let us look at the anger
emotion.
If you look at the characterization of the anger emotion of course, the valence of the anger is
also negative right, it cannot be positive. And, if you look at the energy that is associated with
the angerness or the intensity that is associated with the angerness can also be low or high
right. So, basically you can be very angry or you can be less angry as well and for the same
reason the arousal can be low or high.
So, what happens now if you look at these the characterization of the fear and the anger on
this two-dimensional scale, then there is not much difference here and that is where it can be
confusing. So, we are not able to differentiate the emotions that are overlapping in the
two-dimensional space. So, for the same reason what we do? We let us now see what happens
if we include the D, that is the dominance emotion. So, please remember here that the D
stands for the dominance right.
Now, in the case of the fear what will happen? Of course, our valence will remain negative;
our arousal can be low or high. But, when we look at the dominance of the fear emotion, in
the fear emotion usually the individual who is experiencing the fear emotion is submissive
right. So, for the same reason the dominance can on the dominance scale, it can be described
as submissive.
Now, if you look at the anger of course, the valence is the negative, arousal is low or high.
Now, but if you look at the dominance, the dominance is the individual who is experiencing
the anger emotion usually is more dominant right is more wants to take the control of the
situation, more action oriented. And, for the same reason the dominance on the dominance
scale, it can be said that it is controlling right.
So, dominance is really high on the high side of the dominance scale. So, now, you can see
that earlier the fear and the anger emotions were not getting classified or discriminating
discriminated on the basis of the two-dimensional emotion. But, the moment we include the
third dimension, now we are able to classify or characterize it in a bit better way right.
So, that is what is the advantage of including or having the D in the PAD model. But, for
most for our common purposes most of the time what we do? We simply omit the D
dimension and we simply operate in the two-dimensions of valence and arousal to represent
82
the emotions right. And, when say you say the emotions both the type of the emotions, the
perceived emotions as well as the induced emotions right, perfect.
So, now we have already understood that how the emotions are represented and what are the
categorical and the this dimensional models of the emotion right. Now, with that let us try to
look that what are some of the problems associated with this because of this the way we
represent the emotions, the way we understand the emotions in the traditional affective
computing research or the domain.
So, first thing that we have to understand most of the time what happens when we talk about
the human emotions, when the affective computing community talks about the human
emotions; they talk about the emotions that are sort of emergency emotions and that are
stimulated by some intense or more importantly instantaneous, instantaneous stimuli right.
But, the problem is that the that is not how the human emotions are exactly stimulated.
Of course, you may see some lion, you may see some bear for example, or you may see a
very beautiful interface and you feel happy about that, you feel good about that. So, that is all
fine, but many times what happens that this process of the emotion is stimulated by an
accumulation of continued weak stimuli over a period of time. What it means? That it means
for example, if you are feeling tense, then what happens that you are not just feeling tense
because you just saw something at the moment.
83
It may have happened that you know over the period of time, you have been encountering
several tense full situations. And, because of that now what is happening that your general
baseline state has become kind of a more tense than normal. And, whenever you are
encountering a particular stimuli that is kind of a acting as a trigger then you are feeling very
tense about it.
So, that is one particular problem, that in the affective computing community and in the
research that we have seen in the affective computing so far, most of the time when we talk
about the human emotions, we are talking about the emotions which are emergency emotions
and which are stimulated by the instantaneous stimuli. But, we fail to take into account the
effect of weak stimuli over a longer period of time or how the emotions are accumulated over
a longer period of time.
Of course, there are several reasons that we have not been able to do so. One simple reason
can you think of? Any reason about it. For example, one simple thing reason could be which
is of the resources as well. Definitely, in order to understand in individual’s emotion over a
longer period of time, you need to be able to monitor the emotional state over a longer period
of time right. And, this is where you know it all becomes very very resource intensive.
So, it is rather easy to make you listen a music for example, and see what is your response to
listening that music. But, let us say if you want to understand that how you have been
responding to the different music’s, that you have been listening over a period of 1 week, then
definitely you will need lots of resources, you will need lots of different ways in which you
can monitor the this response. And, you can see that in what way at what points, what
instances the individual is listening to the music and so on so forth.
So, that is what the one particular problem here is that here what happens that it becomes
very very, it becomes very very resource intensive. So, that is what we have to remember and
hence we have not been able to do the monitoring over a longer period of time sorry. Now,
emergency emotions as we have already talked about, they are quick and they are low
precision responses.
What it means that first thing that you experience a particular emotion and then they are
gone, gone maybe in seconds, maybe in minutes, maybe in few minutes right. Definitely,
most of the instantaneous emotions, emergency emotions or the emotions that you feel in
84
response to a particular stimuli, they do not last longer than a few minutes, definitely not
hours maybe right and days and weeks definitely not.
So, what happens that since they are quick in that sense their precision also is very very low
right. What happens that it may be confusing at the at times or there may be a mix of different
emotions that an individual is experiencing. So, what happens that the precision oriented
affective computing that we have been using; its computational complexity becomes too high
to handle these kind of emergency emotions.
What it means that if you want to understand accurately that what is the emotion that
individual felt in response to a particular stimuli, then we will have to be very very precise.
We will have to be seeing, we will we need to be able to see that ok at what particular time
for example, the stimuli was presented, at what particular time for example, the music was
presented or the listener started listening music.
And, within how many seconds for example, the emotion started getting manifested in the
user’s response and more importantly what exactly was that particular emotion. So, what
happens here that you need to create a system that is computationally very very powerful.
And, hence the computational complexity of the system should be very very high as well in
order to handle these kind of emergency emotions right. So, this is other problem that the
computational complexity demand is very very high.
Now, 3rd and definitely not the very one of the most important problems as well which is
what, that the human emotions are very very personalized right. What it means that when we
say the human emotions are very very personalized that one thing that they are also not only
influenced by the stimuli that the individual is experiencing, but each individual they have
their own set of different beliefs, desires, intentions, background, context so on and so forth.
And, for the same reason the emotions of different individuals excited by the same stimulus
can be different. Or in other words, as we talked about before as well if there is any happy
music, one may feel or experience the happiness by listening to the happy music.
But, the other individual may or may not exactly feel the happiness associated with that
particular music or even if there are two individuals who are experiencing the happiness by
listening to a particular type of music, the amount with which they are feeling or experiencing
the happiness may not be the same.
85
And, on the top of that maybe there could be a mix of different emotions that are playing a
role while listening to the music, when we are talking about when we are checking the
different individual’s right. And so, this is where what we say that always there is a lot of
individual variability. So, we call we say that there is a lot of individual variability that we
need to take into account right.
And, individual variability not only in terms of the baseline response, but also in terms of the
context that one particular individual may or may not have. Right, perfect. So, now, this is
what was the now we talked about briefly we talked about some of the problems that are
there in the traditional affecting computing approach. And, nevertheless all these problems,
they are not easy to tackle; they are not easy to address.
But, in order to tackle them, in order to understand them we definitely need better
understanding of the emotions and better understanding of the individual variability. And of
course, computationally more powerful not only the hardware, but also more powerful the
methods or the software’s and the algorithms to help us achieve this in an efficient manner
right.
Now, having talked about the limitations, let us now move our focus to some a bit something
a bit different; where we want to understand that how the different emotions are expressed
right. So, we already talked about the different emotion, different individuals, they can
86
experience different types of emotions when presented a same type of stimuli. Now, here we
want to talk about something that is a bit related, but at the same time bit different.
So, what happens that there are different basic emotions and different basic emotions are
characterized by different facial expressions or for that matter Ekman proved that Ekman
showed that different basic emotions are characterized by different specific set of facial
expressions. But, nevertheless a single set of facial actions or a single set of behaviour facial
behaviour of facial muscles can also become different emotional expressions in different
context right.
So, for example, without getting confused too much, if you look at this beautiful diagram that
Paul Ekman group has created. Now, we can see the facial expressions of individuals in case
of enjoyment for example, in case of sadness for example, in case of fear, anger, disgust and
contempt.
So, in general what you can see from the diagram that depending upon the type of the
emotion for all these type of emotions, usually the response is differently definitely different
right. The facial expressions that an individual is experiencing is or making is quite different.
But, at the same time there are certain emotions which have lot of things in common as well.
For example, if you talk about the anger and the fear, sorry if you talk about the anger and the
fear emotion; then what is happening here that if you look at that the for example, the eyes of
this individual the anger and the fear emotions, while experiencing the anger and the fear
emotions, the they both in both the cases we have very very open eyes right. We have very
very set of open eyes.
In the same way for example, if we look at the disgust and the sadness maybe the eyes are
quite closed right. Of course, the amount with which we are they are closed depends on so
many different things, about the individual behaviour and all that as well. But, but
nevertheless in general in anger and the fear we have open eyes, in the disgust and the
sadness we have closed eyes.
So, this is what the Ekman first hypothesized and proved that different emotions are
characterized by different facial expressions. But, then came Barrett and what Barrett said at
the same time different emotional expressions in different contexts can have lot of similarity
also between same.
87
And, then there could be a single set of facial actions that can represent both in different
context right. So, for example, in this case the set of facial actions that we are talking about is
the opening of the eyes or the closing of the eyes right.
So, here we need to understand a bit more about what we mean by the specificity of the
emotions. So, let us start with the fear emotion first. So, we already look at looked at this
diagram of the fear emotion. Now, you can rightly see in the diagram that this fear emotion is
characterized by the eyebrows of the individual is quite raised eyebrows and of course, they
are drawn sort of together and it has wide open eyes. You can cleverly clearly see that the
eyes are very very widely open, it has tense lowered eyelids.
So, basically if you look at the eyelids, eyelids are quite lowered and tense and then of
course, if you look at the lips, then lips are stretched kind of you know the lips are stretched.
So, in general Paul Ekman made a lot of effort in characterizing the different facial
expressions in terms of the different facial behaviours that individual shows. So, this is what
is the behaviour of the fear and we saw that this has a lot of similarity with the anger as well.
Now, but if you look at only the facial expressions of course, there are certain ways in which
this expression is being expressed, this emotion is being expressed. Now, the same emotion
of fear can also be expressed in other physiological responses as well. So, for example, it has
been found that there is there an activation has been found within the frontoparietal brain
regions, right.
88
So, what we mean by the frontoparietal? So, basically this is the front, this is the parietal. So,
basically frontoparietal brain regions an activation has been found. At the same time a broad
pattern of sympathetic activation has also been found. What do you mean by the sympathetic
activation? It relates to the cardiovascular activities right. So, basically there is lot of
cardiovascular activity that has been found in case of the fear emotion which includes for
example, reduced heart rate variability.
Similarly, in the case of the fear emotion numerous or increased skin conductance responses
has also been found. And, we have also seen larger EMG activity in case of the fear emotion
than in comparison to the anger emotion. So, with that let us take a break and we will
continue in the next session.
Thanks.
89
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 02
Lecture - 03
Brain and Asymmetry
So, having understood the specificity of the emotions, now let us try to see that how different
emotions are experienced by an user among different modalities right.
So, let us take the example of this fear emotion. In case of the fear emotion as you can see in
the image itself that it is characterized by the eyebrows are already raised right and they are
drawn together and then eyes are widely open and then you have lower eyelids right if you
look visualize that and then you have stretched lips.
90
So, for example, Paul Ekman and his team they worked a lot on identifying these different set
of facial action you needs which characterizes a different set of emotions, but these different
set of emotions are not only experienced or expressed in one particular modality that is the
facial expression modality, but definitely they are also experienced and expressed in different
modalities and here we will be talking about a bit about how different emotions are
experienced or discriminated among the different modalities as well.
So, for example, if we talk about the fear. Fear is also associated with the activation in the
fronto parietal brain region. What do you mean the fronto parietal brain regions? So, basically
this is your frontal region this is your parietal region. So, where they meet that is the place
which is known as the fronto parietal brain region.
So, this is also associated with the higher activity in the fronto parietal brain region and it is
also represented as a higher sympathetic activation. Sympathetic activation basically it
consists of higher cardiovascular activities which may also include the heart rate variability.
So, for example, reduced heart.
So, basically heart rate variability what it means, we will be talking about all those things in
when we will be talking about the emotions and the physiological signals, but basically the
heart rate variability is basically the variance that we have among the different heart rates
right. So, basically what happens that in the case of the fear emotion, we experience it is also
expressed by the reduced heart rate variability.
Similarly, when you talk about the fear we can also look at the skin conductance response and
what has been found that in the case of the fear a higher skin conductance response has been
found among the users. And similarly the facial expressions when you have this fear
emotions, it is also indicated by larger EMG or electromyographic corrugator activity that is
in the facial muscles for example, somewhere here than that is in the anger.
So, these are the different ways in which the same emotion that is the fear emotion can be
expressed in the facial expressions, can be expressed in the brain activity, can be expressed
for example, in case in the cardiovascular activity, can be expressed in the skin conductance
responses and can also be expressed in the EMG activity right. So, you can see that since
91
different emotions are different they are also experienced or expressed in different modalities
in a different way.
So, similarly after having understood the fear emotion, let us try to look at the anger emotion.
Now if you look at the anger emotion in the case of the anger emotion just like we have in the
fear emotion, we also have widely open eyes here right and in this case also the eyebrows if
you look at the eyebrows here. So, for example, the eyebrows are also drawn together and
then the eyelids for example, they are lowered and then you have the pressed lips kind of a
situation.
So, this is how the facial expressions express characterize the anger emotion. Now of course,
not to say that that these are the only things that are there a bit of individual variability can be
there right. So, similarly for example, if you look at the anger emotion and then the brain
activity that is corresponding in the with respect to the anger emotion, then what we see that
we have more left frontal prefrontal cortex.
92
So, basically this is what you have the pre frontal cortex. So, in the pre frontal cortex you
have more activity in the case of the anger emotion and, but in this case for example, a
literature is a bit contradictory as well. So, Harmons Jones for example, whereas, he has been
saying where he and the other researchers have been saying that the left frontal pre frontal
cortex is activated the literature is a bit contradictory where it is said that maybe the right pre
frontal cortex is more activated. Right.
Interestingly, when we analyze the heart rate variability, the one that we saw in the case of the
fear emotion, we observe no change in the case of the heart rate variability while if you recall
in the case of the fear emotion, we observed a reduced heart rate variability. So, it clearly
shows that among the different modalities, different types of discrimination can be possible.
So, for example, while if you are only looking at the whether the eyes are widely open or not
you may not be able to differentiate between the anger and the fear emotion, but for example,
if you are looking at other modalities such as in this case if you are looking at the heart rate
variability so, while in the case of the anger emotion we are not going to observe any change
in the heart rate variability, but in the case of the fear emotion we will be observing a reduced
heart rate variability right.
So, that is how different modalities they help you to characterize a particular emotion right.
Similarly, for example, in the case of the anger what happen? Anger can be different types of
angers as well right. So, it can be an anger where you are just someone is angry in front of
you and you are responding with an angerness or for example, you are it can also the
response that you get in the fear response you may also have similar kind of responses in the
anger.
For example, that we just saw in the case of the facial emotions right facial expressions so,
but nevertheless we have already seen the psycho physiological responses will always be
different in both the cases and that is what is going to help you to discriminate among the
these two different emotions perfect.
93
So, with that now let us look at the similar let us look at the similar type of analysis of the
disgust and the other emotions. So, this is a very simple picture of the disgust emotion now if
you look at the disgust emotion what we have? We have for example, this is probably
characterized by the wrinkled nose bridge here and raised cheeks here right for example, you
have this kind of emotion and then we have raised upper lip.
So, for example, upper lip is a bit raised of course, again not to mention that all these
behaviours that have been characterized by Paul Ekman and his team there is a good amount
of individual variability and depending upon depending from one individual to another this
particular behaviour may change a bit here and there.
But largely this is how the discussed emotion for example, could be perceived in the facial
expressions right. Now if you look at the skin conductance responses it depends on what type
of stimuli is causing the disgust emotion that is very interesting here. So, for example, what is
happening that if you have a core-discussed inducing stimuli right.
So, you may have different types of skin conductance response or in comparison to for
example, if you have body boundary violating stimuli where you may be experience looking
94
at some mutilation scene, then the skin conductance responses could be different. So, for
example, in the case of the core discussed inducing stimuli which is such as the pictures of
dirty toilets.
For example, you have either unchanged or you have either or decreased kind of skin
conductance response. On the other hand, if you look at the this mutilation scenes kind of a
stimuli then what you have? You will have increased skin conductance response. So, this is
also very interesting thing to observe and very important to note as well that it may also
happen that depending upon the stimuli your response may also change even if it belongs to
the same emotional category right.
So, now, having looked at the disgust now let us look at the sadness emotion. Now in the case
of the sadness emotion as you can see here this is broadly characterized by the lowered lip
corners for example, and then the raised inner eyebrows, right. I hope that you can look into
95
the image here broadly and then if you look at the cardiovascular activity, it is associated with
the increased blood flow, right.
And then at the same time there are two different types of sadness where we would like to
differentiate the physiological responses one is the crying related sadness. So, in the crying
related sadness where you are literally crying when you are sad then increased heart rate is
observed while we do not observe any change in the heart rate variability please pay attention
that heart rate and the heart rate variability these two are different things right.
So, increased heart rate, but may be no change in the heart rate variability and then you also
have the increased skin conductance. Now, in the case of the non crying sadness when you
are sad, but you are not crying please pay attention that where earlier you were observing a
increased heart rate now you are observing a reduction in the heart rate. Earlier when you are
not observing any change in the heart rate variability now you are also observing a reduced
heart rate variability.
Similarly, where earlier you were observing an increased skin conductance now you are
observing a reduced skin conductance and similarly your respiration also, it is increased,
right. So, then there is a difference in the sadness also when whether the individual is crying
or when the whether the individual is not crying.
Now, why these different types of changes are important? These different types of changes
are important because now imagine that if you want to understand a particular emotional
response then you will have to understand these different types of behaviours in order to
know that what exactly is the modality to look for number 1. Number 2 what exactly is the
type of change that you can expect in that particular type of modality right.
(Refer Slide Time: 11:08)
96
So, that is why it is important for us to understand that how different emotions are being
characterized or expressed in different modalities. So, let us take the last example which is of
the happiness now in the case of the happiness as hypothesized as given by Ekman Paul
Ekman and his team for example, you have again you have raised cheeks, you have raised lip
corners and you have tense lower eyelids right.
So, this is broad characterization in terms of the facial expressions. Now other thing that you
need to look at that in a very interesting study Abel and Kruger found that the intensity of the
smiling of an individual in the photographs has also been found to predict longevity right.
So, for example, what they found what they showed in their this interesting study that
individuals with no smile their age has been found to be ranged from 72.9 years. Similarly the
individuals those who showed partial smiles in the photographs their longevity ranged around
75 years similarly those who had the Duchenne smiles like very broad smiles their longevity
has been found to be around 80 years, right.
So, this was another very interesting study that Abel and Kruger performed. Now happiness
97
of course, like the other emotions happiness can also be associated with the activation in the
other physiological signals as well. So, for example, it is associated with the medial prefrontal
and the temporal parietal cortices right and that is a more brain activity in these regions and
then at the same time, there are lots of literature which also highlights the role of the left pre
frontal cortex in the positive effects.
So, what it says that in the case of the positive emotions such as happiness, there is an
increased activity in the left free prefrontal cortex that is the in this side of the brain region
right and if you look at the other physiological signals then happiness is also associated with
the decreased heart rate variability and at the same time amusement with the decreased heart
rate variability such as the case is the amusement and the joy are associated with the
increased heart rate variability right.
So, basically the differentiation between the happiness and the amusement and the joy is that
there is varying degree of positivity as well as the intensity there and by this experiment this
in by this experiment by Kreibig what they showed that while there is a decreased heart rate
variability in the happiness there is an increased heart rate variability in the case of the
amusement and the joy among the users right, perfect.
So, now we talked about one very interesting point here that in the case of the positive
affects, left more activities found in the left prefrontal cortex.
(Refer Slide Time: 14:11)
98
And building upon that we will be talking about one very interesting concept in the
psychology of the emotion that is the emotion and the brain asymmetry. So, basically what
happens in this case that, it is it has been found by the researchers that higher activation in the
left region of the brain is found in relative to the right region of the brain when the individual
is experiencing any positive feeling or experiencing higher engagement and this has been of
course, given by Coan and the several other researchers the same thing.
So, what it means? So, for example, if you are feeling if you are experiencing a positive
feeling if you are experiencing approach motivation you know more dominant motivation
then you will have more brain activity in the left side of the hemisphere, but versus when you
are experiencing let say some negative feeling such as let say you are experiencing sadness
fear and all those kind of things or you are feeling less dominant when you are having a
withdrawal motivation then more increased right pre frontal activity can be observed in the
brain.
So, that is what is the relation between the emotions and the brain asymmetry. But there is
one exception to this thing which is the in case of the anger. In the case of the anger that we
99
as we have seen before also and that in case of the anger even though anger is broadly
characterized as a negative feeling, but more activity is found in the left frontal cortex of the
brain.
So, this is where there is a single exception otherwise the science is very clear here that if you
are experiencing a positive emotion you will have more brain activity in the left side of the
hemisphere if you are experiencing a negative emotion you will have more activity, in the
right side of the hemisphere and this simple test can be done to understand that which side of
the brain is activated and accordingly what could be the type of the emotion that for example,
an individual may be feeling or not. Right.
So, with that we today we finish at the emotion and the brain asymmetry and in the next
session we will be talking about the emotional design right. So, let us look forward to the
next session.
100
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 02
Lecture - 04
Emotional Design
Hi friends. So, welcome back. And today, we are going to talk about Emotional Design, that
it we would like to understand how can we put emotions into the design of the services or the
products that we are creating for the end users, ok.
So, the very first thing that we have to understand is a term which is known as emotional
design which has become very very popular among the designers and among the UX
researchers. So, emotional design, what does it mean? It simply refers to creating of the
designs, creation of the designs that evoke emotions which result in positive user experiences.
Now, why we would like to have a design which invokes positive user experience? The idea
is very simple, right. Unless and until your user, let it be a product, let it be a service
experiences a positive emotion, the user will never come back for the services or for the
products that you or your organization is providing, right. So, this is where the emotional
101
design comes into the picture, where we want to insert the emotion into the design of a
product or the services quite simple.
Now, as of now, what happens? So, commonly the common practice is among the designers,
that while creating a product or a service they tend to focus more on the user's need. That is
mostly they tend to focus on the usability of the product, right and the functionalities of the
product, while it is the basic thing that any user wants.
But this is something that if you remember the UX pyramid, then in the UX pyramid what
happens, that at the very bottom level you have the functional diagram. Let me just quickly
try to draw the UX diagram pyramid for you. If you do not recall, it is very simple.
So, for example, if you look at the UX pyramid, then what you have is at the bottom level
you have a, what we call it as a functional element. Then, at the next level we have the
reliable element, then at the next level we have the element of the usability, on the top we
have the convenience, right. And then on the very top, you have something known as the
enjoy ability or enjoyable component.
Now, what happens? When we, the designers design a product or the services primarily they
tend to focus on these 3 bottom aspects of a product or of a service. So, which are mostly
referring to the functionality, usability of a particular product which is good to have.
But nevertheless, if you go at the top of the this UX pyramid, the user experience pyramid,
then you can see that there is something which refers to the enjoyability or the convenience
that the user is getting or having by using the product or the particular services that you or
your organization is offering.
102
(Refer Slide Time: 03:35)
And for the same reason, let me just quickly erase it, I think. And for the same reason what
happens, that what we would like what we would like to do is we need to put proper focus on
the upper aspects of the pyramid as well which is the enjoyability and the convenience. So,
basically, what they do?
They put the focus on the responses of the user when the user is using a product or after the
user has already used the product. And of course, when we talk about such experiences of the
user when the user is experiencing the product, these are naturally emotional. And hence, the
term emotional design comes into the picture, right. So, what happens when the user is using
the product?
So, basically what happens that many times user is not even aware of the thing, but there are
lots of sophisticated thought process, thought processes are going on at the cognitive level
through which the user is trying to process the functionalities, the experience, the usability
and the enjoyability of the product. So, we really need to pay attention to that particular
process as well, those particular cognitive experiences and cognitive processes as well.
And while doing so, what is our aim? Our aim is very very simple. What we want to do? We
want to elicit strong emotions in the users, so that these emotions can create a type of bond
with the product or with the services that we are offering, that your organization is offering.
And in turn what happens?
103
It creates a sense of brand value, it creates a sense of loyalty, and hence your it allows, it
enables your customers to take a particular action which may be as simple as that coming
again and again to you for the use and of your product, of your of or of your services, right.
And this all these aspects are known as the emotional design. And please mind one particular
thing, that when we are talking about the emotional design of a product, many times we may
not only want to induce a positive user experience, but positive emotion. Not of course, the
user experiences should always be positive.
But many times when we say that positive user experiences, it could be through the
inducement of a positive emotion, but many times it could be through the inducement of a
negative emotion as well. A very simple example, you want to create a movie, but then in the
movie it is all about a horror movie. Now, when you are making your user watch a horror
movie, of course, the type of the emotions that are going to be elicited among the users are
going to be scary off, are going to be horror, right.
I mean the user cannot feel joyful about it. The user cannot feel happy about it, because the
intended use is something quite different. You want to create a horror movie. But, while if
you have been able to successfully create or elicit that particular emotion among your users, it
means your product that is your movie or your service has been successful, right.
So, while creating a emotional design you have to keep this aspect into the mind that you
want to induce a positive user experience which could be through a positive emotion or many
times which could be through a negative emotion as well. And the end goal is that of course,
it should be enjoyable. The user should be able to enjoy the product or the services that you
are offering, right.
Now, how to do that? So, we will talk about 3 levels of design which was proposed by Don
Norman, very famous researcher. Where, what he proposed that designers can aim to reach
the users on 3 different cognitive levels. So, one cognitive level is known as the visceral
cognitive level.
We will be talking more about it. Second level is known as the behavioral cognitive level.
And the third level is known as the reflective cognitive level. So, we will be talking more
about it that what exactly they are and how they are different. But whole idea is that we want
104
to reach the users on these 3 different cognitive levels, so that users can develop positive
associations with the product.
And as I said sometimes including negative emotions, and so, that they can feel associated
with the product, they can feel associated with the services, and it allows them to use it again
it allows them to enjoy it again, and hence guarantees the success of the product or the
services that you are offering, perfect.
Now, what we want to do? We want to understand the 3 levels of design as proposed by Don
Norman. So, if you can see this very beautiful diagram taken from the emotional design book,
very famous book by the Don Norman himself. So, he talks about the cognitive levels or the
cognitive processes of the users happening at 3 different levels, right. So, let us take it one by
one.
So, the very first level is the visceral level. Now, the visceral level is what? Visceral level is
all about the initial impression. So, it is all about the initial impression, it is all about the
attractiveness that the users feel toward the product towards the services. It is all about the
pre-consciousness. And it is all about the initial failings that the users are having while using
a particular product or while using a particular services.
A very simple example. Think of when you bought your first iPhone maybe, right. So, iPhone
could be a very good example, of a very good product. So, think about the instance when you
105
bought your first iPhone and think about the experience that you had when you are looking at
the iPhone, at the very first time, right.
So, maybe you are going through it and you are looking and you are saying, ok my goodness
it is very slim, it is very responsive; the touch screen is very very responsive, well maybe not
the touch screen is very responsive, but maybe you are looking at the appearance. So, please
pay attention that the visceral cognitive level, it takes care of the initial impression only. So,
that has nothing to do with the usability.
So, maybe you are not still exploring the touch screen, but maybe you are just looking at the
design and you are looking, ok this is very very slim, it is very light weight, it looks so good,
you know very very symmetric. And the box itself in which it arrives like it is you know
white box; usually it is very nice box, very nice packing and everything.
So, so far you start feeling good, you start feeling the positive emotions about the product
itself, even without having used the product, even without having used the services of the
product which in this case is the iPhone, right. So, this is how what you are getting is the
initial impression of the product. And the way you get the initial impression of the product is
simply by having the very first connection, very first reactions in response to looking at the
product or in response to using of the services.
So, these all happens at the very very motor level, right. And it is mostly very very subjective
in this case because ones experience can be very very different from the someone else’s
experience. So, someone may like a very symmetric design pattern, but someone may not like
the very symmetry design pattern.
So, this is all very very subjective. So, this is again another aspect that you may want to keep
in mind, that this particular initial impression that the user is having while looking at the
product or the service at the very first time is the very very subjective experience.
Nevertheless, it is a very important type of experience that you may want to provide to the
user.
Initial impression, there is a saying like the first impression is the last impression in many
cases. So, you want to create a very good impression, a very good first initial expression to
the for your product or for the for your services to the user, and through whatever sensors,
whatever sensory modalities that the users are perceiving it.
106
So, maybe there is something that the users are looking at it. So, you want to create a very
nice design. So, that this is appealing to the user. Maybe there is something that the users are
touching it. So, you want to create a very nice tactile perception. So, that the users are liking
it. The initial impression itself is very very good.
Maybe you are creating a food and the you know the sensory perception that the users are
going to use is through the smell and you want to create a smell that is very very good smell,
right. So, these are the one, these are the ones that are creating the initial impression about the
product about the services.
Now, please pay attention to this particular thing that this the experiences, that the users are
having at the visceral level, they are completely free from any bias or any prejudice, right.
And so, it is there are not many factors that are adding to this particular thing, and it is sort of
you know very very controlled in are happening in a very very controlled fashion, one thing.
Other thing that while all these initial impressions or the processes that are happening at the
visceral level while observing a particular product, are very very responsive, are very very
reactive, are very very momentarily. So, it may be very quick responses. You are not waiting
for so long, you saw an iPhone and immediately you felt something. That is it, right you
know. So, within few seconds or within few milliseconds, microseconds, many times you felt
something and that response is already there, right.
So, you can see that response. So, this is what the what is happening mostly at the visceral
level. Now, let us go to the higher level which is the what Don Norman is proposing which is
the behavioral level. Now, the behavioral level is mostly, it has, most of the thing it has to do
with the usability of the particular product. It has it takes care of the usability, the
performance, or the effectiveness of the product or the services that you are offering, right.
So, as the name itself suggests behavioral cognitive process. What we mean by the behavioral
design is that, once the user takes a particular product, let us again take the example of the
iPhone only. The user first opens the box looks, makes the very first impression at the iPhone,
likes the iPhone, but the next step is of course, the user may want to and will have to use the
product, right it is all about the usability at the end.
So, the user would like to use the product. Now, the user started using the product. Maybe the
user liked the touch screen response for example, as simple as that. Maybe the user liked the
107
typing on the iPhone screen. Maybe the user like the noise of the tick tick tick which was
coming when the user was typing on the iPhone, right. So, all these are what?
Constituting the experiences at the behavioral level when the user has already started using
the product or when the user has already started experiencing the services that you are
offering or that for example, your organization is offering, right. So, most of the time and
many times what happens, that once the user has learned how to use the product, the next
time when the user is user wants to use the product, mostly it will be because of the, it will be
motivated because the user want to achieve certain objectives, user wants to achieve certain
goals.
It may not only be motivated because the initial impression that the user had, right. So,
imagine, you have the iPhone, when you saw the iPhone at the very first time you are
impressed by the iPhone, the design, the slimness, the weight and so on so forth, right. But
once you have learnt how to operate the phone, you may want to come back to using the
usage of the iPhone because you want to achieve certain objectives, you want to make a
phone call, you want to use the internet, you want to use the social media and so on so forth,
right.
So, basically the functionalities are the ones which are deriving you, which are motivating
you to come back to the usage of the product. It is not the initial impression anymore. Of
course, the initial impression created an impression for you, it created an bond for you, it
created a positive emotion for you, but now that is gone, right.
Next time, this is the usability, that is going to attract the user to come to the usage of the
product or the services. So, this is what is, this is what is all is being captured at the
behavioral level. Now, one thing that you may want to pay attention again here, that most of
the things that are happening at the behavioral level, once the user has started using the
products and the services, it may start coming as very very subconsciously.
So, user when if there is a good product, if there is a good service, then at the behavioral level
what happens usually that the user is will not have to put a lot of effort consciously and things
will be happening most of the time subconsciously and in a very pleasant manner. So, for
example, you know how to drive a car, and in the beginning when you want to drive a car you
know you have to put a lot of effort in driving the car itself, that you know you have to pay a
lot of attention to this that and then so on so forth.
108
But once you have learned how to drive a car, then you do not have to put a lot of focus on
driving the car, right. I mean you simply go inside the car and your sort of you know your
hands are already there at the steering, your feet is already there at the clutch, accelerator, and
everything is already taken care of.
So, you know while you are the one who is doing these actions, your motors are the ones
which are doing the actions, but most of these things are happening subconsciously, right. I
mean you may be you are not aware at the conscious level that, ok maybe I am putting a gear
for example, or I am changing this clutch, I am changing this brake.
I mean most of the things are happening at the subconscious level because you are already
very fluent in now the usage of the product and you have become very comfortable in the
usage of the product or the services.
In this case, usage of the iPhone, usage of the car, and unless and until this starts happening,
you will you would not like to do that again and again. So, for example, if there is a car and
you are not able to learn how to drive the car or it has become very very cumbersome because
there are you know multiple gears, and there are the gear is something here some somewhere
else, the steering operates in a certain way.
And it is not very usable, it is not very functional, and you have to consciously put a lot of
effort in using that product, in using that car, maybe you would not like to do that again and
again. You would like to avoid that particular experience, right. And hence, most of the things
that happens here at the behavioral level are happens in a very very happens subconsciously.
Other thing that you want to pay attention to this that most of the experiences that are
happening at the behavioral level, can be monitored, can be measured in an objective way.
This is in contrast to the visceral design or the visceral cognitive process, where most of the
things were happening in a very very subjective manner.
So, you did not had a lot of way to evaluate those experiences as well, right. I mean it was
very very subjective. I like this design, I did not like this design, I like the symmetric design,
I did not like the symmetric design as simple as that. But when you are talking about the
behavioral experiences, when you are talking about the behavioral design or the experiences
at the behavioral cognitive level, then what is happening that you are looking at the
experiences that can be objectively measured is as simple as that.
109
For example, you are trying to complete a particular task on iPhone for example, may be on
your phone maybe you want to make a call. Now, you can simply measure the amount of the
time that it took for you to open the contact list, select the name of the person that you want
to make a call and in making a call, right. So, is as simple as that. These are very objective.
Similarly, for example, you want to take a photo from the camera of the iPhone, the amount
of time that you took to locate the camera app, open the camera app, focus on the user and
took the picture. So, all these there are so many behavioral metrics, there are so many
behavioral indicators that you can use to objectively evaluate the experience of the user at
this particular level, right. So, that is what is the cognitive processes that are happening at the
behavioral level.
And at the visceral level and at the behavioral level, both there are lots of sensory inputs
which are involved. But at the behavioral level as I said, mostly since you are taking some
actions it is all about the usability. So, your motor actions are very very involved here, right.
So, it is motivated by some sensory inputs and then it is followed by, it is reflected in some
motor actions such as the usability here.
Now, the third is known as the reflective level. So, the reflective level of the design, it mostly
refers to the rationalization or the intellectual part of the experience, right. So, basically this
is, what is this? Basically, what happens here, that in the visceral as we already understood,
that in the visceral level, we are talking about the initial impression of using the product. In
the behavioral level, we are talking about the usability of the product; we are talking about
the usability of the services.
Now, in the reflective level, what the reflective level refers to that, ok we have made a very
good first impression. The usability is also good. The user is able to come back to the
usability. Now, but at the third level the user may want to evaluate the experience that the
user had with the usage of the product or the services. So, and this is where the rationalization
happens. And this is where the intellectual part of the cognitive process comes into the
picture.
So, now they start the user starts making you know analyzing the users experience and starts
analyzing the pros and the cons of the things. Say it is like you know, you have already gone
through an immersive experience and the moment you came out of that experience immersive
110
experience, now you may want to recall the experience that you had and you want to analyze
that, ok.
Whether that experience was good, bad? Number one thing. Was it worth it? Was there a
good return on investment? Maybe I maybe it was not worth it. Maybe I am putting lot of
money, but the experience was good, but was not worth it.
Or maybe would I like to repeat this experience again and again. For example, should I would
I be sharing this experience with someone else, would I be telling this, ok maybe you know I
use this particular product, I use this particular services, you should also use this because it is
very very enjoyable.
So, this is where you know all these kind of thing happens where the user thinks about
whether the user wants to do it again and again and again and at the same time whether the
user wants to suggest the usage of the particular product or the services to the other users.
And hence, the user may want to you know do some kind of publicity for you. So, this is
what happens at the reflective level.
So, basically what happens, that of course, lots of your thoughts, lots of your actions of using
the product in a long term are motivated by this reflective processing, reflective cognitive
processing. But at the same time it does not have a direct control or the connections let us say
to the visceral level because of course, visceral level is very very momentary.
You saw the product, you use the product, you saw the product you and it made a very first
impression. You do not; the reflective process or the rationalization process did not come into
the picture. Similarly, for the usability, ok once you have started using the product you talk
about the usability, the functionality, and the reliability of the product, but then that is there
But then on the top of then on the reflective level, you want to go upper in the user
experience pyramid and you want to think about the enjoyability, did you enjoy the
experience, you want to think about the pleasure that you have, had, right. And then more
importantly you want to evaluate whether there was a good return on investment on this
particular product or not, right.
So, most of the time the reflective as I said this is very very you want to there is lot of
rationalization here. So, this is where the intellectual part of you or your customer comes into
111
the picture and that is where you want to hit properly. Otherwise, the user may would like to
may evaluate that, ok maybe the experience that I am getting from the product is not worth it.
Now, enough about the emotional design enough about that what are the different types of
design. Now, why are we talking about this in the course of the affective computing? So, in
the affective computing, in the beginning, we already talked about that affective computing is
all about the emotional intelligence. It helps you to understand how to model the emotions, it
helps you understand how to evaluate the user experience, and more importantly how can you
make an experience which is adaptive.
Now, here let us try to see that how the effective computing can help you in making a design
at all the 3 different levels, in making an efficient design at all the 3 levels. So, for example, if
you are talking about the visceral design. Now, in the visceral design as we already talked
about here, in the visceral design it is all about the initial impression.
So, basically, what affective computing can do for you? What it can do for you that it can
help you in understanding the users feelings, the users first response, users first impression
when the user is using the product or the services. And why would you like to do that?
You may want to understand that for example, when I am first time I bought an iPhone first
time, I took the iPhone in my hand, how did I felt. I mean what was my emotional
expressions on my face. I mean was did I look joyful, did I look happy, did I look curious,
112
what were the for example, the what were the emotions that were being expressed by my
body language, what were the emotions that were expressed by my for example, my
physiological signals. As simple as that if you have a way to measure it, so as simple as that.
If you have watch, maybe you want to see that what were the what is exactly happening with
my heart rate variability and so on so forth. So, the good thing about all these thing the
affective computing is that, on one hand of course, you can have the subjective self-reporting
questionnaire kind of thing where you gave an iPhone to an user, and then you ask the user,
ok did you like it? Did you see it? How did you felt it? How did you feel about the product?
How did you feel about the first impression?
And user may give whatever the user wants, right. And there could be lot of biases that can
come in understanding the user’s response or when the user is making a response here. And
that is where affective computing can really help you a lot. So, without having ask, without
having being asked to the user, you can simply by looking at the facial expressions, looking
at the different modalities, through which the emotions are being expressed.
You can look at those modalities, you can monitor those modalities and you can attract, you
can capture these momentarily reactions that the user is having while the user is using a
particular phone or the product. And this is more true when for example, you have to evaluate
those experiences impressions among n number of different users. Now, imagine that there
are 100 different users, and for the 100 different users you gave them hundred iPhones and
then of course, you want to evaluate their first impression.
So, this is where affective computing can come into the picture. So, we already we already
saw in the earlier part of the lectures that how can we model the different types of emotions
and how those different types of emotions are expressed in different types of modalities. So,
now, you can already analyze that what type of emotional experiences that you are
envisioning, that you are hoping.
113
So, for example, maybe you are hoping for a smile on the user's face, ok. So, then again, how
are you going to capture that smile? Maybe you want to put a camera and then that camera
making use of sophisticated artificial intelligence and machine learning algorithms is
capturing all the 100 users' faces at the same time and it is analyzing all the 100 users at the
same time and it is creating an objective kind of report for you, right.
So, you do not have to go to the user and ask the user, but everything is happening in a very
automated fashion. And hence, the chances of introducing a bias is minimal and you can use
that experience and say that, ok, ok maybe you know when I looked at the responses of the
100 users the experience was not so positive. Maybe I do not know, what happened.
Maybe and then of course, you can correlate it with other modalities. Maybe you can look at
the gaze, you can look at the eye tracking, sensors, you can look at the pattern of the eye
behavior, the gaze behaviors of the users, along with the facial emotions. And then, you can
combine these two and you can see, ok maybe when the user was looking at the upper part of
the phone, for example, he did not felt as happy as it felt while it was looking at the box of
the phone itself, the packaging of the phone itself, right and so on so forth.
So, by combining different modalities, you can get very interesting observations about the
first impressions that your product or that your service is making on to the user. And hence,
this is how you can improve the first impression of the user when the user is using the
product or the services, right. So, that is the very first level, in brief how affective computing
can help you in creating a better design at the visceral level. Perfect.
Now, having understood the visceral design let us look at the behavioral design. So, please
recall the behavioral design is all about the usability, is all about the functionalities of the
product. Now, affective computing in this case, how can it help? What it can do? It can help
you understanding the emotional experiences of the user while the user is using the product
or the services in very very real time, right.
So, for example, I am just typing on a MacBook, and then while I am typing on the MacBook
you are monitoring my typing speed, you are monitoring maybe my different modalities in
which you are expecting the responses to be elicited. For example, even you know facial
expressions again, one is very good example, again may be some physiological signals. And
of course, depending upon the resources that you have, you may want to introduce certain
sensors while the user is using the product or the services.
114
And by observing the data that you are capturing through those particular modalities, through
those particular sensors, you may want to say that you know, ok maybe when the user you
know in general the user liked the MacBook for example. But when the user started typing on
the MacBook, maybe the typing experience was not as good as the experience maybe that the
user that the user was having on let us say typing when the user was typing on particular xyz
computer or a laptop, right.
So, you can compare these experiences in a very very objective fashion. And then, you can
combine these experiences as the data, as it is being observed through different modalities.
And more importantly you it allows you to capture the experience not only during a particular
task, that is let us say resulting in a success. But also during the task that are resulting in a
failure for example. And then, you want to analyze where exactly they are failing and how
exactly they are failing.
So, for example, there is an UI that you created, and maybe let us take an example of iPhone
again. Maybe in the iPhone itself you know I launch the home screen and on the home screen
there are n different types of apps, but unfortunately what happens due to the design or
whatever, that may be all the apps they are being represented more or less in the same color
for example, right. And then what happens that, ok you will have a very hard time in locating
the app that you want to open.
So, for example, maybe I just want to open G-mail, but all the other apps are looking like just
like the email G-mail in the color and everything, the logo and everything. So, it is going to it
is it will be difficult for you to locate that particular app and then to launch that particular
app.
So, for example, one way that you can measure that, ok it resulted in a kind of failure, that by
amount of time that you took for the user to start in locating the particular app that is one let
us say behavioral response. And then, while the user was doing this, of course, user will start
showing user will start showing some frustration.
And that that frustration, that emotional experiences that user is having while it is not able to
do certain things you can capture through facial expressions, you can capture through certain
modalities. And using those modalities, using those data you can analyze that what went
wrong, when did it went wrong, and how can you create a better emotional flow process
through the better design of the entire thing, right.
115
So, now the last, but not the least. So, this is the reflective design, how the affective
computing can help you on the reflective on designing the process or the products in a better
way at the reflective level. So, now, here what can be done? That as you understood the
reflective design is all about the post usage experience. It is about that I use the particular
product, and after having used it how did I feel about the product.
It is as simple as that. So, what you can do? Once the user has already used the product you
can simply put the user through all the affective computing technologies and sensors and the
modalities and then you may want the user to recall the experience of a particular product or
a service.
For example, having gone through a roller coaster ride, and then you want to see that, ok
while the user is recalling that particular experience, is the similar type of emotion being
elicited when the user was actually using the product or maybe when the user was having the
first impression of the product, right. And if not, maybe this is a very good indication that, ok
maybe the user will not be able to create a very strong loyalty with the product or the services
that you are offering.
And the maybe the user will not be able to come back for the usage of the services, right. So,
this is how you create and you try to understand the post usage emotional experience of your
the service or the product. And then, you can try to see that how can you create, how the
affective computing can help you in creating a better emotional bond, right. So, that is all
about the affective computing and the emotional design.
I hope that you are able to understand that what the emotional design refers to and how
affective computing can help you to create better loyalty or the bonding with the product,
right. So, see you in the next lecture.
Thanks.
116
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 03
Lecture - 01
Part - 1
Affect Elicitation
Hi friends, so welcome to this week’s class. So, in this week we are going to talk about some
very interesting things, more importantly we will be talking about the Affect Elicitation. That
is the process through which the emotions are induced among the human subjects; of course
we would like to restrict ourselves to the human subjects for the sake of this course.
And more importantly while doing so in order to ensure that it is all being done in an ethical
fashion we will be reviewing the experimental methodology as well including the role of the
institutional review board. And finally we will end the lecture with the discussion of the
research and development tools which are in the various categories including the data
annotation, machine learning and so on so forth.
And so this is basically to enable even the learners from the non machine learning domains as
well to be able to work in affective computing with same efficiency. And as part of this
discussion only we will be doing a tutorial as well which is going to do a demonstration of
the Psycho pi tool.
So, let us just quickly dive in in the Affect Elicitation. So, the affect elicitation we have to
understand that we need to be able to reliably and ethically induce the emotions among the
human subjects, in order to understand and develop the emotionally intelligent machines.
And there are different ways in which the emotions can be induced and similarly there are
different ways in which the emotions can be collected, the induced or the elicited emotions
can be collected from the human experts. So, in this particular module we will be discussing
that what are the different ways in which the affect elicitation can be done, what are the
different stimulus that are there for us and what are the different ways in which the those
different stimulus they vary in terms of the advantages and their disadvantages and for certain
type of experiments you may want to use one stimulus over the other right.
117
(Refer Slide Time: 02:44)
So, let us just quickly dive in the Affect elicitation. So, when it comes to the affect elicitation
let us first try to look at the different categories of the data sets which are available for the
emotional expressions. So, all the data sets as of now which are available for the emotional
expressions can quickly categorized into 3 different categories. So, one category is known as
the Acted or the Posed Expressions.
Now, acted or the posed expressions as the name itself suggests, what it means that in this
particular type of category the actors which could be the common individuals or often the
professionally trained actors, can be asked to portray a particular type of emotion and then
that particular type of emotion is captured.
Let us say in the in using the camera sensors or let us say using the audio modalities. So,
whatever the modalities that you are interested in getting the data captured. So now, in this it
is quite easy actually to collect because you can say. So, for example, you can ask an
individual make a pose of a smile and the individual starts smiling.
So for example, right now I am a bit smiling right and now you are collecting the data when I
am smiling, so that is what is the acted emotion and so it is quite easy to collect. Of course, if
you have a professionally trained artist then it makes the life much easier, because
professionally trained artist which would be better able to mimic the expressions that you
want them to mimic right. So, they can become happy they can become sad they can become
118
angry and so and so forth. So, this is quite easy to collect. So, I think that is very easy to
understand.
But now as you can see about the second point the ecological validity of the emotions
collected in this fashion is of concern. Why this is of concern? Because we all know that for
example, if you are asking an individual or if you are asking a professionally trained artist to
smile or to laugh even or to express happiness, the individual is just expressing the happiness
or mimicking the happiness, but individual may or may not be happy within right.
So, the actual state of the happiness that the individual is actual state of the emotion that the
individual has may or may not be getting reflected through the emotions that are getting
portrayed on the surface right and for the same reason the ecological validity of the data is a
big concern here.
So, while maybe you know you are capturing the let us say the facial expressions facial
expressions could be ok. But let us say if you decide to capture or analyse or understand the
brain signals at the same time the polygraphy signals. For example, the EG signals at the
same time that may not give you a very good result because the actual emotional state of the
individual may or may not be the same as what the individual is representing.
So, I hope that this is a bit clear and for the same reason while this type of data sets which are
using the acted or the posed expressions are, while they are very easy to collect, but their
ecological validity is always a bit of concern. So, they lie on the one side of the spectrum.
Similarly, now we have a different category of data sets as well which is known as the
Naturalistic which constitutes of the Naturalistic display of the emotions. Now in the
naturalistic display of the emotions as the name itself suggests this is completely on the other
side of the spectrum.
So, you are not asking an individual you are not asking an actor to mimic a particular
emotion, rather you are collecting a particular emotion emotions expression when it is being
displayed in a natural setting in a natural fashion, right. So, for the same reason the ecological
validity of the data is very high or it is very valid, because if I am smiling in a natural setting
it may very well mean that maybe I am happy very I am happy within as well right.
119
So, all my all my modalities all the modalities of my body including my facial expressions,
my smile, my audio, my physiological signals, my brain signals they all are going to
represent the same emotion that I am actually in right. So, for the same reason the ecological
validity is quite high.
Now but of course, you can very well imagine on the other side the problem with this type of
data set is that it is quite difficult to collect. Why is it so difficult to collect? Of course, now
imagine that you want to have a data set of happy emotions, sad emotions, angry emotions.
Now, you will have to first identify the individuals and look for the individuals who are
happy in their natural setting, who are happy in their natural, who are sad in their natural
setting, who are angry in their natural setting?
And you will have to imagine you know just take your bag of sensors and you will have to
run behind them and chase them and see when they are getting happy, when they are getting
sad, when they are getting angry etc. etc. and then only you will quickly put up the sensors
and you will start collecting the data right. So of course, that is not very easy task to do and in
some cases it could be quite impossible task to do.
So, for the same reason the data collection itself can be very very difficult and nevertheless in
many cases you may have seen that people they just take the camera and then they start
shooting taking the pictures, when the people are happy in their natural settings.
But then there are ethical concerns also that are related to it right, you cannot just violate the
privacy of an individual or of a group of individuals and it is only polite and ethical to even
ask before even collecting their data; even if they are showing some emotions in a natural
settings right.
So, these are the some restrictions and some disadvantages some problems that are associated
with the naturalistic display of the emotions and for the same reason there are very few or I
would say very limited number of data sets that may be there even which are constituting this
type of emotions. Perfect.
So, 2 categories first is the acted category completely portrayed emotions, second is the
naturalistic category where you have a expression of emotions which is very naturalistic and
third category which is really lies in the middle of these 2 categories which is known as the
induced expressions.
120
So, basically in the induced expressions what is being done is that emotional responses are
elicited using some stimulus and this is sort of as I said is in the middle of both the type of the
data sets. So, on the one hand when I say it is in the middle of this thing the when it comes to
the data collection efforts, of course the data collection efforts in the induced expressions are
going to be a bit higher than the let us say than the acted or the posed expressions right.
But then it is going to be definitely a bit lower. So, maybe if I just look at the collection data
collection efforts, if I look at the data collection efforts then this is going to be a bit higher
than the acted or the posed expressions. But then at the same time this is going to be lower
than the naturalistic display of the emotions right.
Similarly, if you were to look at the validity or the ecological validity of the data, then with
respect to the ecological validity the ecological validity of induced expressions is going to be
a bit more than the acted or the induced expressions. But at the same time the ecological
validity is going to be lower now in this case than the naturalistic display of the emotions.
So, I hope that this is now clear that there are three categories in which the data sets of the
emotional expressions are present acted, naturalistic and induced. Acted lies on the one side
of the spectrum naturalistic lies on the other side of the spectrum and induced is something
that is in the middle of the spectrum perfect.
121
So, having understood what are the different categories in which the data sets are present and
now that we have understood that the third category that is in which the expressions are
induced is lies in the middle. So, we would like to focus our attention more on the induced
expression category. So, basically in the induced expression categories as we just saw that the
emotional responses are elicited via some stimulus.
Let us look at that what are the different ways in which the Emotion Elicitation can be done.
So, primarily the emotion elicitation can be done in 2 ways first is known as the Passive or
the perception based emotion elicitation. In the passive or the perception based emotion
elicitation the individuals they look at a particular stimuli which could be an image, could be
a music, could be a film clip.
And these stimulus they evoke a particular type of a feeling, particular type of emotion
among the individuals. And while the individuals are experiencing these emotions their some
their data in terms of certain modalities which have been pre identified such as audio, visual,
physiological signals, brain signals so on so forth is being collected.
So, that is what is known as the passive or the perception based. It is known as passive
because the other than watching other than experiencing looking the individual is not really
doing something right, the individual is not playing the participant is not playing a very active
role here so right. So, that is what is the perception or the passive emotion elicitation.
Now, of course, on contrary to the passive or the perception based elicitation we have the
Active or the Expression based elicitation. So, active or expression based elicitation as you
can rightly understand understood now that this is just opposite of the passive emotion
elicitation.
In which in this what happens that the individuals are asked to perform particular behaviours,
please pay attention here that rather than observing a particular stimuli rather than simply
observing a particular stimuli the individuals are now being asked to perform a particular
behaviour that might naturally evoke different emotions right.
So, for example as simple as that posing facial muscles as simple as that. So, for example, if I
were to put this pen in my mouth which I am not going to do, but if I were to put this pen in
my mouth something like this it means I am hindering my facial expressions from smiling; I
122
am inhibiting the smile right I am preventing I am sorry I am preventing my face to from my
face to do a smiling face.
Similarly, if I were to put this pen in my teeth like this you know. So, I am making an
expression like this. So, I am forcing myself to smile right something like that. Similarly you
can adopt lots of different types of body postures or you can be interacting with the people in
a particular fashion.
So, that a particular emotion is being elicited and then of course, while these particular
emotions are being elicited again we can use predefined set of sensors modalities in which we
want to collect the data and the data collection can be done right. So, then this is quite simple
first category Passive or the perception based, second category Active or the Expression
based.
Now if I were to ask which category of which of these 2 categories is easier to collect the
data and which of this is a bit harder to collect the data. What do you think? Well of course,
the answer would be that the passive or the perception based would be easier in comparison
to the active or the expression based.
Why because of course, the for more or less for the same reasons that we have seen the data
sets earlier that you are not really asking the individuals to do lot of things right. So,
individuals are just playing a passive role and hence it is being easy it is relatively easy to
capture their data and perhaps it gives you more control of the environment to collect the data
and hence to have a standard data set right perfect.
So, we talked about the emotion elicitation one is the passive based, one is the active based.
123
(Refer Slide Time: 14:51)
Let us look at now the passive let us look in more detail about the what are the different
ways in which we can do the passive emotion elicitation. So, the very first stimulus that
comes to our mind is of course the images, with respect to the images the idea is very simple
what you do you present a particular type of image to a to certain set of individuals in order
to evoke a particular type of feeling.
So for example, if you want the individual to experience a happy emotion you would be
presenting some happy images right, images depicting happy emotions. So for example,
images of a maybe of a cute baby images, of a very beautiful flower and then thing so on so
forth right.
So, you got the idea that the basic idea is that you are going to use image as an stimulus to
provoke a particular type of emotion and what the participants has to do they simply have to
look at the screen and they simply look at have to look at the image and that is being
presented and then while the image is being presented you are going to hope that the same
type of emotion is being experienced by the participant that you wanted to elicit and the
accordingly you would be collecting the data in the different modalities right.
So, but one thing that you have to pay attention here not only with respect to the images, but
with all the stimulus that the presentation method has to be standardized. Such that the
individuals all the individuals those who are participating they have the same viewing
experience.
124
So for example, there are different researchers those who have given their comments on how
to standardize this kind of viewing experience. But for example, if you look at this paper
from Monkaresi and his group, then what they are saying that the images for example, could
be presented for 10 seconds. So, how long the images should be presented the answer is 10
seconds on a computer screen and the computer screen should be fixed at a fixed distance.
So, you cannot just change the distance of the participant from the screen there has to be a
fixed chair, there has to be a fixed monitor on which the image is being presented, the image
duration should be same 10 seconds. Of course, needless to say it has to be a constant screen
resolution you cannot just change the resolution of the screen, the screen brightness as well
has to be constant.
You cannot just change the screen brightness from the image to image and the same goes for
the image size as well. So, image size as well has to be the constant or has to be the same.
Now, what particular image size, what should be the screen resolution, what should be the
screen brightness that depends on so many different things.
But what is of more importance that your participant or the individual whom who is acting as
a participant should be able to clearly look at the image number 1, should be able to
experience the emotions that are being presented in the image without having the influence of
any other variables.
So, without having the influence of the screen brightness, resolution, image size or the
distance right, so this is a nice paper to look at. But more or less these are some of the
standard conditions that you want to use you want to maintain while collecting the data for
while presenting the image as an stimulus right.
Of course, now the next set of questions which images should I present, as I said so you want
to make use of the images according to the emotions that you want to evoke. But now are
there any standard set of images? So for example, or let me ask you this question now if of
course I know that it is quite acceptable that if I am going to make use of an image which is
of a cute baby or of a flower maybe it is going to make the individuals happy.
125
International Effective Picture System and Genova Affective Picture Database which is
already well validated data sets of images.
What do you mean by well validated that a particular set of images have been screened for
you, they have been already shown to the participants and participants have been shown to
have experienced the emotions that these images wanted to provoke, these images wanted to
elicit right.
So, let us just take a quick example of a of some sample pictures from the IAPS a data set. So
for example, if you look at the screen now in this you have of course and there are different
categories of emotions that can be elicited. So, for example, for the IAPS you have awe
excitement, anger, sadness, contentment, amusement and so on and so forth.
Now, if you were to look at a particular kind of feeling so for example sadness, if you look at
the sadness emotion of the sadness. Now by looking at this particular image it looks kind of a
very sad image actually and when you look at the sad images of course you may experience
the sad emotion as well right. So, that is what the basic idea is.
Similarly, let us say if you look at this image here which is depicting the fear emotion. So, if
you look at this particular image emotion maybe the individual is feeling scared or afraid and
the same emotion the participant who is going to view this image may experience the same
126
type of emotion and similarly it goes for the same for the awe for the containment and so on
and so forth right.
So, basically this is the IAPS data set images which you can very well go and look up look
for them and this is publicly available for you to download and for you to experience perfect.
So, having talked about that images are one particular type of passive stimuli which can be
used to present the stimulus.
Now let us try to look at what are some of the advantages of this particular type of emotion
of this particular type of stimulus. So of course, one particular advantage it is very
Noninvasive of course, all the passive methods they are going to be Noninvasive right.
So, it is not intrusive and the individual does not have to put any effort. So, it is very
comfortable for the participant to look at these stimulus which is these images. Of course,
these are easily accessible in the sense first that there is a this data set of images which are
widely validated already which is already publicly available in terms of IAPS GAPED and all
and so on and so forth.
So, you can simply use them and in order to present these images as an stimulus you require a
very simple setup right. So, you simply need a computer screen, you simply need to put the
computer screen at a particular resolution at a particular distance and so on and so forth.
127
And you simply need to put the participants in a fixed chair and participants should be able to
simply look at the screen and then you are going to collect the data right. So, it is really very
easy to collect the data set to create a data set of the emotional expressions of by using the
images as an stimuli.
Now, there are certain disadvantages of course, and I think one disadvantage that you may
have observed already by looking at the images is that the strength of the emotion is lower
right. So, basically the intensity with which the participant is going to experience the
emotions while looking at an image may not be as high right or may not be very high.
In fact, it could be quite low in many cases right. So, that is why the strength of the emotion
is lower of course, it is quite understood emotional reactions are short and transient. What
does it mean? It simply means that even if let us say you know like I am going to watch an
image and I am going to feel something that particular feeling that I am going to have.
Let us say happiness or sadness or whatever is going to be very short I am not going to be
happy for a longer duration of time let us say you know I just look at an image and now I am
sad for 10 minutes that is not going to be the case. So, it is going to be very short duration
may be couple of seconds few seconds only maybe and then of course, this is going to be
very transient right.
So, in that sense the you will have a very short time window in which you can really capture
the expression of that emotion through different modalities right. So, that is what essentially
it means and hence this is another disadvantage, of course the third that there is a lack of
personal personalization. What do you mean by there is a lack of personalization? Of course,
apart from the fact that different individuals may experience different emotions by observing
while looking at the same image right.
So for example, maybe you are presenting a picture of a let us say a dog a cute puppy right,
but maybe you know someone’s puppy recently died maybe and for the same reason while
looking at that particular image while one individual may feel happy, then another individual
may start feeling sad.
So of course, there is a lack of personalization here if you are thinking that the same image is
going to induce the same set of emotions among different participants. Apart from this there
128
are different studies which have shown that the personalization can be done at the various
level.
So for example, this is very interesting study from the Asensio and his group pointed out that
the individuals with cocaine addiction, they react differently to the pleasant, unpleasant and
the neutral IAPS images in comparison with the control population. So, here you have
another way in which you see that the individual's preconditions including their addition
including their medical history maybe and so on and so forth.
And including their biases and prejudices also can play a role in the their experience of the
emotion. Now so this is just one thing right and then the there is a reason why I put this
etcetera term here because, of course there are different other several other types of
disadvantages that can be associated with this image as an stimuli as well. But now I would
like you to think some of them, right, perfect. So, we talked about the advantages we talked
about the disadvantages of the emotions.
Now, the utilizing there is another thing that we have to understand that when you use the
images for the emotion expressions, they can be very useful actually for the reactive
modalities in which there is a reaction from the human. In which there is a reaction from the
body such as your facial expressions you are reacting by seeing an image in your
physiological signals.
So, the same type of reactions can be expressed in your physiological modalities as well, but
may not be very useful when you want to collect the productive modality. So for example, let
us say if you want to analyse that ok, you want to understand the emotions in the text. Now if
you want to understand the emotions in the text and you are showing image as an stimuli.
Now, if you are showing an image as an stimuli it is going to be very very hard or next to
impossible to collect the textual data in which the individual is going to express his or her
emotional state right. So for example, it is not going to happen that you know suddenly I saw
in image and maybe I will start writing that ok I am feeling very happy by looking at this
image for example right.
And similarly for example, gestures while of course, the amount of gestures that one
individual makes in comparison to the other individual it varies. But it is going to be very
129
hard that you know just by looking at one particular image suddenly the individual will start
making gestures right.
So, now, you got the idea that why this type of passive method of image of using image as an
stimulus of image as a stimulus may not be very helpful for collecting the productive
modalities in the data collection right perfect.
So, this was all about the images now of course, the next thing that we can ask the next
question that we can ask if we can use image as an stimulus can we use film clips, a video
clips as an stimulus the answer is yes. If we can use the video clips also as a stimulus, the
basic idea is the same we have to follow some standardization and the idea here is more or
less, the same as the idea of the images is that first what we do that we present a neutral
baseline film.
A neutral baseline film means of course everyone’s new baseline emotional state can be
different. Hence even before start collecting the actual emotional expressions you may want
to first show a neutral film which is not necessarily evoking any particular emotion. It is a
neutral film maybe you are just talking about some documentary talking about general stuff
and so you want to put the participant in a particular neutral state of emotion.
And then after that you want to maybe be present one particular emotion clip related to let us
say you want to make the individual happy, then maybe after making the individual happy or
130
sad or whatever then maybe you want to do some sort of self-assessment in order to ask that
ok. What exactly was the emotion? For example, that you felt while using this thing and
similarly so on and so forth.
You can simply go on putting one emotional clip one self-assessment another emotional clip
another self-assessment and so on and so forth. Please pay attention self-assessment is just
one type of ground truth that you can collect from the participants right. Of course, while the
individual is watching the clip you may want to collect and you have to keep collecting the
data of the audio visual modalities using a webcam or a camera.
Similarly, if you are interested in the physiological signals you may have attached already the
sensors which are already collecting the physiological signals, while the individual is
watching the video clip and so on and so forth right. So, and of course, once the emotional
clip presentation is done, then the self-assessment phase is there where you are asking the
participant to provide the basic ground truth that how exactly. So, the questions can be very
simple.
So, for example, I showed you an image video and then I asked ok how do you feel after
seeing the video is as simple as that right. Now having said that so it is very simple, of course
all the standardization conditions are more or less as we saw in the case of the images it has
to be the more or less the same.
Now, there are couple of questions that we may want to answer, first should we make use of
the short clips or the long clips. Now of course, there is a trade-off here right. So, if you make
use of a very short clip, then what is going to happen, this particular video clip stimulus is
going to start behaving like an image.
So, these emotions are going to be short transient, the intensity of the emotions may not be as
high as you want it to be right. But then if you make use of a very long clip then also you can
say that ok, for example if I show you a one hour long video and then I ask how did you feel.
Now, that is a very complex answer because over the period of 1 hour I may have felt n
different emotions and then you cannot really say that at what particular point of time I was
experiencing what particular emotion right. So, then there has to be a trade-off and as I said
so for example, usually in the research in the literature a clip of 1 to 2 minutes have been used
131
fairly widely and that is what is the trade-off that is there between the video shot and the long
clips so1 to 2 is quite good.
Of course as I said physical situation as we are standardizing with respect to the images has
to be standardized for the video clips as well. And so for example, Rottenberg shows and his
team shows in a paper that for example it could be a 20 inch monitor at approximately at 5
feet from the participant of course, with a constant screen resolution, constant duration more
or less constant duration.
Of course, while there could be video clips which can be one video clip could be of another 1
duration another video clip, let us say you know 1 video clip let us say if you are talking
about 2 video clips. So, 1 video clip could be of I do not know let us say 60 seconds another
video clip could be 63 seconds. So, this is perfectly fine there is no problem in that perfect.
So, now let me just quickly rub this thing in order to make it clear, for each target emotion the
basic strategies that you may want to present rather than just 1 maybe 1 to 2 short clips and of
course, these 2 short clips should be as homogeneous as possible. What do you mean by as
homogeneous as possible?
More or less the duration should be the same, the intensity should be the same the context
more or less should be the same, the of the of the videos the quality of the video should be
more or less the same otherwise it can influence this thing right. Now, similarly the way we
132
have some standard images data set of standard images which are already pre validated, do
we have data sets of videos that are already pre validated.
Of course, no one is stopping you to create curate your own data set, but of course it is
always good to look at the what if there is something existing if there is existing you can use
that otherwise of course feel free to go. So for example, do we do have. So, for example, film
stim is one such popular data set of video clips which is already there.
And you can use one of the images from one of the videos from this thing to present a
particular to use it as a stimulus. So for example, let us see I am not sure if I can play this
thing here, but ok. So, you can just go ahead and look at. So, for example, this is the sample
video clip from this film stim.
So, this if you recall this is the very famous clip from the movie I believe is from the
Schindler list and if you recall so what happens basically in this clip that this particular
individual you know was looking at some the migrants and some workers and he I believe he
shots one of the one of the one of the children that were there. And so I think that was very
sad kind of a clip.
So, you look at this clip and you feel a very sad emotion you feel a bit angry as well you feel
disappointed as well. So, there are lots of emotions that you experience while looking at this
133
particular type of video clips. So, as you can see so the duration of this thing is of course
between 1 to 2 minutes, which is 1 minute 38 seconds exactly to be precise in this case.
Now having talked about the sample, now let us quickly look at sorry let us quickly look at
you know what are some of the advantages of the video clips over the images when we are
using them as a stimulus. Of course, they capture the attention very well it is needless to say
that you know you will pay more attention while looking at an video clip, rather than while
looking at an image, fairly easy.
If you look at the intensity of the emotions there has been n number of research which shows
that the intensity of the emotions for the video clips are bit higher and could be quite for the
video clips. And at the same time you can even make use of some you can even induce
complex emotional states such as let us say shame, guilt and all those kind of emotions which
otherwise would be very very hard to you know induce while using the images as an stimulus
right.
And of course, as I said for the images there was another problem that since it were the
emotions that were being expressed were short and transient. So, you could not really study
the let us say the what is the emotional latency. So, what do you mean by the latency?
Emotional latency means that after having presented the stimulus how much time it took for
the individual to express or experience those emotions. So for example, maybe you presented
134
an image you presented a video clip and after exactly 200 milliseconds the individual process
it took 200 milliseconds for the individual to process it and within 200 milliseconds the
expression or the experience of that emotion is started appearing in the facial expressions in
the brain signals and so on so forth. So, that is the emotion latency.
So, since the emotions they last longer now you can study what is the latency of a particular
emotion in a particular modality. What is the rise time? Of course, how much time it takes for
the emotion to reach at a peak, because of course you have a longer duration emotions.
So, you can really analyse that ok how the emotions are arousing and when they are
appearing at a peak and then when they are going down. Similarly what was the total duration
and what was the offset time between the it is more or less similar to the latency between the
presentation of an stimulus and the experience of the emotion right.
Of course then associated with the advantages then you also have some disadvantages. For
example, ecological validity of course, it can be in question. So, maybe you want to induce a
positive emotion by presenting a positive clip, but maybe the individual is feeling sad for so
many reason, maybe there is a lack of personalization, maybe because of the prejudices,
because of the bias and so on and so forth.
So, the ecological validity it may be in question, of course to avoid this what we try to do we
try to make use of the data sets which are already pre validated such as film stim and so on
and so forth. Other of course, you have to take into account this factor that the individuals
may have seen the video clips before.
So for example, the Schindler’s list if the same movie someone has already seen this movie
maybe the individual is not going to feel or experience the emotion in the same way while
looking at this clip in comparison to someone who has never seen the movie or the clip.
Because, of course they already know what is going to happen what is about to happen? So
you have to take all these factors into account.
Now of course, again here since it is a passive modality again it is not ideal for collecting the
data from the productive modalities such as text, suddenly I am not going to start writing you
know while looking at the video clip that ok how I am feeling, I am not going to you know
make start making gestures that are accordance to the emotions that I am feeling right.
135
So, that is what the other thing about the this video clips. So, with that now so we talked
about the images and the film clips.
Now let us move to the Behavioural Manipulation. So now, that we have already seen the
image and the film clips as the stimulus passive methods, emotion elicitation method way,
now let us look at the Active Methods. So, in the active methods the behavioural
manipulation is one very popular method in which you ask the participants or the participants
are being asked to put their themselves in a particular behavioural pattern, behavioural
configuration in order to invoke evoke a particular type of emotion right.
And there was one of Ekman was one of the first researchers who did a lot of work
extensively on the facial muscles in general and trying to understand that how putting
yourself in one particular type of facial configuration can also evoke a particular type of
emotion or the desired emotion that you want the participant to experience. And in relation to
this only he proposed something which is known as a directed facial action task.
So, in the directed facial action task, what Paul Ekman proposed that you can ask the
participants to create or mimic a particular type of action unit. So, by action units what we
mean by? For example this is the directed facial action task.
136
(Refer Slide Time: 38:01)
So, for example, you can ask the participant to put himself or herself in a particular type of
action unit. So for example, action unit 12 is the lip corner puller. So, you can ask the
participant to mimic the facial action unit 12 which is the lip corner puller and while putting
experiencing while putting the participant in this particular configuration facial configuration.
The participant will experience the emotion that is associated with this facial action unit.
So for example, there is a there are lots of research which says Paul Ekman’s group himself
has given that what particular type of facial action units are associated with which type of
emotion. So for example, if I were to ask here that facial action unit with this is the facial
action unit is the lip corner puller, then what is the type of the emotion that you think will it
be associated with it.
Of course the answer would be that it is could be a happy or joyful kind of a position right,
because you are putting forcing your face to do a mimic a smile. So, this is what is the facial
action task, in which the you can ask the participant to do the manipulation of the facial
expressions through which the desired emotion can be evoked, that is one type of behavioural
manipulation.
Other type of behavioural manipulation which is also again not so uncommon in the affective
computing research which is known as the Recall method. So, in the recall method you may
have experienced that when you ask the participant or when you yourself try to recall the
experiences of the past, then you tend to feel that particular emotion as well.
137
So for example, in the past if in a moment I have felt a particular happy emotion, while let us
say visiting a particular place then if I start recollecting the memories of that particular place I
will again start experiencing the happy emotion that is associated with it right. So, that is the
recall manipulation recall behavioural manipulation.
So, we already looked at the one example of the facial action task. So for example, this was a
work that was done by our group only, in which we showed that how can you use how can
you use the directed facial action task to identify, what are the different how the different
facial expressions can be recognized with the help of the earable device that the individual is
putting in his ear, while making a particular facial expression.
So, for the interested readers I would like to invite them to please go ahead and read the
paper, in order to know the more details how earable device can be used to recognize the
facial emotions that is associated with the facial action task perfect.
That is one of the advantages, but then of course there are certain disadvantages associated
with the behavioural manipulation. So for example, one advantage one disadvantage that is
there is the ecological validity of the data itself can be in question here. Of course, while I am
trying to mimic a smile does it really mean that I am smiling, that is it really mean that I am
happy. So, that is where is the question that can be asked that are my emotions pure, when I
am trying to mimic a particular emotion.
138
So, again this is the same question that we had asked in the beginning as well. So, the
ecological validity of course is in question. Other question that we want to ask is that in order
to use the behavioural manipulation to evoke a particular emotion we need to know the
physical behaviour associated with all the target emotions right. So, for example, Paul
Ekman’s group his work itself did not include several complex emotions such as frustration,
confusion, engagement.
So, Paul Ekman and his in his work did not very clearly distinguish how when a individual is
experiencing frustration or confusion or engagement for that matter. What are the different
facial action units associated with it and how exactly it can be evoked and expressed in terms
of the facial expressions? So now, with this what happens that it limits the range of the target
emotions that you can really induce in a participant right.
So for example, if I really do not know that what is the gesture, what is the facial expression
that is associated with the frustration. I really cannot ask the participant to make that gesture
to make that facial expression and hence I cannot evoke that particular emotion and in the
absence of such elicitation, I cannot really understand and analyse that particular type of
emotion right. So, it really limits the range of the target emotions that the individual can be
elicited with.
Now, third any guess what is the third disadvantage perfect. So now the third of course it can
be very clear that the poses or the actions even if it is known for particular emotions can be
really hard or difficult for a participant to mimic or to put himself or herself in. So for
example, here only if you look at the directed facial action task, now I am pretty sure that for
example in this there are certain emotions certain facial action units that could be a bit easier
in comparison to something that is a bit harder.
So for example, if you look at the brow lowerer may be brow lowerer ok, even I think I can
do. So, maybe brow lowerer brow lowerer is a bit is a bit easy. Now, similarly if I look if I
were to look at let us say lip tightener maybe ok, I do not know if I am able to make it very
clearly right. So but of course, now you saw the problem that whether I am able to do the lip
tightening the way it is supposed to be.
And in similarly there are all other different types of facial action units and not only the facial
action units you also have to look at the gestures associated with it, that you want your
139
participants to perform and maybe not all the participants would be very comfortable in
performing this those and hence this pose or the actions can be very difficult or complicated.
Of course in comparison maybe you can get the recall experiences can be a better, that you
can get the get the participants recall some of the experiences that they had in the past and
can hopefully it will evoke a particular desired emotion in them. And advantages we already
talked about that it is really useful in collecting the reactive expressions, such as the emotions
as it is being expressed in the facial expressions or in the gestures.
Now, other thing that you have to understand that the intensity of the motion can be really
strong, if we can ensure the ecological validity of it. How can we ensure the ecological
validity of it? Of course, in order to do that we really need to understand the physical
behaviour associated with the target emotions have to be known precisely right.
So for example of course we know that when you want when you are happy when you are
experiencing joy. Then of course, maybe you are smiling when you are smiling then of
course, you cannot really smile without you know having your having pulled your lips right.
So, maybe you are pulling your lips. So, this is the precise information that is available for
happy and corresponding facial action units and corresponding facial muscles configuration.
So, maybe you can expect here that ok if the individual is mimicking this particular type of
smile through this body brain effect that we studied earlier, we can very well expect the
individual to start feeling or experience experiencing this particular type of happy emotion.
So, in this case the intensity of the emotions can be a bit stronger, than for example in
comparison to the images or the images as a passive method stimulation. Right, perfect.
140
(Refer Slide Time: 46:01)
So, that is the one that is not the only active method behavioural manipulation is not the only
method for the active emotion elicitation, one other method which is not again very
uncommon is the Social Interaction. So, basically social interaction what do we mean by the
social interaction?
Of course, social interaction is basically when you are interacting in a society with the society
in general with the individuals, with the groups and with the community. Now the idea is very
very simple that while you are interacting with the society, in general of course there cannot
be an absence of the emotions right.
So, whenever you are interacting with the other individuals whether, whenever you are
interacting with the groups, whenever you are interacting with the community, with your
relatives, with your friends. Of course, there are certain emotions that are bound to occur that
are bound to happen. So, basically this is a very naturalistic setting in which the emotions are
being evoked and the emotions are being expressed right.
And hence, so the one of the advantages of course it is that the elicited emotions are very
realistic and are very natural provided that you can really capture them right. So, please recall
that we did talk about in the beginning that capturing of the emotions in such a setting can be
really difficult.
141
So, you cannot just take out your camera and start clicking the pictures of the people those
who are for example happily interacting in the park right. Of course, there are privacy
concerns and on the top of that imagine if you want to understand their brain signals. So, are
you going to really put a sensor immediately on their head when you saw them laughing in
the park that you cannot do right.
So, the there is a disadvantage associated with it that the data collection setup can be really
difficult it is not easy to collect the data. But of course, the emotions they can be very realistic
and natural. Other very interesting thing with this particular type of setting which is a social
interaction setting, is that there are certain emotions that are really hard to evoke when you
are using the passive methods.
Such as images or even for that matter behavioural manipulation or for that matter video clips
or along the same lines we can all we could have also talked about the audio which is the
music. But nevertheless I think now you got the idea that audio, images, video clips they all
are fall in the same category of the passive methods of the emotion elicitation.
But now there are certain emotions which are a bit complex and may be a bit harder to
induce. Such as for example guilt, shame, again anger, may be to certain extent right these are
certain emotions that are bit complex and you cannot really just present a particular image
present a particular video clip, maybe make them listen to some music and you start
individually experiencing experience some guilt, some shame.
So, but in the social interaction this is fairly easy. So, I think it is fairly common experience
that you may have also been when you know your relatives your friends, they have said
something and you have immediately you know started feeling guilty about that.
For example you may not have been able to give them enough time or for example maybe
you started feeling ashamed about certain things; that you may or may not have done. But
have been it is now being talked about in the social gathering in the social interaction right.
So, in short we talked about the social interaction as the active emotion elicitation, we talked
about the behavioural manipulation as a active method of manipulation, we talked about the
video clips, we talked about the images as the passive method of emotion elicitation.
142
And as I said along these lines you can very well include the music the audio, of course it is
not hard to believe understand that it falls into the passive category of the emotional
elicitation, advantages disadvantages. I invite the interested audience to look at and explore
what could be the advantages of using the music as a stimulus, using the audio as a stimulus
and what could be some of the disadvantages associated with the music as an stimulus
perfect.
So, with that then we finish this particular module of the emotion elicitation and in the next
module now we will be talking about the Experimental Methodology. So, with that let us take
a break.
Thanks.
143
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 03
Lecture - 02
Part - 2
Experimentation Methodology
Hello friends. So, let us look at now at the Experimental Methodology in which you will
learn how to design the experiments using which we can collect the data and we can annotate
the data. And, we can possibly use all the effective computing methods to understand how to
what are the emotions and how to make use of that to make use to develop emotionally
intelligent machines and services perfect.
So, let us just dive in ok. So, first thing first whenever we are involving research around
humans, we have to take care of the their rights and the welfare of the human research
subjects. And, to do so, precisely we have a committee in place which is known as the
institutional review board. Now, depending upon the institution's administrative hierarchy
and logistics, you may or may not have the committee with the same name.
144
But, please take my advice on it whenever you are planning to do some experiments around
the humans, please approach that your head of the organization, head of the department in
your company in your institution. And, check with them that if there is any institutional
review board with that is already in place. If not then, please ask them to help you out on how
to proceed further this approval.
And so, that you can get this approval for the experiment that you are trying to do. Now, this
IRB there is a reason that IRB is there in all the institutions and it takes care of as I said the
rights and the welfare of the human research subjects. Because of course, what we want to do
we do not want to just make use of the humans in the way that we want and all the data that
we are collecting it has to be ethical and of course, it has to be valid data and it has to be done
in an ethical fashion right.
So, without of course, to cover the entire spectrum of the IRBs out of the course out of the
scope of this course, but nevertheless I will touch upon some points that what are the
documents that you will have to prepare and maybe what exactly they need in that document.
And, this is something that you must have and must submitted have must have submitted to
the IRB.
So, first thing first for the first you of course, need a draft or an abstract for the proposal ok.
And, in the draft or the abstract for the proposal basically what you have you may have of
course, a clear description of the research methodology that what exactly is the sub hub for
example, what exactly is the type of the data that you are interested in, what is the type of the
stimulus that you are going to use.
And, how you are going to recruit the participants and the design of your experiments, the
experiment lab conditions and so on so forth right. So, basically this is a draft or the abstract.
This is a very important thing, you need to also have and prepare a informed consent form
right. So, basically the informed consent form is what?
This is a form that you must present and get an approval from the all the participants, that are
participating in your experiments which says that ok here I hereby provide my consent to
participate in the study under these, these, these conditions and circumstances right.
So, this is a fairly detailed form, maybe up to one page or two page form that you need to
create. There are lots of templates that you can find online and there are lots of guidelines
145
from the APA and similar agencies and you can use those guidelines to create the informed
consent forms which must be part of the your IRB application packet.
Third thing of course, you know when you are preparing this draft or the abstract for the
proposal, it needs to have a dedicated paragraph on how exactly you are going to protect the
confidentiality and the anonymity of the participants as well as maybe the privacy of the data
as well right.
So, basically anonymity has to be protected for all the participants. And to do so, maybe you
would like to adhere to certain guidelines, IT related guidelines for example, that are there
from the Government of India or from the international bodies right. So, this is something
that you must spend some time and decide that how exactly this will be done. And, you must
convince the committee that ok, your design is going to take care of the privacy of the data
and the anonymity of the users.
Of course, other thing that you would like to describe is there any risk associated with the
experiment that you are going to do. For example, may be the type of the images that you are
going to show to the participants, that type of the videos that you are going to show the to the
participants or maybe the type of the posture that you are going to ask the participants to do.
Maybe not very comfortable, may induce certain negative emotions and may have some
psychological effects afterwards right.
So, all these things you need to clearly understand and accordingly you need to provide a
detailed description of the risks that are there. And of course, accordingly maybe what are
some of the benefits of to the subjects of course, maybe you know some positive emotions if
you are inducing, what is going to be the benefit of that. What is going to be the benefit of the
data itself that you are collecting or the experiment that you are doing in general and so on so
forth.
Other thing of course, that is very important is the how exactly are you going to recruit the
participants right. So, basically for the participant recruitment maybe you are going to
prepare some advertisement flyers, some invitation letters, some mails, maybe you are going
to make a post on the social media. Whatever method that you decide that ok, this is the
method that I am going to use to contact the participants, you need to provide a copy of that
particular data to the IRB committee as well right, perfect.
146
(Refer Slide Time: 06:17)
So, now other thing that you may want to look into the what are the data collection
instruments that you are going to use to collect the data from the participants. So, for
example, when I say the data collection instruments it consist of so many different things. Of
course, it consist of the surveys, the questionnaires that you may want to use to assess the
ground truth.
Of course, it is going to consist from time to time the sensors for example, the hardware
devices; sorry the hardware devices that you are going to use to collect the data such as the
audio-visual maybe a webcam, maybe a physiological sensor, maybe a sensor to collect the
brain signals and so on so forth right.
And of course, a description of the experimental research method so, basically what exactly
will be the research method that you are going to use and what is the analysis method; maybe
you are going to do certain quantitative analysis, certain qualitative analysis. So, of course, all
these are you know discussion of exact those things are bit out of the scope of this course, but
I am just mentioning the keywords.
So, that you know you can go and look at it some of you may be already familiar with all
those things, but you may want to look into this. So, that you can prepare a nice proposal
before even start collection of the data right. And, similarly if you are going to use some
questionnaire or if you are going to conduct some interviews, then what are the protocols for
147
the interview, what are the questionnaires that you are what are the questions that you are
going to use.
And, similarly you know if you are going to conduct some focus group discussions, then
what is going to do the format of that and so on so forth right. So, basically you have to
understand that this these documents or these things are not only there for you to submit to
the IRB. But, also if you look at all these things together collectively they will help you in
understanding and developing your research proposal, your research methodology in a bit
better way right.
And, then at the end it will help you to collect the data in a better way in and then you will be
able to use that data effectively afterwards right perfect. So, on the right side you have a bit of
theme which is very very common scenario especially when you do the research around the
human subjects.
So, you prepare the IRB approval documents some many times, you submit it, but you never
wait for the IRB approval to arrive and you start collecting the data. So, please do not do that,
that is not ethical, that is not acceptable. So, you have to wait to get the IRB approval in order
to begin the data collection right. So, and you have to of course, maybe you want to have a
discussion with the IRB committee that you know like how should I proceed about it.
If you there is a way that you can expedite the process and so on so forth. But, please do not
do this thing that do not wait, do not just start collection of the data or do not collect the data
or conduct the experiments without the IRB approval. Now, quickly speaking like there are
the five six different things not of course, not an exhaustive list that the IRB committee
would like to look into and would like to ensure in order to approve the your research method
that you have designed right.
So, of course, first thing that you would like to look at that whether the risk that you are
describing are to the subjects it has to be minimized. And, if not minimized, then at least it
has to be reasonable in relation to the anticipated benefits of the study itself right. So, they
would like to look at the importance of the knowledge that you are planning to generate in the
result.
And, maybe if the risks are higher in relation to the let us say the knowledge that you are
planning to generate, then maybe the IRB committee would not be very happy about it and
148
may not provide the approval for it. Of course, the selection of the subjects has to be
equitable.
The IRB committee would like to look at that there is going to be no bias, no prejudices while
doing the selection of the subjects. And, this is something that you have to ensure as well
while doing the recruitment of the participants. As I said before, informed consent has to be
sought from each and every single participant. In case the participant is minor, you may want
to consult the guardian of the or the parent of the participants.
And, maybe you want to discuss with the IRB committee that if the participant is a minor,
what should I do and how should I proceed. Now, the IRB approval committee would also
like to look into you know like how what are the different plans in which you are going to
monitor the data collection. And, then how you are going to ensure the privacy and as I said
the confidentiality of the data and of course, the anonymity of the subject.
So, these are some very very important points and then on the top of it of course, you know if
you are conducting some experiments with the vulnerable population. So, as to say let us say
children, minors, many times you know for example, individuals with intellectual or certain
type of behavioral disorders so on so forth. So, then you may want to put some additional
safeguards to ensure that there are no additional harms for this particular type of population
right perfect.
149
So, enough about the IRB approval. Now, we have already obtained our IRB approval, now
we are ready to dive in and start doing the experiments. In order to do so, of course, this is a
very simple chart that is there which helps you to understand what are the different things that
you need to do in the order right.
Of course, the very first thing you need to develop the study concept. You may want to
understand what type of emotions I am trying to study, what is going to be the type of the
stimulus that I am going to look into. So, maybe roughly I am going to put here. So, for
example, you know you may want to look into what is the different type of emotions that you
want to study.
Once you have understood the different types of emotions, you may want to study look into
like ok what is the different types of stimuli that you may want to use. Once you have
understood the stimuli, maybe what is the different types of sensors or modalities using
which you are going to collect the data.
And, then of course so on so forth then you want to understand the setup of the experiment
and all that right. So, this is what comes under the development of the study concept. Once
you have done this, then there is something known as study design. Now, some of you may
be already familiar with the study design and those who are not familiar with this thing,
quickly speaking there are three different types of study design that you can do.
One is the within subject design, other is the between subject design and then there is a mixed
model design. Without speaking too much, as I said like all these are bit outside of the scope
of the course. Nevertheless, I want to touch upon these things so, that you do not feel
completely unfamiliar about this and more importantly I would invite the interested readers
and the audience to explore more about this.
So, for example, the within subject design as the name itself suggests; so, within subject
design is the design in which you make use of the same participants to study different
conditions of the experiment. So, for example, you may want to understand that how the
participant respond to emotions in VR versus how the participant respond to the emotions in
computer settings for example, right.
So, there are different groups that you have created. Now, you may want to put the same set
of users in the VR and then you may want to use the same set of users in the computer system
150
as well and see how they are responding to the emotions, that is what is known as the within
subject design.
Now, the between subject design is just opposite of this. In the between subject design what
you want to do? You want to put one set of users in the VR experiments, you want to put
other set of users in the PC experiment design right. So, these are there are different groups,
there are different set of users and that is how you want to conduct the understand the effects
and analyze the effects of this thing, that is what is the between subject design.
And, when you do a mix of within and the between subject, that is what is known as the
mixed model design. So, please go ahead and I would invite you to please explore more about
it and see the pros and the cons. Now, which is the best design for a study? That depends on
so many different things right. So, there is no short answer to it of course, you will have to
look into it.
Long story short, if you are going for the within subject design as you can very easily
imagine, within subject design you will need less number of participants, for the between
subject design you will need more number of participants. The duration for the experiments
for the between subject design is going to be shorter in comparison to the session for the
within subject design so and so on so forth.
So, there are so, many different things you may want to look into this further. Accordingly,
the next thing that you want to do is like you understand how many different groups are
going to be there. Now, the number of groups usually speaking there are like you know you
can have for example, two groups. One you can have a control group, one you can call it as a
treatment group.
So, control group is basically where you may want to induce you want to understand the
effects of a motion in a particular setting. And, then you want to understand when the
emotions are not elicited in that particular setting for example. So, there is one control group,
there is one treatment group.
And so, basically this also depends on what are the different variables, how many different
independent variables that you want to analyze in that particular setting right. And, once you
have decided the type of the study design, the number of groups then that is very very
important to determine the sample size.
151
So, when we say the sample size, it simply means the number of; the number of participants
that are going to be there in the group that you are envisioning. So, then again there are so
many different rules and ways to understand what should be the exact ideal sample size. And,
long story short especially talking from the perspective of the statistics and machine learning,
you never get enough samples right.
The large the larger the sample size, the larger the number of participants the better it is. But
then of course, depending upon the resources, depending upon the time that you have at hand
and so many different things, you may want to have a feasible sample size and that may
depend on the study. So, usually there is a thumb rule which says that you know for study of
one particular variable, you may want to have let us say some sample size which is going to
be something which is equal to be equal to the n is equal to 30.
So, 30 participants like for one particular group is ideally is frankly speaking is a good
number is a thumb rule that we usually use. Nevertheless, there are different ways to
calculate the sample size for a particular study and to understand its power perfect. So, we
know how to we have designed our we have develop the study concept. We know what are
the different types of design, we already know what are the different number of groups, we
already know what is the sample size.
Next quickly speaking this is very very important, you want to determine the methods of the
evaluation. Now, when I say the determine the methods of the evaluation, you want to
understand that what how exactly you are going to collect the emotion expression of the
emotions. Now, to collect the expression of the emotions there are so many ways and roughly
they can be divided into five different categories.
So, for example 1, category is the self-assessment category, is as simple as that you showed
an image to the participant and you simply ask the participant after showing the image ok;
what was the how did you felt after looking at the image? Or, you simply presented a video
clip to the participant and at the end of the video clip you ask ok; how did you feel about the
this particular video clip? What was the emotion that you felt?
And so, there are different ways in which you can ask the question that the participant about
the emotions that the participant is feeling. So, this is known as the self-assessment. You want
the participant himself or herself to answer that what was the emotions that they felt and what
152
was the intensity maybe as well. Second thing rather than asking them directly that what did
you felt and then not, maybe you want to conduct certain interviews right.
So, maybe at the end you showed them certain number of clips and then at the end you want
to conduct an interview. And, then it's more informal setting where you want to understand
that in general how did you felt and you want to deduce from that what exactly was the
emotions that was prevalence throughout the experiment.
Third which is really interesting is the psycho-physiology, you may want to understand what
is the you may want to maybe you want to understand that ok while going through this
particular condition one particular condition let us say, how the emotions were being
expressed in the brain signals for example. So, that is where you know you may want to look
at the easy signals and that is where; that is where you want to look at the psycho physiology
of the participants.
And of course, to do that you will have to have some sensors in place and using that sensors
you will have to collect a particular modality which could be brain signals which could be
heart rate as simple as that which could be your skin conductance which could be so many
different things right. There are lots of physiological signals that you may want to look into.
So, that is of course, in order to do so, you will need certain sensors or some certain external
hardware many a times. Observations is as simple as that you know; so, in the observations
you are neither asking the participant, you are not making use of any let us say sensors to
collect the physiological signals, but rather you are simply observing the entire video maybe.
So, for example, you presented some certain images to the participants, you are recording the
entire thing and at the end of it maybe you are just looking at the entire recording, you or
maybe you appointed some experts. Let us say some psychologist, some psychiatrist who is
looking at the entire video and after looking while looking at the video, the expert or the you
are making some annotations and saying ok this is how maybe the user felt and this is how
the participant felt at this particular point of time.
And, an accordingly you are getting the what we call it as a ground perfect; so, this is the
fourth method. Now, 5th method is again is known as the task performance. So, this is more
commonly also known as the behavioral data. So, basically what you are going what you are
153
doing? You were you will be looking at the behavioral performance of the user while the user
is doing something.
So, imagine that you are just the user is performing a particular task, maybe you ask the user
to look at certain images, look at certain video clips, look at a music. And, while the user is
looking at the music, maybe you wanted to understand ok; when the user is exposed to a
positive music whether the performance of the user in solving arithmetic problems has
improved or not for example, a very interesting question right.
So, you may want to simply say you need not to ask any question here to the user. You are
simply putting a positive music and you are asking the user to do some arithmetic problems
right. And, at the same time maybe you are putting some negative music and you are asking
the users to solve certain arithmetic problems.
Now, depending upon the number of correct answers, that the user got at the end of this
experiment, you may simply use that to say that ok what was the performance of the user.
And, that is what you are going to use as the ground truth to analyze that what type of effect,
the positive music had and what type of the effect the negative music had for example, right.
So, that is one particular way to get the ground truth from the users, perfect.
So, having determined the method of the evaluation and an any particular experiment, you
may want to use a mix of all these five. You may want to use one only, you may want to use
two only, three only or any mix of these different five categories of the evaluation or
obtaining the ground truth. Having determined the method of the evolution, you may want to
quickly look at them location of the study as well.
Many times you may want to conduct the experiment in a more lab-like settings where you
the conditions are more under your control. And, you can precisely put the user in a place,
you can precisely put the stimulus on a particular screen. You can control the lighting for
example, that is there, you can control maybe the noise that is there and so on so forth
including the temperature of the room and many different things right.
Or, maybe you want to decide that ok you want to collect the data in a more natural setting,
you may want to put the user in a park in like an outside space and you may want to observe
everything in a bit more the naturalistic settings.
154
(Refer Slide Time: 22:29)
Of course, you will have to take a call depending upon the type of the question that you are
trying to answer here. perfect. And, then accordingly once you have understood what is the
type of the location, you may want to use other types of the equipments apart from the sensor
hardware right. So, other types of equipment would include let us say including of course,
you may want to use a cam as a system, a PC, a computer system to maybe even present the
stimulus for example, if you are talking about an image.
A video clip or for example, you may want to use a headphone if you are trying to present a
music or for example, you may want to may be as simple as that the screen could be rather
than being a computer screen, it could be a mobile phone as well. And, there are lots of
studies where the researchers have tried to understood what is the emotions, how the
emotions vary when the user is looking at something on a mobile phone versus on a computer
screen right.
So, there are different types of equipment, other equipment types of equipment data can be
there and you may want to look at all those things. Accordingly, quickly speaking when there
is going to be lots of sense hardware and software in place then of course, there are going to
be lots of different types of failures that may happen.
And, you may want to create a protocol trying to understand that ok what are the different
types of failures that can happen; while doing the data collection, while doing the
155
annotations, while getting the ground truth. And, accordingly what is the contingency, what is
going to be your contingency to deal with these type of failures.
So, as an example maybe you have decided to use three different sensors to collect three
different types of physiological signals; heart rate, brain signals and your skin conductance
right. But, then what happened that during the experiment itself, you figure out that ok maybe
one of the sensor is not working, but the other two sensors are working.
Now, your contingency could be ok maybe I do not have any spare hardware, but maybe I
can continue the data this experiment with only the remaining two sensors. Because, at least
having two data is better two sets of modalities is better than not having any modality at all
right. So, these are the different types of contingencies that you can also plan in accordance
with the failures that you can envision perfect.
So, you have developed the study protocol and you have developed the scripts, the softwares,
the algorithms that are required to put all these thing together. You have now you also need to
determine the methods for the recruitment of the participants as I said. So, here there is
something we call it as a two different types of criterias that are very common to determine
these methods which is known as the inclusion criteria and the other is known as the
exclusion criteria right.
So, you may want to look into the inclusion and the exclusion criteria that on what basis of
what criteria you are going to include the participants in a study or on the basis of what
criteria you are going to exclude the participants from a study. As an example, I am going to
include for example, for a particular type of study, I would be you may be interested in
including the participants who are of age between 18 to 24, as simple as that.
Maybe you want to study only the college going students and maybe 18 to 24 could be a good
number, that is that could be your inclusion criteria. Now, one simple exclusion criteria could
be so, for example, you are doing the study maybe after the post pandemic period and maybe
you want to exclude the participants who just have had may be COVID for example, as
simple as that right.
Maybe, because COVID may impact certain things so, you may want to include the
participants who just have been you know who have just recovered from the COVID and so
on so forth. So, you may want to list the criteria on the basis you are going to include the
156
participants. You may want to list the criteria on the basis of which you are going to exclude
the participants perfect. We already talked about in detail that you need to prepare the IRB
documents, I already talked about what are the different things.
Of course, once the IRB hopefully IRB approval is done, fingers crossed, you will start the
recruitment of the participants and then at the end of everything goes well; this is where you
will be conducting them study right. And, then at the end of the study hopefully you will have
your data, you will have your ground truth using which you can do the analysis that you
wanted to do perfect. So, this is in brief how to design the experiment.
Having understood how to design the experiments, let us now quickly look at the what are the
different tools that the community, that the researchers that we use in the affective computing
field right. And, these tools they are very very helpful not only for the individuals who are the
beginners in the field, but also many times the individuals who may be the experts in the
domain.
Because, they not only make your life easier, they save the time, they save the resources, they
save the cost. And, they improve the performance, they improve the efficiency of the system
that we are trying to build or the system that we are trying the services that we are trying to
build.
157
(Refer Slide Time: 27:28)
Roughly speaking so, there are all the tools that we the we use in the affective computing
field can be categorized into five different domains. One is the data collection category, tools
can be of for the data annotation. We also use different set of tools for the signal analysis, we
also use a tools for the affect classification; mostly of course, you may have; you may have
guessed it well.
This is the machine learning methods, mostly based on the machine learning methods and
then we also use the tools for the expression of the affects right. So, these are the different
five different sets of tools that we use. We will quickly look at the what are the different
softwares that are available here, what are the different tools that are available in each of the
categories.
158
(Refer Slide Time: 28:13)
And, maybe we will look at one example very quickly and at the end of it we will also have a
tutorial on one particular tool. So, for example, the data collection; what are the data
collection tools? So, basically data collection tools are the softwares that simply help you
conduct a wide range of experiments right. And, then here is a brief, not an exhaustive list by
any means brief list of softwares that are available.
Some of them are open source, some of them are paid softwares that you can look at look at
the internet and they have their own set of advantages and disadvantages. And of course, not
to mention that apart from these tools that are here, you can always if you have a good hands
on experience on the programming, you can always create your own customized script which
is developed in house by you and by your groups right.
159
(Refer Slide Time: 29:08)
So, quickly speaking let us look at for example, one tool which is known as the which is a
very very popular tool which is known as the PsychoPy tool. So, and at the end of this
particular class module, we will also have a tutorial on the PsychoPy tool which is going to
give you a bit more information. So, PsychoPy basically is a very very interesting tool. It is a
free package first thing, it is a cross platform tool and it allows you as I said to run a wide
range of experiments and you can conduct experiments.
And, this is fairly commonly used by the neuroscientists, psychologist, linguistics and so on
so forth right. So, you can see that it has a very very flexible and intuitive builder interface, as
you can rightly see on the screen right. So, this is it is kind of a home screen that you can see
here. So, the good thing about this particular package or the tool is that it provides you a
builder interface in which you can just you know drag and drop things from here.
So, for example, you can select what is the type of the stimulus that you want to use, what is
the type of the response that you want to study. So, for example, maybe the keyboard
response, if you look at the stimulus maybe you can select image, video, music and so on so
forth. Similarly, if you have if you want to integrate with some sensors, it already gives you
some list of options. You can integrate it with easy for example, some easy devices, some eye
tracking devices.
Apart from that, it also allows you to integrate certain input output devices which can be used
for example, for the integration of additional sensors which may not already be there in the
160
list. So, this is a build builder interface in which you can simply drag and drop and create
some experiments right. More importantly, for the code lovers, this is also this is completely
based in Python.
So, you apart from the using the builder interface, you can definitely use the Python scripts to
generate the experiments here. And, when I say that you can use the Python, it means all the
practically speaking all the libraries that are available with the Python, you can very well use
here and that definitely increases and expands the scope like anything right.
So, basically it is it makes it very very powerful, if you can do a bit of programming in
Python. And, for those who cannot do the programming, there is already a build interface that
you can use to design your own trials and do the experiments right. So, this is a very nice set
of tool and for the same reason at the end of this thing, I would definitely we will definitely
have a have some hands on experience, some tutorial on the PsychoPy which is going to
make you familiar with the tool.
And, you will be immediately be able to jump on the software and start using it and I trust me
you will immediately fall in love with this particular tool right perfect. So, that is so, now,
you understand that this particular data collection tools they are used to create the
experiments in which you present a particular stimulus maybe, you decide there what is the
type of the data that you are going to collect and then you do all this in a very controlled
fashion.
161
Now, once you have collected the data, the next part is that you may want to look at the data
annotation. When I say the data annotation means what? Basically, when you do the data
annotation, this is what gives you the ground truth right. And, why do we need the ground
truth? Because, unless and until we have the ground truth, you will not be able to make the
machines understand that what exactly this emotion stands for right. Of course, machines for
machines the emotion itself is a very very abstract concept.
And, then now depending upon what is the different type of what is the modality type of the
modality that you are using, there are different options that are available. So, for example, if I
look at the SAM tools which is Self-Assessment Mankins, it is a very very commonly used
tools; mostly it is used for the self-assessment.
Similarly, if you are using for example, the audio modality and you want to do annotations in
the audio you want to for example, look at it ok when the user was saying this, what was the
emotion there. When the user was saying this, what was the emotion there; maybe you may
want to use one of these tools to do that type of annotation.
Similarly, for the video modality, you have again a very nice list of tools that are available.
For the text, again like this is only one, but certainly there are more options which are
available in the market. And, nevertheless at the end you if you are a code lover again, you
have the option of creating your customized scripts with which you can do the annotations,
you can facilitate the annotations of the data.
162
Now, let us quickly look at one particular tool which is very very popular in this set which is
known as the SAM Mankins. So, as I said basically this SAM Mankins is a you can look at
this, it is a non-verbal pictorial assessment technique that directly measures three different
things. It measures pleasure, arousal, dominance. And, if you recall so, this is PAD is what?
This is what it is you can very well imagine now that this is based on the PAD tool, PAD
model of the PAD or the VAD.
We also called it as the VAD model of the emotion representation that we saw in the earlier
lectures, right. So, basically the idea is very very simple that you are going to present a
particular stimuli to a participant. And, then at the end of the presentation of the stimuli, we
are going to present give this Mankins to the user and you may want to ask that ok; on the
scale of for example, here this is 2 to 4 6 8. This is a 9 scale Mankin, you can have a 7 scale
Mankin, you can very well have a 5 scale Mankin as well.
So, for example, you may want to ask the user that ok, on the scale of 9 what was your level
of pleasure and maybe the user is going to for example, mark this. Similarly, on the scale of
9, what was the level of arousal; maybe the user is going to mark this. Similarly, on the scale
of 9, what was the level of dominance user can mark one of these and that is how you can
collect the data using the SAM Mankins right.
So, this is of course, only valid for the self-assessment mode of annotations perfect. So, now,
you know so, of course, as I said the SAM Mankins can only be used for the self-assessment
mode of the data collection. But, for the others, there are other tools that are available perfect.
163
(Refer Slide Time: 35:11)
So, that was the second category of the tools. Now, quickly speaking the third category of
tools is the signal processing and the analysis category. So, this is fairly broad category that
we use to analyze the different types of modalities, that we have collected, that we are
interested in and analyzing while doing the data collection. And of course, you know the type
of the modalities that we roughly speak could be of audio type, could be of even
physiological signals, could be of image or video type.
Roughly speaking, these are the three four categories and nevertheless you always have an
option of making it in house right. So, for the audio category of course, then you have this
Praat, OpenEAR, OpenSmile, Wavesurfer; very very popularly used tools. So, basically what
they do? So, for example, they allow you to analyze, to preprocess the signals, to remove the
noise for example, from the signals; let it be audio, let it be physiological signals, let it be
image or the video.
At the same time, after doing the preprocessing, they also allow you to maybe extract certain
features, certain type of features that are there in the particular type of signal and then to get
some inference from that particular type of modality right. So, audio is there physiological is
there, image video is there and in-house option is there.
164
(Refer Slide Time: 36:23)
Let us quickly look at for example, you know one particular tool that has recently come up
and become very very popular is for the analysis of the image or the video modalities is
known as the MediaPipe by Google. So, basically MediaPipe is a very very popularly used
tool as of now for the analysis of the image and the video modalities. And, basically this is
the tool that you that is behind that runs behind when you interact with the Google assistant at
your home and when you say ok Google.
So, basically you know in order to understand ok Google, what is going on behind is being
facilitated by the MediaPipe. So, basically MediaPipe is just a framework which is used to
develop machine learning algorithms for the video and the images as I said. And, it can very
well be used for the analysis of the audio modality as well and it is a cross platform a
platform cross platform framework that can run on the desktop, server, android applications
and so on so forth.
And, that is what increases the applicability and then the popularity of this particular tool that
you have. So, for example, if you look at some of the solutions that are readily available in
the MediaPipe are already include. So, you can already do a face detection. Let us say if you
are interested in the analysis of the facial expressions, face detection is the very first thing
that you may want to do.
And, for that you need not to create any algorithm, you need not to create any even write a
single piece of code. There are APIs that are already available in the MediaPipe, you can very
165
well use those APIs and connect those API’s with the code, with the videos or with the image
that you have collected and then you can get the face detection done.
And, once the face detection is done then of course, as a next step you may want to do the
and the analysis of the facial expressions there. Similarly, you know for example, it also
allows you to understand and the pose, the type of the pose that the individual has. And, once
you have once you understand what is the type of the poles that individual has, you can very
well correlate with the emotions that they that the ground truth of the emotions that you have
collected while doing the annotations right.
So, then there are so many different things and of course, there are certain other things which
may or may not be of very use for the affective computing research. Nevertheless, it is a very
very integrated and holistic pipeline, holistic framework that you may want to look into
further right. So, MediaPipe I will definitely encourage you to please go ahead and look into
it perfect.
So, then we have also have the other set of the category which is the fourth set of category, it
known as the data mining tools. So, basically these data mining tools are what? Now, you
may have very well understood that ok, we have already create an experiment, we collected
the data, we annotated the data. After the annotation of the data, we already preprocessed the
data and after the preprocessing, we were able to do certain type of analysis as well may be
using the MediaPipe and similar type of tools.
166
And, then as the next step now you may want to understand the emotions that are exactly
there on the data, that you have collected right. So, now in order to do so, of course, it
becomes now here, here you have to involve the machine learning and the deep learning
tools. There are lots of sophisticated state of the art, machine learning and the deep learning
tools.
And, the development of the machine learning and the deep learning algorithms, while it may
be a very research focused topic in the affective computing community, there are lots of tools
that are already available which develops lots of machine learning tools which implements
machine learning tools for you. And, you can simply use that on that preprocessed data, that
you obtained from the previous set of tools to understand the type of the emotion that you are
looking at.
So, for example, of course, this is fairly a broad list of tools and those who are already
working in the machine learning domain or the deep learning domain, you may already be
familiar with some of these tools.
But for example, just for the roughly for the sake of the completeness, we will just discuss
have a look at the WEKA tool. So, WEKA tool, you may be its again it is a very very fairly
popular and common tool that is available. So, basically the WEKA is an open-source
software again. The good thing is again it is an open source software and it makes available
167
tools for the various algorithms in machine learning that you can see for example, that you
can also see in this.
So, for example, it allows you to make use of the algorithms for the preprocessing for doing
the classification, for doing the clustering, for identifying the association, for even doing the
feature extraction and feature selection and of course, to do the visualization of the data that
is there right. So, there are different categories in which the under which there are different
algorithms that are available and implemented by the WEKA tool.
So, again I mean of course, it is a it has a builder interface as you can see, it is a very it has a
very nice GUI. It has a very nice builder interface, you can simply use this builder interface.
So, for example, from here whatever data you have collected, you can simply open that file
and you can simply you know choose a particular type of algorithm that you want to run on
the data. You can simply select one of them one of the; one of the techniques that you want to
apply on that particular data.
And under which you can simply select the type of the algorithm maybe that you want to do.
So, for example, if you are selected the classification, as simple as that you may want to do a
classification using for example, decision trees or for example, using a support vector
machine. So, of course, what exactly will be the type of the algorithm that you will be using?
All those will depend on so many different things right.
So, here of course, if you have a bit of understanding of the machine learning and deep
learning that will be really helpful. If not then definitely, I will encourage the users to please
go ahead and then take some courses on the machine learning and deep learning or take some
tutorials in order to understand at least what these different means. And, if you already have a
background in machine learning and deep learning, you are really good to go for it right.
168
(Refer Slide Time: 42:30)
So, this is a very very simple tool, but very very powerful and commonly used tool to analyze
the data that you already have right. So, that is the about the WEKA tool. Now, this is the
fifth set of category of tools. So, fifth set of tools that we have lies under the affect
expression. So, now, until now what we have seen? We have already collected a data, we
have already annotated the data, we have already preprocessed the data using some signal
processing tools.
On the pre-process data, we also applied the certain machine learning algorithms using
WEKA like tools perfect. So, far it looks good and at the end of it you should be able to
understand the emotions that are there in the data right. But, now next what? Using these
emotions, you may using these emotions or like without maybe going through all this
exercise as well, you may want your machines or your services to be expressive; they should
be able to express the emotions right.
And, these expression of the emotions could be in accordance with the emotions that you
have detected. For example, using the previous set of tools or maybe you want them to
express certain emotions because that is how you have designed your service or the machines
right. And, then again here there are lots of tools that are readily available and frankly
speaking the expertise that you need to make use of these tools varies hugely.
So, for example, on one hand you have ICT Virtual Human Toolkit which is fairly popular
and again its a bit easy to use. But then again for example, you have ROS like frameworks
169
which is Robotics Operating System which is a bit sophisticated tool. And, you may need to
have a bit of more experience in order to we have to use this thing right. So, you may want to
explore like all these different tools and you may want to see that ok which fits your need
better.
So, quickly for the sake of the completeness, we would just see that you know this is the ICT
Virtual Human Toolkit. So, basically this is a ICT Virtual Human Tool Toolkit is again you
can see right for example, in the on the right hand side, if you can look at on the right hand
side quickly. It is basically a collection of modules, tools and the libraries which are designed
to aid the researchers and the developers in creating virtual human conversational agents.
So, for example, you can see so, there are different options. So, for example, if you can see
here under the category of the resources, you may want you can pick like all different types
of agents that are already available. And, once you have picked one of those agents, then you
may you know like you may want to you can customize those agents depending upon by
using them by using the different options that are available in the panel on your right-hand
side.
So, basically it has different modules also that allows you to emphasize on the natural
language interaction with the agent the non-verbal to analyze the non-verbal behavior of the
agent and to also look at the perception of the agent and the its environment that is around
170
right. So, I will definitely invite the users to go ahead and explore the toolkit to understand
and to gain a bit more of understanding about it right.
So, with that then we quickly came to the conclusion where quickly speaking again, we
looked at different category of tools that are available for you to explore for the in the under
the category of data collection, annotation, analysis, classification and expression.
So, all the references that we have used, I have already given here in the references list. And,
definitely it will be uploaded separately to you, it will be made available separately to you as
well right. So, with that now I finish here and I wish you happy learning.
Thanks.
171
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 03
Lecture - 09
Experimental Design: Affect Elicitation Research and development Tools
Hi friends welcome to today’s module. In the last lecture we have already learnt about the
experimental methodologies in which we learn about different sensors. In
today’s class we will be exploring PsychoPy and how to use these sensors. So, PsychoPy is
an open-source Python tool which is widely accepted to create experiments in neuroscience
and experimental psychology research.
Then the user can control which stimuli they want to present in what order and for how long.
Similarly, response components allows to record different types of responses like keypress,
mouse, clicks, vocal responses, facial data, etcetera. Stimulus and response components are
organized within routines which are a sequence of events within one experimental trial. So,
PsychoPy provides us an option to include custom Python code which can be embedded in
the beginning or end of the experiment.
Text can be added in different routines as stimuli by adding text, pictures, item, keyboard,
mouse and other sensors. So, we will be looking how to integrate them in the further video.
Also, PsychoPy uses Python programming language as a in the background and we can use
custom Python code items to add different features which are not available in PsychoPy.
Routines and loop can be added for repeating one or several routines including the stimuli,
the user response, the sending of the recording of facial data, etcetera.
172
(Refer Slide Time: 02:43)
So, for installing the PsychoPy please visit the official website which is www dot PsychoPy
dot org slash download dot html.
So, after downloading the PsychoPy, now opening the PsychoPy builder. So, after opening
we will be shown this timeline where we have a initial trial a trial routine which is created by
default. So, we will be creating a pretrial routine.
173
(Refer Slide Time: 03:10)
So, for that in click on the insert routine button and adding a pretrial. So, on clicking,
We get an option to install pretrial before and after the trial. So, here we want to install that
before the trial added that pretrial before the actual experiment. So, here we can see now we
have two videos for trial and pretrial.
174
(Refer Slide Time: 03:25)
So, for the pretrial now we will be adding a text component. So, we will name it as a text this
is a variable name. Now, we will be telling the starting time and the stopping time. So, here
we wants like it should continue till 3 seconds and here we can add the custom text that we
want. So, let us say we add ‘Experiment is about to begin. Press here we will be recording the
facial data while watching a video stimulus. Press Y to continue.’
175
So, for recording the experiment we should always take the consent of the participant who is
actually giving his data. So, here we will be providing the consent option as well. So, by
pressing ‘you are providing as the consent to record the data’.
176
(Refer Slide Time: 05:25)
So, here we will add a key response and here we want it to start at time t equals to 1 second
and for a duration of 2 seconds. So, and we want it to press the Y key. So, now we can see
here like we have created a 3 second pretrial where a text will be displayed and a key
response will be taken from the 1 second to 3 second.
Now going on to the trial section. So, here we want to add a video response a video stimuli to
the participant.
177
(Refer Slide Time: 05:47)
So, here we will click on the video option and let us name this variable as movie and we want
it to start after the pretrial is completed. So, here we will have we will start it from the 3
seconds also like we want to watch the whole video. So, whole video is around 2 minutes 44
seconds. So, it will be around 164 seconds. So, it will start from time t equals to 3.
And now we will need to provide the path of the video. So, here we will be adding the path of
the video and we want to end this routine after the completion of the video.
178
(Refer Slide Time: 06:22)
So, now, here we can see from time t equals to 3 the movie stimulus has been added and also,
we want to add a camera option. So, for the camera option.
We will be using the custom code component and we will be writing the code.
179
(Refer Slide Time: 06:38)
So, in the we want that code to be begin in the begin of the experiment. So, we will be adding
the demo code. So, for code we will be importing a cv2 and we will be saving we will be
getting the video capture for the camera that we have actually added external camera that we
have added. So, after that we will be adding a fork to write the video and we will be saving
the video in the data folder.
And we will be at providing the custom window frame size that is 640 cross 480.
180
(Refer Slide Time: 07:15)
Also, for each frame we want to save the get the video frame and save it in the output file. So,
for that we will be using ret and we return in frame to actually read the video that is there and
we want to write that frame. So, we will be writing that frame in the output file.
181
(Refer Slide Time: 07:46)
So, now, clicking on the File option and save as. So, here we will be saving the PsychoPy
experiment that we have generated.
182
(Refer Slide Time: 07:58)
Now starting with the experiment and while playing it we will record it will show me the
stimulus and the facial cameras will record my expression while watching the video. I am
starting with the experiment. So, by clicking play option it will start the experiment and I will
be open up with this dialogue box which will be asking for the participant ID and the session.
So, by clicking OK it will start with the experiment and this PsychoPy has started. So,
initially we will be shown with the pretrial screen and then with the routine.
183
So, here the pretrial routine is there and by pressing Y we are starting with the actual
experiment.
So, now with the experiment has been completed successfully and we can see the data has
been saved. So, going to the data directory.
184
(Refer Slide Time: 11:28)
185
(Refer Slide Time: 11:31)
There is an output file of the camera recording that has been generated. So, if we will open
that. So, here we can see the camp that has generated and if we will look at the video. So,
here we can see like the video and the camera expressions that are there. So, while watching
the video we are actually getting the facial camera experiment and it has saved the data for
around with the same time period that we have generated.
In this lecture we have seen how to create an experiment while using PsychoPy and we have
seen a demo experiment that we have generated and the data that has been generated while
watching a video stimulus and after that the data has been recorded. This data can be used by
the researchers for their future research.
Thank you.
186
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 04
Lecture - 01
Automatic Facial Expression Recognition
Welcome to the lecture on Automatic Facial Expression Recognition as part of the affective
computing course. Now, imagine you are talking to a friend. The friend tells you about how
his or her day went. Now, just by looking at their facial expression, their body gesture and
listening to them through the speech, you can tell what is their mood.
Did they have a good day? Are they comfortable in the conversation? And based on that, you
are going to give an apt reply. Now, we want to have the same in a machine. For that, we
actually look at the methods of automatic facial expression recognition. Now, what will that
mean?
We want to understand the emotion of a user or a group of users. For that, we will be looking
at the cues which the user is giving to the machine. Now, these cues again will be in the form
of their facial expressions, their head gestures as you can see me now, I am right now
nodding and also their body gestures.
So, how can we track this information and then tell a user interface that how is the user
reacting to the system? And in this process, try to have a more efficient interaction with the
system. Now, emotions are very deep embedded in our minds. Whenever we are trying to
interact with someone, the cues which we get from them give us some idea about their
emotion.
However, without the context, it is difficult to understand someone's emotion. But the nearest
neighbor to that is the facial expression. Right. One can say that during conversation, when
you are looking at a facial expression of a person that can tell you give you a glimpse of the
emotion of that person. Therefore, we are going to do two tasks in automatic facial
expression recognition.
187
(Refer Slide Time: 02:23)
The first is your emotion recognition. We want to understand the deep emotion of the person.
The second is your expression recognition. Assuming that the facial expression of a person is
the window to their emotion, we want to understand and then categorize the different facial
expressions which can tell a machine how is the user feeling.
Now, friends, an example of that is on the slide. You see a person here. So, the face is
detected. And based on the lower lip, the upper lip. And if you notice the movement here
around the chin and the eyes, one can tell that this person is happy. Now, if this person is
happy, the system can react accordingly.
One can also go a further level down and then look at what are those muscles which are
activated. Now, these are action units which correspond to the different facial parts. And in
the later part of my lectures, I will be discussing with you about facial action units as well.
Now, typically to understand emotions, there are a wide variety of sensors which are
available. Some of them are obtrusive, that is you will put a sensor on the person. Some of
them are non- obtrusive. In today's class, we will be focusing mainly on the non- obtrusive
sensors.
188
(Refer Slide Time: 04:00)
Now, the first one is a EMG, your electromyograph. Then you have your electrocardiogram
which is your ECG. And then you have the electro encephalogram which is the EEG. So,
EMG, ECG and EEG as you can see here are actually put on the body of the user.
Now, what; that means, is the user could be aware that there is a sensor which is attached to
the body of the user. Therefore, if he or she is aware that the system is trying to understand
their emotion, they may be giving bias signals. Of course, when you look at these sensors,
they have been improved.
You already have the wearables these days. For example, the smart watches, they become so,
ubiquitous with the user that the user does not really always actively analyze and realize that
they are trying to understand the physiological meanings of the body of the user. For
example, the heart rate and the blood pressure right.
What we are more interested today is the camera based signals. Now, you attach a camera to
let us say the wall in a room and then you would like to understand the expressions of the
person. The usefulness is, as the facial expressions are the gateway to the emotions of a
person, you can get a large amount of data at a very low cost.
Camera sensors are not very expensive these days. You get cameras in different forms, from
your CCTV cameras to the camera phones in the smartphones in the front and back. Further,
the biggest advantage is the non-intrusive nature.
189
You can have a person being analyzed by putting a camera which is further away from the
person. Of course, that does not mean that the obtrusiveness does not harbor into their
personal space, but for the sake of discussion, let us say it's very non obtrusive.
Now, when we are further guys going down into facial expression recognition, there are two
important categorizations which I would like to make. The first is based on the amount, the
type of data which you are receiving, you would like to categorize a particular face into a
subset of expressions.
The first one is your static facial expression recognition. Now, as the name suggests, you
capture one frame containing a person and you analyze the information in that frame to
predict the expression and later the emotion of the person. And whenever we are saying
emotion, please notice most of the time we are referring to the perceived emotion.
Perceived emotion is you are talking to someone, now what is your understanding of the
emotional state of that person? Of course, the other side of that is that a person tells you
yourself that this is how I am feeling, so, that would be the self-labeled emotion. Now,
typically what you will do in static facial expression recognition is, you have a frame and
then you would extract some statistics, some features from that.
Now, obviously, the features would correspond to the different shape which you see in your
face. For example, if you notice me right now, I am smiling. Now, as I smile, the lips, the
190
muscles around that they elongate, right. So, we would need a feature, a representation which
can tell the difference between a neutral versus a smile.
Now, typically these frames for static facial expression recognition are selected from a series
of frames which are captured by a camera. One may also say that I will use the selected peak
frames, ok there are different methods with which you can you choose a frame.
For example, one could say, well, I would like to select one frame every second and then
analyze the expression. Another way could be that we would like to only choose a frame to
analyze the expression when there is a substantial movement or substantial change which has
happened in a frame.
Now, this idea of course, is borrowed from the world of video compression, where we are
looking for the change in terms of the frame and we will use one frame as a reference frame.
So, the peak expression frame is actually a reference frame. Now, let us say you have a series
of frames coming in, why not actually analyze the series of frames together and do what we
refer to as spatiotemporal facial expressions.
Since expressions are dynamic in nature, what does that mean? Now, again if you notice me
friends, I am going to smile. Right so, when I started to smile, there was an onset of the smile
and then the expression reached a peak and then there was an offset. So, there is a temporal
movement which is happening through the expression.
Since in the context of frames, you have movement happening in the lip region and the other
parts of the face in both space that is at the frame level. Let us say this is the face at the frame
level and at the other frames level as well. So, spatio this is t, this is your smiley
spatiotemporal ok.
191
(Refer Slide Time: 10:36)
Now, further it has been observed in a lot of studies that this spatiotemporal facial expression
which is also referred to as your dynamic facial expression recognition, it achieves a better
recognition performance as compared to a single frame based static facial expression
recognition.
The reason is quite obvious expressions are dynamic and hence, when you are analyzing a
series of frames, you get extra information about the onset, apex and offset of the frame and
these dates of the expression will then give you information about not just what is the
expression, but also about when did the person started, let us say to smile and when did that
smile episode end.
It is extremely important to be able to localize this expression as well because one thing is to
understand the expression and the perceived emotion of a person and the other thing is once
you understand the expression and the emotion, how you want your machine to react? Now, I
have already told you the advantage of dynamic facial expression, the spatiotemporal analysis
as compared to static facial expressions. There is a drawback.
When we are extracting dynamic features from series of frames, they can have different
transition durations and different feature characteristics depending on particular faces. Now,
different people will have a different reaction to the same stimuli, same stimuli means, let us
say a joke is cracked in front of two people, one person can laugh more than the other person.
192
Now, what; that means, from an affective computing perspective is, the laugh event will have
different durations for the two persons. If they are going to have different durations, that
means, your system which is extracting some type of features, some statistics around that
event needs to be agnostic to the different durations, small laugh, long laugh.
Further, of course, when you are analyzing the information from the perspective of series of
frames, you are expecting a lot more compute requirement. As when you are combining these
frames, there will be a lot more data, simply means we are going to need to have a more
powerful machine.
Now, from the context of both static and dynamic facial expression recognition, here is a
typical pipeline. In the beginning, here you see we have a series of frames. So, this is t and
what you are doing is you are capturing these frames and later, you want to detect the
location of the face.
So, where is the location of the face? Now, this location of the face can be detected using
various object detection techniques. So, if you look at open source systems, there are methods
such as YOLO, which is you look only once. Then there are face specific object detection
systems, for example, open face. So, that is an open source face detector and face tracker as
well.
193
So, what does that mean? You detect the location of the face in the first frame and then in the
consecutive frames, mth frame and nth frame, you are tracking the where has this object
moved across corresponding frames. So, that is a tracker. Now, once you have the face
localized, that is the location of the face in an image, you would be interested in
understanding the location of the different facial paths, the eyes, the lips and so, forth. For
that, we will do landmark detection.
Now, again, for landmark detection, there are a large set of open source libraries. For
example, open face does that, there is one called intraface and then there is another one called
Z face. So, you can detect and then you can find the landmarks, the location of the different
facial points in my image.
Now, once we know where the facial paths are, we would be interested in extracting features.
Now, these features are also some statics statistics about the face, which essentially are going
to create a summary. Now, this summary is going to tell me information about let us say if
this is open or in another frame, for example, here the mouth is closed.
Now, once you have extracted this summary, essentially you have the feature, you are going
to use a machine learning algorithm to do classification. Now, here what you are seeing
essentially is that let us say this is all the faces, all the data points representing one face. So,
this is one face, this is another, this is another and let us say all other points which represent
face where the person is not smiling.
So, you can say well, I have a binary task which is smile versus non-smile. So, I would be
learning this boundary which can segregate the smiling faces from the non-smiling faces.
Now, of course, there are a large number of factors which are going to go into the optimal
boundary.
From the context of affective computing, what will that mean is, you need to have the right
facial expression feature and we will be studying some of the features in the coming slides.
The other is you need high quality localization of the face and the facial points. So, if you
were to extract some statistics around the lip region, you would need good localization. That
means, precise location of where the facial points are with respect to the lip region.
194
(Refer Slide Time: 18:12)
Now, this was when you are doing conventional facial expression recognition. From 2012, we
have seen tremendous progress in machine learning and related areas due to the deep learning
based approaches. So, of course, you can do automatic facial expression recognition with
neural networks. Now, what will that mean? Again, here you have a series of frames.
And what you are doing is, you are taking these frames and you are giving it as an input to a
neural network. Now, this neural network, it is going to learn these feature representations
which are going to correspond to the final goal here. Now, let us say again, similar to our
earlier example, we are doing smile versus non-smile. So, for this binary task, you have a set
of these samples and then you learn a convolutional neural network. Ok.
Now, it is going to learn these feature based on the filter weights. So, these filter weights are
going to be learned based on the trained data. And what; that means, as compared to the
earlier hand engineered based feature approaches is. So, the system will not only learn the
larger shape changes, but subtle expressions as well which can help it in better doing
classification. Smile versus non-smile.
Now, this is also referred to as sometimes an end to end learning approach. What; that means,
is, you input the image directly. So, no longer you are doing face detection. In some cases,
you input the image directly because you know that most of the component of my image
already is a face.
195
You can expect then, the network to learn for example, here to find relevant location. So, it
will find relevant location in the image and the early-on layers will then give more
importance to the relevant location which is the face. And then the later layers will extract
finer features.
These finer features are the location let us say of the lips, the location of the eyes. And that
will further go into your fully connected layers and later you will have your classification.
Now, this deep learning based facial expression recognition is very prevalent. It has shown to
perform much better as compared to the traditional hand engineered based features.
Of course, there is a subtle drawback here. The drawback essentially is that as with most of
the supervised neural network approaches, you would need a large amount of training data.
Now, collecting trained data is also a non-trivial task when it comes to facial expression
recognition and we will discuss some aspects of that later.
The other of course, is the high energy requirement which you would need for not just the
training of your deep neural network, but also the inference time requirement. So, in the end
you will end up using more energy as compared to your traditional hand engineered features.
So, what; that means, is there is a trade-off between your hand engineered facial expression
recognition based systems and deep learning based facial expression recognition systems and
the trade-off is simply based on the accuracy which I need versus the energy which I am
ready to spend for achieving that particular facial expression recognition accuracy.
196
(Refer Slide Time: 22:55)
Now, once we have the classification of facial expression recognition systems into your static
versus dynamic facial expression recognition and later on hand engineered versus deep
learning based facial expression recognition system. Let us talk about how are we going to
represent the expressions.
So, the first one here are your macro expressions. What are the macro expressions? Now,
these are the obvious easily understood expressions. You talk to a person, you see them
smiling and you say well the person is happy because he or she is smiling. Now, you look at
these image simply these are the obvious expression this person is happy.
This person is you know sad and here you actually have more of a neutral type expression, a
bit of you know contempt as well. Now, these macro expressions these are visually observed
through the major facial locations. Now, these are also referred to as the salient points in a
face. What are the salient points?
Now, these are the points which are giving us the movement of the different facial paths.
Also, sometimes referred to as the non-rigid part of the face. These typically would last
through half a second to a 4 second duration and given that their macro expressions they will
match the content and the tone of what is being said.
An example is as follows. I am very happy to speak to all of you for about facial expression
recognition. Now, when you look at my expression and you hear my tone you can correlate
197
that right. Now, further there has been a lot of study about what these macro expressions can
be.
The first one which is a bit simpler representation of this macro expression is a universal
expressions. Now, these typically would be your anger, disgust, fear, happy, neutral, sad and
surprise. For a couple of decades, it has been assumed that these are the universal expressions
you know Irrespective of the nationality, irrespective of the culture from which a person
comes, these are always observed against the same type of stimuli.
Only recently it has been found that not all these expressions, the 6 plus neutral universal
expressions are found across all the cultures. Anger, happy, surprise, these are commonly
found across cultures. But you would have noticed that people coming from different regions
of the country or from different cultures will represent their emotions in different ways.
So, these are no longer universal, but still an easy way to represent the facial expressions.
And of course, as with any affective computing problem, these are based on the ultimate
understanding which you want to have in terms of the affect, emotion of the person. So,
depending on how serious the use case is, we can choose a subset or all of these universal
expressions.
And as the name suggests for this universal macro expressions, it is easier to collect data
because we are only creating one label for an image. For example, this image has the subject
showing happy expression.
198
(Refer Slide Time: 27:28)
Now, if you want to dwell a bit deeper, there is a concept of micro expressions as well. Now,
what are micro expressions? Micro expressions are spontaneous and subtle facial movements
that occur involuntarily. Please note, these are the ones which occur involuntarily. And they
tend to reveal a person's genuine and underlying emotion within a very short period of time.
So, as compared to macro expressions, your micro expressions, they are closer to the
emotion. Now, what are your micro expressions? For example, you notice this image here.
There is a subtle twitch here. There is a movement of the eye here. Now, these are the ones,
for example, let us take a use case.
Now, this involuntary short duration muscle movement is your micro expression ok. You
would have seen this in popular science TVs as well. Right the investigators are able to tell
the you know the person was telling truth or was trying to hide something. Now, for that they
are trying to analyze the micro expressions.
199
milliseconds. Now, what; that means, is to a naked eye, to an untrained expression expert, it
is difficult. Even for an expert who has been trained to understand micro expressions, they
will really have to focus because of the duration.
What that also means, is when you are going to create a dataset for micro expressions, it is
going to be more difficult to create and then label these expressions. So, typically what will
be done is the cameras which are used to create these large datasets where micro expressions
are labeled, they are of very high frame rates.
For example, you will see some datasets coming from the University of Oulu in Finland
where they will have 100 up to 200 hertz. So, you are actually you know having a lot more
data a because the event is small, you record more frames and that way you have sufficient
amount of information which later you can use to extract your features.
Now, the third one friends, is your facial action coding system and also referred to popularly
as the FACS system. Now, after Ekman proposed the universal expressions, he and Friesen
also proposed the facial action coding systems. Now, what are facial action coding systems?
Well, these are the facial muscles which would get activated when a person shows a
particular expression. So, typically let us see your smile will be combination of different
facial muscles ok. Now, here you see a set of frames. Now, these different expressions are
200
essentially combination of different facial muscles moving in which means different facial
actions have happened.
Now, here you see a subset of facial expressions. So, facial expression based action unit 1,
AU2, AU5, to AU27 and their correspondence as well. And notice how localized they are.
They are not talking about the whole face, they are only talking about a local region. So, for
example, when you see these lower lip movement, you have AU15 which corresponds to lip
corner depressor.
When you have AU23, that would mean that the user has you know tightened their lip. Ok.
Now, what we will typically do is, you know we will actually let us say combine certain AUs.
So, maybe AU1 plus you know AU27 and then you could say this corresponds to a certain
expression.
Here are some of the mapping friends. So, let us say when you have the facial expression
which is happy, then action units 12 and action unit 25, they are activated. Let us look at the
other side, negative expressions. So, when the category type is sadly disgusted, you have the
action units 4 and 10 which are activated.
So, ultimately from a facial expression recognition system perspective, that will mean you
need to first detect the action units. And then you can use this information to go deeper in
things such as perceived emotion or you can use action units to let us say predict things such
201
as observed pain in a patient. So, you can train a system for that. Now, we will stop here and
in the next lecture, we will be covering the aspects of facial expression recognition.
202
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 04
Lecture - 02
Automatic Facial Expression Recognition
Welcome to the 2nd lecture in Automatic Facial Expression Recognition as part of the
Affective Computing Lecture Series. So, friends, in the last lecture we discussed about the
components of typical automatic facial expression recognition pipeline. We also talked about
the different variations with respect to the type of data.
You can have static facial expressions that is only one frame or you can have your dynamic
facial expression where you have a series of frames coming in. Further, we also talked about
the types of expressions essentially which tell you the label. You can have your macro labels
which are your universal expressions, your angry, discussed, happiness, sadness and surprise
and so forth.
And you can have micro expressions, the twitch of the eye, the subtle movement of the lip
corner. Now, these are involuntary movements, but these are extremely important for
understanding the affective state, the emotional state of a person. Then we talked about the
facial action coding system where you have the action units, different parts of the muscles
coming together to create a smile, to create an expression of surprise for this side ok.
203
(Refer Slide Time: 01:44)
Now, what you see here on the slide is your typical pipeline of a facial expression recognition
system. You have the object detection where frames come in, you get the face you get the
facial landmarks. And then as we discussed the last time, we are interested in extracting
useful meaningful features, statistics around the facial region, the facial parts.
So, we are going to now deep dive into the facial expression features and see how these
features are extracted, how they have been proposed in the past 20 odd years and what can be
their advantage and disadvantages.
204
So, we are going to start with the most simplest of a feature ok. So, let us say here I have a
face, now in this frame I draw a smiley alright. So, assume that this is your face. Now, what I
can do is I again detect the face using a object detector and then I find out the location of the
parts which are the landmark locations.
Now, once I have these landmark locations. So, I am just you know pinpointing them here
around the facial parts, I can extract what is referred to as geometric features. Now, as the
name suggests, we are interested in extracting the facial geometry and this facial geometry is
a factor of the facial points which you would be extracting using an object detector.
Now, clearly at least for macro expressions there is a strong relationship between the facial
components and the feature way of construction ok. So, when you use these geometric
features, you can understand if a person is smiling or is a person sad because the facial points
are moving as such.
Now, in one of the very interesting works, Ghimire and Lee in 2013 they proposed that you
can have two types of geometric features. One is based on the position whereas; the tip of my
nose and the other is the angle. So, when I smile and you notice my face so, the angle
between the outer corner of the lip and the lower center of the lip.
Now, you use the angle and the position and you can have different number of facial
landmark points and you can create analysis of the expression based on the angle here and the
Euclidean distance right. Now, this will be a feature. If for example, in a face the distance
between the points here on the upper part of the lip and the lower part of the lip is high.
So, let us say this is D1 then if it is greater than another value D2 which of course, for the
same face you are getting from a neutral expression. So, this distance then you can say well
this is where the mouth is open. So, the chances are the person could be showing for example,
a surprise or a happy expression right.
Now, notice in this work the authors had 52 facial landmark points you can have a larger
number of points as well and these points are based on the type of object detector which you
are going to use.
205
(Refer Slide Time: 05:06)
It is noticed that shape or geometric features by themselves are not always sufficient. The
reason is as follows: you are trying to analyze the expression of a person and the person let us
say gives a smirk notice my face right and I give a smirk. Now, this smirk if you have lesser
number of points will not be appropriately captured by the landmarks points.
Another example is when you have action unit 6 and action unit 7 coming together then you
see the narrowing of the eye aperture right. If you have lesser number of facial points then the
geometric feature will have insufficient information. Therefore, in a large number of
applications we use a combination of not just geometric features, but also feature extracted
from appearance or texture.
Now, when you will add appearance or texture that is going to help you in discriminating
these scenarios for example, where you have AU6, action unit 6 which is present, but action
unit 7 is not present right and that is going to give you wrinkles for the eye corners right that
is very vital information. Now, friends let us talk a bit about the appearance textures.
206
(Refer Slide Time: 06:38)
Now, appearance means essentially the skin component of your face right. So, we would like
to have a texture feature descriptor which can analyze the different parts of a face such that
you can have the information about the change in the skin when a facial muscle moves. Now,
there are a large number of approaches which have been proposed in the literature for
expression based features when you are analyzing the appearance ok.
The simplest base is well given a face you can extract the pixel intensity values ok. Now,
what that will mean is let us say here you have a frame you create a face here ok you are
going to analyze each and every pixel in your face and then let us say create a representation
based on the pixel information.
So, you can create a histogram for example, you could say I am going to create a histogram
which is going to have the range let us say its 0 to 255 is the intensity range. So, you are
going to create n number of bins. So, 0 to 15 for example, then you go from 16 to 31 tell you
know similar to 255 and you scan all the pixels and you keep on adding here ok keep on
adding here you increase the frequency right.
Now, that is another feature descriptor based on pixels. Now, a big limitation of pixel based
appearance analysis is the weakness of pixel intensities to the change in illumination. So,
when the lightning condition changes the pixel intensity changes and; that means, let us say
when in image 1 same expression was there and then you had image 2 captured after a while
with the subject showing the same expression.
207
But in a different lightning then the histogram which you would create based on the intensity
that will be different. Now, histogram here histogram here notice they are from the same
facial expression of the same person but are different because the lightning condition has
changed.
Another issue is when you have effect such as translation now what; that means, in frame 1
you had your face this is your frame 1 and in frame 2 the same person moved a bit let us say
by a few pixels. Now, the expression remained the same now since you are having this
intensity based features some new information will be added here some information might go
out we will not have always the same feature representation.
If you were to then divide the face into patches right local regions you wanted to compare
region by region. So, that you have in detail and comparison of the facial regions that is a
feature from here to feature from here the translation will have an effect here.
Therefore, in literature it was proposed that let us use orientation based gradient based
information. Now, it has been shown that our eyes are also sensitive to change in the shape.
Now, similarly for analyzing the change in the shape which is reminiscent to the facial
expression you can use the gradient information ok.
Researchers have used the standard Gabor filters the Gabor wavelet filters to look at the
different orientations the different changes in the facial expression when you apply different
208
filters and then they will do let us say a histogram based representation or similar
representation or the more popular ones for example, the histogram of oriented gradients and
the Scale Invariant Feature Transform. Now, let us discuss these two ok.
Now, your histogram of gradient for facial expression analysis will be as follows. So, you
have a face which is detected using an object detector. What you do is you divide the face
into non overlapping blocks for example, here I am dividing. Now, further what I am going to
do is I am going to compute the edge map.
So, that what I get is the gradient. When I compute the change in x direction and y direction, I
get information for the gradient. Now, what I will do is, I will create a histogram of gradient
for each block here this is block 1. So, let us say you get H 1. Now, you append that with the
histogram which you get from the second block and you then do it for let us say if they were
N blocks.
Now, there are some post-processing operations which are done to increase the robustness of
this feature to illumination ok. For example, normalizing it based on the variance and mean.
Now, this feature is an extremely powerful feature which when you will give to your machine
learning algorithm should be able to differentiate the macro expression and micro expressions
as well.
209
The other one friends which is extremely popular is your scale invariant feature transform.
Now, this was proposed by Professor Lowe in the early 2000s. Now, scale invariant feature
transform based facial expression recognition would mean here you have a face. Now, you
run the SIFT facial point detector which is essentially the interest point detector proposed for
SIFT what that will give you is these points.
Now, these points are the important points where there is change happening in the texture in
all the directions. Now, once you get these SIFT points very similar to how you did for your
histogram of gradients you will take the region around a point let us say I got this interest
point. So, this is the region around this point this is the I region.
You will create a histogram and this histogram will encode local information. So, small
histograms coming together based on the orientations again right. So, you can either use your
HOG or SIFT or variants of these and then you can learn again a machine learning system.
Now, it has been shown in a large number of studies that these appearance features, they
perform much better in terms of classification accuracy as compared to your geometric
features.
Of course, one can do fusion as well, that is we have a parent's feature and a geometric
feature and you then combine them before you learn a machine learning classifier. Now, for
geometric features the advantage is that they are extremely fast in computation. So, low
complexity.
So, they have low time complexity and also, they have low storage requirement because all
what you are doing is you have some facial points and on the basis of that you are doing
some comparisons for example, computing the distance between facial points or computing
the angles.
Now, compared to this the descriptors from appearance they have higher computational
complexity because you are analyzing pixel by pixel. So, they are generally slower to
compute higher time complexity and they have higher storage requirement. Therefore,
depending on the type of application which you want to develop where facial expression is
going to be extracted you can choose geometric appearance or combination of both.
Now, a very simple example which I would like to share friends is let us say you want to
create a simple app in your phone which analyzes the images which are taken from a camera
210
and then based on the facial expressions it clubs the images together ok. So, you join images
together based on facial expressions and you can say well I have a feature in my app which
can show the most happiest moments of a particular event ok.
So, in that case if the faces are clear and closer to the camera; that means, they have a large
face size in the images what can one do is you can have an object detector. And then you can
extract the facial points and for things such as happy a geometric features could be good
enough to increase the performance a bit more. For let us say those expressions which are not
captured by the geometric feature one can have a histogram of gradient where you have less
number of blocks.
So, as to have less time complexity. What this also means friends is when you were extracting
appearance feature based on HOG, you were actually doing a global feature extraction for the
face. When you were using your SIFT, you were first getting the points, the important interest
points on the face and around each interest point you were extracting a histogram, which
means we will have what is referred to as local features, extracted.
Now, an obvious question arises if given a face I have n interest points around each interest
point I compute an appearance feature descriptor based on SIFT for each image now I will
have n SIFT feature descriptors. How do I learn a classifier? Because typically what you will
have is each data point is represented by a histogram but. When you are doing SIFT you have
a large number of histograms.
Therefore, when you have this kind of situation typically we have these pooling methods
which are used for expression recognition and I will give you a quick example of one.
211
(Refer Slide Time: 19:19)
So, a simplest one is called your bag of words. Now, this concept is coming from the
language domain. What you are saying is well, I have N histograms which are around N
locations in my face. Let each histogram be one word and each face be a bag. What I will do
is I will create a pooling mechanism.
So, that these n histograms they can be created and compressed into one histogram, which I
can further use to do training. Now, quickly how do we do that let us first erase ink on the
slide.
212
What you are saying is I will create a bag or a dictionary. So, I will take all the train samples
and let us say I have m train samples basically m faces and I may have N interest points. So, I
will have n into m data points I will do clustering and get what is referred to as representative
clusters.
Now, what that should mean let us say in your faces you have a set of smiling faces, sad
faces, neutral faces you extract these interest points, you do clustering. So, you have your
points. You created these clusters: maybe cluster 1 corresponds to a smile, cluster 2
corresponds to a frown. So, now, you have these representative clusters. What you will do is
you will take one sample at a time. So, you take one data point which contains N interest
points.
And you will create a histogram by looking at the frequency of occurrence of representative
clusters in a face. You compute the distance between the SIFT histogram for each point with
all the features of clusters and you assign to the nearest neighbour and that will give you the
histogram.
Now, this solves a lot of problems not only friends now I can have a method for N different
facial points for a face, but if instead of a face a single frame I was given a video, if I was
given a video then as well, I can do pooling using Bag of words, right.
213
So, you have frame one you have frame two you ran your object detector you got the face
you got the landmark points and let us say you extracted global or local features right. So,
what you will do is you will create a dictionary using let us say clustering. So, you could say
well, one representation is one frame is a word and this is represented by let us say a HOG
based feature.
You do dictionary based clustering then you do the histogram process creation which
essentially referred to as your vector quantization now you will have a single histogram for 1
video ok. So, this could be a face across all the frames and this way you can have facial
expression recognition for dynamic where you have a series of frames coming in.
Now, we also have another way of extracting motion features see when I say you can create a
bag of words based representation for a video that is going to analyze each frame as a word I
am not learning the relationship between each sequential frame you started smiling, right the
onset apex offset. So, every frame one has a relation with frame two.
Therefore, an ideal feature representation should be the one which not only looks at the
spatial information for example, when you apply HOG to a frame you extract spatial
information, but also looks at the time series, right. The changes which are happening across
time because facial expressions are dynamic in nature.
214
Therefore, motion based features are extensively used in a work by Ambadar and others they
showed that you know when you add motion based information into your features you can do
better facial expression recognition. There are a large number of methods, let us look at the
most prominent and commonly used ones. So, the first one is optical flow.
So, optical flow is a traditional computer vision based technique where you say well given a
series of frames I want to know where my pixel from frame one moved to in frame two. In
context of your faces that will mean you had smile starting in second frame you had the same
face and let us smile increasing a bit.
Now this frame at the corner, this pixel at the corner that has moved a bit in the next frame
right. So, what is that flow which is the velocity and the direction in which a pixel has
moved. So, once I compute this optical flow then I can have different type of pooling
methods again similar to bag of words.
So, friends what happens is you have N frames let us say as input you will get N minus 1
optical flow frames which tells you know how pixel is moving from first frame to other
frame. Now you can extract any feature you can say well I want to do a histogram of gradient
again to each optical flow based frame and then a bag of words and then classify.
So, now what is happening we took the face as an input we extracted the flow, it told me how
are the points moving across time and then we looked at the change in spatial domain as well.
So, each optical flow frame now has a histogram representing the orientations of the
gradients. You can do some pooling because you have a video, you know. Bag of words is
one. One could say I want to go even more simpler because I do not want to spend more
computation cycles right.
So, I could say well then just compute the max, min and let us say average across all the
HOG frames you get three frames you can flatten them and then that is your feature vector.
Friends other one is motion history image. This is a bit older one as well and it has been used
more for human action recognition and has been shown to be useful for facial expressions as
well.
What you are saying is I want to create one single image from a series of frames and these
frames are going to be flattened into one frame wherein, I am going to look at the change
which has happened in a particular location. So, essentially you know this one single image is
215
going to create a, contain the history of motion of the pixel across that location the other one
which is extremely popular and very effective for dynamic facial expression recognition is
your Local Binary Pattern.
Now, Zhao and Piettkainen in proposed this method in 2007 and let us see what this method
is. So, friends what you are saying is you have a series of frames again of a single person. So,
we are assuming always we have a single person. Now what you do is you would take one
block of the frame here and similarly same size block in the same location, but now in the
second frame.
So, from this volume you would extract a sub volume. So, all these non-overlapping first
blocks. What you will do next is, you compute the local binary pattern. What that local binary
pattern says in its simplest form, Let us say I am right now analyzing the location of the tip of
the eye. So, this is the outer tip of the eye.
You will take one pixel so, this pixel and what you do is you compare this pixel with its local
neighbour. So, this is the pixel this is the tip of the eye. You compare this with the
neighbouring intensities. Now let us say the intensity here, if that is larger than my neighbour,
then I would say this is a 1 . Now I compare the intensity of this pixel with this one if this
was let us say less I will say it is a 0. So, now, if it is larger it is a one if it is smaller than the
neighbour it is a 0 this way I will get a 8-bit code ok. Now you compute the required the
decimal point for this now let us say 55 just as an example friends.
And then you will create a histogram of all the points, right, you have created these binary
code for all the points. Now this should give you one histogram ok. Now this is taking care of
the spatial part. Now Zhao and Piettkainen said well, That is for the spatial part now let us do
something for the temporal part as well.
So, what I am going to do is let us say I take again this non overlapping block first block of
my first frame then I take for a second and third frame and I have the sub volume ok. So, I
am going to draw the sub volume. So, this is x this is y this is t. They said well, Let us divide
the volume into 3 orthogonal planes what; that means, is let us say I have this frame here
which is your xt then I have your frame in the here in the center which is your xy and here I
have a frame which is your yt.
216
So, we have three frames from each you compute one LBP. And then you combine this and
this is now your spatiotemporal analysis of the video using a local binary pattern
representation. Local Binary Pattern is also used for texture analysis in computer vision.
What that means, in context of faces is for applications such as micro expressions these have
been observed to be fairly discriminative.
Because you are analyzing the pattern in the texture pixel by pixel and are encoding the
difference of that pixel with its local neighbourhood in both space and time. So, once you
have any of these features friends you will then learn again the classifier the same drill we
have been discussing about.
Now these are some of the methods for your features which we are going to extract. Another
important aspect here is the data right. Affective computing is a data driven area. So, let us
look at some of the facial expression recognition databases which are extremely popular in
the community and have very different purposes which helps us in solving different
problems.
So, we have the Cohn-Kanade dataset from the robotic institute at CMU where you have CK
dataset and then later you know the CK plus dataset which was collected in 2010 and what
this dataset has is, it has video sequences where a person was asked to come and sit in front
of the camera, now if you observe me guys I am looking right into the camera and the person
was asked to smile then the person was asked to show sad expression.
217
Now, if you notice what has happened is my head did not move, but the expression changed,
right. Now researchers wanted to have pure expression movement in these samples and that is
how you get these points, data points where you have the video where the expression is
changing further, there are over 100 subjects there is a diverse age range, notice this is
actually not containing much older subjects. This is mainly you know graduate students and
faculty around the campus.
Now CK and CK+ dataset have been the main breadwinners with respect to the datasets
which are used for learning. Clearly, they have added to much progress in the area, but as I
explained to you the method with which the data has been recorded; that means, that the
expressions are not spontaneous, right.
So, that brings us to the discussion of what is spontaneous and what is posed expression
,right. Posed is you looked into the camera and then you smiled. Spontaneous is let us say
you were having conversation you were watching a movie there was a joke you really
appreciated you liked the joke and then you smiled right.
So, that is spontaneous in the real world we have more spontaneous expressions, but certainly
you have posed as well. It is similarly you can say you have a group of friends who are now
posing in front of the camera and they are saying cheese right when you say cheese it is a
posed expression right.
Now, mainly these are posed expressions in the case of the Cohn-Kanade dataset. Another
aspect to notice we discussed that in the case of static facial expression recognition we are
mainly looking at the peak expression. So, researchers here also labelled the peak expression
which in other words means the expression's highest intensity which is the apex ok.
218
(Refer Slide Time: 36:16)
Now researchers moved on from this and then they started looking at different aspects with
respect to the recording environment, with respect to the age range, the cultural variability
and also the labels, right. In the case of your CK dataset you had the compound emotions and
the sorry, the universal emotions and the facial action coding system FACS action unit
labelling. Another work from Ohio State University in 2014 proposed a dataset for compound
emotions.
They said well universal emotions are too less right and they are not universal as well we
have finer states in between. So, for example, you have surprise and you have disgust, right.
These are two different expressions in the case of universal emotions. You could have a
scenario where you could have a person who is disgusted and surprised right so, the emotion
in terms of the expression that is compounding.
So, Martinez and others they proposed this dataset containing 5060 what images from 230
subjects and had 22 categories of basic and compound emotions, which means now from the
perspective of facial expression recognition system you have a 22 class problem and; that
means, you can have more fine grained information about the state of the user.
Then friends, we also have the DISFA Dataset from the University of Denver. So, it stands
for the Denver Intensity of Spontaneous Facial Action dataset. Now, as the name suggests
this is spontaneous; that means, the users were shown let us say a series of videos could have
219
positive emotion or negative emotions and the expectation is that some emotion will be
induced in the viewer.
And that will be shown in terms of facial expressions right. So, spontaneous expressions and
then there was human labelling for facial actions. So, you have now 3D data at a high
resolution and different ethnicities as well right. So, there's another type of variability which
is coming in this dataset.
Now, all the examples which I have been talking till now have been about 2D data, but we
also have 3 dimensional data friends, right? So, you can have a 3 dimension data either it can
be captured from 2 cameras or it could be 3D data which is synthetic right. So, Lijun and
others they proposed a dataset called the Binghamton University 3D facial expression dataset
which samples are shown here.
So, what you have here is the texture and the shape right and this again was created in a bit
different manner ok. So, they used a 3D face scanner and then they captured the faces of the
participants. In the next version they actually captured 3D videos right. Now when you
capture 3D videos, again one is posed and then you can go to the next step which they did.
Wherein they showed the videos stimuli to participants and then they got 3D videos and then
they got spontaneous 3D facial expressions.
220
(Refer Slide Time: 39:51)
Now this is one of the works which I was involved in very early on in my PhD 10 years ago.
So, the datasets which you have seen till now they have been focusing on facial expressions,
posed or spontaneous, but the assumption is that the subject the user is always inside the
laboratory with good illumination and mostly facing the camera so, looking into the camera
as right now I am doing.
But this is not how exactly always the data will be when you capture the world around you.
Many a times, you will see that the pose of the person that is the head with respect to the
camera would be non-frontal. For example, I am looking towards my left right now in this
case the camera will only be captured in the right hand side of my face.
So, we have to introduce these kind of data such that the methods can be tested for these real
world conditions. Now, along with the pose there are other aspects to real world conditions as
well for example, there is occlusion, there is presence of multiple subjects, there is presence
of missing data as well. Now to progress work in this area, we proposed a dataset called
Acted Facial Expressions in the Wild.
Now in the wild here friend means this is a representative of real world conditions. Now,
what we did was we chose popular movies which had method actors. Now method actors are
these high quality actors who would generally show expressions which are closer to
spontaneous expressions, but since we are getting data from movies the environment in which
221
the people are that is very varied right now what we did was we said well we would get these
movie data.
But we need a method to extract those short video segments where a person or a group of
persons are showing the expression right, could be happy sad and angry and so forth. So,
what did we do? We extracted closed caption subtitles. Now friends closed caption subtitles
are those subtitles which you would have generally observed when you watch these foreign
materials which is dubbed into let us say another language and has square brackets around it.
Now, these words for example, you know here you see laugh these generally will convey the
emotion or something which is in the scene and these are for people with hearing impairment.
So, we did this simple analysis of the text and we chose those segments which had keywords
related to affect and emotion and then gave us this kind of data which was later either label
connected.
Because you know you can always say that the word is related more to happy or not to sad
and then use that label and along with this we also had the context. Now the context is as
follows: who is the person, what is the age and what is the gender. Now we can use this
information for facial expression recognition training and have more accurate prediction for
let us say a better model, which predicts facial expression for older population, a more
accurate model for facial expression prediction in children right.
So, this brings us to the end of this lecture we have discussed the process of feature extraction
we started with the features for frames geometric features. Then we moved on to appearance
based features and then spatiotemporal features and later we discussed the different facial
expression datasets and their attributes. For example, recorded in lab or recorded outside
posed expression or spontaneous expression and then the different aspects such as was a 2D
camera used or a 3D scanner was used.
Thank you.
222
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 04
Lecture - 03
Automatic Facial Expression Recognition Group Emotions
Welcome back, I am Abhinav Dhall and we are going to discuss the third part of Automatic
Facial Expression Recognition, which is in the series of the Affective Computing course. So,
friends, we have been talking about the different techniques in facial expression recognition
from the perspective of static versus dynamic facial expressions, posed versus spontaneous
facial expressions.
The different features which we extract and the different datasets which have been proposed
in the affective computing community. Now, we are going to discuss about a newer
dimension in automatic facial expression recognition. So, if you notice on the slide here.
So, what we have here is a frame from the extended Cohn-Kanade Plus dataset. As you notice
here, there is a one subject, this subject is looking directly into the camera and there is a
posed expression, ok. So, this is a posed expression. Now, we discussed earlier that we would
223
have spontaneous expressions in the world as well, right that is how we communicate through
nonverbal communication.
What we have also seen is given that there is so much data which is now available on social
media and also that there are these circumstances where we are not alone, we have a group of
people around us. So, in that case, there can be scenarios like this, where you have a group of
people and a camera is recording either an image or a series of images or video.
And what we want to understand is that, what is the overall perceived emotion of this group,
right. Why do we want to do this? Well, of course, I told you one reason already, there is
abundant amount of data available and then we also see that in task such as understanding the
cohesion of a group that is an indicator of let us say the unity within a group, the task where
we want to understand how a group of people are performing.
So, in that particular case, the facial expressions along with other cues gives us a vital cue.
Now, as compared to traditional facial expression recognition where we have one subject, in
this case there are a large number of factors which will affect the perception of a person when
they look at let us say photograph of a group of people.
Now, we would like to map these factors onto a computational model so that we can do
overall facial expression recognition for a group. Now, the question of course, is how can we
represent the emotion of a group, right? So, in this similar to how we did for single subject
based facial expression recognition, we can use the universal category.
So, that is your angry, sad disgust, happy, neutral and so forth. You can also use here the
valence and arousal, continuous emotions, right. So, we have seen in the earlier lectures, we
can use this valence and arousal to indicate the emotion and its strength, positive or negative.
Now, if you have a video of a group of people, you can then indicate that what is the primary
emotion and then what is the intensity, right. Now, we will go through this part trying to
understand first what is going to be the contributing factor when someone looks at a
photograph, why will they will say for example, a group of people look happy or they are in a
professional setting. So, maybe they are showing a neutral expression.
224
(Refer Slide Time: 04:30)
So, we did a survey, ok. Now, this is one of our earlier works where we asked users a
question where we said well you have 2 images, can you compare these 2 images and tell us
that which group of people is happier. Now, once you have chosen the choice, there are a set
of questions which are derived from works in psychology which have already studied how
humans are perceiving the groups and what is their choice based on these factors, right.
For example, you could say you chose image A having better higher happiness as compared
to B, is it that because the faces are less occluded or the faces in a particular group have
larger size or maybe some people have particular attributes which are affecting your
judgment. Based on that, we did some analysis and we found out for example, there are a set
of contributing factors.
The example is larger faces, more smiling faces, they are getting more weightage. So, in a
way when you look at a group of people and let us say this is a large group of people, you do
not need to look at each and every face in order to tell me what is the overall group emotion,
right. You can look at a few people and that can give, you know, a correct understanding of
the perceived emotion.
Now, we also asked our participants that what are the reasons which, other than the ones
asked based on face size and others, which made you choose a particular image. So, if you
notice they say there is more togetherness in this group. So, I am rating the happiness of this
225
group higher than the other one. There is a more colorful scene in the image. So, that will
mean perhaps you know the event in which these people are that is more positive.
Now, moving forward in this, we group these factors which are affecting the perception of a
group of few people into a top down affect and bottom up affect. Now, friends, what is top
down affect? Top down affect is here you have a group of people and they are at a particular
place. Let us say they are at a park.
Then what is the effect of the neighborhood of each person on another person? For example,
for this subject, what is the effect on his or her expression based on with whom they are
standing and where they are standing in the group? So, typically the structure of a group also
tells you about the social setting of that group.
For example, if you are in a formal setting, in office, then it is possible that the senior person
could be in the center of the group. If you are with friends, maybe this condition will not
hold. The other is the scene. What are the things which are in the background? Right, maybe
if there are balloons, it can indicate it is actually some celebration going on or if it is actually,
you know some chairs and people are wearing robes, then maybe a convocation is going on
right. So, the clothes of the people and their body pose.
Now, the second category is the bottom up affect. This is essentially saying what is the
individual expression and what is that contribution of this individual's expression towards the
226
overall group emotion. And there are things such as attractiveness, the spontaneous
expression, age, gender, large faces, center faces, clearer faces and so forth. So, here we are
looking at each person at a time from a computer's perspective.
Now, this is also somewhat in sync with Bar’s scene context model which proposed that
when we look at a particular scene, we would do a very low resolution quick scan of that
scene. So, gives us the holistic picture of what you have in front of you. Once you have done
that, we will pinpoint a region of interest and we will fixate on that. What; that means, is you
are looking let us say at scenery.
Now, you will do a quick scan, look at the low resolution version of the scenery and then
maybe fixate on the house which is there just beside the mountain because that is what is of
interest to you. So, in that way, first you are looking at the overall group, then you are looking
at the individual people ok.
Now, there are several ways in which one could model these attributes into a facial
expression recognition system. We are going to discuss one such model. So, let us say we
have an input image. In this image, we have six subjects. Now, what we can say is we are
interested in the top down representation, those affects and the bottom up right. For me to let
us say first look at top down, I need to have a method of modeling the group.
227
So, typical pipeline which we have discussed till now for facial expression recognition,
friends we use an object detector right. In this case when you will use the object detector for
detecting the faces, it will give you six faces in this image. Now, once you have it, what you
could say is well let my group of people in image be a fully connected graph ok.
So, you have a fully connected graph here. In this fully connected graph, your vertices V are
the faces and the edges are representing the link between two people. For example, in this
image F 1 and F 2 are linked, so, this is an edge. The weight of an edge is essentially let us
say the distance between two faces.
Now, the distance between these two faces can be calculated based on the face location. So,
you can take the center of the two faces and let us say calculate the Euclidean distance. So,
now let us say the Euclidean distance between these two faces will be the weight of the edge
between these two vertices two faces.
Now, top down was based on where are people standing, with whom they are standing and
then what is the effect of their relative location onto the neighbor, onto themselves which is
affecting their expression and that in of course, in the end is giving the impression of the
overall group’s perceived emotion. So, what we do is after we have a fully connected graph,
we say well let us compute a min-spantree ok.
So, for example, you can use the Prim’s min span computation algorithm and that is going to
tell you who is the neighbor of whom, right? So, you have a fully connected graph, you
computed a min-spantree and now it actually gives you the shortest path and in the classic
sense of this group as a fully connected graph it is telling me now what is the shortest path.
So, in a way it could give you for example, if F 1 was your first face, then F 1 is let us say
link to F 2, it is F 3, then F 4 from F 4 you can reach maybe to F 5 and then you can reach to
F 6 just as an example. Now, what; that means, is with this min-spantree F 2's two neighbors
are now clear to me. So, F 2 has two immediate neighbors, right?
So, I can now calculate more information about the relative location, right? Just as an
example what I can do is I can say well the relative face size of a person with respect to his or
her neighbor is a rough indication of where that person could be standing in a group, right? If
that is the case, I can do something like this I can say; well, if theta of i is the relative face
228
size of the i-th subject in my group then that can be calculated as the face size which I can get
from the face detection which I have done.
So, let us say that is S of i divided by the average of the size of the neighborhood, ok. So, I
can get this from F 1, F 2 and F 3 and this simply is a very rough indication of where other
people are. Similarly, I can compute the distance of a particular person let us say F 1 from let
us say the centroid of the group. Earlier I gave you an example, right? When you are in a
formal setting, generally seniors are in the middle of the image.
Same goes of family photographs as well, you may have noticed you know the grandfather,
grandmother that could they would be sitting in the center, the elder ones would be on the
sides, children could be in the lap, right and so forth. So, once distance from the centroid can
indicate how much is their weightage in that social setting.
In vision, computer vision it is also referred to as you know finding a very important person.
So, you have a group of people who is the important person in this image, once you know
that that gives you a fair bit of idea about the social setting of that group where they are what
they could be doing and what could be the perceived emotion, right?
So, if that is the case, I can do it this way I can say the distance is actually based on the
centroid of the group and then you can subtract it by the location which you get for the face.
So, let us say the location is l for a group i and this is group so, you can compute this, ok. So,
larger the distance, you can then penalize the expression.
Now, similarly friends you can look at the bottom affect, right? I will give you an example.
You can do AFER. So, that is your Automatic Facial Expression Recognition and you can
compute of particular face let us say happiness, right? So, H of i will tell you how much
happy of particular faces, right? Now, you have this information you can use d of i and theta i
as weights and apply it to the expression of one person.
So, you can down weight or increase their value based on their social significance in that
group. Based on that trivial model of group expression let us call it group expression model
gem could be simply you know submitting the expression intensity which is weighted by
theta i and d of i and then you have the total number of subjects n in that group.
229
Of course, this is the simplest representation which you could have. You can have more
factors added into it, but for example, let us say using machine learning techniques such as
Latent Dirichlet allocation, right? LDA's. So, you can model the relationship between the
faces, the group members wherein the word would be the face, similar representation. One
could also use a bag of words framework as we discussed in the last lecture for the dynamic
facial expressions.
Now, these were techniques whereby you are using these handcrafted features. You could
also use a deep learning based technique for group emotion, right? Here is an example. So,
here you have an input image and again what you want is top down features to be taken care
of and your bottom up features, right? So, these are the ones which we understood from
surveys.
So, this pipeline takes the whole image as an input into a network. So, in this case this is an
inception V3 network and this is actually taking care of the top down, right? So, you are
looking at the whole scene. Further what you do is you detect the location of the faces and for
each face now you are using a network. In this case it is a capsule network; you can use a
convolutional neural network as well.
Now, this is your bottom up analysis, you are looking at the facial information here, ok? To
add more context, for example, where people are, right? You will see for example, people
could be wearing birthday hats only in a birthday social scenario, but not in offices, right?
230
So, you can have these facial attributes computed for each face and all this can be fused
together. And this will be the prediction, right? So, you can do a prediction here, maybe let us
say 3 class prediction which simply tells you if the perceived emotion is positive, neutral or
negative and this is just a rough step on your valence axis, ok.
Now, I will play a video and this is an example of looking at the faces and the whole of the
scene as well. Now, in this case you will notice that the facial expressions, the body gestures
tell that the overall group emotion is not so positive, right? And here you have, you know, let
us go back to the video, let us play it again.
And here on the top, right, you have the overall emotion which is negative and the bars are
indicating the strength. So, higher score would mean that people are looking more negative as
you can clearly tell from the expression of the group of people here. Now, in this very
example, if I was to move on a bit, you will notice that there are a positive sample coming in
as well. Let us wait for it to come.
231
(Refer Slide Time: 20:07)
Now; obviously, one could say that, you know, for example, in this group.
One can look at not just the expression, but the body gesture is well-trained.
232
(Refer Slide Time: 20:16)
The body gesture of the group of people can tell you, for example, they are, you know,
having a positive emotion right now. So, these are very powerful cues which we can extract
and later embed into a large number of systems.
233
(Refer Slide Time: 20:26)
Now, I will share some future directions. So, group level emotion is actually a fairly new area
as we are getting more and more data sets collected. For example, here is one from Sharma
and others and then there are other repositories such as there is a happy data set and then
there is a group affect data set GAF. So, these are available publicly. You can, you know,
download them, train systems and test out group emotions.
Now, we discussed about image level group emotion, but of course, we have discussed when
you have time series information that is giving you a lot more information for understanding
234
the emotion of a person or a group of people. So, in this case as well, if you have a video of a
group of people, you can analyze the rich dynamics with respect to the group and then add
that information to the expression.
The other direction friends is audio-video fusion. So, generally in group settings you will
have audio information available as well. Now, this audio information would be in the form
of either the voices of someone could be speaking in a group of people or it could be some
background music. So, that information can actually help us in more robust facial expression
recognition right. So, it will be complementary information.
The third is more larger data sets are required. Typically, what happens is when you are
collecting a data set where you have a group of people and you are trying to label the
perceived emotion of those people, there is a lot of variability which can come. Now, this
variability would be in the form of the labels which different labels, labelers could be giving
to the same video.
So, person 1 could perceive a video having let us say very high positive perceived emotion.
Person 2 could have it as a medium positive perceived emotion right. Then how do you get
the consensus right? How do you get the consensus among the labelers? That is important.
The other is how do you decide that one labeler is more consistent as compared to other
labeler right.
So, if you know a labeler one is more consistent in creating the data set is having more
consistent labels then you could perhaps give more weightage to a labeler who is more
consistent. And the third is when you are collecting these data sets what about the copyright?
Right, you just cannot download videos and use it without proper copyright check.
So, typically in the academic research community we will look for data with creative
commons license. So, you can dwell deeper into it and then make sure that the large data set
which you are collecting you actually have the right for it. So, friends this brings us to the end
of the group emotion part of the automatic facial expression recognition lecture.
Thank you.
235
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 04
Lecture - 04
Automatic Facial Expression Recognition Applications
Welcome back, I am Abhinav Dhall and we are going to discuss about the Applications of
Automatic Facial Expression Recognition as part of the Affective Computing lecture. So,
friends, we have been talking about different aspects of automatic facial expression
recognition, how we create the different blocks, the aspects of the data and the recent things
such as group level emotion recognition.
So, now I will be discussing with you some of the applications. So, it is like asking the
question that you create a system and that system is able to identify the expression of a user.
Now, once you know the expression, what do you do with it? What type of meta information
can be extracted so that we can solve a real-world problem and also how to use that
information in driving a user interface? so, we will start with different areas and I would like
to draw your attention to the problem of digital health. Since, when we are talking about
facial expressions, we are looking at the different movements in the face. So, we can use that
information for different indicators about the health and wellness of a person.
236
So, here is the first example. How can facial expressions be used to predict if a person is
feeling pain or not? Now, the assumption is that the understanding of a painful event is going
to be on the basis of the expressions of the person. Of course, we know that there is a lot of
variability. Some people express pain high; some would express similar event as low and this
is about the self-pain labelling.
If there is a third person who is observing the patient, he or she will have their own different
observation because there is a bias. So, there is a subjective bias which comes in when pain is
described. Therefore, from an analytics perspective, understanding the pain objectively is
useful for better care of a patient.
Now, imagine a scenario, there is an intensive care unit, there is a patient who is unable to
communicate well and we want to understand that if the patient is feeling pain, how many
times have they felt pain? What were the intensity of those painful events? And, we are going
to use the facial expression as the cues and then we can have a summary across time and that
will give very rich information to the physician.
Now, what we have here on the slide is video from the McMasters, UNBC, University of
North British Columbia data set. Now, this subject has gone through a shoulder
reconstruction surgery ok. And, what they have is there is a camera in front of the person and
there is a research assistant who is asking this person to move the shoulder, the place where
they had the reconstruction done a few weeks ago.
Now, what we want to do is we want to detect the face, analyse the expression and this
analysis could be your universal or could let us say be your facial actions, the facial action
coding system. And based on this analysis, we want to plot let us say graph like this where on
the timeline here, we have overlaid the score which we are getting from our classifier based
on the expression analysis, where the red score means the classifier said there is a high
probability that this person is feeling pain.
And, blue means there is a very low probability that the person is feeling pain so, this is like
no pain event ok. Now, let us play the video. So, the person is moving their arm and you can
clearly tell from the facial expression that they are feeling pain right. Again, they are moving
there is a painful event. They will now rest down a bit so, expression is relaxed.
237
Now, again a painful event and you can tell that the person is feeling pain right. So, we can
now have these information on a longer time duration and this can be very vital for the
treatment. From a machine learning perspective friends, you could also ask this question that
well when I am going to train such systems from where will I get such rich label data that at
every time stamp, what was the painful event and what was the intensity of the painful event?
So, in that case what we typically do is we model these kind of digital health based problems
as weakly supervised learning problems ok. Now, these problems would use these rich
information from the features and then we will use this classifiers to tell if the person is
feeling pain or not.
The other one from the digital health one which I would like to discuss with you is depression
and psychological distress. Now, depression is one of the biggest disablers worldwide and
this problem is very prevalent, but there is a lot of social stigma around this problem.
Typically, when you go to a psychologist, they would look at certain cues and they will get
some forms filled in by the patient in order to understand if the person is feeling depressed, if
the person is going through a depressive episode.
Now, from the perspective of automatic facial expression recognition, what is being done is
we can try to map the facial expressions of a patient onto a model and then tell if there is a
probability of this person being depressed. And, also in the same exercise predict the intensity
of depression.
238
So, an example of that is the work which is being done at the University of Canberra in
Australia. So, there is Professor Roland Goecke’s lab. What they are doing is they are saying
well if we are considering unipolar depression, we can make an assumption here. The
assumption is that it is commonly observed that in the case of unipolar depression, a person
may go through a psychomotor retardation.
Now, in the context of facial expression friends, what that will mean is that the frequency of
occurrence of facial expressions, the intensity of facial expressions that will be mellowed
down, that will be slow and less intense in a patient suffering from unipolar depression. Now,
what that means is if we can get clinically validated data then we can try to find this
observation by analysing the facial expressions.
So, the pipeline of course, here is also similar. You are going to detect the object which is the
face in your video. So, you have now series of faces, you are going to then extract features
which will be taking care of let us say the intensity and the frequency of expressions.
So, yes you have clearly guessed it right. You can actually now use facial action coding
systems here right. You can say well the frequency of occurrence of action units and the
intensity of action units; if I can have a feature representation perhaps, I can train a system
which can predict unipolar depression.
Now, similar to this depression, you can use facial expression for understanding states such as
confusion. So, there are lot of neurological diseases where confusion is one of the attributes.
So, how can we assess if a person is confused? Now, within a conversation when someone is
confused, you would be looking at their facial expressions and speech cues right. So, a
multimodal system can be created in this.
Now, moving from digital health examples, let us look at some forensics right. So, there is
deception detection, you would have also heard of lie detection right. So, there is let us say an
interrogator who is asking questions to a person and what we have is along with that
interrogating asking question, there is a camera which is facing the person who is being
interrogated.
So, one can look for micro expressions in that case and these micro expressions could
indicate if a person is trying to deceive right. Recall, micro expressions are your involuntary
movements, the short duration movements. So, since they are involuntary, only extremely
239
well-trained actors could be able to control part of it. So, one could analyse micro
expressions, train a system for deception detection.
Then from smart cars perspective, we have the drowsy driver detection right. Typically,
people driving for longer durations would be feeling drowsy, tired. So, how can a
camera-based system tell that the person is drowsy and of course, then issue an alarm. So,
that the driver either takes a rest or you know gets more careful.
So, in this case what you are typically doing is you are saying well, now here is the frame and
I have a face here, this is let us say the face of your driver. Along, with the facial expression
you would also be looking at the eye region as well. So, you can use a deep learning based
method or a traditional handcrafted based feature based method and that could look for sign
such as yawning right.
Again, when you yawn you have different facial movement. We can try to detect yawning
and that can raise an alarm right. So, that is a safety feature. Now, the next to this is at
understanding the engagement and attention of a person ok. An example of that is let us say
we are doing a training exercise, there is a group of people and we want to understand if the
people, if the attendees, they are engaged or not right.
So, in this case we would be looking at cues again such as let us say yawning and a bit of the
pose and the expression as well right and this is far varied applications. For example, if you
are talking of a human robo interaction, where a robo is trying to interact with a human and it
needs to understand subtle cues. An example of that is when does the conversation end?
Is the user paying attention to the robo or not and if the user is busy in a task should a robo
make some noise, some gesture and try to start the interaction? So, now based on the facial
expression and other body cues of the user, one can try to understand the attention. And, that
can be used for things such as human robo interaction or for in class attention as well, that
you know you are doing a training exercise if people are paying attention or not.
Now, this is not just from the perspective of judging anyone, it is also that sometimes the
content needs to be modified right. So, before let us say a content is shared to a wider
audience, one can create a facial expression analysis system and do a testing with a smaller
number of people. And, when you analyse the time series, you look at the expressions and
then you can modify the content before it is released to a larger audience.
240
Now, one could also do another thing where you can say well you know for these problems
such as attention or looking at you know the health and well-being problems, face in itself
might not be enough right. So, for example, in the depression, voice is also a very strong
indicator of this psychomotor retardation form observation. So, multi-model systems, you can
use modalities to make a more robust system. So, this will give you complementary
information.
Now, these are some of the serious application friends. There can be some mobile phone
based applications as well. For example, one could use a facial expression analysis technique
to analyse the faces in a gallery app of a smartphone and the images which are presented to
the user could be for example, sorted by expression right. Just another way of visualizing
information.
Now, from the applications I would like to now take you to the last part which is essentially
the limitations and where are the scope for improvement in automatic facial expression
recognition. Now, one of the challenge which comes with automatic facial expression
recognition systems, typically when you are using low power devices, the computational
complexity.
Here you are analysing the face. When you are analysing a face in an image that is typically a
matrix which is very rich in data. So, the facial expression recognition system needs to be
241
computationally light so that we can do real-time analysis. The other challenge which comes
now for automatic facial expression recognition is illumination.
Now, what that would mean is let us say you were doing a recording outside right and all of a
sudden a very sunny day becomes cloudy. Now, what is going to happen? You are going to
see the effect of change in illumination on the faces which are being recorded right.
Now, from computer vision and affective computing perspective what that could mean is if
the automatic facial expression recognition system is not trained to be agnostic or to be robust
to illumination changes, one could see noisy outputs. The third one is, guys, occlusion.
So, facial expression recognition under occlusion, when let us say there is self-occlusion;
self-occlusion could be for example, a very heavy beard ok. So, you are not able to see parts
of the face very clearly. So, in that case as well you could see that the facial expression
recognition that could be a bit noisy right.
So, that is there is an opportunity here to improve facial expression recognition when there is
a heavy beard or there are dark glasses. The other is external occlusion. This clearly, we have
seen in the case of group emotions right. You have a group of people; it is possible let us say
that the face of one person is partially occluded by the body part of another person right.
In that case, we have incomplete information right. So, if you have incomplete information,
that incomplete information could lead to a noisy output right. So, there is an opportunity to
do this partial data based facial expression recognition as well. The next one is subtle facial
expressions right. So, these subtle facial expressions again are your micro expressions.
Now, in this case there is a considerable line of work where researchers are looking at micro
expressions, but there are lot of challenges in micro expressions right. Of course, the first is
you need very sophisticated equipment because we are looking at recording let us say 200
frames per second right. So, the camera and the hardware that needs to be quite sophisticated.
The other is the labelling problem right. So, labelling can be a bit noisy in the case of micro
expressions. Now, if your labelling is noisy, if you get noisy labels during your training then
it is going to affect the overall performance right of the facial expression recognition system.
Then friends, is the individual variability. What that means is here you have two friends; let
us say one is very open, the other is a bit introvert.
242
It is possible for the same joke; both of them may show different intensities of happiness or
smile right. So, the laughter could be very high in one, could be low in other one. Now, from
the perspective of training or automatic facial expression recognition system that would mean
that there is a lot of intra class variation right. Now, this is again a challenge right. So, you
would like to model this intra class variation.
Now, from the data perspective, you would like to have large intra class variation so as to
cover as much type of expressions and the variability in expressions. From the machine
learning model perspective, you would like to learn a generic representation which is agnostic
to the intra class variation right.
So, if you have very high intra class variation in your data, but you do not have a
sophisticated facial expression recognition system that will lead to noisy outputs. Now, here
is another interesting question right. For a ideal facial expression recognition system which is
more important, the higher quality features or from a deep learning perspective a more
complex deep network or more data right?
Now, in more data it could mean both labelled or unlabelled data. Well, typically we will
need both. So, now as you are increasing the amount of your data that would also mean that if
enough care is taken during the recording of the data, you would have these individual
variability which would be more intra class variation.
So, we would like to have intra class variation so, that we can capture the different facial
expressions and your facial expression detector is not person specific, but it is generic right.
So, with more data you would need better features and more complex networks. So, both go
in hand in hand. The other is extremely important friends.
So, the ethical issues with automatic facial expression recognition. Recently, it has been
observed that facial expression recognition systems in part have been used for things such as
automatic job interviews right. Now, as I have been discussing from the beginning that facial
expression is just one indicator of the emotion.
And, it is possible that without context the facial expression which you see for a person does
not really tell their real emotional state. Therefore, there are these issues, ethical issues which
are you know asking these very serious questions that which are those areas where we should
exercise caution when using automatic facial expression recognition.
243
Then, along with this is fairness. What is fairness? You have a automatic facial expression
recognition system which is trained let us say on Caucasian data and Asian data. Now, during
the test time if you get data from a subject of let us say African ethnicity, how will the system
behave? Will it be fair? Will it do correct recognition?
Because, in the past in some studies it has been found that some of the commercial systems
have been biased against women of color, right. So, if you are going to use this facial
expression recognition system output for serious things in such as health and well-being, then
you have to be very careful that if system is fair.
It is not giving biased outputs based on things such as age, gender and color. Then, friends
there is accountability. How did the system reach at a particular output? What was the type of
data which was used? What were the design elements, let us say, of the network? So, we need
all this information because ultimately facial expressions are used for serious applications.
And, then there is transparency as well. Now, transparency not only is from the perspective of
the model or the system, but transparency in terms of let us say how the data has been
collected. What were the rules and regulations, was appropriate approval taken from let us
say an ethics committee?
Were the participants who were part of the data recording, were they aware that where the
data will be used? So, all these are very serious concerns which are now being addressed
which are now being discussed openly, because facial expression recognition that has reached
a healthy state.
And, this healthy state is now helping things such as you know in health and well-being and
so forth. So, friends with this we reach to the end of our discussion in automatic facial
expression recognition.
Thank you.
244
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Lecture - 14
Tutorial: Emotion Recognition using Images
Hi, everyone. My name is Gulshan Sharma. I am the TA for this NPTEL Affective
Computing Course. In this Tutorial, we will be working on Emotion Recognition using
Images.
We will be using RAVDESS dataset which consists of 7,356 files divided into 8 emotion
classes which are calm, happiness, sadness, anger, fearful, surprise and disgust. For this
tutorial, we will be using video only data.
245
(Refer Slide Time: 01:01)
So, this is the file name identifier. It consists of seven sub identifiers in which 3rd sub
identifiers tell us about the emotion class.
So, let me give you the overview of this tutorial. We will start with extracting frames on the
video file in Python, then we will be extracting a HoG which is Histogram of oriented
Gradient feature from these images. And, we will try to classify these features using our
machine learning classifier like Gaussian Naïve Bayes, Linear Discriminant Analysis and
Support Vector Machine.
246
After that, we will be using VGG-16 architecture which will be pre-trained on VGG Face
dataset and we will try to classify these emotional images. Let me tell you this VGG about
this VGG face dataset. So, VGG face is a dataset of 2.6 million face images of 2622 people
that is used to train this VGG16 architecture. So, let us start with the coding part now.
So, in our coding part, we will start with importing all the essential libraries that will be
required during the programming. After importing this library, I will start with data fetching
part and for this part, I am assuming that everyone has already downloaded the data and
saved in their respective Google drives. So, here I will create couple of variables which will
be essentially signifying the dataset path.
After creating these variables, I will be writing a code to extract frames from the video and
saving the frames into another directory. So, the code will look something like this. In this
code, we will simply iterate through all the video files available in our folder and we will read
each file using this cv2 library.
So, there is a function in our open cv library called video capture which will essentially
capture all the videos with respect to the input path. And later, we will simply extract the
frame out of this video and save it to another directory. This code might take a couple of
minutes to execute completely. So, please be patient. So, after completion of this code block,
frames are extracted from the video files and saved into another folder at this location.
247
Now, our next task will be to extract all the images from this particular folder and save them
into a separate numpy arrays. For that, we will write another code. So, the code will look
something like this where we have defined our two list data_all and label_all list and we are
iterating through all of our files and we are simply reading those file names, extracting the
label out of those file and saving them into the and appending them into the label_all list.
Later on, we are simply reading all these image files using our skimage library and saving
them into data_all list. Let me run this code, ok. So, as we can see that we are getting these
873 files, 873 images whose dimensions are 720 cross 1280 and three simply the number of
channels which represent here RGB images. Let me also convert the label_all into what we
call it..numpy array.
So, for that, I will simply reuse my existing code, ok and we are getting 873 labels
corresponding to these files. Let me also join these two cells for a simplicity.
248
(Refer Slide Time: 05:43)
After that, as now we have data all the data and their corresponding label. As we have already
seen there are 8 classes in this problem. So, just to save some computational time and make
this problem a little bit easier, we will try to make it a sort of a binary classification where we
will choose any two random classes. To do so, code will look something like this.
There will be clear to list data binary and label binary list and we will iterate through all the
labels, all our original labels. And if our label is equal to 1 or the label is equal to 5, in both of
these cases, we will simply append those labels into our label_binary list and all the data
corresponding to these labels will be appended into data_binary list. Later, we will simply
transform these list into numpy array.
And, the shape of these array will look something like this. Now, as you can see that we are
having lesser number of images, which will help us to get some computational efficiency
over here. And remember, I am talking about computational efficiency, not about the model
efficiency or some sort of a training efficiency over here. This is just for the easier
demonstration.
Now, we will try another sort of a computational optimization here, like apart from using 720
cross 1280 image, which is a huge dimensional image, which is a huge image, we will try to
resize it into a smaller dimension, let us say 224 cross 224 image. So, for that, I will also
write another piece of code and the code will look something like this.
249
Here we are declaring a resized list called resized_data, which will be storing over all the
resized data. And we will be using this skimage inbuilt function called resize transformed
resize and we will pass the original data and the respective shape, which we have to resize.
So, after running this code, now, as you can see after execution of this code, our new image
size is 224 cross 224. So, to visualize this data, I will be using another function from scikit
image library and this function is called io dot im show. And here I will simply pass some
image that have resized and image will look something like this; maybe I can use another
image say 89.
So, by now we have extracted all these images, and we have stored them into their respective
folders. And we have made this as a binary classification problem and resized our images.
Now, onwards, I will be, I will try to extract a histogram of gradients from these images and
we will train couple of classifier over that and learn how to classify this image into the
emotion classes. For that, so, to extract a HoG features, I will be again using an inbuilt
function from skimage library.
250
(Refer Slide Time: 09:45)
And, this will look something like this, where I will be passing a resized image data and I will
be passing some sort of a hyper parameter into this block. These hyper parameter orientation,
which essentially represent the number of histogram bins. In original at HoG paper, they
mention this parameter value to 9. And we will be using same value over here.
And, another hyper parameter in HoG function is pixels per cell, which essentially represents
cell size for histogram and we will be using 8 x 8 pixel per cell. And our third hyper
parameter is cells per block, which represent block size for the histogram normalization. And,
after running this code, our image will look something like this.
Maybe I can zoom into a bit and as you can see, this HoG feature can represent the
boundaries of you know this face and so, basically, this is the HoG image generated from this
function and FD is the feature descriptors. Now, using FD we will try to classify the emotion
classes. But before that, I need to calculate these HoG image and features for all the data set.
So, the HoG code will look something like this. Here in this code, I have declared two list
name HoG feature and HoG images. So, what are we basically doing over here is we are
iterating through all the resized images and we are using this inbuilt HoG extractor and
saving these extracted images and features in these two lists and later we are converting this
list into NumPy array.
251
So, let me run this code to this block, maybe I can show you the shape of this HoG feature
and HoG image.
So, for that, I will simply write shape of these images and shape of HoG features, ok. So,
these are respective shape of the features and these are the images over there. HoG images
over there. Now, we do have extracted this HoG feature over here. So, we are ready to play
with our machine learning classifier.
For that, I will start with dividing the data into their respective training and testing set. For
that, I will be using this in-built function from sklearn and which is trained as split function
and the code will look something like this. Here we are taking test size equal to 0.010, which
essentially means 10 percent of the data will be taken as test data. So, after execution of this
code.
So, after dividing this data into their respective training test set, we will be using our first
classifier, which will be Gaussian Naive Bayes. And the code will look something like this.
So, we have already imported this Gaussian Naive Bayes classifier and we are fitting it upon
the our trained data. And later we are simply showing our training data score and test data
score, ok.
So, as we can see that our training and test data, accuracies are 41 percent and 48 percent
respectively, which is not so good. So, we will be using our another classifier called linear
252
discriminant analysis over here and the code will be similar. I will simply drag my code over
here and apart from a Gaussian Naive Bayes, I will be using linear discriminant analysis over
here and the rest the code is same.
And in this case, we got a slightly higher train score and test accuracy over here. Finally, I
will be using my linear support vector machine and whose code will look something like this.
So, in this case, as I can see, the linear SVC is giving similar score as our linear discriminant
analysis is giving. Let me try a same code with RBF kernel, ok.
Surprisingly, here we are also getting similar results using this HoG feature. One more thing I
would like to tell you is the since here I am using those standard hyperparameter in our HoG
feature extractor, one can try to experiment with changing the values of these
hyperparameters and try to see is there any sort of effect in the classification accuracy by
changing these particular values. That could be one exercise that you have to do on your own.
And, now I have done my basic classification part using HoG feature. Now, onward, I will be
using a standard VGG 16 architecture, neural network architecture to classify these emotions.
And in this VGG 16 architecture, we will be using a pre-determined weights of VGG face
dataset consists of 2.6 million face images.
So, the weight will be already pre-trained weights and using that model, pre-trained model
and pre-trained weight, I will try to classify these emotions. So, for this, I first need to
download the exact VGG 16 weight file.
253
(Refer Slide Time: 16:08)
So, let me first separate this section over here. So yeah, I will download my VGG 16-weight
file and I will be using this particular link over here where these VGG face weights are
released and let me download it. It might take some time according to your internet speed.
Once it is downloaded, you can see this file over here.
So, it is in your local drive over here. Now, my weights are downloaded. I can simply use my
VGG 16 model and I will code the look something like this where I will be using my VGG
16 pre-defined model and I will be including these weights. This is the basic this is basically
the exact path of the weights and our input images like 224 cross 224 cross 3. So, after
running this code, I can see my model is there. So, I can simply try to see the structure of this
model.
So, to view the structure of this model, I will simply write a loop where my agenda is to see
the exact layer of the model. So, I will simply write, ok.
254
(Refer Slide Time: 17:48)
We can see the these are the layers of this particular model and I can also show you like are
these layers trainable or not trainable. And, maybe I can also show you like are these layers
trainable or not. Let me check, ok. So, each of these layers are trainable. Since we will be
using a pre-trained model, pre-trained (Refer Time: 18:42) or in this VGG 16 model, I need
to turn off the trainability of the I mean layer.
So, to do so, I will write as simple loop where my agenda will be to turn off the trainability of
these layers. And here I will simply write layer dot trainable equal to false, ok. Let me try to
255
visualize are these layers turned off or not yeah. Now, you can see all of these layers are not
trainable.
So, what I exactly mean by this is when I will be fitting my model over my data. We would
not be training this layer. We will simply train a multi-layer perceptron, which I will be
appending after these layers. So, to code that part, I will be simply writing another piece of
code.
I will be using this model of sequential method and I will define my multi-layer perceptron
model, which will essentially consist of two dense layers with ReLU activation function and
another layer with Softmax activation function, which is essentially used as a classifier.
So, remaining code will look something like this. We used model dot sequential, then we
added the model dot VGG over here. Later we added a flattened layer and we added our MLP
part over here, which is consists of 1024 dense neurons, 512 neurons and 2 neurons later with
Softmax for classification. Let me see how our model looks like. So, I will try to print a
summary over here and, ok.
So, now our model will still consist of these vgg16 layers, these pretrained network over
there and a multi-layer perceptron over here ok. Now, my agenda will be to compile this
model. I will try to compile this model using categorical cross entropy loss and for optimizer,
256
I will be using Adam Optimizer and the learning rate will be set at 0.001 and the validation
metric will be accuracy.
So, ok, my model is compiled by now. Now, I just need to basically again define my training
test set over here. So, after successful, so, after successful compilation of the model, our
agenda will be to train this model, so, our agenda will be to fit this model. But, before fitting,
I just need to, you know, I just need to divide my data into respective train and test set.
And, also sort of pre-processing over here where I will be converting my label 5 to level 0.
This is just a preprocessing step over here and I will be converting these labels into
categorical labels as we are using categorical cross entropy function. So, let me run this code,
ok. So, we have successfully splitted our data into our respective train and test sets.
Now, our agenda will be to simply train this model, get this model. So, for this, I will be
using model underscore top dot fit and I will be passing my train data and corresponding
labels and for batch size, I am taking a batch size of 16 and number of epoch equal to 10.
And this training part will also take a couple of minutes. So, please have some patience. So,
after training this model up to 10 epochs, we can see that our training accuracy is somewhat
around 57 percent. Now, let us try to check what will be our validation accuracy as in test
accuracy, which will be equivalent to ok 58 percent of the accuracy is over here.
257
So, what I can include from here is these networks, although our neural network is working
comparatively better than our basic classifier, but still these networks are not trained up to its
potential. We might need to do a rigorous hyper parameter tuning in terms of learning rate,
loss function, optimizer, size of this particular MLP over here and maybe we can get a better
result than this. But, the agenda of this tutorial is to just give you a hands-on-experience to
how to code these classifiers.
Thank you.
258
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Lecture - 15
Tutorial: Emotion Recognition by Speech Signal
Hello everyone. My name is Gulshan Sharma and I am the Teaching Assistant for this
NPTEL Affective Computing course. First and foremost I would like to welcome everyone
on this very first tutorial of this course. We will attempt to learn emotion recognition through
speech in this tutorial.
With recent advances in the field of machine learning, emotion recognition via speech signal
has dramatically increased. Various theoretical and experimental study have been conducted
in order to identify a person’s emotional state by examining their speech signals. The speech
emotion system pipeline includes preparation of an appropriate dataset, selection of
promising feature and design of an appropriate classification method.
259
(Refer Slide Time: 01:13)
So, in this tutorial, we will be utilizing a publicly available dataset known as RAVDESS. The
RAVDESS dataset consists of 7356 files. The database includes speeches and songs from 24
actors, 12 male and 12 female. Emotion classes include calm, happiness, sadness, anger, fear,
surprise and disgust. And this dataset is available in 3 formats audio only, video only, and
audio and video. So, for our task, we will use only speech part, that is our audio only files.
Now, moving towards the file name convention in this dataset, the file name in this dataset
consists of 7 identifiers where the first identifier tell us about the modality either it is a full
260
audio video file, video only file or audio only file. The second identifier will tell about the
vocal channel, it is either a speech file or a song file. The third and the most important
identifier is our emotion identifier, which will tell about the class of the emotion, neutral,
calm, happy, sad, angry, fearful, disgust or surprised.
The fourth one is the emotional intensity either the emotion is of normal intensity or the
strong intensity. Later on, fifth identifier will tell about the statement. And sixth will tell
about the repetition of that statement. And the seventh identifier will tell about the actor. Odd
number actors are male and even number actors are female.
So, before starting the coding part, let me first give you the complete overview of this
tutorial. We will start with downloading the dataset. After downloading the dataset we will
import that dataset into a Google Drive. The reason behind importing the dataset into Google
Drive is that we will be using Google Colab for our experimentations.
And our experimentation will start with reading audio file in Python. Then, we will extract
these fundamental frequency zero cross rates and Mel Frequency Cepstral Coefficient as our
features from the audio files. And after that we will be employing some of the classification
algorithms like Gaussian Naive Bayes, Linear Discriminant Analysis and Support Vector
Machine.
261
And in the end, we will also try to create a 1-dimensional convolution neural network over
the raw audio for the emotion classification.
So, starting with our very first exercise which is dataset download, we can download this
dataset by simply searching the RAVDESS on Google.
262
After putting this RAVDESS keyword on Google, you will find this zenodo link over here.
So, basically, zenodo is a general purpose open access repository. Here we can store our data
up to 50 GBs.
So, we can simply click on this link, and find the dataset.
So, this dataset is basically released under creative common attribute license. So, one can
openly use it for the publications.
263
(Refer Slide Time: 04:42)
And to download the dataset, and to download the our exact part which is audio speech actor
dataset we can simply click on this link.
It will take some time to download, but in my case I have already downloaded this dataset.
And I can show you.
264
(Refer Slide Time: 05:05)
After unzipping the downloaded file, this dataset will look something like this. So, there will
be 24 folders each belonging to one actor.
And after going through one folder, we will have a couple of files over here. Or maybe I can
just play a couple of files just for your reference.
Dogs are sitting by the door. Dogs are sitting by the door. Dogs are sitting by the door. Dogs
are sitting by the door. Dogs are sitting by the door.
265
So, as you can see there are multiple emotions saying this line, dogs are sitting on the door.
And if I move to some other folder, let us say actor 2 folder and play a couple of files.
Kids are talking by the door. Kids are talking by the door. Kids are talking by the door. Kids
are talking by the door.
So, as we can see, like there are a couple of variations in this speaking style representing
different different type of emotions. So, as we have downloaded this dataset, now our next
task is to upload it on Google Drive, so that we can easily access it through a Google Colab. I
believe most of us can easily upload a folder on Google Drive. So, I will be skipping that
part.
But with some of the participants it could be a situation that they are having a low bandwidth
internet connection, these participants can take any of the folder and upload it on a Google
Drive. So, let us suppose you are taking folder number 1.
So, folder number 1, I believe is of 25.9 MB. So, it will not be a very big file to upload on a
Google Drive.
266
(Refer Slide Time: 07:14)
So, now, I will shift on the Google Colab and we will start writing a program for emotion
recognition.
So, before starting our programming exercise, so before starting our programming exercise I
assume that everyone has some sort of experience with Python programming language and
everyone is aware of Google Colab Interface. We will start this exercise with importing
couple of libraries and the helping functions. To save some time I have already copied
267
required import code. So, everyone who is programming along with me can pause this video
and write this code in their own environment.
So, after importing the required libraries and helping functions, we will start with reading the
audio files using Python, but before that we need to mount Google Drive with our Colab
Interface. To do so, we will first click on this files icon over here and then select mount drive
option.
After pressing this button, you will find a dialog box over here asking for permission to
access Google Drive. So, we will simply click on connect to Google Drive. So, after
mounting Google Drive with Colab environment, we will now import the data. I am also
assuming that some of the participant does not have enough powerful machine or high speed
internet connection.
So, to simplify our job, I will be using data from a single subfolder. Since, we are importing
audio data, so to read audio data in our Python environment, I will be using librosa dot load
function.
268
(Refer Slide Time: 09:27)
I will create two variable called data and sample rate, then I will simply write librosa dot load
and inside the brackets I will pass the I will pass the file name. So, as you can say this
function has successfully run, and now I will show you the shape of the data. It consists of
72838 sample values and our sample rate is 22,050.
So, for more simplification I am planning to use just first 3 seconds of our audio data. So,
calculating first 3 seconds of data, I can simply multiply our fs by 3, our sample rate by 3 and
we will get the exact time. So, first 3 seconds of time. So, it will be equivalent to 66150
samples.
Now, we have imported just one data. So, we need to import all the data inside this folder. To
do so, I need to write a piece of code where I will be sequentially reading all the files and
saving them into a Python list. So, let us start with our code, I will name my variable as data
all I will be also extracting all the labels. Since, we have already seen that data file name
contains their respective label. So, we need to extract the relevant label also.
So, I will start with a loop where I will be reading all the file names in a sorted function from
os dot list dir and in this list dir I will be importing that data file, I mean the data path of the
data folder. So, maybe I can just create a another variable which is data path and this will be
treated as a string and I need to write the exact data path for this folder. So, I will simply copy
it from this place and paste it over here, ok.
269
Now, I can simply use this variable wherever I want this data path. So, first I will try to
extract all the labels. So, for that I will be pending all the labels, in this label all list and I will
simply read the file name and extract the substring, does that sub identify from that particular
file name. Maybe I will I also need to write classes to integer as these little bit treated as
label, so that that could approach.
Now, I can simply read my data and also my sample rate librosa dot load. And now, now this
line; this line of code will simply read all the file names and append it with the our data path
and then the librosa function will read that corresponding file. Now, I need to store all those
files in our data_all list. So, to do so, I will simply write all append data and also I will be you
know simply using first 3 second. So, will I will put a colon, and sorry I will put a colon and
type the time over here.
So, let me just run this code. It might take a couple of seconds to run this file completely, ok.
So, code has been executed now. So, I will simply convert these list into numpy array. So, to
do so, I will simply write data all equal to np dot array and label all will become.. sorry, this
is a mistake over here. I need to write underscore not hyphen.
So, after converting these list into numpy arrays I can simply try to see their shape, what are
their exact shapes. So, I will simply write data all dot shape, ok. So, now, we have written 60
files each consisting of 66150 sample which corresponding to 3 seconds of initial data. I can
also see the label shape which is equal to 60 file. So, I guess we are good to go. Maybe I can
also show you the exact labels.
270
(Refer Slide Time: 16:01)
So, you can simply print, ok. So, these are the labels corresponding to our 60 files. Maybe I
will do a simple pre-processing over here as I want my label to start with 0. So, I will simply
write a code, ok. So, let me print these labels again, perfect. Now, I will also show you a
utility in IPython. From IPython dot display, we have imported audio. So, this function can
simply play that exact audio file in our Python environment. So, let us try to play a audio file,
ok.
In my case suppose, I will be using a 0th indexed file. And our rate will be equal to, our rate
was fs was 220 50 (Refer Time: 17:24), I will simply write, I will simply type fs over here.
So, as you can see using this utility, I can simply play the exact audio file in my Colab.
Maybe I can play one more file over here.
Sounds good. So, after importing our data, I will now move towards some sort of a particular
feature extraction phase. And in our feature extraction phase, we will be simply using
fundamental frequency, and uh, zero crossing rate, Mel-frequency cepstral coefficients as our
basic features.
271
So, let me show you how to extract a fundamental frequency from this audio files. And, to
extract the fundamental frequency, I will simply make a.. I am so sorry; I will simply make a
variable f0. And I will be using a library function called librosa dot yin. And in this function,
I just need to pass a data instance with a range of, range of frequency value like minimum
frequency value and the highest frequency value.
So, let me just extract fundamental frequency for a single instance, then I will show you how
to do it for a whole folder, ok.
So, it is working. I will simply print the exact value of this fundamental frequency or maybe I
can also try to show you a plot where these values are plotted against this time. To do so, I
will simply type plt dot plot and I will simply pass the array.
272
(Refer Slide Time: 19:36)
So, yeah, as you can see that initial values are somewhat around 882, then there was some
sort of a variation in this part and then again it is going to 882 over here. Or, maybe I just
need to show you another file, let us say 5 and then I will print another plot over here. So, as
you can see over here that there is some sort of a difference between the fundamental
frequencies in two different emotions.
Now, let me simply extract the fundamental frequency for all the data. For that I will simply
write another list where; to do so, I will be simply using a for loop, where I will iterate over
273
all the data and extract our fundamental frequency. So, I will simply append this. Later I will
convert it into a and by later I will; later I will convert it to a numpy array, this exact list. This
is copied, see it sometime.
Let me print the shape of our; now, let me print the shape of this variable, ok. So, we have
selected the fundamental frequency for each and every file in the folder 1. Now, I will go
with another sort of feature known as our zero cross ratings. So, I will be extracting 0 cross
rate over here. So, to do so, I will be again using librosa library, and there is a function in it
called zero crossing rate which will be giving us the exact zero cross rating corresponding to
these audio files.
So, let me use the variable name as zcr and then I can use, then I can write; let me show you
with the single file first. (Refer Time: 23:49) it is working. Let me show the print. Let me
print the zcr, ok. So, you can see there is some sort of differences over here. Maybe I can give
you a better visualization by simply plotting this over the time.
So, for that I will be writing plot, ok. Maybe I will create the zcr for another emotion. See for
emotion number 5 and then plot, ok. There is some issue over here, ok. It is not scr, it is zcr.
Yeah, here we can also observe that there is some significant amount of differences between
these two features. So, I will simply write another code to you know extract the zcr value for
all the files.
274
(Refer Slide Time: 25:19)
Now, maybe to just keep it consistent with my previous feature shape which was 16 to 130,
you can also reshape this zcr_all function. And now, if I run my print function again, so yeah
we will got a similar shape over here. This is just to keep back the consistency among the all
the feature. Now, I will show you two extract another feature called Mel-frequency cepstral
coefficients.
So, for that maybe I just need to let us write MFCC separate over here. And yeah, I can
simply extract MFCC from librosa. And yeah, there is a parameter in MFCC, it is called
number of MFCC coefficient. In my case, let us say, we will be extracting first 13
coefficients, yeah.
275
(Refer Slide Time: 26:27)
So, these are MFCC coefficients. I will like to print its shape also. So, yeah, there are 13
cross 130 for a single (Refer Time: 26:42). So, in this case, as we are getting 13 rows and 130
columns. So, for each row there are 130 values and each of these 13 values corresponding to
one MFCC coefficient.
Now, to visualize this I can simply type and I can simply write MFCC, ok or maybe just to
avoid these values I can simply put a semicolon.
276
(Refer Slide Time: 27:23)
Now, let me plot it for another file. Let us see for file number I mean in the 5th file and our
MFCC coefficient will look something like this. So, are there some significant differences
over here?
I believe yes, I can see some differences over here. There are some differences over here also.
And since, it is a very complex and very tightly bounded lines, but yeah there are some
significant differences over these two files. So, now, again we will simply extract all these
MFCC for all the files.
277
(Refer Slide Time: 28:17)
So, now yeah it looked consistent with my prior representations. And now, we have extracted
all 3 features of each basic features like MFCC, fundamental frequency and zero cross
ratings. Now, after this I will be using my basic machine learning inferences. In my machine
learning algorithms, I will be using Gaussian Naive Bayes, linear discriminant analysis and
support vector machines.
So, now, moving towards the one machine learning part, in machine learning part we will
take one feature, divide it into their respective train and test parts, and then run our classifier
over it. So, starting with the our very first feature which is fundamental frequency. So, let me
first divide it into their train and test part. So, for this division, I will be using train test split
function from sklearn library. So, code will look something like this, ok.
Now, I can simply run my classifier as clf equal to see my first classifier which is let me show
you, Gaussian Naive Bayes. So, I have already inputted like from as sklearn dot Naive Bayes
import Gaussian Naive Bayes.
278
(Refer Slide Time: 29:37)
So, I will simply copy it over here and function over here and fit it on my training set, ok.
Now, my classifier is fit on our training set. So, let me check about the training accuracy over
here, ok. So, we are getting 89 percent of training accuracy. And let me also check for the
testing score, training accuracy I will get, ok; so, I am getting 83 percent of training accuracy
over here using LDA which is lesser than our Gaussian Naive Bayes.
And let me also check out my test score, ok. So, yeah test score is also getting down to 0.25
percent. So, I believe Gaussian Naive Bayes is performing better in our fundamental
frequency. So, guys let me try my third classifier now which is support vector machine. So,
for that I will also again reuse my code. And instead of Gaussian Naive Bayes, I will be using
our SVC function over here.
279
(Refer Slide Time: 30:53)
So, I will replace Gaussian Naive with SVC and inside SVC I need to define my kernel,
which kernel I will be using. So, kernel equal to let us say we start with a linear kernel and let
me check, ok yeah. So, classifier is set on our training data. Let us see about our train score,
ok. So, with linear classifier we are getting 100 percent training accuracy. Let me check the
test score over here. We are getting 58 percent of accuracy which is you know higher than all
of other classifiers.
So, maybe I can also check it with another kernel called RBF kernel, ok. RBF kernel is not
getting with that good accuracy. And yeah of course, our testing score also decreased towards
16 percent, which I believe is a chance level. So, yeah, in our case for a fundamental
frequency, we can easily see that our support vector machine with linear kernel giving the
best results, ok.
Now, let us try a similar classifier using another feature. So, after fundamental frequency, our
next feature was zero cross rates. Let me code similar stuff for zcr. Again, I will be you know
simply reusing my code over here. So, instead of f frequency all, I will be using my zcr
underscore all and rest of the part will be same, ok. Now, my train and test variable this
setting is over the zcr features.
So, again I will simply reuse my code. I will be using Gaussian Naive Bayes over here. And
again I have to show my train and test accuracies. So, I will simply use this code, ok. So, for
280
Gaussian Naive Bayes in case of our zero cross rates, the train accuracy is somewhat around
85 percent, but the test accuracy is around chance level only.
So, we will try with another classifier which is our linear discriminant analysis. Again, I will
simply reuse my code. So, for linear discriminant analysis we are getting a train accuracy of
100 percent and test accuracy is somewhat around 8 percent which is very lower than I guess
chance level. And this is a clear example of overfitting in this case. In fact, this is also
example of overfitting.
Let me try with support vector machine now. So, you simply change the function to SVC,
SVC and then we use kernel equal to linear, ok. Some problem over here, ok. I forgot to put
equal to. In this case, results look little bit better than, I mean, Gaussian Naive Bayes and
linear discriminant analysis, but still there is a huge variance between train score and a test
score. So, it is another example of overfitting only. Let me try with the RBF kernel, same
case, overfitting.
So, now, moving towards our final feature, final manual feature that we have extracted
MFCC, let us try to run similar code using MFCC. Again, in re-usability of code.
281
(Refer Slide Time: 35:24)
Now, training over Gaussian Naive Bayes classifier for MFCC feature, ok again, ok. These
are comparatively better result. We are getting trained accuracy of 95 percent and test
accuracy of 50 percent. Now, let us try with linear discriminant analysis, ok. 70, 41, and in
case of.. now in case of our support vector machines, 1 and 75 which was a good result for
linear as a linear kernel.
And in case of RBF kernel, ok (Refer Time: 36:34) simply you know working over here. So,
yeah, as we can see that support vector machine with linear kernel is giving best results. Now,
as we all can see that we have used our basic feature on our basic classifier. After this, I
maybe I can do one more exercise where we will be using a raw audio data over a
one-dimensional convolution neural network.
And that one-dimensional neural network will you know automatically extract relevant
features out of the audio data. And using a softmax classification method we will simply
classify the emotion classes. So, to do so, I will be using a help of a library called Keras. And
before that I need to do some minor settings in my environment in terms of data shape.
282
(Refer Slide Time: 37:30)
I need to just reshape my data, so that I can fit it in a convolution neural network. So, what I
am basically doing here is.. I mean, let me show you the exact shape of our data before
running this code. So, my original data shape was 60 cross 66150, ok. And after you know
reshaping this data my data shape will become. So, my next part will be how to code a 1D
CNN. We will be using a library called Keras. I have already imported these library.
So, to start with I will simply write model equal to models dot sequential, S may be capital
over here. I will first define the input shape which is essentially a layer, my network dot input
a shape will be a tuple consist of 66150 cross 1. So, I will simply copy it. And after inputting
the data of this shape, I will add a convolution layer over it.
So, the code will look something like model dot add. For activation function we will be using
ReLu and padding will be same. After a convolution layer maybe I will try another layer
called max pooling or average pooling. Let us say I will use max pool, ok. Maybe I will just
use single convolution layer and try to see how my result changes with this network.
Maybe I can put a batch normalization layer, then a dense layer (Refer Time: 39:53) layer dot
add see number of neuron equal to 128, with activation equal to relu, ok. Now, maybe we can
also add a dropout layer you know just to avoid any overfitting case. And as a final layer, we
will simply add a dense layer with number of neurons equal to 8 which is equivalent over
number of classes and activation will be Softmax, just for the classification purpose, yeah.
283
Now, maybe I can just present the summary of this model, ok. There is some error, syntax
error over here, ok. I forgot to put an equal to over here, ok. Another error, activation I again
forgot put the equal to over here, no attribute max pooling 1D. Let me check, ok.
Another error attribute max pooling, ok. The P in max pooling will be in capital, ok. One
more error, ok epsilon spelling is wrong.
284
(Refer Slide Time: 41:45)
Yeah, now it is working. And this is the summary of our model and these are total number of
parameter that the model will be tuning. Out of these parameter, this much will be trainable
parameter and others will be non-trainable parameters. Now, we have defined our.. the
structure of our network. Now, we can simply compile this model using model dot compile.
Now, in compilation of model we have to define a loss function, but exact loss function we
will be using and a optimizer, but sort of optimizer we will be using to you know optimize
that loss function, ok. There is a error over here, sequential model has no attribute compile,
ok. I have written the wrong spelling. So, after compilation of my model, I just need to fit my
model over a training data.
But before dividing our data into train and test split, I just need to convert my labels into
categorical classes in terms of one hot encoded vector, since we are using categorical cross
entropy loss. So, for that I will be using inbuilt function in tensor for keras, ok. This has
changed my labels into one hot code encoded vectors. Let me show you, yeah.
285
(Refer Slide Time: 43:23)
Now, I can use my train and test splitting code to you know divide this data into train and test
split. I will simply copy my previous code and paste it over here. And this time my training
data will be our raw audio data, ok. So, after dividing into train and test split, we have to
simply fit our model, but we have already defined and compiled through model dot fit. It
might take some time to you know as we are, right now we are just using our CPU you know
Google Colab. So, it might take some time to learn, ok.
286
We can see that our accuracy is improving over here, ok. So, my model has now completed
all its epochs. So, let us try to evaluate what test accuracy is over here. We are getting training
accuracy of somewhere around 70 percent and test accuracy is 0 percent, ok. So, this, my this
network architecture is not learning anything as of now.
Maybe I can you know start with some sort of a hyper parameter tuning, start adding a couple
of layers in it or maybe using different sort of activation functions or decreasing or increasing
the number of neuron in dense layers. All that sort of hyper parameter tuning we I can do to
make this network a better classifier.
So, guys, this was all about this tutorial. Hope I was able to give you a basic idea about
programming on these sort of networks. And if you have any sort of doubt, feel free to put a
question in discussion forum.
Thank you.
287
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 05
Lecture - 01
Speech Based Emotion Recognition
Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology, Ropar.
Friends, this is the lecture in the series of Affective Computing. Today, we will be discussing
about how we can recognize emotions of a user by analyzing the voice. So, we will be
fixating on the voice modality. So, the content which we will be covering in this lecture is as
follows.
First, I will introduce to you, give you some examples about why speech is an extremely
important cue for us to understand the emotion of a user. Then we will discuss several
applications where speech and voice based affective computing are already being used. And
then we will switch gears and we will talk about the first component which is required to
create a voice based affective computing system, which is labeled datasets.
288
And in this pursuit, we will also discuss the different attributes of the data and the conditions
in which it has been recorded. Now, if you look at me, let us say I have to say a statement
about how I am feeling today and I say, well today is a nice day and I am feeling contented.
Now, let me look down a bit and I say, today is a wonderful day and I am feeling great. Now,
in the first case, you could hear me and you could see my face very clearly. And in the second
case, my face was partially visible, but you could hear me clearly. And, I am sure you can
make out that in the first case, I was showing neutral expression. And in the second case,
even though my face was facing downwards, not directly looking into the camera, I was
sounding to be more positive right.
So, there was a more happy emotion which could be heard from my speech. So, this is one of
the reasons why we are using voice as one of the primary modalities in affective computing.
You talk to a friend, you understand their facial expressions, you look at the person's face, but
in the parallel, you are also listening to what that person is speaking. So, you can actually tell
how that person is feeling from their speech. And that is why we would be looking at
different aspects of how voice can be used in this environment.
So, here is an example. So, this is a video which I am going to play from audio visual group
affect dataset. So, let me play the video.
Protects.
289
Protects. (Refer Time: 03:47) So, you protect yourself. No.
So, if you notice in this case, the body language of the subjects here, that is trying to be a bit
aggressive. So, this looks like a training video. But if you hear the voice over, the explanation
voice in this video, you can tell that there is no fight going on, there is no aggressive
behavior, it is simply a training going on. And how are we able to find that? By simply
looking at the tonality of the voice.
If it was, let us say, actually a fight or some aggressive behavior shown by the subjects in the
video and the voice was also from one of the subjects, we would also hear a similar pattern
which would tell us that let us say the subjects could be angry. But in this case, even though
the body language facial expression says that they are in an aggressive pose, but from the
voice, we can tell that this is actually a training video. So, it is the environment is actually
neutral. Now, let us look at and hear another video.
Now, in this case, the video has been blacked out. You can hear the audio and you can tell
that there are several subjects in the audio video sample and the subjects are happy right.
How are we able to tell that? We can hear the laughter.
290
(Refer Slide Time: 05:34)
Now, if I was to play the video, now this is the video which we had earlier blacked out. So,
you can look at of course, the facial expressions, but even without looking at the facial
expressions and the gestures just by hearing you can tell that the subject are happy. So, this
gives us enough motivation to actually pursue voice as a primary modality in affective
computing.
Now, as I mentioned earlier, there are a large number of applications where speech of the user
that is being analyzed to understand the effect. And friends this is similar to how when we
discussed in the last lecture about facial expression analysis, the several applications were
there in health and in education.
We find the similar use cases for voice based affect, but which are applicable in different
circumstances in circumstances, in scenarios where it could be non-trivial to have a camera
look at the user. Of course, there is a privacy concern which comes with the camera as well.
So, instead we can use microphones and we can analyze the spoken speech and the
information which is there in the background.
291
(Refer Slide Time: 07:03)
Now, the first and quite obvious application of voice in affective computing is understanding
the man machine interaction on a natural basis. What does that mean? Let us say there is a
social robo. Now, the robo is greeting the user and the user has just entered into the room.
The robo greets the user and the user replies back.
Now, based on the voice of the user and the expression which is being conveyed by the user,
the machine which is the robo in this case is able to understand the emotion of the user and
then the feedback to the user can be based on the emotional state. Let us say if the user is not
so cheerful.
So, the robo reacts accordingly and then tries to understand with a question which could
better make the either the user more comfortable, relaxed or the robo tries to investigate a bit.
So that it can have a conversation which is appropriate with respect to the emotion of the
user.
The 2nd we see is in entertainment particularly looking at computer movies. So, friends in
this case, we are talking about the aspect of indexing. Let us say you have a large repository
of movies ok. So, these are let us say a large repository. Now, the user wants to search, let us
say the user wants to search all those videos which are belonging to a happy event ok.
You can think of it as a set of videos in your phone's gallery and you want to fetch those
videos which let us say are from events such as birthdays which are generally cheerful right.
292
So, from this audio visual sample we can analyze the audio which would mean the spoken
content by the subject and the background voices could be music. So, we can analyze and get
the emotion.
Now, this emotion information it can be stored as a metadata into this gallery. So, let us say
the user searches for all the happy videos. We look through the metadata which tells us that
when we analyze the audio these are the particular audio video samples which based on their
spoken content and the background voice or music sound cheerful. So, the same is then
shown to the user.
Now, moving on to another very important application here, let me first clear the screen a bit.
So, looking at the aspects of operator safety let us say there is a driver and the driver is
operating a complex heavy machinery. You can think of an environment for example, in
mining where a driver is handling a big machinery which has several controls. What does that
apply? Well, harsh working environment, a large number of controls of the machine and the
high cost of error.
So, the driver would be required to be attentive right. Now, you can clearly understand the
state of the driver by listening to what they are speaking and how they are speaking. From the
voice pattern one could easily figure out things such as if the person is sounding tired, is not
attentive, has negative emotion. So, if these attributes can be figured out the machine let us
say the car or the mining machine it can give a feedback to the user.
293
An example feedback can be please take a break right before any accident happens please
take a break. Because when I analyzed your voice I could figure out that you sounded tired,
distracted or an indication of some negative emotion, which can hamper the productivity and
affect the safety of the user and of the people who are in this environment around the user.
Now, friends the other extremely important aspect of where we use this voice based affect is
in the case of health and well being. Now, an example of that which is right now being
experimented in a large number of academic and industrial labs is looking at the mental
health through the voice patterns. So, an example of that is let us say we want to analyze data
of patients and healthy controls who are in a study where the patients have been clinically
diagnosed with unipolar depression.
So, when we would observe the psychomotor retardation which I briefly mentioned in the
facial expression recognition based lecture as well. The changes in the speech in terms of that
is the frequency of words which are spoken, the intensity, the pitch you could learn a machine
learning model which can predict the intensity of depression. Similarly, from the same
perspective of objective diagnostic tools, which can assist clinicians let us say there is a
patient with ADHD.
So, when a clinician or an expert is interacting with the patient we can record the speech of
the interaction, we can record the voices and then we can analyze how the patient was
responding to the expert to the clinician and what was the emotion which was elicited when a
particular question was asked. That can give very vital useful information to the clinician.
294
(Refer Slide Time: 15:36)
Now, another aspect where voice based affective computing is being used is for automatic
translation systems. Now, in this case a speaker would be let us say communicating between
a party or translating right. So, let me give you an example to understand this. Let us say we
have speakers of language one, you know group of people who are in a negotiation deal
trying to negotiate a deal with group of people who speak language two and both parties do
not really understand each other's language.
Now, here comes a translator could be a machine, could be a real person who is listening to
group 1 translating to group 2 and vice versa. Now, along with the task of translation from
language 1 to language 2 and vice versa there is a very subtle yet extremely important
information which the translator needs to convey.
Since the scenario is about negotiation, let us say a deal is being cracked. The emotional
aspect of what is the emotion which is conveyed when the speakers of language 1 are trying
to make a point to the other team that also needs to be conveyed. And based on this simply by
understanding the emotional the and the behavioral part 1 could indicate one could
understand if let us say the communication is clear.
And if the two parties are going in the direction as intended you can think of it as an
interrogation scenario as well. Let us say interrogator speaks another language and the person
who is being interrogated speaks another language right. So, how do we understand that in
what is the direction of communication are they actually able to understand each other and
295
when the context of the communication has changed all of a sudden a person let us say who
was cooperating is not cooperating, but speaks another language.
So, that is where we analyze this voice when you analyze the voice you can understand the
emotion and that is a very extremely useful cue in this kind of a dyadic conversation or
multiparty interaction. And of course, in this case the same is applicable to the human
machine interaction as well across different languages. Friends, also another use case is
mobile communication. So, let us say you are talking over a device, could be on a using a
mobile phone.
Now, from strictly privacy aware health and well being prospective can the device compute
the emotional state of the user and then let us say after the call or communication is over
maybe in a subtle way suggest some feedback to the user to perhaps let us say calm down or
simply indicate that you have been using the device for n number of hours this is actually
quite long you may like to take a break right.
Now, of course, you know in all these kind of passive analysis of the emotion of the user the
privacy aspect is extremely important. So, either that information is analyzed, used as is on
the device and the user is also aware that there is a feature like this on the device or it could
be something which is prescribed suggested to the user by an expert. So, the confidentiality
and privacy that need to be taken care of.
Now, this is a very interesting aspect friends. On one end we were saying well when you use
a camera to understand the facial expression of a person there is a major concern with the
privacy. Therefore, microphone could be a better sensor. So, analysis or voice could be a
better medium.
However, the same applies to your voice based on analysis as well because we can analyze
the identity of the subject through the voice and also when you speak there could be personal
information. So, where the processing has to be done to understand the affect through voice is
it on device of the user where is it stored. So, these are all very extremely important
applications which come into the picture when we are talking about these applications.
296
(Refer Slide Time: 21:04)
Now, let us discuss about some difficulties in understanding of the emotional state through
voice. So, according to Borden and others there are three major factors which are the
challenges in understanding of the emotion through voice. The first is what is said. Now, this
is about the information of the linguistic origin and depends on the way of pronunciation of
words as representatives of the language. What did the person actually say right? The content
for example, I am feeling happy today right.
So, the content what is being spoken the interpretation of this based on the pronunciation of
the speaker that could vary if that varies if there is any noise in the understanding of this
content which is being spoken then that can lead to noisy interpretation of the emotion as
well. The second part, the second challenge is how it is said you know how is a particular
statement said.
Now, this carries again paralinguistic information which is related to the speaker's emotional
state. And example is let us say you were in discussion with a friend and you asked ok do you
agree to what I am saying? The person replies in scenario 1. Yes. Yes, I agree to what you are
saying. In scenario 2, the person says hm, yes. Hm, I agree. Now, in these two examples,
there is a difference right. The difference in which how the same words were said the
difference was the emotion.
Let us say, the confidence in this particular example of how the person agreed to the other if
the person agreed or not or was the bit you know hesitant. So, we have to understand how the
297
content is being spoken which would indicate the emotion of the speaker. Now, looking at the
third challenge, third difficulty in understanding emotion from voice which is who says it ok.
So, this means you know the cumulative information regarding the speaker's basic attributes
and features for example, the age, gender and even body size.
So, in this case let us say a young individual saying I am not feeling any pain you know as an
example versus an individual you know adult saying I am not feeling any pain right a young
individual versus an adult speaking the same content I am not feeling any pain. Maybe the
young individual is a bit hesitant maybe the adult who is speaking this is too cautious. So,
what; that means, is the attributes the characteristics of the speaker which not only is based
on just their age, gender and body type, but also their cultural context.
So, in some cultures it could be a bit frowned upon to express certain type of emotion in a
particular context right. So, that means, if we want to understand the emotion through voice
of a user from a particular culture or a particular age range we need to equip our system, our
affective computing system with this Meta information. So, that the machine learning model
then could be made aware during the training itself that there could be differences in the
emotional state of the user based on their background, their cultures.
So, this means to understand emotion we need to be able to understand what is spoken ok. So,
you can think of it as speech to text conversion then how it is said, a very trivial way to
explain will be you got the text what was the duration in which the same was said were there
any breaks were there any umms and repetition of the same words you know.
So, that would indicate how it is being said and then the attributes of the speaker. So, we
would require all this information when we would be designing this voice based affect
computing system.
298
(Refer Slide Time: 27:11)
Now, as we have discussed earlier when we were talking about facial expression analysis
through cameras the extremely important requirement for creating a system is access to data
which has these examples which you could use to train a system right. Now, when we are
talking about voice based affect then there are three kind of databases you know the three
broad categories of databases which are existing, which have been proposed in the
community.
Now, the attributes of these databases is essentially based on how the emotion has been
elicited. So, we will see what does that mean and what is the context in which the participant
of the database have been recorded ok. So, let us look at the category. So, the first is natural
data ok. Now, this you can very easily link to facial expressions again. We are talking about
spontaneous speech in this case.
Spontaneous speech is what you are, let us say creating a group discussion you give a topic to
the participants and then they start discussing on that. Let us say they are not provided with
much constraints. It is supposed to be a free form discussion and during that discussion
within the group participants you record the data ok. So, that would be spontaneous replies,
spontaneous questions and within that we will have the emotion which is represented by a
particular speaker.
Now, other environments scenarios where you could get this kind of spontaneous speech data
which is reminiscent of, representative of natural environment is for example, also in call
299
center conversations ok. So, in this case you know let us say customer calls in there is a call
center representative, conversation goes on and if it is a real one then that could give you
know spontaneous speech.
Similarly, you could have you know cockpit recordings during abnormal conditions. Now, in
this case what happens right let us say there is an adverse condition there is an abnormal
condition the pilot or the user they would be communicating based on you know how they
would generally communicate when they are under stress. And in that whole exercise we
would get the emotional speech right.
Then also conversation between a patient and a doctor. I already gave you an example right
when we were talking about how voice could be used for affect computing in the case of
health and well-being right. A patient asking questions to sorry a patient replying to questions
to doctors a patient replying to the questions of the doctor and in that case you know we
would have these conversations about emotions. Same goes for you know these
communication which could be happening in public places as well.
Now, the other category for the voice based data set friends is simulated or acted. Now, in this
case the speech utterances the voice patterns they are collected from experienced trained
professional artist. So, in this case you know you would have let us say actors who would be
coming to a recording studio and then they could be given a script or a topic to speak about
and you know that data would be recorded.
Now, there are several advantages when it comes to the simulated data. Advantages, well it is
relatively easier to collect as compared to natural data. Since the speakers, they are already
you know informed about the content or the theme which they are supposed to speak, they
also have given an agreement. So, you know the cash compared to your natural data privacy
could be better handled in this case, but I should say easier to handle.
Now, the issue of course, is when you are talking about simulated data, acted data, then not
all examples which you are capturing in your data set could be the best examples of how the
user behavior will be in the real world. Now, the third category friends, is elicited emotion
which is induced.
So, in this case an example of course, is you know let us say you show a stimuli, a video
which contains positive or negative affect and after the user has seen the video, you could ask
300
them to answer certain questions about that video and the assumption is that the stimuli
would have elicited some emotion into the user right. And that would be affected represented
shown when the speaker the user in the study is answering questions.
Now, let us look at the databases which are very actively used in the community. The first is
the AIBO Database by Batliners and others. Now, this contains the interaction between
children and the AIBO robo, contains 110 dialogues and the emotion categories, the labels
are anger, boredom, empathetic, helpless, ironic and so forth.
So, the children are interacting with this robo, the robo is a cute you know Sony AIBO dog
robo. So, the assumption here is that the participant would get a bit comfortable with the robo
and then emotion would be elicited within the participant. And we can have these labels,
these emotion categories you know labeled afterwards into the data which has been recorded
in during the interaction between the robo and the children.
The other dataset which is very commonly used is the Berlin Database of Emotional speech
which was proposed by Burkhardt and others in 2005. Now, this contains 10 subjects and you
know the data consist of 10 German sentences which are now recorded in different emotions.
Notice this one is the acted one ok. So, this is the acted type of dataset where in the content
was already provided to the participants, actors and you know they try to speak it in different
emotions.
301
So, what does this mean is now the quality of emotions which are reflected are based on the
content and the quality of acting by the participant. Now, friends the third dataset is the
Ryerson Audio Visual Database of emotional speech and song. Now, again this is an active
dataset contains professional actors and these actors were given the task of vocalizing two
statements in North American accent.
Now, of course, you know again the cultural context is coming into the picture as well you
know this is also acted and if you compare that with the first dataset of the AIBO dataset then
this was more spontaneous ok. Of course, you know you would understand that getting this
type of interaction is non-trivial.
So, extremely important to be careful about the privacy and all the ethics approvals which are
required to be taken. Now, in these kind of databases where you have actors it is relatively
easier to scale the database because you know you could hire actors and you can have
multiple recording sessions and you can give different content as well.
Now, moving on to other databases. Friends the next is the IITKGP SESC dataset which was
proposed in 2009 by Koolagudi and others. Again, an active dataset 10 professional artist.
Now, this is a non-English dataset, it is in an Indian language Telugu. Now, each artist
participant here they spoke 15 sentences trying to represent 8 basic emotions in 1 session.
Another dataset is again from the same lab called the IITKGP SEHSC again by Rao and
Koolagudi and others 2011.
302
Now, in this case you again have 10 professional actors, but the recording is coming from
radio jockeys ok so from all India radio. So, these are extremely good speakers and the
emotions are 8 categories, again acted. But if you have high quality actors then the
assumption is that we would be able to get emotional speech as directed during the creation
of the dataset.
Now, moving forward friends the number of utterances here you know this is a fairly good
sized dataset 15 sentences, 8 emotions and then 10 artist, they were record in 10 sessions. So,
we have 12000 samples which are available for the learning of an emotional speech analysis
in system.
Now, in the community there are several projects going on. Now, they are looking at different
aspects of affect and behavior. So, one such is the empathetic grant in the EU. So, there as
well you know there are these dataset resources which are used for analysis of emotions and
speech. Another extremely useful, very commonly used platform is the computational
paralinguistic challenge platform by Schuller and Batliner.
303
(Refer Slide Time: 39:42)
So, this is actually hosted as part of a conference called inter-speech. Now, this is a very
reputed conference speech analysis. So, in the ComParE benchmarking challenge the
organizers have been proposing every year different sub challenges which are related to
speech analysis and a large number are related to emotion analysis and different task and
different settings in which we would like to understand the emotion of a user or a group of
users.
Now, there would be some acted and some spontaneous datasets which are available on this
benchmarking platform.
304
(Refer Slide Time: 40:40)
Now, moving on from the speech databases, let us see the databases have been collected
could be acted could be spontaneous. The next task is to generate the annotations, the labels.
Of course, before the recording is done the design of the experiment would already consider
the type of emotions which are expected to be annotated generated from the data. So, one
popularly used tool for annotation of speech is the Audino tool.
Now, here is a quick go through of how the labeling is done. Let us say friends here is the
waveform representation. The labeler would listen to the particular chunk, let us say this
chunk, can re-listen, can move forward backwards. And they also could have access to the
transcript what is being spoken during this time.
So, what they can do is you know they can then add the emotion which they interpret from
this audio data point. And they can label you know different things such as the topic of the
spoken text. And also things such as you know who is the speaker and the metadata. So, once
they listen to the content they generate the labels, then they can save and then they can move
forward.
Now, extremely important to have the right annotation tool because you may be planning to
create a large dataset representing different settings. So, in that case you would also have
multiple labelers right. So, if you have multiple labelers the tool needs to be scalable.
305
And as friends we have already discussed in the facial expression recognition lectures as
well. If you would have multiple labelers you will have to look at things such as consistency
for each labeler how consistent they are in the labeling process with respect to the sanity of
the labels.
So, you would like the labels to be as less affected by thing such as confirmation bias. So,
after you have the database collected, annotated by multiple labelers you may like to do the
statistical analysis of the labels for the same samples where the labels are generated from
multiple labelers. So, that at the end we have one or multiple coherent labels for that data
point.
Now, let us look at some of the limitations and this also is linked to the challenges in voice
based affect analysis. What we have seen till now is that there is a limited work on non
English language based effect analysis. You already saw the IIT KGP datasets which were
around the Telugu language and then the Hindi language. There are a few German speaking
datasets as well, Mandarin as well, but they are lesser in number, smaller in size as compared
to English only based datasets.
So, what that typically would mean is let us say you have a system which is analyzing the
emotion of a speaker speaking in English. You use that dataset and then you train a system on
that dataset. Now, you would like to test it on other users who are speaking some other
language.
306
Now, this cross dataset performance across different languages, that is a big challenge right
now in the community. Why is it a challenge? Because you have already seen the challenges
which are there are the three challenges you know what is being said, who said it and how it
was said. So, these will vary across different languages.
The other is limited number of speakers. So, if you want to create emotion detection in a
system based on voice which is supposed to work on a large number of users on a large scale,
you would ideally like a dataset where you can get a large number of users in the dataset. So,
that we learn the variability which is there when we speak right, different people will speak
differently, will have different styles of speaking and expressing emotions.
Now, with respect to the datasets of course, there is a limitation based on the number of
speakers which you can have there is a practicality limit. Let us say you wanted to create a
spontaneous dataset. So, if you try to increase the number of participants in the dataset, there
could be challenges such as getting the approvals, getting the approval from the participant
themselves and so forth.
Now, on the same lines friends issue is there are limited natural databases and I have already
explained to you right creating spontaneous dataset is a challenge because if you the user is
aware they are being recorded that could add a small bias. The other is the privacy concern
needs to be taken into picture. So, the spontaneous conversations you know if the proper
ethics and the permissions have been taken or not you know ethics based considerations are
there or not. So all that affects the number and size of the natural databases.
Now, this is fairly new, but extremely relevant. As of today there is not a large amount of
work on emotional speech synthesis. So, friends till now I have been talking about you have
speech pattern, someone spoke, machine analyzed, we understood the emotion. But
remember we have been saying right affect sensing is the first part of affective computing and
then the feedback has to be there.
So, in the case of speech we can have emotional synthesis done. So, the user speaks to
interacts with the system, system understands the emotion of the user and then the reply back
let us say that is also through speech that can have emotion in it as well right. Now, with
respect to the progress in the synthetic data generation of course, we have seen large strides
in the visual domain data face generation, facial movement generation.
307
But comparatively there is a bit less progress in the case of emotional speech and that is due
to of course, you know the challenges which I have just mentioned above. So, this of course,
is being you know worked upon in several labs across the globe, but that is currently a
challenge how to add emotion to the speech. Some examples you can check out for example,
from this link from developer.amazon.com you know there are a few styles, few emotions
which are added. But essentially the issue is as follows.
Let us say I want to create a text to speech system which is emotion aware. So, I could input
into let us say this TTS text to speech system the text this is the text from which I want to
generate the speech and let us say as a one hot vector the emotion as well. Now, this will give
me the emotional speech. But how do you scale across large number of speakers? Typically
high quality TTS systems are subject specific; you will have one subjects text to speech
model.
Of course, there are newer systems which are based on machine learning techniques such as
zero shot learning or one shot you know where you would require lesser amount of data for
training or in zero shot what you are saying is well, I have the same text to speech system,
which has been trained for large number of speakers along with the text and emotion, I would
also add the speech from the a speaker for which I want the new speech to be generated based
on this text input, right.
So, that is the challenge, how do you scale your text to speech system across different
speakers and have the emotion synthesized. The other, friends, extremely important aspect
which is the limitation currently is cross lingual emotion recognition. I have already given
you an example when we are talking about the limited number of non English language based
emotion recognition works.
So, you train on language 1 a system for detecting emotion, test it on language 2 generally a
large performance drop is observed. But one thing to understand is let us say for some
languages it is far more difficult to collect data to create databases as compared to some other
languages right, some languages are spoken more there are larger number of speakers other
languages could be older languages are spoken by less number of people. So, obviously, there
creating datasets would be a challenge.
Therefore, in the pursuit of cross lingual emotion recognition we would also like to have
systems where, let us say you train a system on language 1, which is very widely spoken and
308
the assumption is that you can actually create a large dataset. Then can be learn systems on
that dataset and later borrow and do things such as domain adaptation, adapt from that learn
from that borrow information and fine tune on another language where the we have smaller
datasets.
So, that you know now we can do emotion recognition on data from other smaller dataset.
Now, another challenge limitation is this is applicable to not just voice or speech, but other
modalities as well. When you are looking at the explanation part of why given a speech
sample the system said the person is feeling happy.
If you use traditional machine learning systems for example, your decision trees or support
vector machines it is a bit easier to understand why the system reached at a particular
emotion. Why the system predicted a certain emotion. However, speech based emotion
through deep learning.
So, this is deep learning friends, DL deep learning based methods. Even though it has the
state of the art performance even then the explanation part of why you reached at a certain
consensus based on the perceived emotion through the speech of a user that is still a very
active area of research.
So, we would like to understand, why the system reached at a certain point with respect to the
emotion of the user? Because if you are using this information about emotion state of the user
in let us say a serious application such as health and well-being, we would like to understand
how the system reached at that consensus.
So, friends with this we reach the end of lecture one for the voiced base emotion recognition
and we have seen why speech analysis is important, why is it useful for emotion recognition,
and then what are the challenges in understanding of emotion from speech from there, we
moved on to the different characteristics of the databases the data which is available for
learning voice based emotion recognition systems.
And then we concluded with the limitations which are currently there in voice based emotion
recognition systems.
Thank you.
309
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 05
Lecture - 02
Automatic Speech Analysis based Affect Recognition
Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology Ropar.
Friends, today we are going to discuss about the aspects of Automatic Speech Analysis based
Affect Recognition. This is part of the Affective Computing course.
So, in the last lecture, we discussed about why speech and voice-based emotion recognition is
a useful modality for understanding the affective state of the user. We discussed about the
application and challenges. Then we looked at some of the commonly used speech-based
affect recognition databases.
Today, we are going to discuss about the feature analysis aspect. So, I will mention to you
about some of the commonly used hand engineered features which are used in speech
analysis. Then we will look at a system for normalization of the speech feature and later on
we will see an example of affect induced speech synthesis.
310
So, we are not only interested in understanding the emotion of the user using the speech
modality, but we are also interested in that if the feedback can be in the form of speech which
is generated by a machine, then how can appropriate emotions be added to the generated
speech itself for a more engaging and productive interaction with the user, alright.
So, let us dive in. So, first we will talk about the automatic feature extraction for speech. So,
friends, the very commonly used features are referred to as a prosody features. These relate to
the rhythm stress and the intonation of the speech. These are generally computed in the form
of the fundamental frequency, the short-term energy of the input signal and simple statistics
such as the speech rate, syllable and the phoneme rate.
Now, along with this, we also have features for speech analysis based on the measurement of
the spectral characteristics of the input speech. Now, these are related to the harmonic or
resonant structures. And typically, commonly these are based on your Mel frequency cepstral
coefficients, the MFCCs and the Mel filter bank energy coefficients.
In fact, MFCCs is one of the most commonly used speech analysis feature for not only just
emotion recognition, but also for other speech related applications such as automatic speech
recognition.
311
(Refer Slide Time: 03:43)
Now, the MFCCs are as I mentioned the most commonly used feature. So, from a
practitioner’s perspective, you can extract the MFCCs with library such as librosa or
openSMILE. Now, the steps are as follows, here you have continuous audio which is coming
in or since it is a continuous audio signal, we divide it into chunks. So, we are doing, we are
doing here. So, we would then apply the discrete Fourier transform where we move in the
incoming audio signal into the Fourier domain.
Then we are going to apply the Mel filter banks to these power spectra which we are getting
from spectral analysis of the DFT outputs and then we are going to sum the energy in each
filter. Later on, we compute the logarithm of these filter bank energies and take the discrete
cosine transform of the input. Further, this gives us the MFCCs where we will keep the
certain DCT coefficients and discard others.
312
(Refer Slide Time: 05:08)
Now, if you look at the prosody-based features. So, we are interested in analysing the
intensity and the amplitude of the input signal wherein you can use loudness which is a
measure of the energy in your input acoustic signal. We also very commonly use the
fundamental frequency. Now, the fundamental frequency, this is based on the approximation
of the frequency of your quasio periodic structure of voice.
And based on the fundamental frequency from the perception perspective, we have the
feature attribute called pitch, which is essentially the lowest periodic cycle of your input
acoustic signal. It is commonly referred to as the perception of your fundamental frequency.
Typically, this means you will require listeners to be able to quantify the pitch of an input
signal, but from a compute perspective wherein we have an automatic system, we are going
to compute the fundamental frequency. Then friends moving on, we have the formant
frequencies F1 and F2.
Now, you can compute the quality of a voice through this wherein we are looking at the
concentration of the acoustic energy around the first and the second formants. From this
another commonly used prosody feature is the speech rate as the name suggests. We are
interested in let us say computing the velocity of the speech, which is basically number of
complete utterances or elements produced per time unit, ok. The other one which is very
commonly used is your spectral energy.
313
Now, through this you can compute the timbre which essentially the relative energy in the
different frequency bands of your input signal. Now, a point to note here is these features are
very fundamental basic features have been extensively used in speech analysis for emotion
understanding of a user and in other applications. But again, these features such as MFCC are
hand engineered features.
So, Friends similar to how we discussed for automatic facial expression recognition wherein
early on histogram of gradients, scale in within feature transform features these were used.
But the community in academia and industry it moved to representation learning through
deep learning and we have these pre-trained networks through which we are extracting the
features.
Now, later on I will show you an example of how the community in speech analysis has
moved to the representation called spectrograms which is ideal for using convolution neural
networks. Now, what is observed is based on these features which we discussed. If we have
positive voices which are generally loud, they are with considerable variability in the
loudness attribute.
And have high and variable pitch and they are having high in the first two format frequencies
F1 and F2. Further it is observed that the variations in the pitch they tell us about the
differences between high arousal emotions. So, again we are talking about the valence in
314
arousal dimension where we are talking about the arousal emotion for example, representing
joy.
And friend’s low arousal emotions such as tenderness and lust when compared with neutral
vocalizations, ok so, pitch is an important feature for looking at the high and lower arousal
emotion.
Now, what typically happens is after let us say we have extracted a feature. We will have
these features extracted from n number of different data points of those are speech
feature-based data points which are coming from either different sources different speakers.
So, there is a very important step which is involved in the pipeline for speech-based emotion
recognition which is the normalization of the feature. Now, an example of this which will set
the motivation is as follows. When you have angry speech, it is observed that it has higher
fundamental frequency values than compared with the frequency values fundamental
frequency values for the neutral speech.
Now, this difference in the two emotions that is actually blurred by the inner speaker
differences so, what that means, you have n speakers who let us say are speaking in turn both
an angry speech and neutral speech. Though we observe differences between the fundamental
frequency value of the angry and the neutral speech between one subject.
315
But if you have multiple subjects because of the difference in style of speaking of these
subjects because of this intra-class variability the difference between the fundamental
frequency of the angry speech and neutral speech would be varying, ok. So, you will observe
a variation across different speakers. So, this observation that you know angry speech will
have a higher fundamental frequency as compared to neutral speech would be blurred in
some cases, but would be very evident in other speaker's cases.
Therefore, speaker normalization is used to accommodate these kind of differences which are
introduced in my data set due to the differences in the speakers. Now, here is an example
based on gender friends. For the fundamental frequency men would typically have signal in
between 50 to 250 hertz and if you look at women subjects it would be between 120 and 500
hertz.
So, there is a difference in the range here. Now, for feature normalization there is a very
simple approach let us talk about that one commonly used which is the Z-score
normalization. What do we do in this Z-score normalization? We say well, I am going to take
each feature and I am going to transpose it so that it will have zero mean and unit variance
across all the data.
Now, please notice this is actually a very common technique which is not just limited to
speech, but it is applied to vision and text features as well. Other approaches are the min-max
approach. So, you set the bound the minimum and the maximum value across the whole data
set.
And then you will map the whole of the features from all the speech data points based on this
minimum and maximum. So, you are just you either stretching or you are actually squeezing
the feature values for a particular data point. Then friend’s others are based on normalization
of the distribution. So, you will like to have a normal distribution for the feature which you
are extracting from speech.
Now, here is a problem with these standard you yet very commonly used features. The
problem is these feature normalization techniques can adversely affect the emotional
discrimination of the features. Since what we are doing is our observation is due to the
differences in the different speakers, we observe that the observable difference between the
feature for different emotion that varies.
316
So, we are applying some normalization techniques. However, since this normalization
technique is generally applied across the whole data set then what it can do is, it can
sometimes reduce the discriminative aspect of the feature for certain subjects as we are
comparing all the subjects together.
So, to mitigate this now I will mention about one simple yet very effective technique for
feature normalization which is called the Iterative Feature Normalization. Now, the
motivation for IFN iterative feature normalization is as follows. As we have seen that the
applying a single normalization across the entire corpus can adversely affect the emotional
discrimination of the features.
Therefore, let us try to work on this aspect ok, which is applying the normalization on the
entire data set. So, what Carlos Busso and his team proposed in 2011 was that let us estimate
the feature using only the neutral non-emotional samples. So, let me have a baseline let me
have a reference and what can be a good reference? Well, identifying the neutral emotion
utterance in audio sample.
Now, once I know what is the neutral utterance or neutral sample in my data for a subject, I
can treat that as a baseline to normalize the other samples, ok. Now, friend’s similar methods
are also used in facial expression analysis with videos as well. Wherein one could modify the
geometric features of a face based on the neutral expression. So, typically a neutral
expression you will observe the lips are closed.
317
So, the difference between the points the distance is less very less negligible. So, that is used
as a baseline for comparing with let us say when the mouth is wide open, ok. So, now let us
come back to speech, ok. So, we are going to talk about the iterative feature normalization.
You get the input audio signal we are interested in feature normalization. So, what do we do?
We use automatic emotion speech detection and that gives us two labels indicating if it is
neutral speech and if it is emotional speech. Once we know what is emotional speech, we
then use that to normalize the parameters. We do this iteratively and then we get the ideal
normalization.
So, here are the steps one by one. First, we take the acoustic features could be any of a feature
friend’s your MFCCs your F naughts and so forth without any normalization. So, we do not
apply any Z-cross normalization or any min-max approach. Now, we use these features and
we detect the expressive speech. Essentially, which part of the speech is neutral which is
showing some emotion so, kind of a binary problem binary class problem.
Now, the these observations which are labeled as neutral. So, those part of the speech or the
speech sample themselves which are labeled as neutral, they are used to re-estimate the
normalization parameters. As now the approximation of normalization parameter improves,
the performance of the detection algorithm is expected to improve. Now, this in turn is going
to give you better normalization parameters.
318
So, this is an iterative process and this process is repeated until certain percentage of files in
the emotional database they have now got change label from successive iteration. Now, this is
give from the original work this is given a threshold of 5 percent. But you know you can
empirically vary that as well. So, what you are saying you get neutral change parameters,
reiterate now again detect neutral and expressive speech and get the newer parameters and
run this iterative process.
Now, friends what we have done? We said we have a bunch of features which could be
extracted for analysis of the acoustic signal, we then discussed about the feature
normalization part. After that we can learn different machine learning techniques. Now, here I
am mentioning the commonly used machine learning techniques. Now, the reason to discuss
this is as follows.
Based on how we are extracting the audio feature, let us say you know this is your timeline
and we have some audio feature which we are extracting, ok. The duration of the window the
frequency of occurrence of the input so, at what frequency am I getting the data. And let us
say in this signal if I was to consider this part and call it part P 1 and call this part as P of N,
how important it is for me to correctly be able to predict the emotion for P of N with or
without prior information let us say P of 1?
So, how much of the prior information is required? Ok. In other words, how much temporal
variation is required across the windows, how much history do you need? Ok. So, that is
319
based on the use case. And also, would be one of the primary factors along with the
frequency time duration and also let me add here the computational complexity, ok. To decide
which machine learning technique are you going to use.
So, commonly in speech analysis the state-based machines they have been used the hidden
Markov models, the condition random fields also researchers have used, support vector
machines at random forest wherein they will first compute the feature from the whole of the
sample mostly you know you will take the whole of the sample speed sample extract the
features and then either run a support vector machine or random forest.
Recently we have also seen researchers are using deep learning based techniques as well. So,
either you can use your convolutional neural network or your recurrent neural networks.
Now, obvious question comes well, if you want to use the convolutional neural network you
would need let us say an image like feature, right. So, we will come to this in the coming
slides.
Your RNN based learning the motivation is very similar to this. You have your P signal you
divide it into chunks and let us say you want to learn the feature from the chunks and also
want to understand the information from the prior right from the background. So, let us say I
am here in the cell and I would not only analyse the feature for this a chunk, but I also want
some learning from the background, right. So, that is how you would commonly use RNNs as
well.
320
Now, coming to the feature which I have been talking about for which is commonly used in
your convolutional neural networks and has been shown to have high quality highly accurate
speech-based emotion recognition is the representation of the audio signal in the form of
spectrograms.
So, this is essentially the visualization of the frequency. So, you can see frequency against
time also it gets you the amplitude. Now, what do you see here friends is there are two
spectrograms from the same subject spectrogram one is when the speech was neutral and
spectrogram two is when the speech was angry. You can very well see the differences in these
two visualizations.
And since we are able to visualize this, I can treat a spectrogram as an image. You know I can
assume that this is actually an image. Now, if this is an image then I can use a convolutional
neural network to train with a spectrogram as an input and the output would be the emotion
classes.
So, here is an example of a work by Sat and others where they proposed this emotion
recognition system wherein you have your spectrogram as input. And then similar to your
traditional deep convolutional neural network you have your series of convolutions, max
pooling and so forth and to induce the time window as well. So, essentially you have a
bi-directional LSTM. So, what you are doing?
321
You are saying well this was my audio signal, this is time I have my S1 spectrogram one S2
spectrogram two. Of course, you could have overlapping windows as well, but this is just for
visualization S3 and so forth till S of N, right. So, these are all spectrogram you input this
here you get the feature representation for each spectrogram and then you are using a
recurrent neural network to finally, predict the emotion categories ok, these emotion classes.
Now, with this we have seen how emotion is predicted using speech signal and friends with
the tutorial which is after this lecture. So, for this very week you will also see a in detail
example of how to create a simple speech-based emotion recognition system starting from
getting a dataset and then extracting different features and trying out different classifiers.
So, till this is the recognition part which falls under the affect sensing step of affective
computing. Now, let me give you an example of speech synthesis where emotion is also
added. So, here we have two samples which are generated using Amazon's Alexa. The first
one is your speech synthesis where the subject is disappointed let us play that. I am playing a
single hand and what looks like a losing game.
The second one is when the speaker sounds excited let us hear this one. I am playing a single
hand and what looks like a losing game. By listening to these two samples one can easily tell
what is the emotion reflected, right. So, what this means is when we want to generate speech.
So, you have a text to speech system which takes into input the text generates the required
speech for emotional speech synthesis along with the text input.
322
We also need to give as an input the emotion class or could be let us say the valence arousal
intensities. Now, with these two inputs to your TTS one would then get emotion enhanced
speech synthesis. So, this is a very nascent area and we see that there is a lot of work going
on to generate emotion enhanced speech.
Now, here is an example by Sivaprasad and others who generate speech with emotional
prosody control. So, let us look at the framework the system which is proposed. So, what we
have here is first the text input, this is the text against which they are going to generate the
speech.
Then we have the second input which is the speaker style. So, this is the speaker reference
waveform and third friends is your target emotion the value for arousal and valence. Now,
text goes into a phoneme encoder. So, you extract a representation for the required phonemes
for the input text statement.
Then the speaker waveform is input into a speaker encoder it extracts the characteristics of
the speaker essentially the style with which one speaks. These are concatenated input into an
encoder. In parallel we have the AV the arousal and valence vectors which are concatenated
and later we are concatenating that with the input from the phoneme and speaker encoder.
There are then duration predictor to predict the duration of the samples of the word which are
going to generate. And further this is input into a length regulator where the energy of a
323
required word is predicted, the pitch is predicted and in parallel the length regulator and the
output of the energy predictor and predictor are concatenated are added into a decoder.
So, this is your decoder. Now friends this gives you a spectrogram the visualization of your
frequencies. And from this we can generate speech which is controlled by the emotional state
so, the target emotion. So, what this means is from a bird's eye view the system would require
the emotion and a representation for emotion and a series of encoders and decoders.
Now, let us look at some of the open challenges in speech-based affect analysis. The first is
inter and intra speaker variability. So, we have different speaking styles you add to it. Let us
say people coming in from different cultures speaking the same language, ok. Let us say you
have an Indians descent person and Asian descent person or Caucasian and all three are
speaking English.
So, these are let us say the ethnicities of the subjects in your data set. Now, this is going to
lead to a lot of variation even though everyone is speaking the same language because of
different difference pronunciation different style in which we speak, right. So, for
generalization of affect analysis system which is based on speech we would require to have
generic representations.
Now, we will observe that when we are want to have these generic representations they will
also have to be agnostic or at least would have analysed and then extracted generic
324
representations for the different display of emotions and the differences in individuals vocal
structures, ok. Even though let us say you have an Indian speakers they will have different
vocal structures which will lead to the variability in the data which is captured.
Further a speaker can express an emotion in n number of ways and is influenced by the
context, ok. Now, the context can again here be where the person is with whom the person is
interacting. So, the emotion would be reflected in different ways. Let us say when you say the
same statement to a friend and compare that while you are speaking the same statement to let
us say an elder, right.
So, the same statement would have a difference in either the intensity of emotion or could
have different emotions altogether. This adds to the variability right the content the linguistic
remains the same, but the emotion changes. Further the second open challenges what are the
aspects of emotion production-perception mechanism which are captured with the acoustic
feature?
We have seen different features are proposed for analysis. Some features would extract a
particular attribute of your signal and when you are trying to understand trying to perceive
the emotion one feature could be better let us say for arousal the other could be better for
valence, right. And how do you actually choose the right balance? Maybe fusion that could
be one.
The third open challenge friends is that the speech based affect recognition can be exhaustive
and can be computationally expensive. And hence it can have limited real time applicability.
We have seen in lecture 1 for speech the trade off right, between the window of sample
duration and how much real time it needs to be.
So, if you have a longer duration sample it can be more computational, but that may have far
more detailed information which could be required for an accurate prediction. However, if
you have a smaller duration sample that could have lesser information, but could be
computed closer to real time, right. So, this is an open challenge that if you want the rich
information how can we do that in closer to real time.
Now, along with this one more aspect which will come into the picture is, let us say even if
you are having a window of audio which is giving you features which are good enough for
predicting the emotion of the person in that very time stamp. There could be things such as
325
background noise in that sample let us say background music is there where the subject is.
So, we would require a noise removal step as well before feature extraction or it could be that
you could have a noise removal step after feature extraction, right.
So, this could also affect the computational aspect and I have it real time.
Now, let us look at some of the research challenges friends. The very commonly used
research challenge benchmarking platform is the Interspeech 2009 Emotion Challenge. The
other one which is very commonly used in the community is your Audiovisual Emotion
Challenge which is the AVEC challenge which will have several tasks for emotion, but it is
also used for multi-model emotion recognition as well audiovisual.
But the audio here is also very rich. Then we also have the Interspeech Computational
Paralinguistic Challenge. Now, this one is actually a very commonly used benchmark in the
speech-based affect recognition community and this has been running for years with different
tasks related to affect. Then we also have the Emotion in the Wild Challenge. So, motion
recognition in the wild.
Here you have audio and video, but audio itself is a combination of the background music,
background noise and the speaker’s voice. So, that is also used for understanding of affect
from audio. So, these are openly available resources which anyone could access based on the
326
different licensing agreements with these resources. And then you can use them for creating
evaluating speech-based emotion recognition works.
Now, along with this I will also like to mention other Multimodal Emotion Challenge which
is the MEC and MEC 16 and 2017 challenge which is in the Chinese language. So, friends
with this we come to the end of the second lecture of speech-based emotion recognition. In
this we briefly touched upon the features which are hand-engineered features and have been
commonly used in speech analysis the prosody-based features and then your MFCC's.
Later on, we talked about an important step in speech analysis for emotion prediction which
is your normalization of the feature. To this end we looked at the iterative feature
normalization technique. Then we looked at the different machine learning techniques which
are used for the affect prediction and later on we touched upon the concept of speech
synthesis where emotion is also induced into the generated speech.
Thank you.
327
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 06
Lecture - 01
Emotion Analysis with Text
Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology Ropar.
Friends, today we are going to discuss about analysis of emotion of a user through the text
modality and this is part of the ongoing lecture series in Affective Computing. So, till now we
have discussed how we can use the voice signal of a user, the facial expressions, the head
pose of a user to understand the emotion which either the user is feeling or the one which is
perceived.
Now, there is another modality which is very commonly used in the affective computing
community that is the text and the reason is quite obvious. We see so many documents
around us. We see conversations happening on chatting platforms. So, you would have seen
billions of users on platforms such as WhatsApp, Telegram and so forth and they are
conversing with text.
So, how can we understand the affective state of a user when the communication medium is
text? And on contrary to the analysis with faces and voice what is really interesting to note
with something with text is that text is not just about what is being communicated by a user in
real time, but it is also about let us say looking at the emotion conveyed in chapter of a book,
looking at the emotion conveyed by a particular comment which was posted few months ago
on a website.
So, both online and offline text is being analyzed for understanding the emotion in different
use cases.
328
(Refer Slide Time: 02:24)
So, moving forward in this lecture, first we are going to talk about why text is important as a
modality for understanding the affect of a user. Then we are going to look at some of the
applications. From this we will go a bit back in time. So, we will explore how emotion has
been conveyed through topography, how different fonts are used to convey the different
emotions to a user when he or she is reading certain text.
And then we will move on to the databases. So, we will look at some of the resources which
are available in the affective computing community where text has been labeled for its
perceived effect.
329
(Refer Slide Time: 03:15)
Now, when we are typically looking at a conversation. Now this conversation let us say is
between a user and a virtual avatar. So, we can have the camera modality, we can have the
microphone, right. Now, when you are recording what the person is saying, you can also use
speech to text in this pursuit and understand the lexical content.
What is being said that conveys the emotion, how is it being said, is the communication more
metaphorical or there are some implicit meanings which are being conveyed, right. So, there
are several applications which are based on this construct of how emotion is conveyed
through text.
Now, typically when we talk about emotions in text, friends you will see that in literature
sometimes the keyword emotion and sentiment could be used interchangeably. However, this
is a very simple yet extremely important difference. Emotion is what a user is feeling. So,
either the user can tell his or her emotional state or what is being perceived by another person
who is communicating with the user. So, the perceived emotion.
Now, with respect to text, let us say you are reading a comment about certain product on a
website. Now, in this case, we are interpreting the affect conveyed by the written text in the
comment as a third person. Now, this is the sentiment. So, what is the sentiment which a
particular lines or text that is communicating to the user who is reading that text?
330
Now, we see that this type of sentiment analysis in text that is based on the text categorization
according to what is the affective relevance. So, you read a comment on a website regarding a
product. After you finish reading, you can actually tell if the user who was writing his or her
feedback about that product felt positive or negative about that product, right.
Further, this is not only limited to the comments on these platforms, online platforms, but
also it is very commonly used in understanding of the opinion for markets, right. Let us say a
news article comes in about a certain company, about a certain product. Now, what is the
sentiment which is conveyed in that news article?
And with that, we can try to understand that let us say what is the generic public opinion
about that particular context regarding which that news article is written. Further emotion in
also is commonly used in computer assisted creativity. Now, an example of that is,
personalized advertisement that can be created by understanding the affective state of the user
and the same can be now conveyed in the form of text to the user.
So, that the appropriate emotion is conveyed in this text. Now, this can lead to a more
persuasive communication with the user, right. Let us say the text which you read in an image
based advertisement that is using the words which are conveying let us say the excitement
which a product could bring to a user, right. So, that can be conveyed within that text of how
the framing of the text is done. Further for adding expressivity in human computer
interaction, we use the emotion which is reflected through text.
What that means? Let us say a user is interacting with the machine. The machine is going to
give a verbal feedback. Now, this verbal feedback would of course, be a speech to text
conversion. Now, this speech to text conversion would from the user would give the content
about the lexical knowledge of what the user was saying. And now when the machine has to
give this verbal feedback, it will do text to speech synthesis.
But what it means the appropriate text which needs to be replied back to the user that can
have the emotional words added to it so as to convey the appropriate emotion, right. So that
means, the word selection for conveying the correct affect. And then of course from the
understanding perspective from speech to text when the user says is extremely important.
And in both these cases the text based affect analysis comes into the picture. You need to pick
up the right words to convey to the user. You need to understand the text which the user
331
spoke to you. The same is also friends applicable in a use case such as the question answering
system. So, how to convey the emotion behind let us say answer which a machine is giving to
a user, right.
So, that would be based on the word selection and so forth. That means your system needs to
pick up the right words which will convey the right emotion.
Now, from this let us move to how emotion has been conveyed in the past decades in the field
of typography, ok. Now, typically typographers they would use these minute information to
convey how the content is interpreted in terms of the emotion which is supposed to be
conveyed by a text.
Now, friends if you notice the alphabets uppercase A and lowercase a in two different fonts
Georgia and Verdana, right. You can actually see here that the way these characters are
represented in these two fonts is actually going to in a very subtle way convey the affect
which is supposed to be understood by the user. This not only is limited to the shape of the
characters, but also to the placement as well.
Now, observe this example here. In this case you notice that there is difference in the gap
between the alphabets. So, if one wanted to read this stanza it would actually of course, not
only be a bit difficult to read it out, but it will be non trivial to understand, sometimes the
affect which is conveyed. Now, if we improve this in the second example wherein at least the
332
gaps are more standardized we see that the reading and then further the interpretation that
improves.
Now; however, when we are introducing symmetry with respect to these gaps in the third
case notice how trivial it is to read and poetry must be collected and published to lay the base
for a modern culture and so forth, right. So, the emotion that is now conveyed more clearly
and the content is also understood by the user.
Now, there is this very interesting work which was proposed in 1924 by Poffenberger and
Barrows. What they are saying is the way in which a line is drawn, the style of the line would
represent the emotion which is supposed to be conveyed by let us say a word which is written
by using the style of those lines.
Now, notice this friends. So, here you have different styles of how lines can be written you
know for example, A to R here are the different styles. Now, with the curvature and the
frequency of change of the line you could have let us say smooth lines or you could have
these zigzag seesaw pattern lines.
If you use that you can convey different type of emotions, ok. For example when we wanted
to convey angry, furious kind of emotion we can use these seesaw angular sloping forward
lines and when we use these lines these shapes and then we create these words. For example,
333
here you see angry you can actually understand that the intensity of the emotion which is
supposed to be conveyed by that word it varies in these two different iterations.
So, of course, this is very subjective. I would say that the second iteration of angry that to me
is far more intense as compared to the first iteration where we are using more smoother lines,
right. So, if you are using these angular patterns, you are able to convey far more intense
angry react emotion.
Now, let us pick up the sad emotion, right. So, in this case if you have a gentler curve which
is sloping down and then you use these suggested typeface equivalent, you can actually
observe that this is slanting backwards sloping down. So, text in this would be perceived a bit
more seriously in the gravity of the content which it is conveying.
Now, if you were to compare that with happy friendly then you know it is observed that if
you have these gentle curve balance lines which are not sloping down like sad then this is
more easily conveyed, right. So, just by changing the typeface the orientation of the basic
components which are coming in together for the word formation.
We can have the user interpret the emotion from the text easily and this is how you can
convey the emotional content clearly to the user. Now, let us look at a bit relatively recent
work. So, in this Juni and Gross in their work from 2008 they got a set of labelers
334
(Refer Time: 15:40) who added ratings, ok. Now, these ratings were given to articles as being
let us say funnier and angrier or in other words more satirical. And they wanted to compare
Times New Roman font with the Arial font.
Now, this kind of questionnaire was given and what you can actually see here is, let us say
when we are talking about the question is which looks the happiest and you see these three
options, right. So, of course, this is subjective, but the orientation in which the text is written
which is of course, the font style that essentially was given to have effect on the user.
So, that means, the same word would be conveyed a bit differently in its emotional content.
So, they found that you know in for example, in this case when you look at calm, the word
calm itself is better conveyed in the second version because of the stability which has there in
the alphabets, right.
Now, let us look at another very relevant interesting work. Now, in this work Larson's and
Picard they wanted to understand the aesthetics and its effect on reading, ok. Now, what do
we have here? We have two versions of the same text. And if you close you observe they are
in different font styles.
What they found was that you know if you have a good typography then it should affect the
perceived affect and the real affect of the person, right. So, it could elevate the mood when
you are reading let us say this article as compared to the same article in a different style.
335
(Refer Slide Time: 17:52)
So, to this end they looked at two task. The first is friends the relative subjective duration
RSD. So, in this the participant’s perception of how long they have been performing a task
was evaluated. What was seen is when you have poor typographic conditions, the
participants, they underestimated the duration of the task by 24 seconds on the average.
So, let us say they have been performing a task for n seconds when the text which they were
reading as part of the task they were given with poor typographic conditions. They said well
you know the duration for which I have been working on this task is roughly 24 seconds less
than the actual duration. Why? Because In this they had more cognitive load because of the
poor typography.
However, when good typography condition based text was presented they underestimated the
duration by an average of 3 minutes and 18 seconds which means the reading was far more
easier, right. And that is how the time the perception of time passing by that was faster. Now,
what we learn is that good quality typography is responsible for greater engagement during
the reading task, the user is more engaged and it is a more immersive experience.
Now, they also did a candle task. So, this is a very old cognitive task from the 1945 proposed
by Duncker. So, what the task is, you have a candle, you have the match sticks, you have
some pins. What you want to do is, you want to take this candle, you want to place it on the
wall somehow that when you are going to burn that candle the wax should not fall on the
table, right.
336
Now, in this particular case from the studies participants 4 of the 10 participants were in the
good typography so they read the instructions. While 0 of 9 participants in the were in the
poor typographic conditions. Now, of course, the participants who were given the same
content in good typographic conditions they performed better in this case.
Now, with respect to the conditions of the how emotion is presented, I have already used in
the beginning terms you know the positive emotion or negative emotion which is perceived
let us say from a comment. But in a lot of cases that when you are looking at the text this
might not be enough, ok. So, we see that in the works looking at text modality for emotion
positive and negative is very commonly used.
But fine grained emotion annotation that even though it is more effective is lesser used in the
works of course, there are the very obvious reasons right, data annotations and so forth. Now,
if we take an example friends, for two emotions fear and anger. Now, both of this are
expressed negative opinion of a person let us say towards something.
But the latter, is more relevant in marketing or socio-political monitoring of the public
sentiment, right. So, anger is more relevant in the marketing or socio-political monitoring.
But both fear and anger are negative; that means, you know we need to have a more fine
grained representation of emotion when we are looking at text.
337
It is also shown that let us say when people are fearful, they tend to have more pessimistic
view of the future. While angry people tend to have a more optimistic view, right. Even
though both of these fear and anger are negative in their nature. But the intention for people
that is different, right. So, when we are trying to understand the users their affective state
through the text modality using fine grained annotation that is more clearer and more useful.
Further fear generally is a passive emotion while anger is more likely to lead to an action,
right. So, there is very fundamental difference in what a user would be doing after they are
you know experiencing these 2 emotions. Therefore you would like to have more fine grained
transformation not just simply seeing positive or negative.
Now, the question of course, comes friends is when you are looking at text and you want to
go fine grained emotion representation, either we go for categorical classes or we look at the
continuous emotion representations through the 4 dimensions. Now, the dimensional model
that has been very less used in emotion detection literature. But of course, in the recent works
it has shown to be more promising in the case of the text data.
Further, if you look at an example it is essential to identify the difference between fear and
anger, right. You because of how fear and anger have different post objectives we will have to
let us say look at the categorical data, but because it is difficult to label when can actually
have the same represented on the continuous dimensions as well.
338
Now, when you look at fear, with respect to its representation on the valence axis, fear is
negative and its representation on the arousal axis, the intensity is low or high. And when you
look at the dominance axis fear is submissive. Now, compare that with the anger emotion. On
the valence axis it is negative.
So, this is same as fear. On the arousal it could be low or high intensity and the dominance
that is submissive. Now, the only difference would be that when you are actually having let
us say the valence and arousal, where are exactly these placed right, fear and anger. That
would have an effect on better understanding of the emotional state of the user.
Now, in the same direction let us look at some examples, ok. These examples will convey to
us how complex emotions can be when we are analyzing the emotion conveyed by a user in
the text modality. So, let me read a statement friends. “The cook was frightened when he
heard the order, and said to Cat-skin, You must have let a hair fall into the soup; if it be so,
you will have a good beating”.
Now, here you have a complex statement where the sub parts are also reflecting different
emotions. However, this mainly is expressing fear. Now, let us look at the second example the
statement is “When therefore she came to the castle gate she saw him, and cried aloud for
joy.”
339
Now, if you notice the content, you have the word cried, you have the word joy. Is it sad? Is it
happy? We can only understand the meaning in the terms of emotion when we look at the
whole statement and the semantic meaning of it, right. So, this statement actually is an
expression of joy even though the word cried is there in the statement, right. So, this actually
reflects how complex emotion representation to text can be.
Now, let us look at another example friends. The statement is “Gretel was not idle; she ran
screaming to her master, and cried: You have invited a fine guest”, ok. Now, if you look at the
emotion this is actually an expression of angry and disgust. However, notice towards the end
of the statement you have invited a fine guest. So, there is a bit of anger there is a bit of
sarcasm as well, but it is only understood through when you analyze the whole statement,
right.
So, you can actually see the relation between some words which are towards the end, but
which are related to the words in the beginning, right. So, the emotion is conveyed when you
start linking the words and this represents the complexity of emotions when you are looking
at the text.
Now, emotions they can be implicit, right. Emotion expression is very context sensitive and
complex. We have seen that in these three examples. It is noted that you know a considerable
portion of emotion expressions they are not explicit. Implicitly through the text you would be
340
actually trying to understand the conveyed emotion, ok. For example, ‘be laid off’ or another
statement ‘go on a first date’.
Now, these contain emotional information without specifying any emotional lexicon, right.
So, there is an implicit emotion representation here. Now, in 2009 Erik Cambria and others
they proposed an approach to overcome this issue by building a knowledge base which
merges common sense and affective knowledge.
Example, spending time with friends causes happiness. Getting into a car wreck makes one
angry, right. So, adding common sense and the affect knowledge, if this happens this would
be the generic affect which would be there.
So, emotions can also be represented metaphorically, right. So, if you see expressions of
many emotions such as anger, they are metaphorical. Therefore, they cannot be assessed by
the literal meaning of the expression. Listen to this example friends, ‘he lost his cool’ or ‘you
make my blood boil’. So, these are metaphorical in nature. It is not actually boiling the blood
it is actually the emotion which is conveyed anger, right.
Now, it is difficult to create a lexical or a machine learning method which can identify
emotions in such text. And this is without first solving the problem of understanding the
metaphorical expression, right. So, in these kind of statements we need to understand the
341
metaphor so as to be able to solve the riddle of what is the emotion which is being conveyed
by the text.
Now, emotions, these are very complex conceptual structure and the structure could be
studied by systematic investigation of expressions that are understood metaphorically. So,
given that you know they are such a complex construct we need to understand the expression
which are understood metaphorically.
So, that means, of course, you know if you wanted to create a system which could understand
the implicitly presented emotions and the metaphorical ones you would need these datasets.
And of course, in that you would need these samples where emotion is presented
metaphorically.
Now, in the same direction friends, the problem of detecting emotions from text is a more
multi classification, multi class classification problem. This is simply because there is so
much complexity of human emotion, how people are presenting emotion, what they speak,
how they speak. Of course, you know this is all about the inter and intra subject variability.
Then some emotions or some speakers in a statement would speak and statement and then the
emotion would be implicitly represented which saw some examples. Some subjects will very
commonly use metaphors. So, that you know would mean that first we need to understand the
metaphor itself.
342
And then there is an importance of context in identifying emotions right, you need to
understand in what context something was said, right. Is the context a comment about a
product, is the context that someone is orating, someone is telling about how their day was
and in that particular use case scenario you would be then understanding the emotion, right.
So, the system needs to understand the context as well. Further as we have seen with voice
based emotion and face based emotion analysis, cross cultural and intra cultural variation of
emotion is there. Even if you are looking at the same language different people will be
expressing the same context in different manners.
Same content in different manners which would mean that there would there is a lot of
variation even though let us say the subjects are trying to convey the same emotion, but in
different styles, right. So, this actually makes the task complex and that means, we will be
required to have a multi class approach here.
And then of course, there are a lot more challenges right, when you are talking about text the
moment you go to different languages. Their emotion will be represented differently,
metaphors would be different, phrases would be different then the systems would need to
adapt to these different language and cultural scenarios. So, there are a large number of
challenges when we are looking at the text based emotion analysis.
343
Now, friends let us look at some of the standard resources which are available for text based
emotion and understanding. There is a dataset called ISEAR which is essentially the
International Survey on Emotional Emotion Antecedents and Reactions by Scherer and
Wallbott from 1994.
Now, in this there are 3000 people who were asked to report situations in which they
experienced each of the seven major emotions. So, it is a categorical approach and they were
also asked how they reacted to them, right. So, it is essentially that the user will recount about
an event and then they would then you know write about it and when they are writing about,
they are recalling an event then the emotion would be elicited.
Now, the other one friends is the EmotiNet Knowledge Base. In this, the authors started from
around a 1000 samples from this ISEAR dataset and clustered examples within each emotion
category based on the language similarity.
Now, another dataset is alm’s Annotated Fairy Tale dataset proposed in 2005. In this dataset
there are 1580 sentences from children fairy tales and they annotated with the categories from
the Ekman’s classical categorical emotion representation. Then there is the SemEval-2007
data by Strapparava and Mihalcea. In this we have 1250 new headlines extracted from news
websites and these are again annotated for the categorical emotions.
344
(Refer Slide Time: 35:55)
Now, let us look at another resource. Friends, this is also very commonly used resource, this
is called the Affective Norms for English Words ANEW. In this case the ANEW contains
2000 words which are annotated based on the dimensional model. So, these are about the
valence and arousal with the dimensions of valence arousal and dominance.
So, here for example, you see the words and the mean valence arousal and dominance that is
presented here for these particular words and in the bracket, you see the standard deviation.
So, these are again collected from a large number of labelers and that is how you have the
mean and standard deviation.
345
(Refer Slide Time: 36:45)
Now, if we observe the plot of pleasure versus arousal and the data points are for men and
women, you notice here these are quite inter related, they are overlaying. However, you will
see the data coming from men labelers, that is actually a bit more on the outside female data
label.
So, essentially the arousal and pleasure dimensions they are more concentrate. This is just to
show you that there is a bit of a difference when it comes to the labels as well based on the
difference in gender. Of course, you know when you have n number of labelers who are
labeling the app perceived affect of certain statements we are using text in this case then we
are going to observe some difference.
And that is why we would like to compute basic statistics around the labels which are
assigned by different labelers so that we can have a final label for the statement.
346
(Refer Slide Time: 37:51)
Another resource, friends, is the SentiWordNet. Now, in this case the lexical resources that
focuses on the polarity of subjective term. So, this is a bit different approach. What we are
saying is for mining of the opinion you can actually have an objective score which is 1 minus
positive plus negative, ok. Now, the rationale is as follows: the given text has factual nature.
If there is no presence of a positive or a negative opinion on it, ok.
Now, let us look at an example in this case. So, here you see this is positive, this is negative
and this is the opinion. So, if you pick up a statement for example, a short life. Now, in this
case there is a word with a bit of negative connotation here, right. So, nothing positive P is
equals to 0, negative is 0.125 and then of course, we are subtracting. So, you know this is the
opinion.
Let us take another example, friends. So, in this case you know let us say the statement is his
was short and stocky, short in nature, a short smokestack, right. So, you have more negative
connotation and the objective score that is 0.25 as compared to here where you had 0.875. So,
this is the opinion about the statement.
347
(Refer Slide Time: 39:35)
Now, let us look at another standard resource this is the NRC word emotion association
Lexicon. Now, collected in 2013 this is based on prior work which largely focus on positive
and negative sentiments, ok. So, what the authors they said well, let us crowd source and
annotate 14000 words in English language. When you are crowdsourcing you can have a
large number of labelers who are annotating the data.
And they have different versions of the lexicon in over 100 languages which are translated
using the Google Translate and this is friends the heat map, ok. So, if you look at the
sentiment this just shows the overall label. So, this has the highest number and of frequency
and this has the lowest frequency.
348
(Refer Slide Time: 40:30)
Now, some other lexicon resources this is the Linguistic Inquiry and Word Count LIWC,
another very commonly used resource, it was proposed in 2001. Contains 6400 words which
were annotated for emotions and each word or word stem defines one or more word
categories. So, there are more than 70 classes. An example is the word ‘cried’, now it is part
of four word categories sadness, negative emotion, overall affect, and a past tense verb.
Now, another one is, friends, WordNet Affect proposed in 2004. In this case it was developed
from word net through a selection and tagging of a subset of synset representing the affect
meaning which is subset of the work word net dataset. Then there is the DepecheMood which
was proposed in 2014. And this one actually again is based on crowd-sourcing to annotate
35000 words. So, this is actually a bit larger resource as compared to the earlier ones.
349
(Refer Slide Time: 41:45)
Now, typically for categorical models, NRC data, that is used as the lexicon. When we are
talking about the continuous dimension the ANEW that is used as a lexicon. Now, further
friends the emotion of the text can be assigned based on the closeness, cosine similarity of the
vectors to the vectors for each category. So, this is one way.
You are saying well, I have a representation for the text. Now, I am going to use the cosine
similarity distance metric and I am going to assign the category or dimension. So, you would
have some samples, some data points. Now these data points would have some category or
some dimensional emotion based intensity assigned to them and ANEW sample comes in.
You compute the cosine similarity between this sample and the samples which already have
the labels and you assign the one which is closest, right. So, the closest sample from the data
will use those labels and assign it to the new sample.
Now, this simply means, well you know you define the similarity between a given input text I
and an emotion class E of j and then the similarity for the categorical classification is
formally represented as simply saying well, I want to compute the maximum for the
similarity. If the similarity is greater than a threshold, if it is greater than a threshold you say
well it is a non neutral emotion.
If it is neutral then essentially you are saying that the similarity between the input text and the
emotion class that is less than a certain threshold, right. So, in you can define the class in
350
such a manner. But it simply means two things: you need a representation for the text; you
have the cosine similarity metric, you can use some other metric for computing the distance
as well.
And you keep a threshold where you say well, if the distance is larger than this particular
threshold, there is an emotion conveyed, if the distance is less than the threshold then this is a
neutral statement. So, this way you can have unsupervised learning for the understanding of
emotion using the text. So, friends with this we reach towards the end of lecture 1 of Text
Based Emotion Analysis.
What we discussed was, why is text based analysis important, what are the applications, how
the subtle details in topography, how fonts are represented, how the basic components in form
the lines
(Refer Time: 44:45) when they come together they would represent different emotions to the
user and have an effect in inducing emotions in the user. And later on, we discussed about
some of the most commonly used resources for text based affect understanding.
Thank you.
351
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 06
Lecture - 02
Emotion Analysis with Text
Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology, Ropar and
friends today we are going to discuss about Analysis of Emotion from Text. So, this is the
2nd lecture in the text analysis for emotion recognition in the Affective Computing series.
So, in the last lecture we discussed about what is the importance of analysis of the emotion
from the text. The text could be in the form of a tweet, could be a document and then we
discussed, how simple organization of text in the terms of the font, the curvature of the lines
which are constituting the font, essentially topography, how that affects the perception of
emotion.
And today first I am going to discuss with you a very important building block of a text based
emotion recognition system which is the important features. How are we going to understand
the context, the linguistics, how are we going to represent the text in a tweet or a document.
352
Later on, I am going to discuss with you few methods, which are proposed in the academic
community which are predicting emotion from text.
Later on we are going to switch the gears a bit and we are going to discuss about how
emotion recognition can be performed when there is a conversation happening. When two
people are let us say conversing through text, the perception of the emotion and the
understanding of the emotional state of the user that can be very dynamic. So, how the
emotion changes, how the perception changes, over the conversation flow. So, we are going
to look into that.
So, let us start with one of the most commonly used feature representation for text and to link
it to how we were using audio and video for emotion recognition, Recall we discussed that
we can have a bag of words based representation, right. So, we can represent parts of an
image or parts of a speech sample as a word and then we can create a representation.
Similarly, when we are talking about your bag of words, typically what we are saying is let us
say you have a document. Now, for the sake of the example, imagine the document contains
information about the different cities in India ok. So, this one let us say talks about Delhi,
then the second document talks about Kolkata and then the third one talks about Chennai and
so forth, right. So, we have a long list of documents.
353
Now, the content in each document would be different that would mean the number of words
in each document will be different. Now, what we want to do, in the plain vanilla bag of
words representation friends, is we want to take all the documents which are present during
the training and we want to create a large repository, which contains all the words, so all the
words in my training documents, ok.
From this we can apply a simple clustering technique like k means, we have discussed this in
facial expression based emotion recognition and that is going to give me a dictionary, which
will contain the frequently occurring words. So, you can say these frequently occurring words
are the building block of my documents.
And once I have identified this frequently occurring words, I can then have a vector
representation for each document separately, which is essentially the frequency of occurrence
of these important words in my document, right. So, that is a vector representation.
So, for example, if you are talking about Delhi, then the word for example, parliament and
government, history these might be more common within the Delhi based document as
compared to let us say, document discussing another city which is not the state capital and is
let us say New City, right.
So, you will have a histogram, a vectorised representation, then standard method you can
train a machine learning model and predict the emotion, which let us say the text is
conveying about the city, right. Now, as you would have figured out friends here, what we are
simply doing is?
We are counting the number of words, the unique words in a document and then trying to see
the frequently used words across all the documents, how many times they are occurring in my
current document. Now, imagine when you were talking about the input stage you know the
first step, wherein I said well you create a repository of all the words in all the documents.
354
(Refer Slide Time: 06:48)
So, we were considering the words here individually. What will that mean? So, let us say a
line in my document says Delhi is the capital of India. Now, you treat each word here
separately Delhi is one word, is one word, the is one word and so forth, right. Now, very early
on, a concept was introduced specially for natural language processing, where we said well,
let us also have combination of words which are occurring together. Now, that is referred to
as n-grams.
So, now when n is equals to 1, you observe this scenario where Delhi is a separate word, is a
separate word, the is a separate word, separate entities. When you have, n is equals to 2,
notice what is going to happen? I am going to say in the same statement Delhi is the capital
of India. I am going to take these two consecutive words together.
So, ‘Delhi is’, now this is one unit, ‘the capital’ that is another unit and ‘of India’ that is
another unit, ok. One could move a step further and say I am also going to now take three
neighbors together as one unit, as one word. So, what will that be in our example statement?
When n is equals to 3, a 3 gram a trigram you would say ‘Delhi is the’, is one word, is one
unit.
And once you have established, the different combinations of words and the appropriate
representation for n-gram, n is equals to 2, 3 depending on the value, you can then repeat all
the steps which we just discussed for bag of words. Wherein now these words before k means
355
could be a combination of sequential words right, each word here could be for example
‘Delhi is the’.
Now, that is one unit, one word in the bag of words sense and one could then train the system.
Why would that be useful? Simply because when let us say you encounter a individual unit
like this where we say Delhi is the. So, from the words meaning perspective, what we
understand is? That after this some information about Delhi is going to be presented, right.
So, here we say Delhi is the, and then for a trigram where n is equals to 3, you would have
‘capital of India’ as a second unit, right. So, learning the relationship between sequentially
occurring words as well, so that is another way of extracting information for our final goal
which is understanding emotion from text analysis, right.
Now, similar to this you can also have another concept, wherein you can say well, I am
looking at the words individually or sequential words as one unit. How about I also take into
consideration, the importance of a word, ok. Now, what could be importance of a word?
There are different ways to discuss and describe this. One could say well, if we have the word
‘Delhi’ being repeated in a document several times, there is a high probability that the
document is discussing about Delhi, something in Delhi, something around Delhi, right.
The other word is let us say you have a word “the”. Now the word “the” is a common word
and it would be repeated across different documents. So, even if let us see in a vector
representation of a document, you know I am just going to draw histogram here, the word
“the” could be repeated multiple times in a document. And other words are of you know
similar or smaller frequency, you would find that in other document histograms, the again
could be very high in value.
So, large frequency, so “the” does not helps me much, in discriminating between the types of
document and perhaps from the context of emotion recognition, it is not helping me much in
discriminating between if let us say the perceived contents emotion from a document is class
x and then in other document you know the perceived emotion is class y. Because ‘the’ is
happening so many times, it is not really helping me much, right.
So, how do we take care of the situation, how simply we can encode this information within
our representation of the text? So, for that friends, we have the very popularly and commonly
used TF-IDF representation.
356
(Refer Slide Time: 12:22)
Now, what is that? The first part of this is the term frequency TF. Simply that means, how
many times a word is appearing in a document. Now, how many times for example, the word
Delhi is occurring in a document. If that word is occurring multiple times, then it could be
important ok. You have seen words like “the” and “a”, they happen multiple times you know
all the documents, right.
So, that could mean I cannot only depend on the term frequency, what I; what would I need?
To balance it I would need a inverse document frequency it is called IDF. What it does is, It
tells the importance relevance of a word, how is it doing it. Well, you say for a given word, I
am going to compute the relevance by saying how many data points do I have in terms of
documents. Let us say I have N data points.
Simply speaking, I am going divided by the number of times this particular word term
happens in some of the documents ok. So, let us say M. ‘The’ happen so many times. Now, I
would like to use this as a weight ok and so that I do not do a very harsh treatment to the
words which are happening multiple times in some of the documents, I am simply adding a
logarithmic to it.
So, this fellow is going to give a weight, essentially if you were to understand in the
importance saliency terms. So, for a given term its TF-IDF value would be its term frequency
multiplied by its inverse document frequency. Now, Delhi may be a common keyword in
some of the documents and it is repeated multiple times.
357
So, you will see a high value of TF, but it will not let us say be giving you a low value for
IDF, so that means, salience. However, the word “the” would have a high TF, but a low IDF
because it is happening all across multiple times in multiple documents.
Now, friends let us look at a system proposed for a text-based emotion recognition. Now, we
are going to discuss a classification task. So, this task proposed by Gishe and others is for
classifying the perceived mood in blog posts. Now, the approach of the authors was as
follows, they curated a large dataset of about 815000 posts from a website called LiveJournal
and then they asked the labellers to also indicate that current mood while they are were
writing a particular blog.
They had 132 categories of mood different keywords which the labellers, they gave to their
particular blogs, how they were feeling. And now, let us look at the features which the
authors used. The first is which we have already discussed looking at the bag of the words
based representation. Now, what the authors also did is, they added what is referred to as part
of speech?
Now, part-of-speech is essentially tagging the words in your data with them being a noun or
an adjective or a verb. Because the number of nouns, the number of adjectives and you know
the number of verbs in a particular document also gives us vital meta information about the
task. Second, the authors used what is referred to as a point mutual information.
358
So, this is a feature which was proposed back in 99 and this point wise mutual information
simply gives us the degree of association between two terms. How much two terms are
related? Now, based on your point wise mutual information, the other feature which the
authors used is the point wise mutual information hyphen information retrieval.
Now, this is a work from 2001 and this gives us the probability for PMI using search engine
hits. So, you use the point wise mutual information and then you see, what is the probability
of retrieving a certain result given certain keywords based on the PMI feature which
essentially how close are two terms, you know what is the degree of association. Further they
would then go forward, combine this and then do classification.
Now, we saw a paradigm shift right, as with any machine learning system, we saw earlier
support vector machines, knife base kind of algorithms were used and then deep neural
networks they kind of came back in the early 2012-2011. Which affected vision analysis,
speech analysis and the same effect was felt in text analysis as well.
Now, to this end, we now are going to talk about representation learning based features. So,
how representation learning based features can be used to map an input text into its emotion
category. To mention one of the most popular ones which are used in the community there is
the Word2Vec, then an extension from Word2Vec called FastText and later is the GloVe in
feature representation. So, let us look at what is Word2Vec.
359
(Refer Slide Time: 18:42)
Now, friends Word2Vec was proposed by Mikolov and others in 2013. Now, in this
representation the authors leveraged the ability of a network to understand the presence of
certain words together. So, if you look at a statement, words which are in a sequence are
related to each other. So, can we use this observation, this very simple observation and learn
the vectorized representation.
Now, what that means is: We would have an auto encoder to recall what was an auto
encoder? You have a neural network, where the first part is your encoder and then the second
part is a decoder. In between here, we have what is referred to as a latent representation
which is essentially nothing but a compressed representation of the input and using this
compressed representation which we input into the decoder, we get the reconstruction of the
original input let me call it I dash.
Now, since we are talking about text and Word2Vec in specific, if we are inputting a series of
words into my encoder then I would like to learn the relationship between them and I am
going to use the latent representation, which is essentially going to give to me the vectorized
representation of the input words, so what we are doing is: We input into our network a
one-hot encoding.
So, let us say the term is Delhi is the capital of India. Now, when you have the word Delhi
under the focus, you keep the corresponding indexes 1, every other word gets a 0. Now, let us
360
say in this case of Word2Vec you say Delhi is the capital of India, right. So, you could use
Delhi is dash of India.
Now, the word capital is linked to the country which comes afterwards and it is linked to
Delhi as well because Delhi is the capital of India. Therefore, the author, they said well, let us
have two types of representations two ways of computing the representation. The first one is
referred to as your continuous bag of words. Now let us see what is that.
So, here you have your words as the input. So, you are saying Delhi is the.. I am not going to
input capital here ok. So, this word is not input, this is the input here and later it is of India
ok. Now, these are one-hot encoding vectors representation of the words, what we are saying
is?
The network would not learn the presence of this word the representation for this word by
submitting, the precursor and the content which comes afterwards the content which was
before, the content which is afterwards and that is going to give me the output for
representation of capital.
Another representation which was proposed in the same work is called your skip gram. Now,
in that you are saying well you should be able to predict the neighboring words, right. So, if
you had capital, capital city, capital of a country, city capital of a country, right. So, they the
361
words before and afterwards have a relationship with the word which is being input so, how
about trying to predict the neighborhood words.
And in both these independent pursues of your continuous bag of words and your skip grams
we are learning the representation ok. So, this is a very powerful representation friends,
which has been extensively used in predicting the vector representation for input words. Now,
there have been some famous, widely used extensions of Word2Vec and that I will give an
example to you is for example, your document 2 vector, right.
So, what is the vector representation for a document? So, again that is based on the concept
of Word2Vec. Now, recall we wanted emotions, right. So, what will that mean? You take the
vector representation for each word using Word2Vec pre-trained network, then you do the
pooling as we have done, discussed earlier and then you can learn a machine learning system
to predict.
Now, another representation which I would if you like to mention to you is friends, the global
vector for word representation proposed in 2015 ok. In short popularly referred to as GloVe.
Now, this is an unsupervised technique to learn the representation and simply it is based on
creating and then little learning the word to word co occurrence matrix.
362
So, the authors created a matrix which is telling us, that what is the probability of occurrence
of words together based on that, we learn a network and we are going to get a representation
which I am going to be able to use later.
Now, let us look at one example where this representation learning is used. So, friends we
have seen bag of words ok, we know this term bag of words now. So, from bag of words
there is a an extension called bag of concepts. So, I do not want to input individual words or
your bigrams or trigrams that is 2 words together, 3 words together as 1 unit, but I want to
input into my system a vector which is created from a bag of concepts, right.
Now, let us look at one such work proposed by Kim and others, this you have the raw text
data coming in you extract the representation using Word2Vec. Now, that gives you for a
document, the representation for each word. What we do is, we then compute a clustering on
the representation which was extracted from Word2Vec and later we do this weighted scheme
similar to TF-IDF which we have seen, right.
So, we are actually looking at the importance of each word. Now, notice in the plain vanilla
bag of words, when you are doing vectorization of an input sample you would increment the
bin in your histogram during the vectorization by let us say 1, when a particular new word
arrives you have checked that word.
363
In this case we want the histogram to be a weighted accumulation of the content in a certain
bin, right. So, you are doing your TF-IDF and then you are getting a representation which
you can further use for predicting the emotion.
Here is another system, friends, on similar lines. Now, this work by Kratzwald and others that
is deep learning for affective computing, text based emotion recognition in decision support.
So, how we can use the affective states for predicting? The you know decisions which a
machine could take and also assisting the user.
So, what do we have here? We have set of documents as training data. We have a pre-trained
network which is learned, trained on a sentiment labelled corpus. Now, let us look at the
features, the authors first are extracting bag of words and word embeddings from your
pre-trained network.
Then you are predicting the network with the features you are doing feature fusion here for
these two and then you are predicting the emotion. In parallel you use the word embeddings,
which we have seen how they are extracted in the earlier slides. Then you extract the feature
from this transfer learning based model concatenate and then you have a recurrent neural
network.
You combine the outputs together, so you do the decision fusion and that gives you the
affective states. Now, one thing to note here friends, I will like to draw your attention here is
364
for the second part ok, for the second part here which is; when you have the word embedding
and the features from transfer learning input into a recurrent neural network. What is
typically happening in your RNN? Let us say the statement again which we have been using
right; the example, Delhi is the capital of India.
Now, when you are having a RNN what you are saying is essentially, you get a feature
representation for this word you input it to the cell and along with that you have the feature
representation of the second word available to the cell. So, there is a sequence which is being
followed, right. So, this you could say that the sequence in which the words are arriving in
my statement, I am learning that pattern in recurrent fashion in the recurrent neural network
way, right.
So, you have the cells you are inputting from the prior and to from the current representation.
Now, this has been very commonly used in natural language processing, you know the
recurrent neural networks, your LSTMs, Bi-LSTMs and they are processing that information
sequentially.
365
(Refer Slide Time: 29:23)
Now, the community slowly has started moving to other approaches where; we are saying
well we do not need sequential approach. But how about if I was able to parallelize and these
are also referred to as your non-regressive models and after this I will show you few
examples of that as well ok.
Now, let us look again at another system proposed by Shelke and others, which is using social
media data as input to predict the emotion from the social media post. Now, notice in social
media post, you are not only going to have the text or you could also have emoticons.
So, in this very interesting work the authors are first pre-processing the input text from social
media post these are the different platforms from this they are fetching the information. You
see they are doing the tokenization that is word level individual unit creation, they are
removing the punctuations; they are removing the stop word during the stop word removal,
removing URLs and also doing lemmatization.
Now, for the emoticons, they are also adding labels to it as well. For example, they say well
you see the joy emoticon, let us say you know the representation for that is a one. And for the
text they are actually using another mood analysis system it is called the Depeche mood
emotion analysis.
So, they combine, it extract features from the emoticons from the text and then they are doing
a machine learning based ranking. They rank the presence of the emoticons along with the
366
text and then input that into a deep neural network to predict what is the emotion, which is
perceived from let us say post on Facebook or on Twitter and so forth.
Now, let us look at another work friends, in this work the authors Batbaatar and others, they
proposed a work called semantic emotion neural network for emotion recognition from text.
What do we have in this work? We have the input words; we are extracting two types of
embeddings from this work.
The first is a we have a pre-trained semantic encoder; it takes one word as an input at time
and gives you the representation which is relating to the semantic information. In parallel the
same word is input into an encoder which is giving us a representation which has the emotion
representation. For the earlier one, there is a recurrent neural network and we get the fully
connected layer.
In the emotion channel we are using a CNN based representation, we fuse it, get the hidden
network and then we are fusing this, concatenating. So, what is happening here? We are
getting your feature fusion and then we are predicting the emotion class. So, what did we
achieve here? We had a semantic pre-trained representation, we had a emotion representation
and then we are fusing it together.
367
(Refer Slide Time: 32:28)
Now, as I was referring to you earlier, about these non-regressive techniques coming right,
where we could input the data in parallel. After Word2Vec type of representations, the
community has moved to attention based systems. Now, attention is simply saying an input
statement comes in, what is important, what is not so important and how do you learn that
well there is a whole mechanism and I am writing it here for you, to check out separately.
So, there is a paper called ‘Attention is all you need’. This is a seminal work proposed to
2017. which discusses how attention can be applied to a network for text task particularly in
this. And this has given birth to neural networks, which are referred to as transformers. Now,
these are non-regressive methods that is, the tokens which are input into it, the words which
are input into it are parallelly taken care of.
And similar to Word2Vec, the transformer based representation which is extremely common
now in natural language processing and hence for natural language processing based emotion
prediction is called BERT. So, this is your Bidirectional Encoder Representation which is
based on a transformer architecture.
So, here is a work by Huang and others, what they are doing here is? They are modeling the
utterances ok, so they are doing utterance pre-processing and they are looking at the
personality tokenization. So, this is the pre-processing part, they input it into a pre-trained
BERT model, which is called a friend BERT. And then the second one from the emotion
368
corpora, they are input into a chat BERT. So, these are of course, you know as the name
suggest, have been trained on different types of data.
The representation which they get, they are actually using that to do a pre-training on the
input data and they are using a masked language model and a next sentence prediction what
would come afterwards and for the output coming from the chat BERT, they are using it for a
emotion pre-training for the Twitter data. Data, they do fine tuning and then they are doing
the evaluation ok. So, you can use these BERT base representation nowadays as well.
Now, here is another work. So, this is again by Kumar and others, it is called a BERT base
dual channel explainable text emotion recognition system ok. Now, notice it is explainable
ok. Now, you have the input representation for each word. So, these are the tokens which you
are inputting into your BERT module.
You get the feature embeddings from the BERT module, but do we have data? We have a
RNN-CNN model and a CNN-RNN model this is just to exploit the temporal relationship, we
concatenate the feature output and then we have a classifier. Now, this classifier is telling us
one of these four states: anger, happy, hate or sad.
Further the authors also propose an explainability module, which is looking at the intra
cluster distance, inter cluster distance of the outputs to explain, why we have anger or happy
369
as a particular label which is predicted. So, notice how the transition has happened for text
based emotion recognition systems.
In the beginning we were using bag of words, bag of concepts based representation. And now
we are in 2023 we are using systems such as BERT which are based on transformers.
Now, let us change the conversation a bit friends, we have been talking about emotion
recognition in text when the text is coming from, let us say a blog, a document, a tweet or a
post on social media. There is another dimension, rather very important dimension to emotion
analysis, that is during conversations. Conversation could be between two humans. So, a
human conversation, it could be a human machine conversation, it could be human machines
conversation or humans and machines conversation.
So, as the conversation happen, the emotion changes the intensity changes. Now, what we
have here on the screen is an example from Poria and others ok. What you see is a dialogue
between two characters, now this is coming from a very popular sitcom, when what you see
as the dialogue proceeds. So, here is the time axis you would notice that the subject is
showing different emotions, right.
Here this fellow Chandler is showing joy, then there is a neutral, then there is a surprise after,
then input coming from this person so this he said something, then this say this guy Joey says
370
something, now emotion is elicited and you know this is actually showing surprise for
Chandler and so forth.
So, you see how for both the subjects the emotion based on the text that is varying here is
right. So, that means, for emotion recognition when conversation is happening, we would
require a more dynamic approach. We require an approach which would be utilizing, what the
other person said and what the user under focus replied back, right. And you could use a time
series information as well to get the context, the long term context.
What we are doing is, we are extracting the feature representations, which are getting us the
situation level context and also the speaker level context again these are the feature
representations. Then the authors they are proposing utterance module, you know this is
extracting the situation in which the conversation is happening again these are using the
neural networks.
In parallel we are analysing the speaker level clue as well, that is individually what a person
said. This information is fused, then there is a classifier to predict what is the emotion during
371
a conversation, right. So, two takeaways from this analyse the situation, analyse the content
which a particular current person is speaking. Situation is going to vary, based on how the
conversation is happening and the conversation is also going to be affected by the situation
and the context where a person is, what are they speaking and so forth.
Now, here is another work by Yeh and others, this work is an interaction aware network for
speech emotion recognition in spoken dialogues, alright. Now, in this what do you see here
some conversation text which is there over time, So, you are extracting the utterances, here
you have a GRU you know you are a current networks.
And you are also adding attention again this is coming from the work which I was referring
to attention is all you need. Then you are looking at the utterances of the speakers in parallel.
So, you know you have M and F let us say male speaker and female speaker and then again, a
bi-directional GRU extracting the representation concatenating and then predicting the
emotion.
Notice how as compared to text based emotion recognition the architectures are changing
when there is a conversation. Because one person speaks, then the other person speaks, right.
So, the system needs to not only analyse the content spoken by one subject at a given time,
but in parallel also look at the conversation happening together as well.
372
(Refer Slide Time: 40:34)
Here is another work friends, Lian and others 2019 which is called domain adversarial
learning for emotion recognition ok. What you have here is you have the utterances as input,
now this is text and audio, which are coming in now you would wonder why text and audio.
Well, it is possible that the conversation is happening in the voice modality you do speech to
text and you get the text which was being spoken ok.
Now, notice how for the different utterances, the conversation is being mapped for its
dynamic through a GRU then you are adding again a attention layer and what we are doing is
at different times, we are predicting the emotion and the speaker level was speaking emotion
and speaker level and so forth.
373
Now, friends this was a brief introduction to the very wide variety of works, which have been
proposed in the literature for emotion recognition during conversations. I also invite you to
look at two of the survey works. Now these are very detailed survey works which are looking
at the different aspects of affect prediction using text, if you are interested in going deeper
into this area.
So, friends with this, we come to the end of the today's lecture, we discussed about the
different features which have been proposed in the literature for analysing the text which is
essentially to create a vector representation, we started with the bag of words with
representation, we talked about what is n-gram, then we looked at what is the concept of the
TF-IDF and later on how we can have a bag of concept.
From this we moved on to, that how with the progress in deep neural network, we are using
your representation learning in the form of pre-trained networks such as Word2Vec. From
that the community has moved to attention based systems. Wherein now we are using
transformer like architectures for predicting the emotion. And in the same context we moved
a bit forward when let us say a conversation is happening.
When you see a machine and a person interacting or two human beings interacting, there is a
very dynamic play of emotion which is happening. Therefore, the system needs to not only
understand the individual person's utterances what a person said, but also look at how the
relationship between the utterances that is being analysed.
Thank you.
374
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 06
Lecture - 20
Tutorial: on Emotion Recognition using Text
Hi everyone, I welcome you all in this Affective Computing Tutorial on Emotion Recognition
using Text. In this tutorial, we will work with uh, text data and try to see how to extract
emotions from the text information.
So, directly starting with the dataset, for this tutorial, we will be using this DailyDialog
dataset, which is essentially a manually labelled multi-turn dialogue dataset. So, this dataset
contains basically 13,118 multi-turn dialogues and dialogues in this dataset reflect daily life
communications. So, one can easily download this dataset from this link and this data is
published under creative common license.
375
(Refer Slide Time: 01:09)
So, talking about the dataset file information, dataset will contain a couple of files. In these
files, we will be basically interested in dialog underscore text file, which essentially contains
these transcribed dialogues. And second file of our interest will be dialogue underscore
emotion dot text, which contains the emotion and rotation in the original dialogue underscore
text file.
Dialogue annotation will look something like number from 0 to 6, where number 0 will be
representing no emotion, number 1 will be representing anger, number 2 will represent
disgust, and 3 as a fear, 4 happiness, 5 sadness and 6 surprise.
376
(Refer Slide Time: 01:55)
Now, let me give you a very a brief overview of this tutorial. In this tutorial, we will be
performing following experiment on Google Colab. First, we will start with data preparation,
where we will be reading text file from Google Drive and we will be cleaning these text files
in terms of removing HTML tags, removing non-alphabetic characters, extra white space and
removing stop words.
Later, we will be using these common feature extraction method from text data, namely, bag
of word, TF-IDF and Word2vec (Refer Time: 02:34) And after extracting these particular
features, we will be performing our emotion classification using machine learning classifiers.
377
(Refer Slide Time: 02:45)
The coding part, we will start with importing all the essential libraries and the code will look
something like this. After importing libraries, we will define our data and our label path.
Code will look something like this. So, after defining our path variable, we will write the
code to read all the text files from our Google drive and save them into Python list.
For that, I will write a function which will look something like this. So, you can simply pause
this video and try to iterate through each line in this code.
378
And now, we will simply call this function and our data will look something like this. So, we
got input data and corresponding labels over here. Maybe I can show you couple of data
instances also, so it will look. So, very first line in our data set is ‘the kitchen stinks’. Let me
show some other line, maybe the second line. So, second line saying, ‘I will throw out the
garbage’ or maybe I can show some random line also that will line add position 67. ‘May I sit
here?’ ok, So, he is asking a question over here.
So, these line belong to conversation and these line are annotated by annotators into some
emotion classes. So, I can show you the emotion classes also. So, here you can see like each
line belonging to some sort of a emotion class over here like zero belong to neutral. So, after
reading these text files, our first task will be to clean these files.
Most of this text might contain some sort of a noise in terms of punctuation marks, maybe
some sort of a hyperlinks. So, before our analysis, we will consider to remove all these sort of
a potential noises in our text data. For that, I will write a function and my function will look
something like this.
So, my function is clean text where I will be passing text into this function and it will be
trying to remove any sort of a HTML text, then it will remove any sort of a non-alphabetic
characters and it will remove any extra white spaces in the text. And in final iteration, it will
simply convert my text into lowercase and give it back.
379
So, let me run this cleaning function on my whole data. So, my data is now clean, so maybe I
can show you couple of text instances and try to see the difference between original data and
the clean data. So, as we can see here, we have no extra spaces like this and our full stop sign
is also removed and all the text is in lower case.
Maybe I will show you (Refer Time: 07:05) also. Here again, you can see like all the
punctuation marks are removed and all the text is in lowercase only. So, after performing
cleaning operation on our text data, we will start with feature extraction. So, our first feature
will be bag of word which we will be implementing using count vectorizer.
So, count vectorizer is basically a tool used in natural language processing to convert words
in a document into a numerical representation. And this numerical representation can be
easily understand by a machine learning algorithm. It basically counts the number of
occurrences of each word and create a table with count for each word in the document. And
this table can be used for various tasks are like as text classification, maybe sentiment
analysis, topic modelling.
So, to perform our bag of word based feature representation here, we will be using intrinsic
function in a sklearn library known as count vectorizer. So, you can see the code will look
something like this. We will use this count vectorizer function and we are also passing
argument calls stop word equal to English. So, this argument basically will enable a count
vectorizer to remove all the possible stop word present in English language from our text.
And then after removing the stop word, we will make this BOG bag of word representations.
If you want to remove some particular sort of stop words, you can also pass a list of words in
this argument. So, as we can see count vectorizer has converted the text representation into
some particular vector representation and I can see here the dimension of that vector
representation.
Now, I can use this representation to learn our machine learning classifier. So, in this case, we
will be using, multinomial Naive-Bayes algorithm classify the particular emotional classes.
So, before that, we just need to divide our data into respective train and test set. For that, I
will be using basic train test split function from a sklearn and let us say the test size is 33
percent.
380
So, we have divided our data into respective train and test sets. Now, I will be using this
multinomial, Naive-Bayes classifier and try to see our classification results ok. So, here I can
see that we are getting a train score of 84 percent and a test score of 81 percent given that we
have a 6 class classification problem and the chance level will come around 16 percent., So,
84 percent train accuracy and 81 percent test accuracy is a good score here.
So, after using this count to vectorizer, maybe we can try another sort of feature
representation technique known as TF-IDF. So, TF-IDF is basically a; is basically a better
version of this count vectorizer. TF-IDF takes into account the importance of word in a
document by multiplying the word count by inverse document frequency.
So, what is this inverse document frequency? IDF is basically calculated as logarithm of ratio
of total number of documents to the number of documents containing the word. This way,
words that are frequent in a particular document, but rare in the corpus as a whole are given
more weight weightage. And this approach can help to better capture the meaning and
importance of word in a document.
And in general, TF-IDF is considered to be a more advanced and more effective techniques
than count vectorizer. So, to use this technique, I will be again using another function from a
sklearn library and our code will look something like this ok. Now, we have gotten those
TF-IDF representation over here.
So, we can try to train our classifier on these sort of features and try to see, do we get any sort
of improvement in term of our emotion classification accuracies. For this, I will again divide
my data into train and test splits with TF-IDF as a input over here.
381
(Refer Slide Time: 12:02)
And I will be reusing my multinomial code and try to see, do I get any sort of better
performance here? So, in this case, we can see, we get slightly better test accuracy over here.
I will also like to see, this change in a classifier has any sort of effect in our performance.
So, for this, I will be using linear SVC, which is basically a support vector machine with
linear kernel. Here, we can see that we get slightly better train score, but test score is
somewhat similar. So, after count vectorizer and TF-IDF representations, we will move to our
third type of representation, which is called Word2Vec.
So, Word2Vec is a type of neural network model used for natural language processing. It was
developed by researchers at Google in back in 2013. And the purpose of Word2Vec is to
create words embedding, which are numerically representation of word that can be used in
machine learning models.
This model, I mean, this Word2Vec takes a large corpus of text input and produce a vector for
each word in the corpus. These vectors are designed to capture the semantic relationship
between the words. So, in our case, we will be using a pre-trained sort of Word2Vec model.
So, we will download this pre-trained model using Jensen API and the code will look
something like this. So, downloading this code may take a good amount of time because this
model will be a little bit larger model. So, please have some patience. So, after downloading
our model, we will write a function, a name as prepare Word2Vec.
382
So, this function will basically prepare training data consisting of vectors from Word2Vec
model and the code will look something like this. And after that, I will simply create my
Word2Vec representation by calling this function.
And this code also might take a little bit of time ok. Now, my Word2Vec representations are
created. I will simply use my train test split over these Word2Vec representations and later I
will simply learn my machine learning classifier. And the code will look something like this.
Let us see how linear SVM classify our motion classes using these Word2Vec
representations. So, as we can see this Word2Vec representations give us a very nice score
using linear SVC. So, concluding this tutorial, in this tutorial, we explored a public dataset
for classifying different sort of emotions.
We started with cleaning text data by removing HTML tags, non-alphabetic characters and
extra-white spaces. Then we used our three different sort of feature extraction method, bag of
word, TF-IDF and Word2Vec and we used machine learning based classifier for classifying
these emotion classes.
383
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 07
Lecture - 01
Emotions in Physiological Signals
Welcome friends, so in today's class we are going to look at the Emotions in Physiological
Signals, a very interesting and in fact very dear to my heart kind of topic. So, here is the
Agenda for today's class.
In this class, we are going to first look at why we are we have we are interested in looking in
the emotions in physiological signals and what are some of the relevant physiological
signals? We will be looking at the emotions in heart and not only metaphorically, but also
physiologically and also, we will try to understand the emotions in another very popular
physiological signal which is known as the skin conductance.
And finally we will also look at the emotions in EEG signal, which is commonly known as
the brain activity signals. And finally, we will conclude with the discussion of some
additional physiological signals, that can also be used for the recognise for the emotions for
understanding and analysing the emotions; perfect.
384
(Refer Slide Time: 01:27)
So, let us start first with the trying to understand the Physiological Signals itself.
So, first thing that we have to understand that the physiological signals, they originate from
the activity of the autonomic nervous system and hence it cannot be triggered by any
conscious or intentional control. Now, without making so many jargon, making use of so
many jargon words, what essentially it means that it cannot be controlled or faked or it cannot
be consciously produced. For example, the way we can do with the facial expressions and in
since it cannot be consciously or intentionally controlled,
385
So, what it means that you cannot fake it. I mean, we cannot easily at least we cannot easily
fake the emotions that are getting produced in the physiological signals. And in fact, that is
what has given a rise in the popularity of the emotions in a physiological signals and it is
analysis. So, other important characteristic related to the physiological signals and emotions
is that it does not really require the user to pay a lot of attention to provide the data of the
emotion that is there in the physiological signals.
So, for example, if you want to get the emotions in facial expressions, you may have to ask
the participant or the user to you know maybe look at in a particular direction, there should be
certain standard setup. And this all will require the user to pay a bit of attention to you know
how the user is sitting, how the user is looking at the camera and so on and so forth.
But in this case, nothing of this sort is required. And for the same reason, it is very very
helpful even to get the emotions among the individuals those who may have some sort of
attention deficiency as well. So, that was the second advantage. And more importantly, why
suddenly there is another reason why it has become very very popular to analyze the
emotions in physiological signals is because of the advancements in the wearable
technologies.
So, from the recent advancements that we have seen over one or two decades, the it has
enabled the acquisition of the physiological signals such as the ones that we are going to
discuss now in an un-intrusive manner and also, I would say in relatively low cost fashion.
So, basically this low cost and un-intrusive possibility of capturing the physiological signals
in a low cost using a low cost device and in an un-intrusive manner, has what also given the
rise to the popularity of the emotions and its analysis in physiological signals.
386
(Refer Slide Time: 04:13)
So, these are the some of the physiological signals that are commonly used when we try to
understand the emotions right. So, for example, the very first one is the blood pressure. So, I
believe the blood pressure is something that we all know. And of course, while it may not be
very, very expressive, the emotions may not be very very expressive in the blood pressure.
There has been certain studies which has looked the emotions in the blood pressure.
And then with the heart rate we can get the emotions that are associated that are expressed in
the heart activity of the humans. Other is very, very popular method is again is the
electroencephalogram. So, basically electroencephalogram is also known as the EEG signals
as you can see on the right side of the screen.
So basically, EEG as you can understand that basically to analyze the EEG, you need to we
attach a cap kind of thing on the head of the humans. And then we try to with the help of
certain electrodes, we try to capture the brain activity of the individual and through which we
try to analyze the emotions.
387
So, that is the EEG signals. Other thing is the Electromyogram. So, basically
electromyogram, it basically refers to as you can see here. So, this is the what is referring to
the electromyogram. So, basically electromyogram, it refers to the activity of the muscles
right.
So basically again, it involves attachment of some electrodes to the body to the muscles and
through which we try to get a certain expression of the body posture through which we can in
turn get the understanding of the emotions. Not so, popular method when it comes to the
emotion analysis, but nevertheless this is another option. Another very popular method is the
Galvanic skin response.
So, basically galvanic skin response as you can see here in the right side of the image.
Galvanic skin response, it refers to the skin conductance. It is also known as the skin
conductance or the electrodermal activity. It refers to the skin conductance or the conductance
of the skin as the name implies of the human body through which we can understand certain
components of the emotion such as arousal and so on and so forth.
So, again we will look at it in a bit more detail and nevertheless of course the other
component is the respiration. Respiration is of course is the respiratory activity of the
humans. So, we all have seen that you know the respiration itself may change depending
upon the type of the emotions that we are experiencing to certain to do a good extent and
hence it has also been used in research, in literature to analyze and understand the emotions.
Nevertheless, again last, but not the least is the temperature of the human body. So, the
temperature of the human body, it has also been found to be correlated with certain
expression of the certain types of emotions. So nevertheless, I mean this is by no means this
is an exhaustive list, apart from this for example you can also look at the EOG signals which
is the activity of the human eyes.
Similarly for example, you can also look at the gaze patterns by looking by making use of the
eye tracking devices and so on so forth. But nevertheless, for the sake of this course, mostly
we will try to stick to the ECG signals. We will look at the EEG signals, we will look at the
GSR and of course by the virtue of the ECG, we will also look at the heart rate right. So,
these are the few signals that we are going to analyze and more or less many of the findings
can be extended to the other signals as well; perfect.
388
(Refer Slide Time: 08:02)
So, with that, let us try to look at the emotions and how it is being expressed and how that can
help us in the analysis of the emotions in the heart rate; perfect.
So, about the heart rate, we all the reason of course the heart has been shown to be
metaphorically correlated with the emotions in so many ways, in poems, in literature, in arts,
in movies.
389
But, the physiology says that it does not only relate to the emotions in a metaphorical way,
but definitely it relates to in a physiological manner and the reason that it relates, and one of
the reasons that it relates to the emotions is that the activity of the heart. It also interacts with
the brain and ultimately it impacts how we experience an emotion.
And one of the best example of this particular is that imagine that you are giving a
presentation or for example, you are doing a lecture such as myself now and you are trying to
recall something that you have to present during the presentation in front of an audience or
for example, even in an online setting.
So, and if you are notable to recall that, then imagine what happens? Imagine it is a big
audience; it is a very important presentation. So, what happens that suddenly you start feeling
anxious, you start feeling aroused and then you know you start feeling sweaty actually.
You know the is like that there is a proverb that you know like you start feeling sweaty here,
sweatiness, lots of sweatiness and that is how you know. So, in this particular activity, if you
try to analyse it, then using your brain, your while your brain is trying to recall some stats,
some facts.
At the same time, your heart is getting the signal from the brain that, ok, maybe it is not able
to recall and then since it knows the importance and the brain also communicates the
importance of the presentation for you. So accordingly it tries to, it emulates the experience
that we are having as of now in front of the audience and hence, we start feeling sweatiness
or even our heart rate starts increasing, right.
So, that is one very good example of how the activity of the heart, it also correlates with the
activity of the brain and how overall it impacts the emotions that we experience. Specifically,
the heart rate usually has been shown to have a very good correlation with the arousal.
If you recall, when we were looking at the arousal and the valence and the dominance model,
VAD or the PAD model, which is the dimensional model of the emotions, then arousal was
the component which was sort of giving you the energy that was there in the present in the in
an emotion.
So basically, this heart rate has been shown to have a good correlation with the arousal. And
hence, it can help us in so many ways in trying to understand the emotional state in a.. in a
390
very good detail. Overall, it is not hard to realize that the heart rate has been shown to be a
good indicator of the overall physical activation and the effort.
So imagine, you know, if you are simply walking, your heart rate is steady and smooth, but
the moment you start running, of course you can immediately notice an increase in the heart
rate activity, right. And for the same reason, it is not very hard to understand how it is can be
an indicator of the overall physical activation and the effort.
Nevertheless, it has when it comes to the emotional activity, it has also been found to be
correlated with the fear, panic, anger, appreciation, etcetera right. So, these are some of the
feelings and mostly while trying to analyze this thing, we try to look at the emotional arousal
from the.. using the heart rate.
Now, having understood that how heart rate can help us in analysing different emotional
states, particularly emotional arousal.
Let us try to understand how can we measure, how the heart rate is measured so many of you
have, we have seen the use of the ECG. For example, the electrocardiography,
electrocardiogram for measurement of the heart rate, so basically ECG you may have seen
that it is the way ECG measures the heart rate is it monitors the electrical changes on the
surface of the skin.
391
And many cases, what happens that there are multiple electrodes, usually 4 electrodes, which
are placed on the four different parts of the human body. And for example, one particular
scenario where it is placed on the palm, such as if you can see in the image. So for example,
here one electrode on the right arm, one electrode on the left arm, one electrode on the right
leg, similarly another electrode on the left leg.
So, this is what a four electrode configuration is, that is commonly used, similarly other
configuration that can be used. For example, one electrode here, one electrode here, one
electrode here, one electrode here and this kind of configuration is also used to get, to
monitor the electrical changes that is there on the surface of the screen through which we try
to observe the heart rate.
We try to get the heart rate. Now of course, the question the next question that you would like
to ask, that the amount of the electricity that we observe on the human skin is it too much and
of course it is not too much. We are not a walking electrical transformer right. So, the even
though the electrical current is very very small, is in microvolts.
But the electrodes that we use in the electrocardiography are designed in such a way that it
can pick this even the small changes in this electrical, small electrical changes on the skin in a
reliable fashion and that is how we measure the heart rate of the human body. Another very
popular method is the use of the PPG signals, which is also known as the photo
plethysmography.
So, in the photo plethysmography or the PPG signal, what we do? We measure the pulse
signals at the various locations on the body. And usually for example, it is very common to
attach a PPG clip on the lobe, on the fingertip or for example even on the ear lobe or for
example on either capillary tissue such as you know on the legs as well.
So, if you look at this diagram on the right hand side, the way a PPG clip is attached. What it
does? It makes use of the tri sensors, which usually consists of some LEDs, some IR LEDs
and some photo diodes. So, basically what happens that the IR LEDs, it emits the it emits
light
And then of course, depending upon whether the blood is actually flowing through the veins
at that appropriate time or not, the amount of lights they get absorbed by it. And then
392
accordingly, whatever is the remaining light that is collected by the photo diode that is
beneath this PPG clip.
Now of course, depending upon how much light it has emitted and how much light was
collected, it could really analyse that ok, whether the blood was flowing or not. And of
course, when the of course, when whether the blood was flowing or not. What it means? that
of course, when your heart pumps the blood, I mean it has to go at certain point, then of
course then it stops for a very few amount of time, then of course, it pumps and then there is a
rise in the blood flow right.
So basically, this is how these are the two configurations in which this IR LED and the photo
diodes are placed in the PPG clip and that is how it measures the flow, blood flow and
accordingly tries to estimate the heart rate of the human body. And this is the same kind of
PPG clip you may have seen, which has been very, very popularly used for example, in the
oximeters as well.
So, you may have seen the oximeters and then you simply place the oximeter in one of your
fingers, fingertips and then you get the heart rate from there. So, this is again and as you can
see that this is of course the attachment is much quicker than in comparison to the ECG
setups. In the ECG setup, you need to use this 4 electrode system, it is kind of very intrusive.
And of course, in comparison this PPG clips, they are relatively easy to use and they are of
course less bothersome or cumbersome for the participants and hence they are a bit more
popular for the analysis of the heart rate signals than in comparison to the ECG. But we will
talk down the line that of course, it compromises the efficiency of the how effectively is
calculating the and the precision of the heart rate essentially. So, but nevertheless, it is a
popular choice for the analysis of the heart rates.
393
(Refer Slide Time: 16:36)
So, these are the two popular methods through which we observe the heart rate in the
physiological experiments and then of course, now once we have observed and obtained the
heart rate then there are certain parameters that we can use to analyse the effects of the heart
rate right. And of course, one simple parameter that we can talk about is the heart rate itself.
So, basically the heart rate, it is typically expressed as the beats per minute and as we already
understand by heart rate what we simply mean, we simply mean the frequency of the
complete heart beat from its generation to the beginning of the next within a particular time
window right.
And then we have already seen that an increased heart rate for example, it typically reflects
increased arousal. So, as I was giving the example, if you are giving an presentation in front
of an audience and you are not able to recall certain very important fact or some stats that you
want to present and immediately you will feel aroused and you will feel aroused and
accordingly you will feel that there is an increase in the heart rate; sorry, other and of course
this is the raw data that you have. Other way commonly used data is the inter beat interval.
So, basically the inter beat interval as the name itself suggest, rather than looking at the how
frequently the heart beat is occurring, what it simply looks is at the time interval between the
individual beats of the heart rate. So, for example, one beat another beat. So, what is the time
interval between these two beats? And usually, this time interval is measured in units of
milliseconds. Just one sec [FL].
394
So, usually it is measured in units of milliseconds as opposed to the frequency that as we see
in the heart rate. But these 2, as you can see this is the raw data itself and the raw data itself
many times it does not give you a lot of information. Hence, there is another very popularly
used parameter which is the heart rate variability.
Which and the heart rate variability basically as the name itself suggest, it expresses the
natural variation of the inter-beat values from the beat to beat. What it simply means as you
can see in the diagram that is given below. So, for example, this is giving you the, this
particular diagram is giving you the data of the 2 and a half seconds of the heart rate,
heartbeat data right.
So, if you can see this is the one particular heartbeat, I mean it is so basically this is a this is a
typical ECG signal, how the typical ECG signal looks like. So, in the typical ECG signal
apart from the P and the T waves, so you have this QRS wave which is the central and most
visually obvious part of tracing the ECG signals.
So, basically in the QRS signal what you have that it represents the depolarization of the right
and the left ventricles of the heart and at the same time it represents the contraction of the
large ventricular muscles. So, basically this is the most important part of the ECG signal that
we use to analyse the ECG abnormality and so on so forth.
So, now, ok and together this QRS signal it occurs together as an event and hence it is
referred to a single QRS waveform. So, in the Q so basically one heartbeat you can represent
it as a single QRS waveform. So now, the this one single now the intermediate interval
essentially what it represents, it represents the distance between the 1 R interval from the 1 R,
of the 1 waveform and from the another R of the next waveform right.
So, basically the inter beat interval is represented as let us say 859 milliseconds here and then
70, of course it can be translated to the beats per minute which is of course, you simply have
to divide the if the inter beat interval is the 793 milliseconds, then how much is going to be
the beats per minute.
And then for example if you take the example of the 793 that if the inter beat interval is the
793 milliseconds then what is going to be the beats per minute. So, it simply has a matter of
time, you know so you can simply take that ok 1 second is equal to 1000 milliseconds and
then there are 60 seconds per minute. So, basically if you divide 60 into 1000 divide by 793
395
here then you will get something around 76 BPMs around 76 beats per minute and that is
how you calculate how much is the beats per minute right.
So, basically this heart rate variability what it essentially gives you it gives the variation that
is there between the inter beat values from beat to beat or from 1 R wave to another 1 R peak
to the another R peak. So, for example, you if you can see that while the first one was 859
milliseconds the inter beat point interval for the second it was the 793 milliseconds and for
the third its 720 milliseconds.
So, there is a variation between these between the times at which these beats are occurring
right and it turns out that the heart rate variability is a very important parameter when it
comes to the analysis of the emotional arousal as well as the emotional regulation.
So, for example, it has been found to that heart rate variability to decrease under the
conditions of the acute time pressure and the emotional stress. What it means, that the heart
beat is.. heart beat is more consistent when we are under a lot of stress.
Now, that may sound a bit unnatural, but what it simply means that you know whenever you
are having a low heart rate variability, then it simply means that you are not able to cope up
with the stress, because in order to cope up with the stress or in order to cope up with some
physically arousing event that is happening around, you need, For example, you need to have
more supply of the blood from your heart and accordingly the heart rate should pump at a
396
faster rate right. Accordingly for example, let us say if you are relaxed it should be pumping
at a lower rate. So, there should be a good variations in the heart rate at which the heart rates
are coming, but when there is no variation low heart rate variability means there is no
variation or there is very less variation in the heartbeat.
And accordingly, what it means that you remain stressed, you remain aroused and you are not
able to cope up with the conditions that are happening around you. Similarly, in contrast if
you have the higher heart rate variability. So, not the higher heart rate, but the higher the heart
rate variability.
So, higher heart rate variability has shown to be correlated with the relaxed position, what it
simply means that your body has a strong ability to tolerate stress or is already recovering
from a prior accumulated stress. So, and it has been found you can see you will see n number
of studies you can come across, n number of studies which has consistently shown that higher
heart rate variability is of course, desirable.
Now, I think you have understood that, but more importantly if you have the more.. healthier
you are, at least the more healthier your heart is, the more healthier you are at the emotional
level, the higher the heart rate variability you will have. Not the higher heart rate, but the
higher heart rate variability you will have.
397
And it has it is such powerful source of information when it comes to the analysis of the
emotions in not only the emotional arousal, but the emotional regulations. So, we will see a
very beautiful example here. So, for example, imagine that there are two conditions one is the
condition A, we will call it this is the condition A and then we will call there is another
condition B.
Now, of course, the waveform that we are seeing here what it is giving you, it is giving you
let us say the ECG data for certain condition A for a certain period of time and for the same
period of time more or less the this B is representing the ECG data for the condition B. Of
course, you can see now that this is the this is what this is nothing but this is the QRS wave
right. So, you can see that this is the Q this is the R this is the S.
So, basically for the analysis purpose as we say we simply look at the and the R to R interval
the distance between the R to R beats and accordingly we try to analyze the beats per minute
perfect. So, now let us say these are the two conditions A and B and from the look at from
looking at the heart rate we are not able to make much sense out of it. But of course, since we
have this intermediate interval what we can do we can calculate the average R, R distance for
the condition A and as well as for the condition B.
Now, I hope that this is not hard to understand that how did we calculate. So, for example, if
you have to calculate the average for the condition A, then you can simply do it as a for
example the average R R A will be represented as 744 plus 427, so 744 plus 427 plus 498
plus 931 right. So basically, all together if you look at this and then if you were to divide this
by 4, then this is how you are going to come across 650 milliseconds and similarly you can
calculate it for the B condition as well.
And now one interesting thing to observe when we are calculating the average let us say R, R
for a condition A condition B that for both the conditions we are getting the similar values
650 milliseconds for condition A, 650 milliseconds for the condition B. So, in this case we
are not able to make much make we are not able to make much information out of it.
398
(Refer Slide Time: 27:04)
Now, let us try to look at another interesting characteristic which is using the heart rate
variability, which is known as the RMSSD so, basically R so ok so this is represented as
RMSSD. So, basically RMSSD as the name itself says it means the root mean square of the
successive differences and the way it is calculated that you simply calculate the square of the
sum of the square of all the RR intervals.
So, for example, you calculate RR1 minus RR2 square similarly RR2 minus RR3 square so
on so forth you take a mean of it and then you simply take a square root of it right. So, that is
how it becomes root mean square of the successive differences. And then it turns out that the
RMSSD has been found to be more informative then let us say the average RR interval.
So, in order to get the root mean square of this thing let us try to calculate. So, the way for
example, we will calculate the RMSSD let us try to calculate the RMSSD for the condition A.
So, if you try to calculate the RMSSD for the condition A this is this will be equal to we will
have to do 744 this is 744 again minus 427 the whole square plus 427 minus 498.
Of course, since we are taking the whole square, so it does not matter whether we are
subtracting the smaller from the bigger or the bigger from the smaller plus 498 minus 931 the
whole square and of course you will have to take the first you will have to take the mean of
the entire thing.
399
So, mean is going to become the you will have to take the mean and then you will have to
take the square root of the square root of the entire thing. So, let me try to do the complete
formula right. So, this is how more or less it is going to look like. So, if you were to if you
were to let us say I will try to just you know summarize it here. So, if you were to take the
square root and the average of this thing then what is going to be turn out to be?
So, for example, if you look at the 744 minus 427 square, so this will turn out to be 317 the
square plus 427 minus 498 it will become 71 square, similarly 498 minus 931 it will become
433 the whole square right and of course, all you will have to take the average of everything.
So, basically this will be the average by 3, so all together it turns out that it is going to
become if you take the average if you take the sum it up and take the average then it is going
to it will be it will become equal to the square root of 97673 and then it is going to become
equal to around 312.526 which you can roughly say that it is become it is equal to the 313
milliseconds perfect.
So, this is how you calculate the RMSSD for the condition A. Similarly, you can calculate the
RMSSD for the condition B ok. So, let me just try to rub it again, so that you can very clearly
look at it. I hope that you can understand now that how to calculate it right.
So, then what you have that you have the similarly you calculated the RMSSD RMSSD or
RMSSA for A and in the same way you can calculate the RMSSD for the B. So, for the B
400
what you will have to do again you will have to I will just write it down for you and then rest
of the things you can calculate easily.
So, it will become 630 minus 675 the square please pay attention that this is I am calculating
for the B and plus 675 minus 655 the square plus and of course this divide by 3 right. So, if
you were to calculate this thing this is going to come out as 31 milliseconds. So, now we have
some interesting results here for the RMSSD1 and 2.
So, again to highlight we have condition A condition B, when we calculated the average RR
values for both the conditions, we got exactly the same values even though the conditions are
different. But when we calculated the RMSSD, so the RMSSD for A it came out to be 313
and similarly RMSSD for B it came out to be 31.
So, definitely now we can observe a clear difference between the condition A and the
condition B. Now, what can we say from the difference between the condition A and the
condition B. So, since the condition B is representing the lower heart rate variability, please
pay attention to this thing condition B is represent this is what this is representing the heart
rate variability right.
So, the condition B is representing the lower heart rate variability and we already know that
the lower heart rate variability is associated with the more arousing condition right. So, for
the same reason we can conclude that the condition B in this case is physiologically more
401
arousing than condition A. Of course, assuming that this both were observed in the standard
conditions and from the same participant right.
So, for example, this is very nice example of how where the heart rate simple heart rate
cannot distinguish or discriminate between two conditions, the heart rate variability can
easily do so.
Now, let us try to look at some of the factors that we left before, that we already know that
there are 2 ways in which we can capture the heart rate. We can capture the heart rate either
using a ECG or we can capture the heart rate using the PPG. Now, we already talked about
that how the ECG is bit more accurate measurement of the heart rate than from the PPG.
But of course, we already saw that it is a bit intrusive and it is not very comfortable for the
participant. So, for the same reason the PPG based measurements have been used very
commonly and it has been shown that if we can take the longer time windows for the analysis
of the heart rate for the PPG signals. As derived from the PPG signals then it can give some
accurate estimates of the heart rate.
So, for example, one common rule of thumb here is that if you can take the data of at least 5
minutes that is roughly corresponding to 300 samples if we take 60 bits per minute, then
roughly 5 minutes of if you take if you do the analysis of the heart rate over a period of 5
402
minutes as collected from the PPG signals; then roughly it gives you the same accuracy as
given by the ECG signals.
So, that is what is one thing that you want to keep in mind, when you are using the PPG over
the PPG based heart rate measurement over the ECG based heart rate measurement ok. But
and then ok so this was about the measurement one thing that one other important thing that
we have to keep in mind that emotion is not the only factor that affects the heart rate and
hence the heart rate variability.
So, there are several factors including for example the age, posture, level of physical
conditioning, breathing frequency etcetera. And many times, you know so what so we can
call it that most many of these factors we often refer to as the individual variability. You may
also have observed this term while we were talking about the emotions in the week 2 I
believe.
So, the individual very so for example, like with the age as the age increases the heart rate
variability it has shown to be decreasing. And this is also one of the reason when why you
will see that the elderly people they become anxious more anxious or they become very
easily anxious in comparison to the younger adults right.
So that is for example one of the reason and then similarly, the posture of course what
particular if you are not sitting in a comfortable posture itself, of course you will have to put
more effort your body will have to put more effort in order to stabilize you and for the same
reason maybe the heart rate itself will be a will be it will impact the heart rate according and
so on so forth.
403
(Refer Slide Time: 35:25)
So, you have to keep this in mind while doing the analysis of the heart rate while doing the
analysis of the emotions in the heart rate. Nevertheless, once you have overcome this thing
then heart rate can give you a good heart rate, can give you a good estimate of the emotional
state particularly with respect to the arousal.
But now there is a catch. while if you only look at the heart rate it can tell us that ok whether
there is an arousal or not, but it will not be able to tell you whether the arousal was because of
the positive or the negative stimulus content or alternatively while it can tell you about the
arousal it cannot tell you about the valence, that is the direction of the emotion.
It does not have a lot to say about the direction of the valence and why? Simply because the
both the positive and the negative stimuli they have been shown to result in an increase in the
arousal triggering changes in the heart rate, hence triggering changes in the heart rate. So,
now of course, both heart triggering changes in the heart rate.
And hence they are impacting the arousal, but you do not have a way to discriminate whether
the arousal was because of the negative or the positive stimuli or alternatively what was the
valence associated with the arousal that I am observing. And hence heart rate has been shown
to be closely related to the arousal and then it is used essentially for the analysis of the
arousal only.
404
So, as I said that, it you may not want to use the heart rate or the ECG or the PPG
measurements for the analysis of the valence, when it come to the when it comes to the
analysis of the emotions. Now, the question is what can you do then?
Of course, in this case, you know usually what we do when we try to make use of the
physiological signals or in general any other sensor as well, we try to make use of a
multi-modal data. In this case for example, the heart rate can be simply combined with
certain other analysis which is the facial expression analysis that you have already seen.
It can also be analysis with combined with some other physiological signals, such as the EEG
or for example, it can also be combined with the eye tracking modalities. In order to
understand a bit more about the direction of the emotion rather than just trying to understand
the emotional arousal content that is there in the emotion.
Perfect. So, with that we finish the part with the heart rate. Now, next we are going to talk
about the skin conductance and the emotions.
405
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 07
Lecture - 22
Tutorial Emotions in Physiological Signals
Hello, I am Shrivatsa Mishra. In the last lecture, we saw how we could use PsychoPy to
present stimuli as well as collect sensor data. PsychoPy allows us to integrate several devices
to collect multiple different types of data including physiological signals. Since we have also
seen how emotions can be extrapolated from physiological data, in this lecture, I shall explain
how we can analyze Electrodermal Activity or EDA using Python.
EDA has been closely linked to autonomic emotions and cognitive processing. And EDA is
widely used as a sensitive index for emotional processing and sympathetic activity.
Investigations of EDA have also been used to eliminate wider areas of inquiry such as
physiology, personality disorders, conditioning and neuropsychology.
We shall use sample EDA data collected through an Empatica E4 device. This is a device that
can be worn and streamed data through Bluetooth or store that data locally. We can extract
that data using an Empatica E4 app and for analysis, we will use flirt module in Python.
406
(Refer Slide Time: 01:41)
Hello, I am Ritik Vatsal and I will be introducing you to the Empatica E4 wristband data
collection device. The Empatica E4 is a wristband that collects your physiological data. It has
a single button and a LED light on the surface and two sensor points on the back with
electrodes that need to touch your skin when you wear the device. To wear the device, just
place it on your wrist with this side on the top and just make sure that it is comfortable
enough that you can do daily task.
But it is tight enough that both the electrodes at the bottom touch your skin at all times. After
that, we are ready to start the device. To start the device, just press and hold the top button for
three seconds. A blue light would come on indicating that the device has started.
The blue light would start blinking for about 15 to 20 seconds while the device initializes the
setup and checks all the sensors and all. After the watch has finished blinking for up to a
minute, the light turns red. When the light is red, you know that the recording has started.
After some time, the red light would fade off and the light would turn black. That would
mean that the data is still being recorded, but the watch has turned off the light to preserve
battery. While recording, you can single press the button to mark events in real time on the
watch like this.
407
The watch light would again turn on for a small time. Finally, to turn off the watch and stop
the data from.. stop data recording, you can just press and hold the button for three seconds
and that is it. The watch has now stopped data recording.
This is the splash screen of the E4 manager device that you would be greeted by when you
will first log into the application. On this screen, you now need to connect your device to the
USB port on your laptop or PC. I will do that now, ok.
408
Now, as you can see, my watch is ready and I see the one session that I just recorded. To
move forward with this, I just simply click the Sync Sessions button and all the sessions
would be synced and would be available for me to view, ok.
Now, we can click the view session button and we can see all the sessions that have been
recorded. This is the session for today’s date which has just been processed, ok. By clicking
this button, we can view more about the session, yeah.
409
(Refer Slide Time: 04:06)
So, this is the E4 website where all the session details are stored. You can just click on the
sessions icon and, there. The sessions are sorted in a date wise manner which you can see in
the top left menu. We just click on today’s date and we can see today’s session which was of
2 minutes and 1 second. We can either download this or view this in more detail.
The red lines denote the markers that we placed on the watch by pressing the button. By
going back to the Sessions menu, we can download this whole session in a zip file format.
410
(Refer Slide Time: 04:35)
So, after we download the zip file, we can simply extract the file, extract here and we will see
all the different modalities that the device has recorded in a simple and convenient CSV
format. Shrivatsa will explain how to extract the data further from this. Thank you.
411
(Refer Slide Time: 04:50)
The Empatica E4 device provides us with multiple types of data such as the heart rate
variability; inter beat intervals as well as electrodermal activity. For this assignment, we
should just be using the electrodermal activity. Now, this data is stored in a CSV file. The
first value in the CSV file is the start time of the entire data. This is stored locally on the
Empatica E4 or using the computer if you are streaming it on (Refer Time: 05:13)
You can reset the clock according to whatever system you want by just connecting it and
using the Empatica app. The second value is the frequency at which the data is collected. This
is in hertz. So, since the example for the EDA is 4, it means 4 data points are collected every
second. Now, let us move on to the code. For this, I shall be using Google Colab as it is easy
to use and readily available.
412
(Refer Slide Time: 05:45)
The data used is 5 minute sample collected during another study using the Empatica E4 itself.
We will start by installing flirt module in Python using the pip command. This will take some
time, but flirt is a library that will allow you to extract the data very easily as well as extract
the features from the same data.
413
(Refer Slide Time: 06:07)
Now, that this has been installed, we will import the library. We will be importing flirt library
as well as its reader. We will also be importing NumPy, seaborn as well as Matplotlib to
graph the data itself.
Importing these libraries usually takes some time. Now, that they have been imported, let us
import the data itself. The data is stored in an EDA dot csv file and we can simply get it
through an reader function in the flirt module itself. Upon running it and printing it, we find
that this is stored in a data frame.
414
A data frame is a pandas data type and in this, there are two columns. The date and time of
when it was recorded as well as the EDA value for that. Since we know, as we know the
frequency is 4, therefore, as we see every successive frame is at a difference of point two fifth
of a second.
Now, let us just graph this data naturally. So, there are two ways we can graph. We could use
Matplotlib or Seaborn. I can show you using both. So, for Matplotlib, we will just use the plt
dot plot function and plot the EDA value in this as well as using the Seaborn function.
415
As we can see at the start, there is a large variation. This can be ignored as we usually take
smaller chunk.
Next, let us move on to extracting the features. We can simply use the get_eda_features
function in the flirt module to get the features. This will take some time. Now, that we have
got the data.
416
(Refer Slide Time: 07:52)
We can look at what all features it has extracted. There are two main components to the
overall complex refer to as the EDA. The first is the tonic level EDA. This relates to the
signals' slower acting components as well as background characteristics. The most common
measure of this component is the skin conducting level on SEL.
And changes in the SEL are thought to reflect general changes in the autonomic response.
The other components is the phasic component, which refers to the signals' faster changing
elements. The Skin Conductance Response or the SCR is what is the major component in
this.
Recent evidences suggest that both components are important and rely on different neural
mechanisms. Crucially, it is important to be aware that the phasic SCR, which often receives
the most attention, only makes up a small proportion of the overall EDA complex.
417
(Refer Slide Time: 08:56)
Now, let us graph the two different values. First, we have the tonic value as well, tonic mean
as well as the phasic mean. We graph this using seaborn, this lineplot function. Over here, we
put the x value as the datetime in the y value as the tonic mean. Upon graphing it, in our case,
we obtain this graph. As we can see, the tonic value is always greater than the phasic value.
This is because as stated, phasic values make up a smaller percentage of the total value as
compared to the tonic values.
In summary, what we have done is imported the library for flirt, imported the data which is
stored in a CSV file, read it, graph the base EDA data, extracted the phasic and tonic levels
from the EDA data, as well as graph them. Using just these two basic datas, we are able to
extract a lot of different things about the data itself and we will need to look into greater
depth into these to understand more. Even now, there is much research being done on the
field and new advances are being made.
418
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 07
Lecture - 02
Skin Conductance and Emotion
Hi friends. So, in the last lecture we discussed about the emotions in the heart rate. And
following the same we will today discuss the emotions in another very interesting
physiological signal that is the Skin Conductance.
So, skin conductance is also refers to as the emotional sweating for the reasons that we will
discuss now. So, whenever we are emotionally aroused you may have observed also that the
electrical conductivity of a skin, it subtly changes and in response to the positive as well as
the negative stimuli.
And one example that you may have seen is trying to recall a fact in front of an audience
when you are giving some very important presentation, right. So, when you are not able to
recall that you are feeling very aroused, very stressed, very anxious and then suddenly you
may have seen you are sweating, suddenly you are sweating, right.
419
But nevertheless, even if the amount of sweating is not so much or is not so high then also we
can observe the subtle changes in the electrical conductivity of the skin and that is what is
known as the skin conductance or alternatively it is also referred to as the Electrodermal
Activity or Galvanic Skin Response. So, this GSR, EDA or the SC is a robust non-invasive
way to measure this kind of activation and which is definitely caused by the amount of the
sweat in a person's sweat gland for the reasons that we just discussed.
It could be because of the positive or the negative stimuli which is making you anxious,
which is making you aroused, right. And as I said before most important thing is that you
need not to have a kind of sweating which is very very visible, right. It is not like you will be
getting wet entirely, right. So, it can be very very subtle, it may not be very visible on the
skin, but it causes subtle changes in the electrical conductivity which can be captured by the
GSR sensors.
So, GSR was first used I believe 1980s or 70s by a researcher named Carl Jung in to identify
the negative complexes in the word association test. So, some of you may know about the
word association test. So, basically the word association test is a very popular psychological
test in which what happens that you are shown a series of words, a positive or the negative
words for a duration of 15 seconds on a screen and you are asked to respond with the very
first word that comes spontaneously in your mind or in your thoughts, right.
So, for example, maybe I asked you a word I as baby. So, maybe the first word that comes
out the cute and then so on and so forth. You got the idea, right. So, for example, I would say
about knife. So, the very first word that maybe comes to your mind may be kitchen, I do not
know. Different individuals may come up with the different sort of words and that is what is
known as the word association test.
So, basically Carl Jung what he showed with the help of the galvanic skin response signals
that when the individuals were coming up with the negative words, right, in response to the
this word association tests, in response to this word association and that was sort of getting
captured or getting highlighted by the galvanic skin response of those individuals or of those
instances.
So, this was the very first sighting of the galvanic skin response and its use. And since
henceforth you know the GSR has been regarded as one of the most popular method for
420
investigating human’s psychophysiological phenomena, right. And you may be knowing also
that GSR is a key component in so called or very very popular lie detector test.
So, you may have seen in the movies or in the series that many times the law enforcement
agencies they use this lie detector test and one of the primary components of those lie
detector tests are these galvanic skin response sensors. Nevertheless, we will not be talking
about that, but for the interested audience I would definitely request you to please go ahead
and investigate more about it.
Now, let us try to discuss and understand how can we measure the GSR signals. So, it turns
out that the measurement for the GSR signals is quite easy as it can be measured anywhere on
the body. And the reason it can be measured anywhere on the body because we try to capture
the electrical a conductance or the electrical activity of the skin and of course, the skin
constitutes the largest organ of our body.
So, it is all over our body, right. But of course, nevertheless there are certain areas of the of
our body which are emotionally more which have emotionally more reactive sweat glands.
And for the same reason we try to concentrate on those areas when we are capturing the
galvanic skin response signals.
And these areas are primarily the palms of the hands and the soles of the feet. So, basically
the soles of the feet and of course, the palm of the hands as you can see in the this diagram
421
the schematic diagram that you can get this use this palms of the hands and then you may
have observed that while sweating also these are the areas where usually you sweat a lot,
right.
So, for example, this particular graph that you are looking at in the screen it shows you a
schematic of the GSR sensor. So, it is a very very easy schematics. So, here basically what
you have, you have a simply through a voltage supplier a constant very small amount of
voltage is passed through the GSR electrodes. So, basically these are the two electrodes that
you can see.
And then the difference across these GSR electrodes in terms of the voltage is measured and
that is what is constituted as the GSR a signal of the human body. Now, there are different
types of sensors that are available in the market. So, for example, if you look at this particular
diagram then this is the Shimmer GSR plus sensor and you can see that it also has the same
based on this is schematics it also has two sensors.
So, for example, this is one cell sorry, electrode this is another electrode and of course, then
there is a this device which sort of is responsible for producing this constant supply of the
voltage and at the same time it has this Bluetooth chip inside it which transfers the signal
wirelessly from the participant or from the human's hand to the whatever device where you
are capturing the data.
So, that is for example, one popular device. Another for example, sort of popular device is the
E4 wristband. So, this is the E4 wristband by the Empatica industry, Empatica start-ups. So,
basically you can see again here you can the these two electrodes are very well visible you
can see and more or less the idea is the same. So, basically you know you wear it as a wrist
watch. So, basically these two electrodes, they get attached here and then from here also you
try to get the GSR signal.
So, as you can see so, for example, in this you can you are getting this from the fingers, in
this you are getting it below beneath the palm and so on and of course, there are other sensors
which are available from which you can attach the sensors at the soles of the feet and you can
get the GSR signal from there. So, as I said, as we said before, that it is quite easy to measure
as it can be measured anywhere on the body, right.
422
(Refer Slide Time: 07:58)
So, then now you have let us look at that what is the nature what does the nature of the GSR
signal look like. So, for example, you can see a very nice GSR response time diagram on the
below. So, basically what it turns out that always from the onset of the stimulus there is a
latency, there is a delay with which the response in the GSR signal occurs, right. And usually
this delay for example, this latency can be 2 to 4 seconds. So, what it means? That it may take
up to 2 seconds for the GSR signal to start reflecting the response from the onset of the
stimuli.
So, imagine that you were presented with an image of in an image or a video or any stimuli
for that matter and then it may take up to let us say 2 seconds, 3 seconds or even 4 seconds
for the GSR signal response to occur in response to the video or the image that you just saw
or you are just seeing, right. So, that is what is the typical delay. Now, of course, the range for
this the amplitude of this GSR signal, basically this part it is it varies between 2 to 20
microsiemens.
So, basically microsiemens is the opposite of the voltage, you can kept you can convert it into
the voltage as well and for the individual individuals it may vary between the 1 to 3 seconds.
So, basically you know 2 to 20 microsiemens is the typical range and then it may vary 1 to 3
microseconds more, Siemens more so for the different individuals depending upon the
individual variability. So, this is the typical range of the GSR signal.
423
Now, if you look at the GSR signal rise response time here response here then you will notice
that there is a particular rise time which the signal takes to reach to its peak value and then
there is a recovery time which is the time that the signal takes from its peak to coming to an
offset or coming to the let us say within 10 percent of the within 10 percent of the GSR’s
amplitude.
So, the typical rise time is always higher than the recovery time. So, basically it takes 1 to 3
seconds here it takes sorry, 1 to 3 seconds for the rising of the GSR signal; that is it takes
around 1 to 3 seconds for the GSR signal to rise from the baseline to the to its peak value.
And then it takes another 2 to 10 seconds anywhere between 2 to 10 seconds to go down to
the baseline value from the, from its peak value, right. So, that is how it typically responds.
And it turns out that the GSR signal is usually sampled at a very very low frequency such as
between 1 to 10 hertz. Say, usually 10 hertz is pretty common to have. So, we do not sample
it at a higher frequency because it is not as responsive and it is not a very very fast signal it
turns out, right. So, that is what the response time of the GSR signal looks like and usually it
is all known as the very slow responsive signal. So, it is a very slow responsive signal, right,
perfect.
So, having understood the GSR’s signals response time let us try to look at that what are the
different types of components in the GSR signal. So, the GSR signal typically it is it consists
of two components. One is known as the Skin Conductance Response SCR and another is
424
known as the known as the Skin Conductance Level. So, basically the skin conductance
response is the phasic or rapid component of the GSR signal.
What it means that in response to an in response to an external event which could be any
stimulus the skin conductance sorry, the skin conductance signal or the GSR signal has a
phasic or rapid component which increases in the, in its amplitude temporarily. So, for
example, if you can see this is the typical these are the GSR peaks and this is the overall GSR
signal.
But if you look at these signals for example, these are the amplitudes which are raising much
temporarily and much frequently also. So, for example, if you look at the overall shape of the
GSR signal then in comparison to this overall shape of the GSR signal there are temporarily
and sudden rises in the amplitude of the GSR signal which is known as the phasic skin
conductance response.
And usually, this phasic skin conductance response, it is an reaction to an external event. It of
course, it may occur in the sense of the emotional external event as well, but what we are
interested in, we are interested in when it is occurring in response to some event. And many
times, when we are doing the analysis of the GSR signal we are more interested in the
analysis of the SCR signals.
In and it turns out of course, the way we have the phasic or the rapid component of the GSR
signal, in the same way we have a tonic or is also known as the slow component of the GSR
signal which is known as the Skin Conductance Level or SCL. So, basically the skin
conductance level or the SCL component of the GSR signal, it changes due to the general
arousal level or general stress, right.
So, basically as you can see the change. So, for example, while analyzing the SCL these this
is what the SCL looks like as I said before, this is what is the SCL component looks like. So,
if you look at the entire pattern of the SCL signal the changes are not many right, I mean.
So, in the same period for example, in the same period maybe you saw that 1, 2, 3, 4, 5, 6
maybe there were like the amplitude of the SCR component increased 7 times, but then in the
same time period the amplitude was there was more or less no changes in the amplitude of
the SCL signal.
425
But then with the time the amplitude of the SCL signal, it changed a lot significantly over a
period of time then sort of became steady over there, right. So, basically this is representing
the general arousal level or the general stress level of an individual rather than the response of
the individual to an external event, right.
So, this is very important to understand that when we want to observe the response of an
individual to an external event, to a stimuli, to a image to a video, to a music that we are
showing we have to look at the SCR component of the signal. But when we want to
understand how what is the general level, general arousal level or the general stress level of
the individual then maybe we want to look at the tonic component which is the SCL
component, right. Not so hard to understand.
Of course, more importantly we will have to understand that how to differentiate between
these two we will come back to that in a minute. So, as I said to study event related SCR’s
what we want to do these are known. So, SCR’s usually they occur as I said when they occur
in response to an external event they are known as the event related SCR’s, right.
So, basically to understand this event related SCR’s or to and to analyze these event related
SCRs this is really important for us to display the stimulus for a long enough duration.
Because even if it occurs frequently, but what happens that if you are going to show it only
for a let us say very very small time 1 second 2 seconds then you already know that the
response time of the GSR signal itself is a bit is a slow responsive signal.
426
So, there is a latency from the stimulus onset to the appearance of the GSR signal. And then
of course, it takes certain time to go to the peak and then it takes certain time to come to the
baseline. And for the same reason what you want to have? You want to also have a similar
inter stimulus interval durations.
What it means, that you may want to present let us say an image for 10 seconds and in order
to and then you want to analyze the effects of that image on the GSR signal and then maybe
you want to take a gap of 10 seconds before even presenting the another image for 10
seconds again, right.
Why you want to do this? Because of course, if you are not going to maintain proper gap
between the presentation of two stimulus two stimuli component then there could be an
overlap in the response of the stimuli, right. So, again I will just make your life easier here.
So, for example, this is an this is the let us say first two second you presented one image then
you may want to keep a gap of let us say another 10 seconds and then this becomes 20 here
and then again you want to present and you may want to present another image. And that is
how you analyze. Else, what will happen? That the response of this particular thing is going
to overlap with the response of the next image or the next stimulus that you are presenting.
So, you may not want to do that because it is going to give you like multiple responses and an
overlapping responses, perfect. So, other thing that now and if you look at this how we
427
observe the response of the GSR when let us say watching a particular stimulus then this is a
very nice example where what you are looking at that this is the GSR response GSR signal
response when an individual let us say who is looking in the image is watching some
particular images or is watching some particular arousing scenes in a video.
So, sorry for example, if you look at this particular a scene then what is happening here? That
the individual is walking at a very tall height maybe on a skyscraper where there is a glass
and the individual is able to see from a very top height. So, basically here it is a very very
arousing scene and again you can see the response. Again, similarly there is a baby which is a
cute baby. So, again there is arousing response and then you can see again the corresponding
facial image, facial expressions or the a response of the individuals in this face.
And accordingly, so for this particular component you can see that how you can see there is a
peak GSR peak present for the most arousing parts of the these scenes or of these videos. And
that is how you know we timestamp the of course, we have a timestamp for the scenes that
are getting presented. And accordingly, we try to put this timestamp or we try to make the
signals responses also in sync with the images or the videos that are getting represented.
And then we look at we observe those responses with respect to the arousing scenes and then
we try to see you know such as here that ok, when the individual is looking aroused what was
the corresponding GSR signal. And if the GSR signal looks out to be ok with the arousing
then what exactly was the individual looking at that particular point of time. So, this is how
we look at the entire setup and this is how we analyze the emotionally arousing scenes let us
say in a video or for example, in any particular stimuli, perfect.
428
(Refer Slide Time: 19:20)
So, having understood that how the GSR signals respond what are the different components
of the GSR signal. Now, there are n number of different features of the GSR signals. So, and
these features could be you know since we are talking about a time series signal. So, GSR
signal essentially is a time series signal, right. So, a time series signal can be analyzed in time
domain, it can be analyzed in frequency domain, it can also be analyzed in the time frequency
domain.
So, without going too much into the signal processing part, there are certain for example,
most important features that are related to the skin conductance or the GSR signal is the event
related features. So, basically the event related features are the features which are response to
an event, to response to an external event. So, you may have rightly guess that these are the
events that you are looking into the mostly the SCR, right. So, basically whenever you have
an SCR you may want to look at the there are multiple SCRs there can be multiple SCRs.
So, for example, here you can see one SCR, two SCR, here SCR and so on so forth. So,
basically there are multiple peaks that may be there. So, you may want to look at the what is
the peak mean peak amplitude, what is the average peak amplitude. Similarly, you may want
to look at the what was the sum of all the peak amplitudes.
Similarly, you want to count the number of peaks, you want to look at the what is the mean
rise time for each of them peaks and so on so forth. So, it takes a bit of signal processing to
429
extract these features, but these are not very hard. And it turns out these are very very popular
signals features when we are analyzing the GSR responses.
Similarly, another very popular feature when we are analyzing the GSR signal is in the
frequency domain which is very very simply as the band power of the GSR signal. So,
basically when we try to analyze the band power of the GSR signal, then what we simply do?
We try to identify or understand the band power the spectral band power of the signal in
different frequency band. So, for example, we look at the band power from 0.5 hertz, then we
look at the what is the band power at 1 hertz, similarly 1.5, 2.5 and so on so forth.
So, basically, we create different bands of the signal GSR signal in the frequency domain and
of course, once we have the spectral power, then we try to look at ok, which is the minimum
spectral power, which is the maximum spectral power and we can look at the what is the
variance in the spectral power.
So, basically these are some statistical features that we obtain on this band powers that we
calculate in the frequency domain. Again, so quite simple features to analyze and it turns out
that this very very important feature when it comes to the analysis.
In a paper in 2019, our group showed that you know in apart from these features, which are
the commonly used, let us say SCR features, what the statistical MFCC features. So, basically
MFCC you may have heard the (Refer Time: 22:15) Mel frequency spectral spectrum
features. So, basically the MFCC features, these features are very very important feature and
at least when we try to analyze the GSR signals. And it turns out that they even outperform
the commonly used SCR feature.
So, for example, without going into too much detail on this particular diagram so, basically
this diagram is representing the weighted occurrence or let us say the weightage of a
particular a feature for a particular classify while doing the classification of the arousal. So,
here what we wanted to do, we wanted to do the classification of arousal.
And while doing the classification of the arousal, we analyzed the features in the you know
we analyzed its SCR features, we analyzed the time domain features, we analyzed the band
power feature, frequency domain features and so on so forth. So, basically these are the
different features that we analyzed.
430
And then if you look at this particular bar here, what it shows that the weightage of the
MFCC statistical features were much higher than in comparison to let us say you know all
these SCR band power frequency domain features. And that is how we concluded that
statistical MFCC features they outperform the commonly used SCR features. And for more
details, I will invite you to please look at this particular interesting paper that our group
published in this domain in 2019, right, perfect.
So, these are the of course, some of the features when we try to analyze the GSR signal and it
takes a bit of single processing, you need not to worry about that. There are different tools
which can do this job for you and at the end of this class; we will try to give you a demo
about it.
So, when it turns that you know like there are certain limitations that are associated with the
GSR signal as well when we try to analyze the emotions. One very commonly, very common
limitation across physiological signals is that how can we label the emotions that are
associated with the physiological signal or in this case the GSR signal.
The reason that we found it challenging because when we are analyzing or monitoring the
GSR signals or any physiological signal in the wild in real time, then what happens that the
onset or the offset of the emotion is not known a precisely, right. So, for example, when we
are trying to recall the particular fact in front of an audience.
431
Of course, we do not know exactly when we started to recall the fact in front of the audience
and of course, maybe there is a very less possibility of being able to do the annotation of or
marking the timestamp when you started recalling that particular fact because of course, you
are in the wild, right.
So, this is this becomes a bit tricky because then the synchronization becomes a problem and
then accordingly in the classification and then everything becomes a bit of a problem. Other
thing that you may have you know this question may have come to your mind when we were
talking about the SCR and the SCL that how to separate the SCR and the SCL signals.
This is a central problem, but then there are lots of tools nowadays which are available online
and we will provide you some links also through which you can do the separation of the GSR
signal into the SCR and the SCL signals, the SCL signals without much trouble.
So, the these tools which are available online which are available they make you which have
been published by the previous researchers, early researchers in the domain they make your
life a lot easier. So, you need not to worry about that, perfect. Now, so other problem that is
there and a very important problem with the GSR signal that it does of course, it tells that you
know like that the individual is aroused, the individual is stressed, but the problem is it does
not tell you whether that arousal or that is stress was due to a positive stimuli or due to a
negative stimuli.
So, it may happen that you may feel aroused because of a positive experience or you may feel
aroused for example, because of a negative experience. And in both the kind of stimuli’s what
happens that it results in an arousal which is triggering the number of GSR peaks let us say,
you may have the same number of peaks when you are aroused in response to a positive
stimuli or when you are aroused in response to a negative stimuli.
So, it does not help you differentiate what is the direction of the emotion. It tells you that you
there is an arousal, but it does not tell you what is the direction of the arousal, right. So, it is a
it is a severe limitation. So, of course, in other words if you recall the emotion theory then of
course, it helps you to track the emotional arousal, but it does not tell you anything about the
emotional valence.
So, if you recall the VAD and the PAD component. So, we have the information about the
arousal, but we do not have the information about the balance at least when we simply use the
432
GSR signal. And it is for the same reason that what we do when we make use of the GSR
signals you we usually combine the GSR signals with other data sources such as of course,
you as you saw the facial expression analysis, the one that I showed you before, we can even
combine it with the other physiological signals such as EEG, electroencephalography signals.
And of course, we want to look at the eye tracking to observe the gaze tracking of the
individual gaze tracking data of the individual. So, this is how you know, in general in short
this is how we use the GSR signal to understand the arousal and then when whenever
required we combine the GSR signal with the other data sources in order to obtain more
information.
Now, let us try to look at how the emotions are represented in the EEG signal.
433
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 07
Lecture - 03
EEG and Emotions
Hi friends. So, in the last module, we looked at the emotions in the GSR signal. A very
popular physiological signal to observe the emotions. Now, in this, we again looking; we will
again look at, a very common physiological signal and very very popular, which has become
very popular in the recent years, is the EEG signal or as it is known as the
electroencephalography signal.
So, you may, you already know that our brain, it consists of millions of neurons and it turns
out that in order to communicate with each other. They, these neurons, they fire in sync in
order to do the communication. So, it turns out that whenever there is a firing of thousands of
neurons together, then what happens that, they generate an electrical field which is strong
enough to spread across tissue, bone and more importantly the skull.
And that is how we eventually measure the electrical activity of the brain on the head surface.
So, this is what essentially is EEG. So, EEG is what? It records the electrical activity
434
generated by the brain and how it records of course, with the help of certain electrodes that
we place on the skull surface.
So, we will see some example. And so basically, as I was saying that the reason, that the
neurons fire in the brain, because they want to communicate with each other via some
electrical impulses and this communication, it turns out that it's not a random communication
that happens, of course. So, these communications, they are associated, they have been shown
to be associated with certain cognitive processes, such as drowsiness, alertness or intern, that
is what you call as a kind of engagement, how engaged you are.
They give a sense of a wakeful relaxation, that whether you are in a relaxed state, even
though when you are wake. And then of course, more importantly, they also tell you about
whether there is an approach or avoidance attitude in response to something by looking at
your brain activity.
So, then because of this reasons you know, that we are able to understand let us say, lots of
emotional information of about the individual, such as the engagement, such as the cognitive,
different cognitive states that an individual is in through the brain activity. And that is what it
makes it very very exciting. Other important feature about the EEG signal is that it provides
an excellent time resolution.
What it means, that first thing the response time of the EEG signal is pretty fast, is
somewhere around 200 millisecond. So, basically from the onset of the stimulus, it takes
around 200 milliseconds for the response to occur in the EEG signal. And hence, it allows the
detection of the activity at sub-second time based scales.
As I said, within milliseconds within 200 milliseconds, you may have the response and so
within a second, it's a time-scale itself, you can analyze the activity that is happening, right.
Activity in the sense, the cognitive activity that is happening inside the brain.
435
(Refer Slide Time: 03:54)
So now, having understood that, let us try to understand how can we measure the EEG signal.
So, it turns out again, its it is; its not so in the due to the advancements in the wearable
technologies. Now, it is, not so invasive sensor and tools that are used to analyze the brain
activity.
So, basically, in this what we do? We put the EEG electrodes, such as the one that you are
seeing on the left screen here. So, basically, this is a one diagram, where we are looking at
how do we do the EEG measurement. So, basically, the EEG electrodes, this is what is the
EEG electrodes.
And these are the EEG electrodes, they are put on the scalp, surface of the scalp using a in the
form of a EEG cap, just to make the life easier, you know when you are putting it in the form
of the EEG cap, it makes the replacements and all a bit easier. And then, what happens, then
whatever signal that is, these electrodes are capturing.
Of course, it turns out that these are again very, very small minuscule voltages of course,
because you are not a transformer, the human brain does not generate a lot of current or
voltage thankfully. So, basically, whatever the voltage, whatever the electrical activity that
the electrodes are capturing, then of course, that electricity is usually passed through an
amplifier, which then amplifies it.
436
And then, of course, then it is sent for the further storage and for the further processing for
these different electrodes. So, for example, here you can see that there are different number of
waves, there are n number of waves. So, for example, in this case, this is 1, 2, 3, 4, 5, 6, 7, 8,
9, 10. So, basically, there are different number of waves, that are representing the electrical
activity as captured by the different electrodes.
So, for example, if there are 10 waves here, so basically, they are representing the activities of
10 different electrodes, right. So, as many number of electrodes you will have on the EEG
cap, as many number of waves you will get further processing and for the further storage and
the recording purposes.
And it turns out that nowadays, as I said, because of the advancements in the wearable
technology, there are many even more non-invasive and portable kind of headsets, such as
this Emotiv EPOC headset that you are seeing on the right side.
So, you can see that the EEG Emotiv EPOC headset it constitutes around, it constitutes 14
electrodes. And among those 14 electrodes, you see that it's not a kind of a in a cap structure,
it is purely; it is a Bluetooth headset. What it means? That it simply captures the data through
the electrodes and then it simply makes use of the Bluetooth to transfer the signals in a
wireless fashion to the nearby computer or wherever you are trying to collect the data.
And it turns out that, you know like all the 14 electrodes that are there is being depicted on
the.. this particular diagram. So, this is the locations of the electrode of the Emotiv EPOC as
per the 10-20 international labelling system. So, basically, in order to understand this thing.
437
(Refer Slide Time: 07:12)
Let us try to understand what is the electrode labelling 10-20 system. So, basically this 10-20
international system, it is a international system to label the placement of the electrodes in an
uniform fashion. So, basically, what it has, as you can see on this diagram, it shows you the
electrode locations of the international 10-20 system for the EEG data.
Now, if you look at here, the electrode placements in general on the surface of the scalp is a
defined in certain different areas such as prefrontal region. So, for example, if you look at this
these two, so for example, basically this part is what is known as the prefrontal region of the
surface.
Similarly, then you have the frontal region, so it's not hard to imagine that why this becomes
the frontal region. So, basically prefrontal and then you have the frontal region of the scalp.
Similarly, you have the temporal region. So, basically, this is what is the temporal region of
the skull. Then you have the parietal region, so basically, this is what is the parietal region,
you can very well imagine this is the parietal region.
And then you have the occipital region, so basically occipital region is at the back. And then
of course, you have the central region and then that is what are the different regions in which
the placement site on the surface is divided as per the 10-20 international system.
Now, apart from this placement reasons, there is a 0 that is often attached with a electrode
and the 0 refers to as the midline sagittal plane of the skull. So, for example, if you look at
438
this midline sagittal plane that is what is; that is what refers to the electrodes that are placed
on this midline sagittal plane. And then accordingly they have becomes Fz, so electrodes
placed on the frontal cortex, they become the Fz.
Similarly, on the central they become Cz, similarly on the parietal region becomes the Pz,
right. And mostly, the electrodes that are Fz, Cz and Pz, they are mostly for the reference or
the measurement points. So, the real data is not used in the analysis, but rather it is used to
understand the baseline or understand the reference points of the EEG data, right.
Now, it turns out that as per the structure, as per this labelling system, the even numbered
electrodes such as you know 2, 4, 8, 6 and so on so forth. They refer to the electrodes placed
on the right side of the head, similar right. So, these are the even numbered electrodes, they
refer to the electrodes placed on the right side. Similarly, the odd numbers such as these, they
refer to the electrodes placed on the left side of the human skull, right. So, skull, so that is
how is the placement of the 10-20 system looks like.
Now, of course, you know like these are not the only the number of the electrodes that can be
represented, that you can have in a EEG system. Of course, you can have very well even 128
or even up to 256 electrode systems are very very popular. Well, are common, not so
uncommon to have but nevertheless you can imagine the more the number of the electrodes,
the more the density increases and hence what we call it as the spatial resolution increases
with the more number of electrodes.
So, we will talk about the spatial resolution in a bit. And of course, you can very well
imagine that with this what happens and that the precision of the data that you get, it
increases. Nevertheless, also it adds to the cost, it adds to the more intrusiveness, it adds to
the complexity of so many different things, right. So, now having understood the 10-20
system, let us try to look at the electrode locations for the Emotiv EPOC.
So, if you look at this thing, so you have this AF3 and the AF4, what is this? Basically, these
are representing the prefrontal cortex, prefrontal electrodes. Similarly, you have these frontal
electrodes, right. These are the frontal electrodes, but which are more related to the central
parietal. Similarly, you have the temporal electrodes; these are the reference point electrodes.
Similarly, you have the electrodes in the parietal region and these are the electrodes in the
occipital region, right. So, that is how all together you have 14 electrodes in the Emotiv
439
EPOC system, which is nice system, it is a portable system, its a wireless system and on the
other hand, you can always have this kind of a; you know, the EEG caps that you are seeing
on the left hand side.
Nevertheless, no matter what type of device you have, all these types of device, they make
use of usually this international 10-20 system for the placement of the electrodes on the
surface.
And it turns out that all these different regions of the human brain, they represent then
information about different areas, right. So, without going too much into the neuroscience,
we will simply try to now look at the different features of the EEG signals that are related or
correlated with the emotions rather than trying to understand the EEG data as such, right.
So, basically one of the very popular feature with respect to the EEG signal is known as the
frontal alpha asymmetry. So, the frontal alpha asymmetry as the name itself suggest, basically
this is what this is the difference between the right and the left alpha activity over the frontal
region.
So, let us try to break it down. So, basically this is what, this is the difference between the
right and the left. So, basically you have of course, you have a right hemisphere and you have
a left hemisphere not hard to understand over the frontal regions. So, we already understand
440
there is a prefrontal cortex and there is a frontal region. So, basically what you have of
course, alpha activity we will try to understand in a bit.
Basically, what it says that it says the difference between the frequency power over the left
and the right frontal regions and we talk about the left and the right frontal regions. So,
basically you have several options, you can have either the F3 or F4 as a pair. F4 representing
the right hemisphere, F3 representing the left hemisphere or you can even have the F8 and the
F7, right.
So, basically you can have a pair of these two or you can have a pair of these two and in turn
what you try to do, whenever you take the difference between the spectrum power of these
electrodes that is what is known as the frontal alpha asymmetry. Now, where does the alpha
comes from? You already understood the frontal; you already understood the asymmetry,
basically the difference between the frequency power between the band power. Now, alpha
where does the alpha come from, so let us try to well.
So, alpha is basically turns out that you know the EEG signal it can be categorized into the
different bands as per the frequency ranges that it has, and so basically roughly speaking we
have this we call this delta, delta is the frequency band that is representing the 1 to 4 hertz
commonly and basically this particular region it represents the electrical activity when then
individual is in the deep sleep state.
441
Similarly, you have the theta, so basically this is what this is representing the deep sleep state.
Similarly, you have the 4-8. So, basically 4-8 is what? 4-8 is known as the theta region. And
this theta region basically theta region the activity in this theta region that is the electrical
activity within the 4 to 8 hertz of the EEG signal, it mostly represents the drowsiness and it
also represents your condition let us say when you are doing the meditation, right.
Similarly, then you have the 8 to 10 hertz. So, basically 8 to 10 hertz and this 8 to 12 hertz
these 2 are combinedly this is known as the alpha band. And basically, in the alpha band it is
basically the activity, electrical activity of the brain, when mostly you are in the relaxing state
and many times when you have this closed eyes, right.
So, basically that is what is the these 2 are representing together and then you have the beta
band. So, basically beta band is the electrical activity, that happens between 12 to 30 hertz
and this beta band it mostly represents when you are in the state of let us say concentration or
when you are doing thinking. So, it represents the concentration and the thinking.
Similarly, when you look at the 30 to 60 hertz, this is the gamma band. And in the gamma
band basically what it represents, it represents different cognitive functions including the
workload, cognitive workload that you may be experiencing, right. So, these are the different
bands in which corresponds to the different sets of activities that the individual is involved in
at least the with respect the cognitive state, right.
So, having understood that, now let us try to understand. So, now, you may have understood
that ok, what is the alpha activity? So, basically alpha activity is what, you are trying to look
at the frequency power in the alpha band. Alpha band is that 8 to 12 hertz frequency roughly
speaking, right. So, basically you are looking at the difference between the F3 or F4 or F7 or
F8, but of what of this 8 to 12 hertz frequency power for the EEG signal.
It's not hard to understand again frontal, is the frontal electrode asymmetry is the difference
and then alpha is basically you are looking at the alpha activity the 8 to 12 hertz and then
when you talk about the frontal pair they are on two usually you take two pairs two sets of
possibilities are there, either you take the F3 or F4 or you take the F7 and F8, right.
And it turns out and that is frontal alpha asymmetry, it is used as a proxy for the feelings of
approach or the avoidance. What it means? So, when you are looking at an stimuli, when you
442
are looking at an activity or when there is an external event towards which you are feeling
engaged or you are feeling as more interested.
So, you are approaching that and then; so that particular experience or in the same way the
experience of when you are feeling boredom, when you are withdrawing from the interest of
a particular activity or an stimuli that is being represented by the FAA signal. right.
And as I said so, mostly we use the alpha band here, but many a times this gamma band,
which is the greater than 30 and this beta bands, are also equally used. But more importantly
this FAA is the most commonly used activity. Now, there is an interesting phenomena here,
that it turns out that there is a inverse relationship that exists between the alpha band power
and the cortical activity.
What it simply means, that if there is more brain activity happening, then there is less alpha
power and vice versa. It means when there is more alpha power it means there is less brain
activity happening. So, I hope that this is clear, again I will repeat. That when there is more
brain activity happening, when there brain is actively involved in something or when
certainly particular part of your region is of your brain is actively involved, then you will find
that the alpha power for that particular region is less and vice versa.
So, now if you combine these two what it means, it simply means that when there is an
increased left frontal activity, when there is an increased left frontal activity, what it means
that it serves as an index of the approach or the motivation or the related emotion, right.
So, increased left frontal activity, it means you will have less alpha power here, when you
have less alpha power in the left; in the left frontal region it means that you are feeling an
approach motivation or let us say related motivation which in this case could be of the anger
as well, which is an exception when it comes to the positive emotions.
Similarly, when there is an relatively increased right frontal activity, which means when you
have a less alpha power in the right frontal region, then it means that there is an withdrawal
motivation; withdrawal motivation see you are sort of experiencing a negative emotion and
then you are trying to withdraw from the stimuli or from the event, which is in turn
representing that you are feeling disgust, fear, sadness and so on. so forth, right.
443
So, now you can see how beautifully the frontal alpha asymmetry, again I will summarize it.
So, basically the frontal alpha asymmetry, it is used as a proxy or as an indicator of approach
or withdrawal or approach or avoidance, in order to calculate the frontal alpha asymmetry.
You simply look at the alpha band power that is the band power that is the power in the range
of 8 to 12 hertz, in the F3 region or F4 region or F7 region and F8 region.
You take any (Refer Time: 20:33) of 2 pairs and then you simply take what is the band power
there and if you have increased left free frontal left frontal activity, which means you have
less alpha power here, then it means there is an approach motivation there is usually a
positive emotion, of course anger is an exception here you may be experiencing anger also.
And we have increased right frontal activity means when you have less alpha power in either
F4 or F8 then it means that you are experiencing an a negative emotion which is an indicator
of the withdrawal motivation or for example, disgust, fear or sadness, right. So, this is a very
very important feature that you may want to look at.
And it turns out that it has been very commonly used in trying to understand, whether you are
liking something or for example, whether you are disliking something right, and then quite
easy to calculate.
Now, apart from these frontal alpha asymmetry there are different; there are other there are
other features that are very very popular. Of course, again you have to understand this is
again a time series signal. So, when you have a time series signal you can do the analysis in
the time domain, you can do the analysis again in the frequency domain, you can do the
analysis in the time frequency domain and so on so forth.
So, basically it turns out that the way we have event related potentials or event related skin
conductance response for the GSR signal, in the same way we have the event related
potentials in the EEG signal and there are different types of ERPs that we have for the EEG
signals. And for example, one that is very popular we are also known as the P300 what it
means. So, basically the number 300 here it represents, the that there is an response within
300 millisecond from the onset of the stimuli, right.
So, that is what is for example, the event related potential and again you can calculate lots of
statistical feature on it and the domain or the community uses a lots of statistical features on
this P300. Similarly, as I said I was suggesting you can definitely use the power features from
444
the different frequency bands, you already know that there are different frequency bands for
the EEG signal.
And then within those different frequency bands you can simply calculate the power and then
using those power you can understand what exactly is happening in that particular region,
right.
So, that is the power feature from the different frequency bands. And then it turns out that
you know like there is a very interesting paper that for example, there is a very interesting
paper which was presented by researchers in 2014 which nicely represents that what are the
different features, that are related to the EEG signal in response for the emotion recognition.
And then it turns out that there they have shown that for example, advanced features such as
you know higher order crossing. So, for example, if you look at this is the; this is what is
representing the higher order crossing. So, for example, higher order crossing is what? Higher
order crossing is simply the number of times, the waveform is crossing the axis, right.
So, basically you count the number of times the waveform is crossing the axis and you take
some differences, you take the difference of that, right. So, basically this HOC is for example,
what the researchers have shown here, that this HOC this is the weightage sort of it is
representing the weightage in among the different features when it comes to the for the
emotion recognition.
445
So, you can see that these advanced features like the HOC for example, it is outperforming
the commonly used for example, the power band features. So, this is what is representing the
power band features. And similarly, it is even you know outperforming many different
features which are even including let us say you know these are the DW2 feature which is
discrete wavelet transform features.
Similarly, for example, there is other this statistical feature and so on so forth. So, without
going too many too much in detail, again here you can see that here in this particular paper
the researchers have tried to analyze and understand the different features of the EEG signal
in time domain, again in frequency domain and again in the time frequency domain and then
if you look at this particular diagram, it gives you very nice representation about the
weightage or the importance of different EEG features with respect to the emotion
recognition.
I will definitely invite the audience which are going to look at the emotion recognition from
the EEG signals to have a look at this very interesting and nice article, perfect.
So, now we have already understood about the EEG signal and then we already are a bit
excited. I hope that we are a bit excited about the use of the EEG signals. So, with that we
have to also understand that there are certain limitations again associated with the EEG
signals.
446
So, one limitation that is associated with the and it is very common with the EEG signal that
it has a very poor spatial resolution. What it means? That while it has an excellent temporal
resolution, it can it is very fast responsive signal and within 200 milliseconds it can give you
the response to an external event or the stimuli.
You really do not know that whether the signal that is being produced is produced near the
surface that is from the cortex or it is being produced at a much deeper level. So, basically
you know there are different ways to and this is what we call it as the source localization
problem in EEG domain to address this source localization problem of course, one way that
you may have rightly guessed that is to make use of multiple electrodes, but now you can
understand the more you have the number of electrodes.
Of course, it sorts of gives you an hint, but then also there is no proper way to understand this
source localization or to understand, to resolve this special resolution for the EEG signal.
Nevertheless, I think it is also very easy to understand, that while there is an availability of
while there is availability of the portable and the mobile portable and non-intrusive devices,
wireless devices; such as, Emotiv EPOC and there are others also in the market.
It's a bit intrusive. Actually, it's a bit intrusive in comparison to the GSR and the heart rate
sensors that we saw earlier. Nevertheless, it provides an excellent time resolution. It's a very,
very effective sensor, but to handle this again you need some sort of training you know about
the how to place do the placement of the electrodes on the human skull.
And the skull and then at the same time you know how to you know perform the analysis of
the signal that you are capturing. It requires a bit of understanding of the signal processing in
order to understand how to do the analysis.
Of course, but bottom line is nowadays there are lots of tools which are available online and
then at the end as I said like we will have a demo of certain tools, certain methods,
methodologies which will definitely help you a lot in doing the pre-processing or in doing the
analysis of the this physiological signals.
So, you need not to worry about it. Of course, it's always good to have a background in the
signal processing for this thing. One very important problem that is associated with the EEG
data that; is it is almost always contaminated by certain artifacts. Such as, the electrical
signals which is detected along the scalp of an EEG.
447
So, for example, as simple as that, so you know when you are capturing the EEG signals, as
many times there is going to be a blink or an activity in the eyes, in the eyes, you are, it is
going to be reflected as an artifact in the EEG signal. So, you have to understand that when
these artifacts are occurring and of course, accordingly how to remove them.
Again, it turns out there are tools which can do this job for you. Again, as I said so EEG
while it's a very interesting signal to have and it gives you a lot of information, many a times
EEG signal again is combined with the facial expression analysis and the eye tracking to have
a multi-model view of the emotion that we are trying to analyze and understand, right.
So, now we already understood about the emotions in the EEG signals. So, let us try to
conclude and after that we will go for a demo on the how to process and analyze the
physiological signals. So, it turns out of course, what we understood first that physiological
sensing is an important tool for the affective computing community, more importantly
because it gives you unaltered form of the emotion. It is it cannot be faked and hence it is sort
of represents the pure emotions that are occurring in the humans.
Other important thing that physiological sensing in contrast to other form of sensing that is
available, it allows you to have continuous and discrete monitoring of the individual's
emotional state, right So, basically when you are trying to analyze the; let us say when you
are using the self reporting to understand the emotional state you cannot really do a
continuous monitoring over there, right.
448
So, you it cannot really ask every second, every millisecond that ok, how are you feeling and
so on so forth, what is your emotional state, right. What it turns out if you have the sensors
which are wearable, which are portable, which are wireless, then of course, you know you
can with the help of those sensors you can really do a continuous monitoring of the emotional
state.
Among the different physiological signals, what we also saw that the heart rate, GSR and the
EEG signals are the signals which are most commonly used for the emotion recognition and
the emotion analysis. And with this you know like among these physiological signals comes
lots of challenges. Such as, of course one we talked about the labelling or the annotations. So,
we saw that ok in the wild, the onset of an emotion is often unclear.
While you know when you have a very laboratory settings lab like conditions, you have a lot
of control, you can really mark when the stimulus is being presented when there is an onset of
the stimulus. But there is usually not much resources available when you want to do these
experiments, when you want to collect the data in the wild. And then for the same reason
there is this there is something known as windowing here.
So, window based approach. So, basically in the window based approach, what you do?
Basically, rather than trying to analyze the signal every second, let us say or every minute you
create a window of let us say 5 seconds, 5 minutes or something like that. And you try to
analyze the emotion within that particular window and then you sort of you know moving
window you make use of a moving window and then you let us say if there is a 5 minute
window you move it with 50 percent interval.
So, then the first signal that will appear from the 1 to 5, similarly the second signal if there is
a 50 percent overlap, may appear from 2.5 to let us say 7.5 and so on so forth. So, you got the
idea right. So, this is known as a 50 percent overlap of moving window. So, you try to
analyze these emotions in a window segments and then you try to do this moving window to
capture to help you analyze the emotions without even precisely knowing the onset of the
emotions.
So, other of course, the problem is the subjective self perception of the emotion. So, what I
mean by the subjective self perception of the emotion. That of course, whenever you want to
analyze the emotions, you need to get the ground truth of the emotions from the humans only
and then it turns out that the emotions are very very subjective.
449
So, the same thing one individual may feel happy about, another individual may feel sad
about and then it turns out that emotions itself are very very complex and then there are of
course, defined well defined emotion models, but then you really need to make the
participants understand those emotion models like in advance and to whatever extent you can.
So, that you know they can give you a good a perception of the emotion in terms of the
models that you are trying to analyze or use.
And this is something that is not, and let us say is specific to the; to the physiological signals,
but it is common to all other modalities as well. This is interesting actually, so basically apart
from this thing you know like it turns out that there is many to one mapping of the
non-emotional and the emotional influences on the physiology.
What it simply means? That for example, while there may be response in the physiological
signals, in response to an emotional stimuli, the same type of response in the physiological
signals may also happen due to some physiological some physical activity as well.
So, for example, as simple as that if I sneeze at a moment; at this moment right now. Maybe
you know just my sneezing itself will increase my heart rate to certain extent and it will make
some changes in my physiology, which may not be related to the emotional state that I am
feeling or it will not relate it will not be associated with the emotional state that I am in, right.
450
So, and it turns out that there is a many to one mapping here. So, basically you have
non-emotional actions, you have emotional influences and for the non-emotional and the
influence emotional influences you may have same sort of a physiological responses. And it
turns out that it is a bit tricky problem when you want to analyze the emotions in the wild.
Of course, that is here you know people have used for example, you people have used
different types of sensors again to understand how to differentiate among these two and one
very popular kind of sensor is the for example, IMU sensor which is the initial inertial
measurement unit sensor. So, basically the IMU sensors, what it they; what they do they try
to observe the emotions.
So, the emotions of the human body while you are trying to observe the emotions making use
of the physiological signals. So, of course, when there is a emotion detected by the IMU
maybe you may not want to consider the physiological responses in that particular segment
right, or you may want to be a bit careful while analyzing that.
So, that is one very important and interesting challenge that we have; again, one challenge
that is very very common across all the modalities and maybe a bit more pronounced in case
of physiological signals is the individual variability. So, what is individual variability?
So, basically individual variability is as simple as that; there are variations in the individual's
expression of physiological signals. As simple as that for example, your baseline heart rate
may not be the same as someone’s, else's baseline heart rate, right. So, for example, there is a
range usually even for the healthy individuals if I am not wrong the heart rate the healthy
heart rate looks like 60 to 72 beats per minute.
So, there is a reason that there is a range, right. So, because there is a individual variability 65
is also ok, 70 is also ok, 60 is also ok. And so, it turns out that you need to know this
variability, the baseline for the individuals in order to understand what is the change in the
physiological activity for that particular individual in that particular sensor or signal, right.
So, this is also very interesting limitation that is there. But nevertheless, bottom line is like
there is a lot of research and lot of work development that is being done, with respect to the
analysis of the emotions in the physiological signals and it turns out that with the proper
endeavour, right.
451
So, with the proper efforts it turns out that if you are looking at the right sensor, if you are
looking at the right setting and if you are looking at the right type of analysis and machine
learning and deep learning and signal processing techniques, then these physiological signals
can enable very effective can enable affective computing in a very very robust fashion and
which is private.
Of course, you are not it is some in many senses the data of the individual is private, you are
not it's kind of anonymized, you are not really looking at the identity of an individual so that
is in that sense it's quite private. Of course, its quite personal, again for the same reasons you
are not trying to identify the individuals in that case, rather than you are simply trying to
identify the emotions that individual is having.
And other thing that is very very important and that is not offered by many different
modalities is that you can do the analysis of the emotions in an continuous fashion, which is
not possible with the several of the other modalities and it turns out that it is a luxury when
you combine this private personal and the continuous method attributes together, right.
452
(Refer Slide Time: 37:22)
So, with that then there are lots of references that we used while trying to come up with this
lecture. But that is all of course, we will provide the references you are to you and now we
will try to look at the demo that we are talking about, that how to do the processing or the
analysis of let us say particular physiological signal to understand the acquisition,
preprocessing, may be feature engineering and then the classification ok. So, with that see
you at the demo.
Thanks.
453
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indian Institute of Technology, Ropar
Week - 08
Lecture - 01
Multimodal Affect Recognition
Hello, I am Abhinav Dhall from Indian Institute of Technology, Ropar. Friends today, we will
be discussing about Multimodal Affect Recognition. So, in the past few weeks, we have been
discussing about single modality based emotion recognition systems. Today, we will see how
we can combine them and get a more fairer, more accurate output with respect to the user
emotion.
So, in this lecture, first I will be discussing about why is multimodal presence of multiple
modalities that is important and useful in a large number of circumstances. And then I will be
discussing about the basics of fusion, which is how we are going to combine the information
and their different methods and types.
454
(Refer Slide Time: 01:21)
Now, the motivation is as follows. Let me try to enact. So, I am going to say I really like ice
cream on a hot day. Let me say this again, but this time I am not going to face the camera, I
really like ice cream when it is a hot day. And now, let me repeat this for the third time, I
really like ice cream on a hot day.
Now, what has happened in this pursuit? In the first iteration, I was looking into the camera.
So, you could analyze my facial expressions. In the second iteration, I was not looking into
the camera, but you could hear me, right. So, the speech modality was present.
In the third iteration, as I was speaking, I was moving my head, looking down, looking
sideways and what happened in this particular iteration? So, you had in some points, access
to both voice and face and some points only the voice. So, that means, there was extra
information, complementary information and sometimes vital information when one modality
was present, the other was not present, right.
So, this gives us the motivation to actually have the possibility of using different modalities.
And if you recall, the different modalities in this case friends is the camera based images and
videos, the microphone based voice and music background and user information, the text
based information in the form of messages, documents and then the physiological sensor
based information coming from sensor such as the EEG or the ECG.
455
Now, this means we are going to combine multiple streams of data so that for our emotion
recognition system, we can have richer information. Now, richer information would mean
extra useful and complementary information which can help my machine learning model in
better predicting the emotional state of the user.
Now, the accuracy is the concern here. From the perspective of when one modality could be
present or could not be present as I showed to you and also when both are present perhaps,
we could have a better performing system.
Now, the expectation of a multimodal emotion recognition system is that each modality is
expected to provide some unique information. When you have these unique information from
different modalities, then combining them makes a lot more sense.
Now, let us say in a scenario both the modalities face and voice are presenting the exact same
type of information about the emotion. So, in that particular case maybe combining them
together would mean more computation, but not necessarily increase in the performance of
the system.
Therefore, we create multimodal systems and during that process we keep this in mind that
we are going to add those modalities to the system which would provide complementary
information, unique information so that together the ensemble of this information coming
from the different streams is useful for the prediction of emotion.
Now, we also do not want that if similar information is coming in then we end up with
redundant data. So, along with compute there is a storage need as well. So, you end up
requiring more storage and it is not really helping the system.
Now, according to Sidney D’Melo and Kory the challenges here for multimodal systems are
as follows. Now, the classifiers the machine learning classifiers and the fusion methods the
techniques with which we are going to combine the information coming from different
sensors that better capture the relationship between different modalities. Now, what does that
mean?
So, let us say you have face information coming from a camera and then you have the speech
from the microphone you know you have the voice of the person let us say there is one user.
456
Now, how do you combine them together such that we are able to better capture the
information for emotion prediction and also what is the relationship between the data?
So, only once you understand the relationship between the data coming from different
sources, you can combine it better with the machine learning model or some statistical
analysis method which you will see during the course of the lecture.
Now, from this challenge the second is the affective corpora. So, your datasets which contain
adequate samples of synchronized expressions, extremely important. Now, we would need a
dataset for learning the patterns for emotion recognition. It is non trivial and time consuming
process to set up an experiment for data recording where the face information let us say here
is the stream of face information coming in from the camera and synchronizing that with the
voice information coming from a microphone, ok.
So, they would have some time sync. You know this was a timeline. So, this actually is a
challenge to have large number of samples where we have the multiple modalities
synchronized.
The other is it is observed in the large number of experiments that the improvement in the
accuracy when you combine multiple modalities, that might come out to be very modest, very
minimal increase. Now, that could be based on the choice of fusion techniques, how much
data do you have, how much complex machine learning or simple machine learning method
you are using and also how many samples do you actually have for the multi modal data,
right.
So, it is a challenge which comes into the picture that you know just because multiple
modalities, face text information and you know so forth is available that does not mean that
when you combine the different modality data the performance is going to go up you know
this increase in performance of the system, increase in the accuracy of the system that is
based on how the information is combined from different modalities.
And of course, as we already discussed this point is there any unique information which is
there in the case of the different modalities. Because if it there is not unique information you
combine the information together you may end up with the problem of curse of
dimensionality from a machine learning perspective.
457
(Refer Slide Time: 09:55)
Now, let us look at multimodal affect recognition. What we are saying here is the underlying
relationships and correlation between the feature sets from different modalities and affect
dimensions, this is what we want to understand, right.
Remember features here for example, for voice you could have your MFCC, for the face you
could have a histogram of gradient and you would like to combine them, you would like to
understand the correlation between them, study the relationship between them and that would
help you in creating a multimodal affect recognition system.
Now, the first task here would be to choose the right features, feature selection, which feature
should I use to represent the text? Which feature should I use to represent the physiological
signals?
The other is how different affective expressions will influence each other? You know the
presence of let us say face data and voice data or gesture data it will help one type of emotion
category more to influence one category more than another category.
So, when you are designing a multimodal system combining the power of different
representations from different streams you have to see you know which emotion category
could gain more. And again now these are like a design choices which we are going to make
when we are creating a system.
458
So, we are going to look for the uniqueness of the information with respect to emotion
categories. Further, how much information each of them provides about the expressed affect?
Right, so let us say we want to create a system which is going to tell us that during the
consumption of let us say massively online content you know some video content which is
available on the internet, what is the emotion which is elicited in the user? Right, so you can
now take two types of inputs.
One let us say is you can capture the face and the body that is from the vision modality and
this is active, right. So, you are. So, this is passive since you are recording this, ok. So, you
are this is passive information. What you could also do is you could also have more active
information wherein you can get the feedback from a user that after they watch the movie
there were some questions and they answered, right. So, feedback from the user.
Now, these two are of very different nature, the data, right. This is a continued stream of
images and this one the feedback this could be simply let us say some yes, no binary answers.
Now, it needs to be seen that during the consumption, during watching of the video how
much information can be provided by the images of the user. And later on let us say when
you ask the questions from the user for example, did you enjoy watching this video, was this
video informative? You get yes or no answers, right. So, this could also be indicative of the
affect which was induced into the user when they watched the video, right.
So, during the consumption of the material the images are telling us better information, after
the user has seen the video the feedback from the user is giving us more information, right.
So, the multimodal system could combine this and can get a more accurate version of the
emotion which was elicited into the user in this particular example. Now, you have selected
the modality.
459
(Refer Slide Time: 14:21)
After you have selected the modality the question similar to how we talked about you know
during the voice based emotion or the face based emotion prediction is the task of collecting
the data. So, we will require let us say a labeled or a semi labeled dataset where the data
would come from different modalities.
Now, of course, similar to unimodal when only single one modality is present, in the case of
multimodal as well when we would be creating these datasets you know the expectation is
that they will have spontaneous data you know more natural data and it could be subtle as
well.
Now, why is this important? Right, typically in the real world when we express ourselves
most of the times the expressions are subtle. Now, these expressions could be in the form of
the face information or the voice information only in very rare circumstances, few cases you
will see extreme expressions or you know extreme show of emotion in the voice or in the
text.
So, the dataset which you would like to capture from different modalities that should contain
spontaneous and should be representative of the real world circumstances. Now, what
happens in the case of multimodal datasets which is different from your unimodal datasets?
Your database, it should contain this contextual description which is helping it through the
synchronization with other modality, ok.
460
Now, what could that mean? Let us say a time stamp T1 you have a stream of information
coming; this is the image of the whole body of the user, ok. So, this is coming in and we are
looking at this particular time stamp T1.
Now, in parallel at T1 I also have the voice information coming in. Now, how do I know that
exactly at T1 which was the timestamp which could allow me to map these two together
saying that ok, even though this is coming from a camera and this is coming from a
microphone, how do I make sure that I get the information exactly at the same time for the
different modalities?
Further during the data collection, I also need to take care of the stimuli. So, the stimuli it
should be labeled simultaneously for all the modalities available to the coder; what that
means is, let us say I am a labeler and I am going to now label assign emotion categories or
emotion intensities if it was valence arousal scale to data which is let us say containing the
camera modality and the voice modality. So, audio is there and video is there.
So, for the assigning of the emotional category as a labeler I have to be careful that I need to
consider both the modalities; the face and the voice, right. If I do individual level labeling
then it will be non trivial to fuse them together. Because it is possible that if you only listen to
the audio, it could convey a tad bit different emotion as compared to why you are just looking
at the video, right.
So, while collecting multimodal databases we need to give very clear instruction to the
coders, to the labelers about how they should interpret the data and what is the expectation on
how the data would be interpreted.
461
(Refer Slide Time: 18:38)
Now, once we have created the dataset comes the next part which is extracting different
features from different modalities. And now let us look at what are the important points
which needs to be taken into consideration while we extract the features. Because in this case
features would be extracted from data streams coming from different type of sensors which
means the data in different data streams would be of different nature, would be of different
format, would be of different size. So, let us dwell into this.
So, what we will observe is, first thing there would be a variable sampling frequency. Now,
let us say we had three sensors ok, the first is the camera, the second is your skin galvanic
response, you know that is let us say attached to the arm of the participant and then we have a
headband. So, that is your EEG right, you put it on the head.
Now, if you notice typically the information coming from your camera would be of a frame
rate let us say 25 or 30 frames per second, right. The information coming from your GSR and
EEG that could be of a different frequencies. So, this one could be at a frequency of 16 hertz
this fellow could be at 128.
Now, different strategies would be required to extract the features from these modalities
individually. But since different amount of data is coming in a fixed unit of time across these
different modalities, we will have to be careful with respect to how are we going to join the
information together simply because we have different amount of information coming in.
462
Therefore, we are going to require methods for synchronization as well. So, let us say we say
well, you know we are going to analyze 1 second of data together by analysis we mean we
are going to extract the features.
So, I should know that from T 1 to T 2 which is of duration 1 second, how much of the video
data has come in, how much of the GSR data has come in and how much of the EEG data has
come in, ok. Because ultimately this will be the duration for which I am going to extract the
individual features, let us say F 1, F 2 and F 3.
So, I would require a way to synchronize. So, that I know that for the camera data which is
coming in for the GSR sensor data and the EEG data what is this T 1 information, what is the
time stamp where in the stream of this data you have the starting point. And then let us say
the ending point for this example where the duration we are taking is one second.
Now, after we would be synchronizing we will be able to combine these features, right. Let
us say this is how we are combining. The other is we can also decide at certain moment in
time that when do we want to make a decision, when do we want to have chunk of data
analyzed coming from different modalities and then that would be let us say telling us the
emotion category, right.
Further it is extremely important to do the right feature selection, right. It is about optimizing
the feature space individually followed by a combined feature selection right, we have seen
this already. So, when you have let us say the video coming in and then you have the EEG
data let us say coming in then the optimum feature F 1 for this.
The optimum feature for EEG, then an optimum strategy for combining this F 1 and so that F
1 and F 2 give you the most discriminative information about the current state of the user and
this should have the unique information from the modalities in this case you know we have
to.
And if they are unique, we also need to normalize them as well. Because the video modality
for example, F 1 and then the feature from video modality the EEG modality F 2 they would
let us say be of different ranges right, different ranges, different sampling rate. So, we have to
normalize them and have the combined feature in the most optimum manner.
463
(Refer Slide Time: 23:42)
It is in often noted that highly correlated information would be reduced individually per
modality, right. So, since we are looking at the unique and of the correlation, we would be
highly dependent on the individual modality as well, the quality of individual modality and
the quality of feature which is representing that individual modality data.
Now, further once you have combined let us say the features right, you combine these two
features you could do certain optimization at this level, right. So, one optimization of course,
is the normalization, it could also be let us say some dimensionality reduction right, or
projecting the data onto a higher dimension space such as simply we reduce the cross modal
information.
So, we do not want the information which is same in phase and in EEG or in EEG and in
speech, what we want is the complementary information. So, that together it's more
information not just you know more data which has redundant information.
464
(Refer Slide Time: 24:55)
Now, once you have selected the type of feature representation, the next is how do we
combine them together? And combining is here called the fusion process. Now, broadly
friends, we can divide the fusion process into late fusion, early fusion and slow fusion. Now,
we will study each one separately.
Let us start with the early fusion, the simplest fusion method. So, what we are going to do is,
we are going to concatenate the features from multiple modalities into a single feature vector.
465
Now, if you see here let us say what is happening is, let us say we have face feature, then we
have some text feature and some speech feature, ok. So, there are three modalities.
What I am going to do is I am going to take these three features and combine them, fuse
them, ok. How do I fuse them? The simplest approach is you append let us say feature 1, and
then, you, behind feature 1 you put feature 2, then append a feature 3, right. And then here
you have some machine learning processing, could be or any machine learning algorithm
here we are showing a neural network in the illustration, but you could have any other
machine learning method.
Now, in this case it becomes challenging as the number of feature increases, right. And also,
when the features are of very different nature; quite obvious. What we are doing is we are
combining the features, right. When we are combine the features, if let us say we had
multiple features coming in from different modalities. Well, we will have a large feature and
also if the features are very different, now what does that mean?
Let us say F 1 face feature is for one frame and is of dimension 1 into 768 you know this is
just an hypothetical example, F2 is a text feature which is actually of the dimension it is a
matrix, ok. So, let us say it is a 5 cross 12 and then speech feature is actually a spectrogram,
ok. So, again you could say well that is let us say of the dimension 128 cross 128, this one is
one dimensional, this one actually is matrices, right.
So, when the features are very different how are we going to fuse? One could say well, you
know I am going to fuse it this way F1 then I take you know flattened F2 and then I take
flattened F3 and I simply append them. But this might not be the most appropriate, the most
useful type of fusion. Because not only we would end up with a very long large feature, but it
is also possible that the values are quite different coming from different ranges in this, right.
So, you will had to normalize as well.
Now, in the case of early fusion, friends, synchronization would be extremely important,
right. Because we are combining the information at the feature level itself. So, as I discussed
earlier, we will have to know the starting point and the ending point in time and we will have
to take the information exactly from those starting and ending points across the information
coming from different modalities.
466
Now, one could say well, one way to combine the features coming from different modalities
is simply concatenating them. There are other interesting yet simple ways one could not say
well, if for example, the dimension of the features coming from different modalities that is
similar post normalization one could extract simple statistics. Let us say I convert the
different features coming from different modalities into same dimension feature. So, I do
some pre-processing.
Then what I can do is since now the different features are of the same dimension coming
from different modalities, I can compute things such as the mean and the standard deviation,
right. So, this could become my feature which goes into the machine learning model for the
further processing for getting the emotion level.
Now, another challenge and a limitation which comes in for your early feature fusion for
multimodal emotion recognition is that since we are combining the features together, we are
actually not analyzing the features for their discriminativeness or unique information first and
are simply doing a combination, right.
467
(Refer Slide Time: 31:04)
So, we may lose a lot of temporal information in this pursuit, right. The individual transitions
which are happening between the data, coming from individual modalities that might be lost.
So, now let us look at the late fusion friends. In the case of late fusion as the name suggest
what we are doing is we say well, let us say we have now two features coming in, ok. One
feature is your text data and the other is your speech data, ok. So, you have feature F 1 and
you have feature F 2.
What you are going to do is you are going to take F 1 and you are going to input that into
your ML pipeline, ok. You separately take F 2 and you input that into another pipeline. Now
these can be similar or these can be different depending on the nature of the feature and the
complexity of the feature and what type of information you want to extract.
Now, what you are going to do is you are going to get some decisions, ok. Let us say from
your ML pipeline one you get some decision about the emotion and from the second pipeline
ML dash you also get some emotion decision.
Now, notice this is for one data point you have the text and speech, you input that
individually separately into the machine learning pipeline and you got some emotion
decisions. Now you are going to combine these decisions, ok. So, the combination the fusion
happens late into the pipeline of processing and that is why we call these type of systems as
468
late fusion based multimodal emotion recognition. So, each classifier processes its own data
stream, the multiple set of outputs are combined as we did here at a later stage.
Now, in this case there are further two sub-categories. The first is we do late fusion at a softer
level which essentially is simply saying well you know we use a measure of confidence
which is associated with the decisions. The second is late fusion at hard level which is simply
saying combining the mechanism and operating on a single hypothesis decision.
So, in one case you are taking the probabilities let us say which are coming from the two
different channels as in here in the example. And in other case you are simply combining the
decisions. A very simple example is let us say you take the decisions about the emotions and
you do AND or a OR across these two and that is going to give you the final emotion, ok.
Now, related to both early and late fusion in a way you know a middle ground between early
and late fusion is friends your slow fusion process. What you are saying here is let us say
again you know you have face and you have your EEG information or maybe you know some
speech information coming in.
You are going to combine the information multiple levels, ok. Perhaps let us say you know
you combine face and EEG together here speech goes in individually and you have a
common ML system, ok. Further what you are doing is you are taking some part of the
469
system and you are individually analyzing it, let us say you get the decisions, you again learn
a small classifier on the decisions and that gives you the final emotion.
Because when you were doing late fusion you got the decisions for all the modalities
separately and then you combined them, you did not analyze the features together, but you
analyze the decisions together, right. But here we have the best of both the worlds, we are
actually not only harbouring on to the independence the uniqueness of the features, we are
also computing the correlation between the features and then we are also combining the
decisions as well.
So, late slow fusion gives us the correlation between the modalities while relaxing the
requirement of synchronization. So, as compared to your early fusion, we are synchronization
is extremely important, in this case you know we can relax a bit because we are combining at
different stages.
So, friends with this we reach towards the end of lecture 1 of the multimodal affect
recognition systems. We looked at why multimodal systems are useful, what are the
challenges in creating multimodal systems and later on we discussed about the different
strategies in creating the multimodal fusion system with respect to the early, late or slow
fusion.
Thank you.
470
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indian Institute of Technology Ropar
Week - 08
Lecture - 02
Multimodal Affect Recognition
Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology Ropar.
Friends, we are going to talk about Multimodal Emotion Recognition. This is the second
lecture in the series for discussion on the topic of multimodal emotion recognition as part of
the Affective Computing course.
So, here are the contents which we will be discussing. First, I would be bringing in front
some of the databases, which have been proposed in the community and have been used for
analysis of emotion through multiple modalities. Then I would be talking about a very
important aspect of the progress which is brought in through the benchmarks which are
available in the community and later on some examples of the proposed multimodal systems.
So, the intent is to actually see how the information coming in from different modalities,
different sensors that is fused. Now, just a recall so, in lecture 1 for multimodal emotion
recognition, we discussed about why multimodal information is useful. To this end, we talked
471
about how unique information can be brought into a system for analysis, when we use
multiple sensors.
For example, we can combine a microphone with a camera. So, you have the voice and the
face or you could have the text information along with physiological signals through EEG or
ECG or text let us say with the face data. So, there is complimentary information which can
be extracted from these modalities and we would like to learn from these different modalities
so as to have a more accurate prediction of emotion recognition. Alright, let us dive in.
So, friends, the first database which I would like to discuss with you is the SEMAINE
dataset. Now, the SEMAINE dataset is a Sensitive Artificial Listener based dataset. What we
have here is a multimodal dialogue system with social interaction skills, which are needed for
a sustained conversation with a human user.
Yes, you would have guessed it right, as you can see here on the slide. Here you have these
virtual avatars, virtual agents with which a user would be interacting and the intent is to drive
a conversation so that emotion is elicited into the user. The aim of the SEMAINE dataset is to
engage the user in a dialogue when conversing with these virtual agents and create an
emotional workout by paying attention to the user,s non-verbal expressions and reacting
accordingly.
472
Now, recall non-verbal expressions could include the facial expressions, your eye gaze and
the body pose movement. And this is again as we have been talking about the affect sensing
part. Once the system has sensed the affective state of the user then through these virtual
avatars we are going to react, right. So, that is how the feedback to the user is going to go.
Now, an example of this could be nodding and smiling. You know these are non-verbal
gestures.
Now, further the aim is to have real-time multimodal data, ok. So, let us say the study is
going on wherein you have this virtual avatar which is going to interact with the user. What
we would have in front of the user is a set of cameras so that we can record the user. Now,
again in this, there are multiple aspects.
Let us say here is a person who is sitting on a chair ok, and here you have a computer screen.
Now, you can have a camera right here let us say at the top of the monitor. This will give you
the frontal view and you can also have other cameras to capture the user from different views.
You can have a microphone, let us say the lapel microphone which will get the clear sound
from the user. And you can also have another microphone, let us say the hanging microphone
to record the ambient noise in the room.
So, this way you can have multiple sensors and get information about the conversation and
the affect, which is elicited into the user. Further we want to do data segmentation and
analysis. Now, imagine here you have the virtual character which is interacting with the user.
473
So, we are continuously recording data, ok. So, this data is coming continuously. So, we need
to segment, divide the data into chunks so that those could be further used for assigning the
labels.
And from the perspective of learning of a multimodal affect recognition system from the
values which are going to be analyzed within a specific time duration. So, you can say well I
am going to have a chunk which is segmented from a long video and this chunk would be of
a certain specific duration and I am going to use this as one data point while I am going to
learn a multimodal system.
Now, this segmentation is going to be at frame level and then further at you know the
millisecond, second and minute level. There are certain trade-offs here. One could say well I
have been recording this data let us say this is the timeline ok. So, this is the timeline, this is
T.
So, these are the different video frames which are being captured. I could take a window for
segmentation. Let us say this much, ok. So, let us say this is S 1 which is segment 1. Another
option is well, I could have taken from this frame to this frame only and I could have called
this as segment S 2.
Now, obviously, in S 1 you have more data and in S 2 you have lesser data. But from a
computation perspective this S 1 which contains more spatiotemporal data would require
larger computation resources as compared to S 2 which has lesser data, right. So, the trade-off
here is essentially what is the frequency at which we want to do the prediction of affect?
Which is going to decide the duration of the chunk, which you are going to segment from the
continuous data, which you are recording.
If you have a smaller window yes, it could have lesser temporal data, but we could do a more
close to real-time analysis of the affect. However, if it is a longer window, you would require
more computing resources then the output of the system might not be closer to real-time, but
this is essentially dependent on the use case where you are deploying the system. So, the use
case will decide the frequency at which we need to measure the affect of the user.
Now, another point to note in this trade-off of duration of the sample is, the reliable
prediction accuracy may require longer term monitoring. Now, what does that mean? If you
consider S 1 it contains more frames, more data then we can get extra information, more
474
detailed information about the changes which are happening in let us say the facial expression
of the user.
So, these changes which are happening over the time will give me important information and
a more accurate prediction because I have more temporal data. However, if it is a shorter
window, it is possible that the very few frames which are captured in this small window, I
may not have got enough information to actually predict the correct affect. So, what that
could mean is let us say a person starts to smile.
Now, they are starting to smile so there is an onset of an expression, typically it has to go to
the apex of the expression the highest intensity and then it might go down to offset. Let us
say the person who is going to smile, but what happened is we were considering a smaller
window.
So, we only had the data from onset to let us say midway to apex. Now, this might not be
enough for the system to be robustly able to predict that the person is smiling, it could let us
say very well be confused with that that the person was just speaking. So, lip close, lip open.
Now, if we have a longer window let us say I am now going to focus on apex to offset all of
the frames here. Now, this could give me a more reliable information with the extra frames,
which now I have to analyze. So, I can see the transition between the facial expressions
which can give me a more robust information.
Further, it is extremely important to have the appropriate unit of analysis. And what we
understand is that the appropriate unit of analysis you know how long you want to have, how
many frames, what is the duration that is all context dependent, that is the use case
dependent, right.
It is about where are you going to use the system and who is going to use the system and an
extremely important point here friends, what is the cost of a incorrect prediction. So, an
example is let us say we are going to use a multimodal affect recognition system in a health
and well-being based use case. So, the cost of doing an incorrect prediction is far higher as
compared to let us say another scenario where we are going to use a multimodal affect
recognition system to suggest movies to a user.
475
(Refer Slide Time: 12:32)
Now, from the SEMAINE perspective, this is the data set. The video was recorded at 49.97
frames per second at a spatial resolution of 780 cross 580 and the researchers, they kept the
bit rate of the data at 8 bits per second, while the audio was recorded at 48 kilohertz with a 24
bit per sample. Now, what that mean friends here is you have the video and then you have the
audio, which have different frequency and size of data, right.
Therefore, synchronisation is required as we have discussed in the earlier lecture. Now, in the
case of SEMAINE the researchers, they synchronised the audio and video with an accuracy
of 25 milliseconds. Now, let us look at other attributes of the SEMAINE dataset. We are
capturing spontaneous data.
476
(Refer Slide Time: 13:40)
Now, this is based on the audio-visual interaction between the human and the operator which
is undertaking the role of an avatar with four personalities. So, the virtual avatar can have
four different type of personalities. Why do we want this? Well, if the avatar is going to have
different personality, then this can affect the course of conversation between the user and the
avatar, right.
In simpler words, if let us say two people are talking, the direction of the conversation would
be affected by the personality of the individuals and that is what the researchers also wanted
to capture. So, that you know we have spontaneous data and along with a spontaneous data
we can also reflect on the nonverbal behaviour which would be induced into the user due to
the personality of the virtual avatar.
So, the four personalities are as follows. The first is Prudence. Now, this virtual avatar is even
tempered and sensible. The second is a Poppy virtual avatar who is happy and outgoing. The
third is Spike. Essentially, this is angry and confrontational and the fourth is Obadiah who is
sad and depressive. Now, you can very well imagine in these four cases, the response of the
avatar would be reflective of the affective state of the avatar and the personality, right.
So, if you have a happy outgoing virtual avatar. So, it is possible that the affect which is
induced into the user who is interacting with the Poppy, happy avatar could also be a slightly
more on the positive affect. As compared to let us say when the same user is interacting with
the fourth one, Obadiah where the virtual avatar's personality is a bit sad, right.
477
So, the response from the user against the conversation which is going to happen with this
fourth avatar could be a bit neutral or could be a bit negative, right. So, the personality is
going to affect the conversation. Further guys in SEMAINE, all interactions they were
annotated by 2 to 8 raters in four dimensions in continuous time and continuous value. Now,
these four dimensions are arousal, expectation, power and valence.
Typically, when we were discussing about the emotion analysis from faces earlier or emotion
analysis or from voice, we were talking about either the categorical emotions or we were
talking about valence and arousal two dimensions only. In this case, the authors, they also got
labelled expectation and power as well. Now, the rationale is as follows, if you have four
dimensions, they account for most of the distinctions between everyday emotion categories.
So, you can have fine-grained emotion labels when you are having four dimensions to
represent the emotion. Now, here V a of i, V e of i and p i here are indicating the ratings
which was given to a particular sample by rater i and the dimensions in the superscript are
representing the four dimensions of the emotion.
Now, guys let us move to another benchmark which is very commonly used in the affective
computing community. Now, this benchmark is again based on a series of challenges which
are hosted by researchers. So, this one is called the emotion recognition in the wild. And
recall in the wild here simply means you have challenging conditions which are
representative of real world scenarios.
478
Now, let us discuss about EmotiW. So, EmotiW has several task for affect analysis. We are
going to talk about the audio visual emotion recognition task. Now, this is based on the acted
facial expressions in the wild database , which we collected from high quality Hollywood
movies which is based on parsing of captions.
So, essentially captions which are available for viewers with hearing disability. So, we pass
those captions, we got those sample videos which could contain a user or group of users
showing some emotion and then the labellers they had the final label for each video.
Now, the EmotiW challenge based on the audio-visual emotion recognition task has seen a
healthy representation and participation from both academic and industrial labs. Now, the
task is you get one data point, data point is audio visual and you predict the category of
emotion, ok.
What is also available along with this audio visual data is Metadata, which tells things such
as the age of the subjects in the video or the gender. Now, the metric of evaluation is
classification accuracy. So, here you see the classification accuracy which is a comparative of
the different methods which were proposed in 2018. So, this is from 2018.
Now, the point to note here friends is the highest performance is not really high. So, this was
a categorical emotion task. Still, we are around 61 percent classification accuracy. Now, there
are several reasons. We have a multimodal emotion recognition system. So, all these teams
479
they propose different multimodal emotion recognition systems, analyzing audio and video
some teams also added text analysis by doing speech to text from the voice part of the video.
But the reason essentially for this performance is that data is limited, around 4000 videos.
Now, given the very varied environments you know which are reflected in the videos because
the videos in the data set are collected from Hollywood movies, the limitation of the data and
the intra class variability introduced due to the different environments, that makes the task
non-trivial.
Now, similar to the emotion recognition in the wild challenge is a very successful and
extremely well used benchmarking effort called the audio visual emotion challenge. Now, in
the audio visual emotion challenge there have been several task similar to EmotiW. Now,
here is one such example where friends you can see the four different dimensions of
continuous emotion and the low and high intensities for the same.
You can actually make out a difference the visible difference between the facial expression
well, let us say the arousal is low or arousal is high, right. Now, the AVEC challenge, the
audio visual emotion challenge, not only had the multi-modal task audio visual task, but
uni-modal task as well, which were based on the acoustics, linguistic and the visual cues
coming from the data.
480
(Refer Slide Time: 23:42)
Now, in order to obtain the binary levels for the average data the average value of each
dimension over all data was computed, ok. So, this is again coming from the SEMAINE data
which resulted into continuous time real variables which are V a V e V p and V v and again
guys these are representing the emotion dimension.
Now, further to get the binary labels mean of the average ratings over all interactions in the
dataset was computed which resulted into these scalar values. Now, if you recall from the
earlier slide, you had low and high binary labels, right. These binary labels were essentially
based on thresholding the values which were based on the mean of all the labellers who
labelled a particular video sample.
Further for AVEC the video streams, they were segmented in two ways. First is frame level.
Now, you would like to analyze per frame for the video only tasks. The second is you there
was a word level segmentation so, per word for the audio and audio-visual tasks.
481
(Refer Slide Time: 25:09)
Now, here is an example of the feature level fusion based task for AVEC. So, friends, what do
we see here? We have an audio-visual signal coming in, the video part the audio part. Now,
the audio part that is input into a library to compute the prosodic features and in parallel the
video part has object detection to detect the location of the face and then from that the
features are extracted.
Further the features which are coming let us say F 1 and F 2 for eyes and different parts they
are all fused together. So, here you have feature fusion and then there is a classifier to predict
the binary label.
482
(Refer Slide Time: 26:01)
Now, what we are going to do from this point is we are going to look at different works
which are proposed in the literature and we are going to see how researchers proposed these
multi-modal affect recognition systems which are combining information at different level.
Now, the first one for feature level fusion friends I would like to discuss with you is the work
called video based emotion recognition using deeply supervised neural networks.
So, this is actually a work from Fan and others in 2018. Now, notice there is only one
modality in this work that is video. However, there is fusion happening at multiple levels how
is that? So, here you have an input face. Now, the architecture is going to have the face
represented at different resolutions or different scales.
So, essentially, we are going to mimic a Gaussian Pyramid where you say well, I have the
smallest resolution of the image 1, a larger resolution version of that image then further the
larger image and then the largest. The aim is this was the original image which I had got,
right. So, I was down sampling as I was going up. Now, at the bottom of this pyramid I would
be able to analyze micro level information from the perspective of face.
Think of it as the twitch, subtle twitch in the corner of the eye. As we go up this pyramid, we
have the macro level information which will give you let us say the just the face, tens of face
and if it was smiling a stretch of the region in the lower half of the face. Now, the authors,
they introduce the concept of scale within the network itself wherein when the input face
483
comes in you have a series of configuration layers then you have these down sampled
versions of the feature maps.
Now, what we are doing here is, we are actually learning to predict the emotion from each
scale feature maps individually. So, you can see the loss here and in parallel we are also
combining the feature maps. So, all of these go together and they are fused. So, this is the
concatenation which is happening ok, and this is the final fusion loss. So, essentially first you
fine tune, you have discriminative we presented the features at each level and then you are
doing the fusion.
Now, friends let us move to another work. So, this is the non deep learning based work by
Sikka and others. So, this is called multiple kernel learning for emotion recognition. Now, let
us look at how information is fused in this part. Here you have the input audio-visual sample
we do object detection for getting the face, we track the facial points. So, here you can see the
facial points.
So, as you have already seen in the tutorial for the automatic facial expression recognition
there is this several open source libraries for object detection and facial pass detection. So,
you can use that, authors used one such library and then they have the aligned faces as input.
What they are doing is they are computing a scale invariant feature transform, a hand
engineered featured feature in a bag of words based framework. We have discussed this in
automatic facial expression recognition.
484
Then they want to analyze the whole scene. So, this is another hand engineered feature called
gist. Along with that they have the audio feature, ok. So, they are extracting audio features
and then they also have spatiotemporal features which are inspired from your local binary
patterns. Now, what do we have here? We have these set of features, audio-visual features
which are extracting different type of information.
So, to combine them the authors they would apply different kernels to each of the individual
features and then learn together a multiple kernel learning based support vector machine to
predict the emotion classes. Now, in this case the fusion essentially happens at the kernel
level. Now, these were some examples of early fusion.
Let us look at some examples of late fusion. You have again the audio-visual signal coming
in, you extract some audio features, you have a classifier here could be a support vector
machine your hidden Markov model, which gives you the emotion classes in the form of
probabilities. In parallel you have the visual channel.
Similarly, you extract the features, then you have these classifiers again these give you the
emotion classes. Now, you can do two things: you can have decision rules for example,
applying and or kind of gates to the different classifier outputs or you can have a classifier.
So, you can combine the output of all these classifiers and learn a classifier on top of it to get
the emotion class.
485
(Refer Slide Time: 31:57)
Now, the first work in this which I would like to bring to your attention is from Liu and
others from 2018. Now, let us see how the fusion is done, ok. So, you have a audio-visual
sample again. Let us look at the first video you do object detection, what the authors do is
they first look at the facial landmark points, which tells you the location similar to how we
saw in the earlier works.
And then they are computing the mean max standard deviation for all the frames and making
a 102 feature dimension for each video. Then they use a support vector machine and here is
their emotion classification. So, these are the classification results for emotion when the
video is analyzed based on the facial points.
The second is we take the face we then have different pre-trained networks. So, these are
pre-trained convolutional neural networks, we extract the features and then we are learning a
support vector machine individually for each of them.
Again, what do you see friends? Series of predictions of emotion based on these classes.
Now, the third is you take the faces you input into a VGG pre-trained network. So, again a
face based communicational neural network and then you have a recurrent neural network for
each of them separately which gives you the predictions.
Now, let us look at the audio, you have the audio channel the pre-trained sound net is used,
you have a series of fully connected layers, you get the predictions. So, look at the levels at
486
which we have got the emotion classes. Now, what we are we doing? We are combining them
together into a grid and then we are combining them with weights. So, this is the decision
fusion part.
So, we optimized the prediction for each feature, which we are extracting here and once we
got the predictions, we are then fusing them with different weights. So, with empirical
evaluation, with experimentation researchers realized for example, this performs better as
compared to this. So, let me give more weightage to this pipeline as compared to this pipeline
and then you can do the combination with weights.
Now, here is another work for fusion. You notice here this is group level emotion task. So,
now we are going to talk about when in a given sample you have multiple people. Now, this
work is from Guo and others. And the authors, they propose the use of face level information,
scene level information, the body pose in the form of skeleton and the visual attention. So, let
us say here you have a sample frame, this is the input frame.
What you see here is the facial structure. So, this is the facial structure and the lines here are
telling us about the body pose, right. So, along with this the user is also analyzed by using
object detector. So, you can see here for example, there is a tall building in the back and then
this person is wearing a black pant blue jeans and so forth. So, these are the visual attention
based attributes about the scene and the subjects.
487
So, this gives you the information about the context right, where people are who these people
could be nor from the identity perspective. But from the perspective of attributes such as
young people or aged people could be school going students or could be colleagues in an
office, right.
So, face detection is used you get the faces, scene is the whole image, then you have the body
pose and the face structure and visual attention. So, these are the input you have the VGG
pre-trained face model which is used to analyze the faces, then you have the scene based
model which is looking at the whole image to get the different attributes.
Further the skeleton body pose, arm up, arm down all this is analyzed and then the visual
attention again this gives you the attributes. Later the predictions are fused here to get the
final result.
488
(Refer Slide Time: 36:51)
Now, let us look at another work for fusion. Friends, this is multiple spatiotemporal feature
learning for video based emotion recognition in the wild. In this case, here you have the
faces, we do some pre-processing and you see three different type of networks.
Now, notice this is again a recurrent neural network where the feature to each cell is the a
feature which is extracted from a pre-trained VGG network, then you have a ResNet based
network, again a recurrent neural network, this is the bidirectional LSTM and here the
authors have a 3D convolution neural network, ok.
Now, you get the scores from each of them and we are going to fuse in parallel, we have the
audio information, we get the spectrograms, we input the spectrogram into the pretrained face
VGG network, we have bidirectional STM, we get the scores, we do the final fusion, ok. And
here, W 1, W 2, W 3, W 4 are signifying the weights for these individual feature based scores
which we get from the classifiers of these features.
489
(Refer Slide Time: 38:05)
Now, here is another work from 2016 friends called multi-clue fusion for emotion recognition
in the wild. Now, let us see what is proposed in this case. Here you have the input video and
the input audio, you again detect the landmarks, you get the facial points where are the
different facial paths, you create trajectories of these landmarks, ok you can. So, you can
create how are the facial point changing over time, right. You input that into a computational
neural network.
Separately, you also again extract VGG face. Now, you would have noticed VGG face and
VGG based pre-trained network that is very commonly used in the community for face
analysis, simply because these network especially VGG face that has been trained on millions
of face images. So, it has very rich representation with respect to the different attributes of
faces.
Now, coming back we have the VGG features, we are extracting these features separately for
each frame, then we have a recurrent neural network, ok. Now, this gives you the scores here.
Similarly, we already had the scores for the trajectories. Now, coming to the audio, you have
opensmile with features.
Now, opensmile again is a very commonly used library to extract the features, you input that
into a audio computational neural network and what you are doing is you get the scores.
These scores and then another score from trajectories by using support vector machine and
then you are fusing them to get the final emotion class.
490
(Refer Slide Time: 39:55)
Now, here is another work friends by Memati and others. This is called a hybrid latent space
data fusion method for multimodal emotion recognition. Let us go through this, you have the
input sample, you extract the visual features, you extract the audio features. What you also do
is you have some textual data. Now, if you notice these are some comments about the video.
Now, you extract visual features, audio features and you also extract textual features. Now,
notice the different level of fusion which is happening. What the authors propose is first we
do a fusion of the audio-visual data and then we have a common classifier for predicting the
emotion from the audio-visual fusion only. So, this is like a feature level fusion which
happens first, ok.
So, notice this is feature level for audio and video. Separately for the text data, we do the text
classification and get the emotion. Now, we have emotion information from this part, emotion
information on this part. What we are doing is we are doing decision level fusion; you know a
late fusion to predict the final emotion label. Now, interesting thing to note here is the
audio-visual data is more similar in terms of frequency as compared to audio and text or
video and text.
Even the audio-visual data is very different, but relatively, right. So, the authors when they
experimented with the different combinations, here they actually come up with a hybrid
solution, right. For some part of the features, they have feature level fusion, they get the
491
prediction from that and then they fuse at the decision level after they get the scores from all
the different features.
Now, let us look at examples from the audio-visual emotion challenge friends. So, in this
case, if you look at Wollmer work in 2013, very similar to the pipelines which we have
discussed till now, you have speed signals, you have video signals, you extract your standard
MFCC feature and what you are doing here is you are doing automatic speech recognition.
Now, this will give you some linguistic features, the content, what is being spoken and from
the audio features, you will get the rich features for example, again you know MFCC, your
fundamental frequency, intensity and so forth. For the video, you extract video features and
then you are combining them together into a bidirectional LSTM to predict the final emotion.
492
(Refer Slide Time: 42:48)
Now, the authors did a very comprehensive analysis within this framework to see how useful
feature selection is going to be when we have these different modalities coming in. Now, here
friends you see the different classifiers, here we have the four dimensions for continuous
emotion and here is the mean result, ok.
Now, for example, notice here the weighted accuracy for LSTM based classifier when the
feature is only audio is highest for arousal. However, let me bring the discussion to the
bottom part of the table here. When you have again your LSTMs, here you are not doing any
feature selection, you see when the combinations are there for the audio video, audio
linguistic and video, the performance is the highest.
So, what do we understand? Feature fusion is helping. And even though we are not selecting
discriminative features, even then we see an increase in performance. However, recall that
one of the challenges in creating a multimodal emotion recognition system is that when we
have data coming in from different modalities, we would like to extract the unique
information from these modalities, which can be complementary to each other so that we can
do a more robust prediction.
If we have overlapping information, then we can have a long dimension feature, let us say if
it was feature fusion which could have a lot of overlap, but unnecessarily we may be going
into the curse of dimensionality.
493
(Refer Slide Time: 44:39)
Now, in this interesting work from AVEC, when no feature selection was used, we observed
weighted accuracy was highest for acoustic features only. This also shows audio is the most
important modality for arousal. Further, the classification of expectation dimension, it seems
to benefit from including the visual information. So, when we do the fusion. For the power
emotion dimension, it is best classified by speech based features.
Further, the authors noticed that unidirectional modeling of the weighted accuracy; it
significantly increases when this linguistic features are also used along with audio. So, the
fusion here is not only about the audio features, but also what is being said, the contextual
information which you are getting from linguistics. For the valence emotion dimension, when
we include video, that increases the weighted accuracy.
494
(Refer Slide Time: 45:36)
Now, when feature selection is used, notice how the performances have gone up. When you
have a bi-directional LSTM, this is audio plus linguistics, notice how now we are into the 70s
with respect to the weighted accuracies and the unweighted accuracies, right.
So, the same is observed for when we have the fusion for audio linguistics and video, the
performance actually goes considerably up. So, that means, feature selection is an important
step when we are going to do fusion of data coming in from different modalities.
495
Now, some other points with respect to the feature selection, it was noted that for most
settings, it did not improve the average weighted accuracy, ok. However, for recognition on
video only, it leads to a remarkable performance gain, increasing the mean weighted accuracy
by 5.4 percent.
Now, some other interesting discussion takeaway points from the audio-visual emotion
challenge benchmarking friends. Audio features lead to the best results for arousal
classification. We have seen that across different works. For classification of expectation,
facial movement features were very useful. For power, emotion dimension, audio visual
classification with latent dynamic conditional random fields, which was proposed in 2011 by
Ramirez and others, that was the state of the art in 2011.
For the valence, the audio features lead to the highest accuracy. Now, if you notice, these are
the findings from the different works, because AVEC is a benchmarking effort, which has
different tasks. So, all these are different papers, different methods, which were proposed
through the analysis on the rich data, which is part of AVEC.
496
(Refer Slide Time: 47:44)
Now, I would like to change the discussion a bit and talk about methods, where we are doing
fusion for with sensors, right. Till now, we have been talking about audio and visual data
primarily. Now, here is a work from Subramanian and others. Now, this is essentially the
researchers were interested in emotion and personality, ok.
So, they collected a very rich data, where they had a series of physiological sensors, which
were attached to the users who were looking at videos. So, here you have 58 users, who
looked at 36 videos each, physiological data was recorded. They gave affective rating to the
videos and their personality traits were also evaluated.
Now, what the researchers did, they computed the correlation between the personality traits
and the emotion, along with two other attributes, which are liking and familiarity. Now, they
noticed that you could actually predict the personality and the affect by looking at
physiological sensors, by doing the fusion of the data coming in from these sensors.
497
(Refer Slide Time: 49:18)
Now, this work is called ASCERTAIN and this is the very commonly used affect recognition
and personality recognition dataset, when it comes to physiological data. Now, friends let us
look at another work, where multiple sensors are used for affect. This is from Park and
others. Here you see two subjects who are doing a conversation.
Now, the subjects have a head mounted camera, they have a mind wave headset, this is the
EEG and then they have a heart rate sensor and the Empatica wristband. So, we are data
coming in from different sensors. Now, this work called K-EmoCon. So, essentially this is a
continuous emotion based data set, where the conversation between the two people, it
happens in an spontaneous naturalistic fashion.
498
(Refer Slide Time: 49:59)
Now, if you look at one of the works for fusion with sensors, another work, here let us say
you have the facial video, here you have the EEG signal, which is the object detection, we
then have a pre-trained network.
In parallel, what we are doing is we extract the features which are coming from the EEG data,
then we have a EEG network. What the authors propose is they are using attention network,
where the input features from video and EEG are inputted. And further, the outputs you can
see here, they are predicted for each EEG and video separately and then fusion is happening.
499
Now, friends, this brings us to the end of the second lecture for multi-modal emotion
recognition. What we have seen in this lecture is different methods, which have been
proposed for affect prediction, when the data streams are coming from different sensors for
both early and late fusion strategies.
Thank you.
500
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 08
Lecture - 27
Tutorial: Multimodal Emotion Recognition
By integrating information from different modalities such as audio and video, one can gain a
more complete and accurate understanding of person's emotion state and there are multiple
different techniques for integrating information from multiple modalities including feature
level fusion, decision level fusion and hybrid approaches that combine both.
Each approach has its own strength and weakness and the choice of technique will depend
upon specific application and data availability.
501
(Refer Slide Time: 01:09)
So, in this tutorial we will be using a RAVDESS dataset. The RAVDESS dataset is a popular
dataset used for emotion recognition research. It contains recording of actors speaking and
performing various emotional states including anger, disgust, fear, happiness, sadness and
surprise.
The dataset includes both audio and video recordings which makes it a well-suited for
multi-modal emotion recognition research. The audio recordings consist of 24 actors
speaking scripted sentences in each of six emotional states. The video recording consist of
same actor performing facial expression and body movements to convey the same six
emotional states.
502
(Refer Slide Time: 01:58)
Now, coming to the filename convention in this dataset, each filename consist of seven
numerical identifier in which third identifier tells about the emotional class.
So, to get started with the multi-modal emotion recognition using the RAVDESS dataset, we
will first need to import audio and video data, and extract relevant features from both
modalities. After that we can use variety of machine learning models to classify the
emotional states based on the extracted features. It is worth noting that the performance of
503
your multi-modal emotion recognition system will depend on the quality of the feature
extracted and fusion techniques used.
As well as the choice of machine learning model so, it is important to carefully evaluate the
performance of the system and fine tune the network. So, we will start with downloading and
importing the dataset into the Google Drive. After that, we will be using some predefined
Python libraries to read audio and video files in Python environment. Later we will do some
feature extraction, particularly this MFCC for audio data. And we will extract video feature
using VGG 16 face model.
After that, we will try to see individual features versus fused features performance. And we
will be using Gaussian Naive Bayes, linear discriminant analysis, support vector machine and
our CNN network for this analysis. And later on we will be doing our early and late fusion
and we will be using CNN for this. So, now let us jump directly upon the coding part.
We will start with installing couple of libraries that might not be pre-installed in the Google
colab environment. Once we have installed all these necessary libraries, we can begin with
importing them in the code and we will start pre-processing the data and extract given feature
for our multi-modal emotional recognition.
504
And I am also assuming that you have already downloaded the dataset and stored it into your
respective Google drives. So, our downloading library code will look something like this.
And it might take a couple of seconds to download all these files.
So, after downloading my relevant packages, I will import all my relevant libraries for this
tutorial and the code will look something like this. After downloading and importing the
libraries, I will start with defining some path variables. In this tutorial, I will be basically
defining three of my path variable. First will be our video path which essentially contains the
exact dataset path or video files are stored.
And then I will also define two other paths where I will be storing video only information and
audio only information extracted from the original dataset. One more thing, I will be just
using actor 1 data for this analysis and the only purpose of using actor1 data is to reduce the
computational demand.
So, as of now, my paths variables are all set. So, I will write first piece of code which will
essentially take our video files from original data and extract video and audio information
from that video file and save it to two separate folder. And the code will look something like
this.
So, let me give you an overall overview of this piece of code. What we are essentially doing
here is we are iterating over all the file in the dataset folder and we will be only using
505
emotion number 2 and emotion number 5, which essentially corresponds to calm and anger
emotions.
The only reason I am using only two emotion classes just reducing down the computational
power. As in the general case, Google colab only consist of 12 GB of RAM. For
computational purpose only, I will be using just two of the emotion classes. Then after
reading the video from their location, I will be resizing them into a smaller size and again, the
reason is to reduce the computational requirement.
And after that, I will simply.. I will be simply considering first four second of the video to
reduce the computational requirement and also to make a consistency among all the videos.
Like there could be some instance that a particular video might consist of 5 second and other
might be consisting of 4 second.
So, I am using a smallest threshold like 4 second and every video so, that all my data will be
consisting along the time dimension. Now, I will extract audio information only using this
command video clip dot audio. And later I will be storing these separate audio and video
content in these two folders. So, now running this code and this code might take good amount
of time as we are reading and writing the data.
Audio and video files are created from the original dataset. Now, I will write another piece of
code to bring those audio only and video only files in my Python environment.
506
(Refer Slide Time: 07:50)
So, I will be just declaring two list video label all and video only all these two list will be
storing the labels and the exact video only data. And I will be iterating through all the files in
my video only path and appending into these two list. Later I will convert these two list into
NumPy array for our ease of use.
507
And I will write another piece of small code which will essentially convert our label into 0 1
binary labels. After execution of this code, I will be using similar sort of code for bringing
audio only data in my Python environment and my code will look something like this.
So, now as you can see that I have brought my video only and audio only data in my Python
environment and the shape of video only data will look like 32 cross 120 cross 128 cross 227
cross 3.
So, what does these number actually mean is like there are 32 files, there are 32 video files
and each video files contain 120 frames since the video data was sampled at 30 hertz and we
are taking just four second of data. So, there will be 120 frames and each frame will be
consisting of 128 cross 227 dimension and there will be three channel red, green and blue.
Similarly, for our audio data we will be considering first four second of audio and all the
audio will be sampled at 8000 hertz which essentially means that for one second of data there
will be 8000 samples and using this strategy we will be collecting all those 32 files and each
file will be consisting of 32000 sample which essentially means 4 second multiplied by 8000
sample rate. You and video only files in my Python environment I will start with the feature
extraction part.
In our feature extraction part. So, in our feature extraction part the first feature will be MFCC
feature which essentially called as Mel frequency cepstral coefficient and it is a commonly
508
used feature extraction technique for audio analysis including in the text of multimodal
emotion recognition. MFCC represent a compact yet efficient way of capturing the spectral
characteristics of an audio signal and they have been shown to be effective at capturing
emotion information in speed signals.
So, there are some basic steps that involve in I am putting the MFCC from audio signal and
these steps consist of the very first step consist of applying a high pass filter to the signal to
amplify the higher frequency component and reducing the impact of lower frequency
components. Then second step, MFCC process is frame segmentation where the signal is
divided into short frames usually 20 to 30 milliseconds with some overlap.
Third step is windowing. A window function such as hamming window is applied to each
frame to reduce the spectral leakage and to smooth of the signal and then we perform a
Fourier transform where that windowed signal is transformed into frequency domain using
the discrete Fourier transformation and after that there will be Mel frequency whopping.
Transform DFT spectrum will be converted into a Mel frequency scale which essentially
approximates the way human perceive the sound.
After that there is a usual cepstral analysis over these whop MEL frequencies. The MEL
frequency spectrum is transformed into cepstral domain using inverse of discrete cosine
transformation and then we select a subset of resulting cepstral coefficient as the MFCC
coefficients.
These MFCC coefficients can be used as a feature for emotion recognition classification
either on their own or maybe with the combination of other feature and for our purpose we
will be using a standard library called librosa which will provide the function to easily
compute MFCC from audio signals and to do so, our code will look something like this.
In this code we have declared a list where we will be storing our MFCC coefficients and we
will be iterating through all the audio only files and we will be using this predefined function
librosa dot feature dot MFCC and passing each file in this predefined function and there is a
parameter in this function called number of MFCC coefficient and for that we have chosen a
value equal to 40.
And after computation of these MFCC coefficients we will convert that list into a array and
we will reshape this MFCC list as number of sample followed by number of coefficients. So,
509
after extracting the MFCC features from audio files we will move towards extracting video
features from the video only files. To do so, we will take the advantage of VGG16 face
architecture.
So, VGG16 is basically a deep convolution neural network architecture that was originally
designed for image classification task. This architecture for image classification consist of 16
layers including 13 convolution layers and 3 fully connected layers. In the standard
architecture input to the network is an RGB image of size 224 cross 224 pixels and this input
layer will be followed by 13 convolution layers each convolution layer with a 3 cross 3 filter
size and a stride of 1 pixel.
In this architecture the number of filter increases as we go deeper into the network basically
ranging from 64 in the first layer to 512 in the last layer. Then after every two convolution
layer, a max pooling operation is applied to reduce the spatial resolution of the feature map
through the convolution future maps coming through the convolution layers and after that a
fully connected layer is applied which consist of 4096 neuron each and after that in the
standard architecture there is a three layered multi-layer perceptron for final prediction and
the final prediction is performed using softmax layer.
So, in our architecture we will be using this standard VGG16 face model it is a pre-trained
model and to do so, our code will look something like this and in this code instead of using
224 cross 224 image we will be using 128 cross 227 dimensional image and also I have
downloaded this file this pre-trained weight file in my local drive and I have already given
you the exact code for downloading this file.
So, after running this file our model is initiated. So, one essential step that we have to do in
this pre-trained model is to freeze down all the convolution layer so, that they can extract
relevant feature according to their pre-trained settings. To do so, my code will look something
like this where I will iterate through all the layers in my VGG model and I will simply passed
an argument layer dot trainable equal to false.
After making our feature learning layers as a non-trainable layers, we can put a another
multi-layer perceptron over it on the code will look something like this. So, what does this
over the top model is doing and this is taking our VGG model architecture and then we are
simply putting a flattened layer which will flat down the extracted features into a single
510
vector and then we will be perform batch normalization and then we put our 21 neuron layer
with the activation of ReLU and we will name this layer as a feature layer.
So, this will be our main layer for extracting the features and then we will simply perform a
classification operation on our dataset and extract the feature learned at this layer.
So, you can see that our VGG model has given 4 cross 7 cross 5 12 dimensional feature shape
then we have flatten that out and after performing the batch normalization we will be taking
this 21 dimensional feature. Now, I can simply compile my model using categorical cross
entropy loss and with my Adam optimizer with learning rate equal to 0.0001.
After this I can simply fit my model with the video data for two classes with a batch size of
128 and number of epochs let us say 20. So, fitting this model over the 20 box this thing also
might take some time. So, please have some patience.
511
(Refer Slide Time: 20:17)
So, as you can see we have already trained our model now and we are getting a accuracy
somewhat around 86.41 percent which is a good discriminating accuracy saying that we are
able to discriminate between the two emotion classes using VGG16 feature. Now, we will
simply write a code to extract that feature layer this feature layer for that I will write a simple
code there. I will create a feature extractor which is essentially taking model input as our
input and output will be the intermediate layer which is our feature layer.
Now, I can pass all my data point to this feature extractor and we will get the corresponding
features. For that I can write a simple code and the code will look something like this. Here I
have declared a list called video only feature list and I will iterate through all the files and I
will pass those file into this feature extractor function over here.
And then I will simply extract the feature, convert them into the numpy array and I will
append these features to a this video only feature list and I will convert that list into a numpy
array for our ease of use. One another thing that you have to look here is after extracting the
numpy array I am reshaping it to 120 cross 21.
So, what does this basically means is since our original data consist of 32 files each files
containing 120 images of dimension 128 cross 227 cross 3. So, basically for each file I am
passing all these 120 images and for one particular image I am getting a 21 dimensional
feature from my VGG 16 architecture.
512
Now, I am simply appending these feature sequentially so, that I can get a long feature for
these 120 images and these 120 images basically correspond to our 4 seconds of video. So, in
this way I am creating feature for all these 32 images.
So, running this code we got 32 files each having dimension 2520. So, by now we have
created our audio features as well as our video features. Now, we will simply first try to see
how these audio feature and video features are working with respect to our basic classifier
which our Gaussian Naive Bayes, linear discriminant analysis and our support vector
machine. To do so, I will import all these classifiers from the standard sklearn library and my
classification code will look something like this.
Here first of all I have I will be using my audio features which are MFCC audio only features
and I will divide these feature into my train and test split where my test split size will be 20
percent and I will first call my Gaussian Naive Bayes classifier and fit that data on my
training set and then again I will call my linear discriminant analysis classifier then support
vector machine with linear kernel and after that support vector machine with my RBF kernel.
Let us see how these feature are actually performing with these basic classifiers. As you can
see we are getting a training score of somewhat around 76 percent using Gaussian base linear
discriminant analysis and our support vector classifiers and support vector machine with RBF
kernel is giving a lower score. So, which can also signify that the discriminating boundary
between these two classes are more of a linear boundary.
513
After this we will perform same thing with our video features and the code will look
something like this.
This is similar code this is similar code as we are doing with the audio feature. Instead of
audio feature which I will simply use my video only features and they are corresponding
label and the test size will again be 20 percent and we will split it into our train and test splits
and then we will call our Gaussian Naive Bayes, linear discriminant analysis, support vector
machine with linear kernel and support vector machine with our RBF kernel.
So, let us run this code and see how our video features are performing. So, as we can see our
video features are really good and they are giving a accuracy of somewhat around 100
percent for each instances including our S codes. But since we are already getting this much
of accuracy ah. So, let us try to see how our accuracy changes if we simply combine these
two features together.
So, what I am basically trying to say is, I will simply concatenate my audio features, audio
only features and my video only features together and then I will pass these fused feature in
my in these basic machine learning classifier and I will try to see how my accuracy changes
with that. So, for my fusion part my code will look something like this. So, I have simply
concatenated these two audio features and video features and let us see how the dimension
goes ok.
514
So, we are having these 32 files and each having 5040 dimensional feature and I will also
printed these my audio label and video label just to show you that these are same with respect
to the indexing. So, we can use them interchangeably. Now, let us try to use our classification
code and see how my fused features are working.
So, for this I will simply reuse my code and instead of my video only features here I will use
fused features and since our audio labels and video label are same with respect to the index, I
can use either of the label, again the test set size is 20 percent and let us see how it works.
As we can see here is these are working slightly lesser as compared to our video only feature
in our first case which is our GaussianNB case and also for our LDA case the accuracy is
dropping, but for the SVM case we are getting a similar accuracies as our video classifier
video features are giving.
So, this basically is saying that feature fusion is giving us a better or at least the same amount
of performance with our SVM classifier and the reason why these classification accuracy
decreases there could be multiple factors like one potential factor could be the curse of
dimensionality since my feature dimension has too much high right now.
So, GaussianNBs and linear discriminant analysis might be suffering from this curse of
dimensionality that is why they are giving a poor performance, but in case of a linear SVM I
am getting a similar accuracy over here. So, till now we tried with traditional machine
515
learning classifiers and now onwards we will be using 1 D CN architecture to classify the
image and audio data encodings into emotion classes.
First, we will separately test the CNN architecture on audio and video data. Later on we will
be using the fuse modality as the CNN input and in this case we are hypothesizing that the
given audio and video embedding that we generated using MFCC and VGG face respectively
are the two representation of the original data and since we have created these embeddings by
separately concatenating each time frames. We can treat these embedding as a time domain
data and use 1D CNN classifier on top of it.
So, our CNN architecture will look something like this where we will be sequentially where
we will be passing the input shape and there will be two convolution layers; each convolution
layer followed by a max pooling operation then we will flatten the output from the second
max pool layer and we will pass through the flattened output through batch normalization
layer.
And later we will collect and later we will be using 128 dimensional dense layer followed by
a dropout layer for decreasing any chance of overfitting and later with the last dense layer
which will be consisting of two neurons using softmax activation we will classify our given
data into the emotion class.
516
Also, we will be using categorical cross entropy as a loss function and the optimizer we will
be using is adam optimizer with learning rate of 0.0001 and the metric will be accuracy.
So, let me run this function. So, as our classifier is ready, we just need to reshape our video
and MFCC audio features in a way that can be input into this CNN architecture. So, for that I
will be using simple reshaping operations and our corresponding data will look something
like this for both video and MFCC features.
Later I will simply train our model and for that I will be first converting our labels into
categorical labels as we are using categorical cross entropy function as our loss function and
we will divide our data into respective train and test set with 20 percent test size and we will
simply fit our model with batch size equal to 8 and epoch equal to 10. Let us see how it
performs.
So, here you can see the summary of our given model and our model ok it's already trained.
517
(Refer Slide Time: 32:50)
So, we can see over here that our CNN architecture is giving a good training accuracy, but the
test accuracy is somewhat around 42 percent which is essentially below chance level and this
thing is happening with video features only ok. So, let me try the same thing with audio
features and let us see how my CNN architecture is working with audio features.
So, in case of audio features we will again divide our audio data into their respective train and
test splits with the test size of 20 percent and we will call our model. We will call a fresh
instance of our model and then we will fit it on our training data with a batch size of 8 and
number of epoch equal to say 10 and they will try to evaluate it on our test data. let us see
how it performs ok.
518
(Refer Slide Time: 33:57)
So, as we can see that using these 1 D CNN, our audio features are giving better test accuracy
as compared to our video features. Now, we are interested to see how the fusion of these two
feature will perform in 1 D CNN architecture. So, for that I will simply use the fused version
of the feature and the code will look something like this.
519
(Refer Slide Time: 34:34)
Here I am using this fused feature with the set size 20 percent and dividing into my respective
filter sets calling a fresh instance of our CNN architecture and fitting our training data with
again, same batch size 8 and epoch equal to 10 ok.
And in this case, we can see that our model is able to learn at least 71 percent which is
somewhat equivalent to the audio level features. Here we can see like fusing feature
embeddings, we can get at least the accuracy single test performing feature. And again, here
though we have not done any sort of hyper parameter tuning and the data is also very less
520
over here. So, I mean there is still a chance that we can get a better accuracy over here and
this is a classic example of our early feature fusion.
So, what is typically happening here is we took two features concatenated it together and
passed it to our network. So, after performing early fusion we will now perform late feature
fusion where late feature fusion refers to a method where feature extracted from multiple
sources such as different sensors or modalities are combined at a later stage in the processing
pipeline.
After initial processing of these individual sources, this allows for more flexible and efficient
feature representation as it enables the combinations of diverse and complementary
information. In late fusion technique the feature are extracted from different sources are first
processed independently often using different techniques or architectures and then combined
in the later stage of the pipeline such as a fully connected layer.
It can be also be done using various techniques such as concatenation of two features coming
from different modalities, maybe their element wise additions or element wise multiplication
or maybe some sort of a weighted averaging of two features. So, in our case we will simply
take our video features and audio features.
And pass them to a different CNN architectures and from the feature representation layer of
their respective CNN architecture, we will collect the late features and then we will manually
concatenate them to a late feature embedding and using that embedding we will obtain a
multi-layer perceptron and see the efficacy of our late fusion technique.
So, to do so, we will first create a model and this will be called as late-CNN model and the
agenda of this model will be to extract relevant features from this feature representation layer
which essentially consist of 128 neurons and the activation function is ReLU over here. So,
running this code. So, after that we can simply now call our late CNN layer as a audio model
and fit it on our MFCC audio only features with a batch size of 8 and the number of epochs
will be 10.
After we fit the model, we will again make our feature extractor and extract the features from
this feature representation layer and the code will look something like this.
521
(Refer Slide Time: 39:03)
And then now since we have our audio and video model train, we will simply extract their
feature layer using this code and save them into a separate variables. So, if I show the shape
of this variable, it will it will be a tensor of 32 cross 128. So, basically 32 are number of our
sample and 128 is our late representation of that size of that latent representation. Similarly,
for the video we have the similar shape.
522
Now, we can simply concatenate these two features these two late representation and make a
late fused latent representation. For that our code will look something like this. Now, we have
this late representation we will create an another multi-layer perceptron and we will train that
perceptron on our training data which essentially will be coming from these late fused
features and test on the respective test data and we will see how this late fusion technique will
work.
So, our model will look something like this it is a very simple model where we will simply
flatten down the these features and then we perform a batch normalization and there will be
two tens layer consisting of 16 neurons and ReLU activation function and I will be putting a
dropout layer of with 10 probability to regularize this network and later we will classify the
emotion classes using softmax classifier.
And I will again training this model with categorical cross entropy loss function with adam
optimizer the at the learning rate of 0.0001 percent.
Now, our model is compiled I can simply divide my late fused data into train and test splits
and the code will look something like this and our test size is again 20 percent and now I will
simply run my model and evaluate it on test set. Let us see how does it perform now ok.
523
(Refer Slide Time: 42:19)
So, yeah as we can see here we are getting better accuracy in this late fusion case as
compared to our early fusion. In late fusion we are getting test accuracy of 85 percent
whereas, in our early fusion case our test accuracies was 71 percent. So, in line with the
literature we were having two different sort of modalities we extracted relevant features of
these two modalities using some sort of a machine learning algorithm in our case a
convolution neural network.
And later we extracted the latent representation from these machine learning algorithm from
the CNN and combined those latent representation and did of late fusion and it is giving a
better accuracy than our earlier techniques. So, finally, concluding this tutorial in this tutorial
we started with we started with working with audio and video information.
We extracted MFCC features from audio modality and we used a pre-trained VGG16 phase
architecture to get features from our video data later using these features we, firstly,
separately trained our machine learning models and saw how they are performing then we try
to fuse the given embeddings and saw how these classified again performed later we did a
CNN based approach where we first tested early feature fusion method and then we tested
later feature fusion method.
And we saw that in case of data where there are multiple modalities coming these late fusion
techniques works better.
524
Thank you.
525
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Lecture - 28
The RavDIS dataset is a popular dataset used for emotion recognition research. It
contains recording of actors speaking and performing various emotional states, including
anger, disgust, fear, happiness, sadness, and surprise. The dataset includes both audio and
video recordings, which makes it a well suited for multimodal emotion recognition
research. The audio recordings consist of 24 actors speaking scripted sentences in each of
six emotional states.
The video recording consists of same actor performing facial expression and body
movements to convey the same six emotional states. Now, coming to the file name
convention in this data set, each file name consists of seven numerical identifier in which
third identifier tells about the emotional class. So to get started with multimodal emotion
recognition using Lab-based dataset, we will first need to import audio and video data
and extract relevant features from both modalities. After that, we can use a variety of
machine learning models to classify the emotional states based on the extracted features.
It is worth noting that the performance of your multimodal emotion recognition system
will depend on the quality of the feature extracted and fusion techniques used, as well as
the choice of machine learning model.
So it is important to carefully evaluate the performance of the system and fine-tune the
network. So we will start with downloading and importing the dataset into the Google
Drive. After that, we will be using some predefined Python libraries to read audio and
video files in Python environment. Later, we will do some feature extraction, particularly
this MFCC for audio data, and we will extract video feature using VGG16 face model.
526
After that, we will try to see individual features versus fused features performance, and
we will be using Gaussian A-based linear discriminant analysis support vector machine
and our CNN network for this analysis.
And later on, we will be doing our early and late fusion, and we will be using CNN for
this. So now let's jump directly upon the coding part. We will start with installing a
couple of libraries that might not be pre-installed in the Google Colab environment. Once
we have installed all these necessary libraries, we can begin with importing them in the
code and we will start pre-processing the data and extract a given feature for our
multimodal emotion recognition. And I'm also assuming that you have already
downloaded the data set and stored it into your respective Google drives.
So our downloading library code will look something like this. And it might take a
couple of seconds to download all these files. So after downloading my relevant
packages, I will import all my relevant libraries for this tutorial. And the code will look
something like this. After downloading and importing the libraries, I will start with
defining some path variables.
In this tutorial, I will be basically defining three of my path variable. First will be our
video path, which essentially contains the exact dataset path our video files are stored.
And then I will also define two other paths where I will be storing video only information
and audio only information extracted from the original dataset. One more thing, I will be
just using actor1 data for this analysis and the only purpose of using actor1 data is to
reduce the computational demand. So as of now my paths variable are all set.
So I will write first piece of code which will essentially take our video files from original
data and extract video and audio information from that video file and save it to two
separate folder. And the code will look something like this. So let me give you an overall
overview of this piece of code. What we are essentially doing here is we are iterating
over all the files in the dataset folder, and we will be only using emotion number two and
emotion number five, which essentially correspond to calm and anger emotions. The only
reason I'm using only two emotion classes is just reducing down the computational
power, as in the general case, Google Colab only consists of 12 GB of RAM.
For computational purpose only, I will be using just two of the emotion classes. Then
after reading the video from their location, I will be resizing them into a smaller size. And
again, the reason is to reduce the computational requirement. And after that, I will simply,
I will be simply considering first four seconds of the video to reduce the computational
requirement and also to make a consistency among all the videos. Like there could be
some instance that a particular video might consist of five seconds and other might be
527
consisting of four seconds.
So I'm using a smallest threshold like 4 seconds and every video so that all my data will
be consistent along the time dimension. Now I will extract audio information only using
this command videoclip.audio and later I will be storing these separate audio and video
content in these two folders. So now running this code. And this code might take a good
amount of time as we are reading and writing the data.
Audio and video files are created from the original dataset. Now, I will write another
piece of code to bring those audio-only and video-only files in my Python environment.
And my code will look something like this. So I will be just declaring two lists, video
label all and video only all. These two lists will be storing the labels and the exact video
only data.
And I will be iterating through all the files in my video only path and appending into
these two lists. later i will convert these two lists into numpy array for our ease of use and
i will write another piece of small code which will essentially convert our label into 01
binary labels After execution of this code, I will be using similar sort of code for bringing
audio only data in my Python environment and my code will look something like this. So
now, as you can see that I have bought my video-only and audio-only data in my Python
environment, and the shape of video-only data will look like 32 cross 120 cross 128 cross
227 cross three. So what does this number actually mean is like there are 32 files, there
are 32 video files, and each video files contain 120 frames. Since the video data was
sampled at 30 hertz, and we are taking just 4 seconds of data so there will be 120 frames
and each frame will be consisting of 128 x 227 dimension and there will be 3 channel red
green and blue similarly for our audio data we will be considering first 4 seconds of audio
and all the audio will be sampled at 8,000 Hertz, which essentially means that for one
second of data, there will be 8,000 samples.
And using this strategy, we will be collecting all those 32 files and each file will be
consisting of 32,000 sample, which essentially means four second multiplied by 8,000
sample rate. and video-only files in my Python environment, I will start with the feature
extraction part. In our feature extraction part, so in our feature extraction part, My first
feature will be MFCC feature, which essentially called as Mel Frequency Capacitor
Coefficient. And it is a commonly used feature extraction technique for audio analysis,
including in the text of multimodal emotion recognition. MFCC represent a compact yet
efficient way of capturing the spectral characteristics of an audio signal.
And they have been shown to be effective at capturing emotion information in speech
signals. So there are some basic steps that are involved in computing the MFCC from
528
audio signal. The very first step consists of applying a high pass filter to the signal to
amplify the higher frequency component and reducing the impact of lower frequency
components. Then second step MFCC process is frame segmentation where the signal is
divided into short frames, usually 20 to 30 milliseconds with some overlap.
After that, there is a usual capstral analysis over these warped male frequencies. The
male frequency spectrum is transformed into capstral domain using inverse of discrete
cosine transformation. And then we select a subset of resulting capstral coefficient as the
MFCC coefficients. These MFCC coefficients can be used as a feature for emotion
recognition classification, either on their own or maybe with the combination of other
features. And for our purpose, we will be using a standard library called Librosa, which
will provide the function to easily compute MFCC from audio signals.
And to do so, our code will look something like this. In this code, we have declared a list
where we will be storing our MFCC coefficients and we will be iterating through all the
audio only files and we will be using this predefined function Librosa.feature.MFCC and
passing each file in this predefined function and there is a parameter in this function
called number of MFCC coefficient and for that we have chosen a value equal to 40. And
after computation of these MFCC coefficients, we will convert that list into an array and
we will reshape this MFCC list as number of sample followed by number of coefficients.
So after extracting the MFCC features from audio files, we will move towards extracting
video features from the video only files. To do so, we will take the advantage of VGG16
based architecture. So VGG16 is basically a deep convolution neural network
architecture that was originally designed for image classification task. This architecture
for image classification consists of 16 layers, including 13 convolution layers and three
fully connected layers. In the standard architecture, input to the network is an RGB image
of size 224 cross 224 pixels.
And this input layer will be followed by 13 convolution layers, each convolution layer
with a three cross three filter size and a stride of one pixel. In this architecture, the
number of filters increases as we go deeper into the network, basically ranging from 64 in
the first layer to 512 in the last layer. Then, after every two convolutional layers, a max
529
pooling operation is applied to reduce the spatial resolution of the feature map. through
the convolution future maps coming through the convolution layers and after that a fully
connected layer is applied which consists of 4096 neuron each and after that in the
standard architecture there is a three layered multi-layer perceptron for final prediction
and the final prediction is performed using softmax layer So in our architecture, we will
be using the standard VGG16 phase model.
It's a pre-trained model. And to do so, our code will look something like this. And in this
code, instead of using 224 cross 224 image, we will be using 128 cross 227 dimensional
image. And also I have downloaded this file, this pre-trained wait file in my local drive.
And I have already given you the exact code for downloading this file. So after running
this file, our model is initiated.
So one essential step that we have to do in this pre-trained model is to freeze down all
the convolution layer so that they can extract relevant feature according to their
pre-trained settings. To do so, my code will look something like this. where I will iterate
through all the layers in my VGG model, and I will simply pass an argument
layer.trainable equal to false. After making our feature learning layers as non-trainable
layers, we can put another multi-layer perceptron over it, and the code will look
something like this.
So what does this over-the-top model is doing? This is taking our VGG model
architecture and then we are simply putting a flatten layer which will flat down the
extracted features into a single vector and then we will perform a batch normalization and
then we put a 21 neuron layer with the activation of ReLU. And we will name this layer
as a feature layer. So this will be our main layer for extracting the features. And then we
will simply perform a classification operation on our data set and extract the feature
learned at this layer. So you can see that our VGG model has given 4 cross 7 cross 5 12
dimensional features shape.
Then we have flattened that out. And after performing the best normalization, we will be
taking this 21 dimensional feature. Now I can simply compile my model. using
categorical cross entropy loss and with my Adam optimizer with learning rate equal to
0.0001. After this, I can simply fit my model with the video data for two classes with a
batch size of 128 and number of reports, let's say 20.
So fitting this model over the 20 box, this thing also might take some time. So please
have some patience. So as you can see, we have already trained our model now, and we
are getting accuracy somewhat around 86.41%, which is good discriminating accuracy,
saying that we are able to discriminate between the two emotion classes using VGG16
530
feature. Now, we will simply write a code to extract that feature layer, this feature layer,
For that I will write a simple code where I will create a feature extractor which is
essentially taking model input as our input and output will be the intermediate layer
which is our feature layer.
Now I can pass all my data point to this feature extractor and we will get the
corresponding features. For that I can write a simple code and the code will look
something like this. Here I have declared a list called video only feature list and I will
iterate through all the files and I will pass those files into this feature extractor function
over here. And then I will simply extract the feature, convert them into the numpy array
and I will append these features to a this video only feature list and i will convert that list
into a numpy array for our ease of use one another thing that you have to look here is
after extracting the numpy array i'm reshaping it to 120 cross 21 so what does this
basically means is since our original data consists of 32 files, each file containing 120
images of dimension 128 cross 227 cross three. So basically for each file, I'm passing all
these 120, images and for one particular image, I'm getting a 21 dimensional feature from
my VGG 16 architecture.
Now I'm simply appending these feature sequentially so that I can get a long feature for
these 120 images. And these 120 images basically correspond to our four seconds of
video. So in this way, I'm creating features for all these 32 images. So running this code,
we got 32 files, each having dimension 2,520. By now, we have created our audio
features as well as our video features.
Now, we will simply first try to see how these audio feature and video features are
working with respect to our basic classifier, which are our Gaussian name base, linear
discriminant analysis, and our support vector machine. To do so, I will import all these
classifiers from the standard sklearn library and my classification code will look
something like this here first of all i have i will be using my audio features which are
mfcc audio only features and i will divide these feature into my train and test split where
my test split size will be 20 percent And I will first call my Gaussian abyss classifier and
fit that data on my training set. And then again, I will call my linear discriminant analysis
classifier, then support vector machine with linear kernel, and after that, support vector
machine with my RPF kernel. Let's see how these features are actually performing with
these basic classifiers. Okay, as you can see, we are getting a training score of somewhat
around 76% using Gaussian-based linear discriminant analysis and our support vector
classifiers.
And support vector vision with RBF kernel is giving a lower score, which can also
signify that the discriminating boundary between these two classes are more of a linear
531
boundary. After this, we will perform same thing with our video features and the code
will look something like this. This is similar code as we are doing with the audio feature.
Instead of audio feature, I will simply use my video only features and their corresponding
label and the test size will again be 20% and we will split it into our train and test splits.
And then we will call our Gaussian A-base linear discriminant analysis support vector
machine with linear kernel and support vector machine with our RPF kernel.
So let's run this code and see how our video features are performing. Okay, so as we can
see, our video features are really good and they are giving accuracy of somewhat around
100% for each instances, including our test scores. but since we are already getting this
much of accuracy, so let's try to see how our accuracy changes if we simply combine
these two features together. So what I'm basically trying to say is I will simply concrete
my audio features, audio only features and my video only features together. And then I
will pass these fused feature in my, in these, Basic machine learning classifier and I will
try to see how my accuracy changes with that So for my fusion part my code will look
something like this so I have simply concatenated these two audio features and video
features and Let's see how dimension goes okay so we are having these 32 files and each
having 5040 dimensional features and i also printed these my audio label and video label
just to show you that these are same with respect to the indexing so we can use them
interchangeably now let's try to use our classification code and see how my fuse features
are working.
So for this, I will simply reuse my code and instead of my video only features here, I will
use fuse features. And since our audio labels and video labels are same with respect to the
index, I can use either of the label. Again, the test set size is 20%. And let's see how it
works. Okay, as we can see here is these are working slightly lesser as compared to our
video only feature in our first case, which is our Gaussian abyss case.
And also for our LDA case, the accuracy is dropping. But for the SVM case, we are
getting a similar accuracy as our video classifier video features are giving. So this
basically is saying that feature fusion is giving us a better or at least the same amount of
performance with our SVM classifier. And the reason why these classification accuracy
decreases, there could be multiple factors, like one potential factor could be the curse of
dimensionality since my feature dimension are too much high right now. So Gaussian
name base and linear discriminant analysis might be suffering from this curse of
dimensionality. That's why they are giving up poor performance, but In case of linear
SVM, I'm getting a similar accuracy over here.
So till now we tried with traditional machine learning classifiers and now onwards we
will be using 1D CNN architecture to classify the image and audio data encodings into
532
emotion classes. First we will separately test the CNN architecture on audio and video
data. Later on, we will be using the fused modality as the CNN input. In this case, we are
hypothesizing that the given audio and video embedding that we generated using MFCC
and VGG phase respectively are the true representation of the original data And since we
have created these embeddings by separately connecting each time frame, we can treat
these embeddings as a time domain data and use 1D CNN classifier on top of it. So our
CNN architecture will look something like this, where we will be sequentially we will be
passing the input shape and there will be two convolutional layers, each convolutional
layer followed by a max pooling operation.
Then we will flatten the output from the second max pool layer and we will pass through
the flattened output through batch normalization layer and later we will collect and later
we will be using 128 dimensional dense layer followed by a dropout layer for decreasing
any chance of overfitting and later with the last dense layer which will be consisting of
two neurons using soft map activation we will classify our given data into the motion
class also we will be using categorical cross entropy as a loss function and the optimizer
we will be using is atom optimizer with learning rate of 0.0001 and the metric will be
accuracy. So let me run this function. So as our classifier is ready, we just need to reshape
our video and MFCC audio features in a way that can be input into this CNN
architecture. So for that, I will be using simple reshaping operations and our
corresponding data will look something like this for both video and MLCC features.
Later, I will simply train our model. And for that, I will be first converting our labels into
categorical labels as we are using categorical cross entropy function as our loss function.
And we will divide our data into respective train and test set with 20% test size. And we
will simply fit our model with batch size equal to 8 and epoch equal to 10. Let's see how
it performs.
So here you can see the summary of our given model. And our model is already trained.
So we can see over here that our CNN architecture is giving a good training accuracy, but
the test accuracy is somewhat around 42%, which is essentially below chance level. And
this thing is happening with video features only.
Okay? So let me... try the same thing with audio features. And let's see how my CNN
architecture is working with audio features. So in case of audio features, we will again
divide our audio data into their respective train and test splits with a test size of 20%. And
we will call our model. We will call a fresh instance of our model, and then we will fit it
on our training data with a batch size of 8 and number of epoch equal to, say, 10.
And then we'll try to evaluate it on our test data. Let's see how it performs. Okay, so as
533
we can see that using these 1D CNN, our audio features are giving better test accuracy as
compared to our video features. Now, we are interested to see how the fusion of these two
features will perform in 1D CNN architecture. So for that, I will simply use the fused
version of the feature.
and the code will look something like this. Here, I'm using this fuse feature with test set
size 20% and writing to my respective printer sets, calling a fresh instance of our CNN
architecture and fitting on our training data with, again, same batch size eight and epochal
to 10. Okay. And in this case, we can see that our model is able to learn at least 71%,
which is somewhat equivalent to the audio level features. Here we can see like fusing
feature, we can get at least the accuracy single best performing feature. And again, here
though, we haven't done any sort of hyperparameter tuning and the data is also very less
over here.
I mean, there is still a chance that we can get a better accuracy over here. And this is a
classic example of our early feature fusion. So what is typically happening here is we
took two features, connected it together, and passed it to our network. So after performing
early fusion, we will now perform late feature fusion, where late feature fusion refers to a
method where feature extracted from multiple sources, such as different sensors or
modalities, are combined at a later stage in the processing pipeline. After initial
processing of these individual sources, this allows for more flexible and efficient feature
representation as it enables the combinations of diverse and complementary information.
In late fusion technique, the features are extracted from different sources, are first
processed independently, often using different techniques or architectures, and then
combined in the later stage of the pipeline, such as a fully connected layer. It can also be
done using various techniques such as concatenation of two features coming from
different modalities, maybe they're element-wise additions or element-wise
multiplication, or maybe some sort of a weighted averaging of two features. So in our
case, we will simply take our video features and audio features and pass them to different
CNN architectures. And from the feature representation layer of their respective CNN
architecture, we will elect the late features And then we will manually concatenate them
to a late feature embedding. And using that embedding, we will paint a multi-layer
perceptron and see the efficacy of our late fusion technique.
So to do so, we will first create a model. And this will be called a late-serum model. And
the agenda of this model will be to extract relevant features from this feature
representation layer, which essentially consists of 128 neurons. And the activation
function is real over here. So running this code.
534
So after that, we can simply now call our late CNN layer as an audio model. and fit it on
our MFCC audio-only features with a batch size of 8 and the number of epochs will be
10. After we fit the model, we will again make our feature extractor and extract the
features from this feature representation layer. And the code will look something like this.
Now same thing we will perform with our video features. And now since we have our
audio and video model trained, we will simply extract their feature layer using this code
and save them into separate variables.
So if I show the shape of this variable, it will be a tensor of 32 cross 128. So basically,
32, our number of our sample, and 128 is our latent representation of that size of that
latent representation. Similarly, for the video, we have the similar shape. Now, we can
simply concatenate these two features, these two late representation, and make a late
fused latent representation. For that, our code will look something like this.
Now, we have this late representation. We will create another multi-layer perceptron, and
we will train that perceptron on training data, which essentially will be coming from
these late fuse features, and test on the respective test data. And we will see how this late
fusion technique will work. So our Model will look something like this. It's a very simple
model where we will simply flatten down these features.
And then we perform a batch normalization. And there will be two layer consisting of 16
neurons and Rayleigh activation function. And I will be putting a drop layer with 10%
probability regularize this network and later we will classify the emotion classes using
softness classifier and i will again training this model with categorical cross entropy loss
function with adam optimizer at the learning rate of 0.0001 percent now model is
compiled, I can simply divide my late fuse data into train and test splits and the code will
look something like this and our test size is again 20%. And now I will simply run my
model and evaluate it on test set.
Let's see how does it perform now. OK. So yeah, as we can see here, we are getting
better accuracy in this late-fusion case as compared to our early-fusion. In late-fusion, we
are getting test accuracy of 85%, whereas in our early-fusion case, our test accuracy was
71%. So in line with the literature, we were having two different sort of modalities. we
extracted relevant features of these two modalities using some sort of a machine learning
algorithm, in our case, convolutional neural network. And later, we extracted the latent
representation from these machine learning algorithms from the CNN and combined
those latent representation and did a late fusion.
And it is giving a better accuracy than our earlier techniques. So finally concluding this
tutorial, in this tutorial we started with working with audio and video information. We
535
extracted MLCC features from audio modality and we used a pre-trained VGG16 phase
architecture to get features from our video data. Later, using these features, we firstly
separately trained our machine learning models and saw how they are performing. Then
we tried to fuse the given embeddings and saw how these classifiers again performed.
Later we did a CNN based approach where we first tested early future fusion method and
then we tested later future fusion method.
And we saw that in case of data where there are multiple modalities coming, these late
fusion techniques works better. Thank you.
536
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 09
Lecture - 01
Empathy and Empathic Agent
So, in this week we are going to talk about how can we create machines which are not only
emotionally intelligent, but also which are empathetic. So, they have some sense of empathy.
It sounds like sci-fi, but there are ways in which we can at least approximate it. So, here is an
outline for today's class. In today's class we are going to look at empathy, try to understand
what empathy itself is and what do we mean by the empathetic agents.
We will next talk about how can we develop artificial empathy. Of course, in order to develop
artificial empathy, we need to understand how empathy can be developed is developed
naturally among humans. Next, we will talk about how can we evoke empathy using these
techniques that we have learned from the natural empathy evolution in the humans.
537
Then of course, we will be looking at how this empathy can be used or has been used in the
virtual and robotic agents. We will talk about how empathy is something that is not just an
emotional state, it is beyond emotional state and how can it be evoked that. And we will
finish the module with the evaluation of some performance metrics with that can be used to
understand the naturalness or the empathetic of the (Refer Time: 01:50) empathetic
interactions that are happening between the humans and the agents, perfect.
So, with that, let us dive in with the first topic, empathy and the empathic, empathetic agents.
538
So, the very first thing that we want to ask ourselves is why empathetic agents and in order to
answer that, we will need to let us revise first that what empathy is. So, empathy as we all
know, empathy is the capacity of humans to understand what other humans are experiencing.
And you may understand that this is different from sympathy.
So, this is basically an experience that you try to have what others are going through and then
in turn this helps you to comfort them in a better way, right. That is what the empathy is all
about. And of course, this when as a human also you may have seen that you want to we want
to interact we prefer interaction with the humans, we want to be around with those friends
and colleagues and family members those who are more empathetic to us rather than towards
those who are not so empathetic.
The same idea is there behind them empathetic agents as well. And then, underlying
hypothesis is that if we can provide if the virtual agents let it be robots, machines, services
any emotionally intelligent component, if it can have empathetic responses then that will lead
to better, more positive and appropriate interactions with the humans. And this is what we
want to enable.
And this is something that is already this is what happens when two humans which are
empathetic to each other they interact, right. So, there is a better conversation interaction,
there is a more positive interaction and it is also at the time it is appropriate when to laugh,
when not to laugh for example, and then things like that. So, this is the same thing that we
want the agents to have so that they can behave in the similar fashion to us and the entire
experience can be more positive for the humans.
But it turns out that why giving this empathetic response also is going to make virtual agents
more empathetic and in interact more interactive is because the underlying assumption that
we have is we humans we always prefer to interact with the machines in the same way that
we prefer to interact with the other people.
So, we want other people to be empathetic. Similarly, we would like to have machines to be
empathetic. And this is something that is not a very new phenomena, something that has been
studied since long. So, the underlying idea is that we whatever entities that we are interacting
with if we can attribute human like qualities to them.
539
If we can attribute human like qualities to those entities then that entity becomes less familiar
to us even if it is a non-human like entity. And in turn if it becomes less familiar to us then
you know we become more familiar and the entire experience around this entity become
more explainable or predictable to us and hence we are very comfortable with that. So, give
you an idea.
We can look at something that is known as anthropomorphic design. Some of you may have
heard about it, but those who have not we are going to look at this. So, this underlying
assumption that we are deriving that humans would like to interact with the machines that
have human-like properties is because we always have this tendency to feel better
comfortable around the things that are that have human-like features. And hence we know we
develop a tendency to provide human-like characteristics to non-life-like artifacts. And this
entire design is known as anthropomorphism.
So, anthropomorphism of course, is a Greek word which is made with two words anthropos
and morphe; anthropos means human morphe means shape or form. So, basically
anthropomorphism is all about having human like shape or form. But we will see that it is not
only about the shape and the form it is also about the behaviour.
But let us see take an example here. So, the word anthropomorphism itself says that ok, its
human like form or shape. So, but as I said it is not only about the form or the shape
540
anymore, it is also about the way they interact with you. So, we would like them to interact as
humans interact with us, right.
For example, looking with 2 eyes with 2 eyes rather than 4 eyes who is stopping you to create
4 eyes in the robots or in the agents or in the systems. For example, no one right, but then it
turns out that the humans are always familiar, humans would feel more more familiar and
comfortable around the robots which have 2 eyes. Because of course, they are human like
right, because of anthropomorphic nature of us.
But similarly, so this is behavior and this is also about the interaction and this is what we are
talking about the empathic interaction. So, we would like to have the non-human like artifacts
to interact also with us just the way humans interact with us, right. So, for example, imagine
that you can have a robot which can communicate to you using your brain signals ok, that
sounds fascinating, but then at the same time this is not how humans communicate to other
humans. Well, most of the humans will not do that some may claim, right.
So, the way humans communicate with other humans or interact with other humans is for
example, maybe using voice of course, using gestures the way I am using and the way we use
with each other things like that. Similarly, we would like to have the machines which can
communicate to us using voice you see and communicate to us using gesture and so on so
forth. Of course, facial expressions and on all that, right. And this is where we are talking
about empathetic interaction as well.
So, you can see this image on the right hand side. I hope this image is clearly visible to you.
So, if you look at this image for example, you know this looks cute to us, this looks appealing
to us or at least to the children and to many of us also. And this is the underlying you know
idea why we like cartoons, why we like caricatures and all that. Because for example, here
you can see this is the picture of a mouse, but the mouse is depicting human like
characteristics.
What characteristics? Many different characteristics. For example, one of course, it is having
glasses. It is funny for a mouse to have glasses, but ok humans have glasses. So, it makes bit
more funny and more human-like. It ok, of course, it is reading of course, it has you know
dress and then like humans of course, it is sleeping like humans you know on the pillow and
things like that and it has socks like humans and all that, right.
541
So, basically all these are the human characteristics. So, if you see the form also ok, I mean it
is trying to have glasses and trying to get the human like form, behavior it is reading ok, there
is no interaction, but it can also interact you know maybe using voice and all that.
So, this is what is the anthropomorphism all about and because of the anthropomorphism we
have a tendency to provide human-like characteristics to the non-human-like artifacts. And
why we do that? We do that because it makes the entire interaction more comfortable for us
and hence, we get a more positive experience for the humans. So, that is the underlying idea.
But then we have to look at this anthropomorphism with a bit of catch as well. So, it turns out
that there are three things that we are talking about or mainly two things that we are talking
about let say. One is the (Refer Time: 09:18) appearance, one is the one is the appearance
which is the shape or the form, one is the function which is the behavior and of course, the
function can include the behavior and the interaction both.
So, it turns out you know of course, you know it impacts the way we how we perceive it how
we interact with it and in what sense we build the long-term relationships with it, right. And
this is what this word is going to be very very important for the people who are in the
industry because one problem is of the retention of the customer services and customer itself,
right.
542
So, the customer maybe they will start using your product, your services, your agents, your
robots and all that, but with the time they are going to lose interest unless and until they are
able to build a long-term relationship with it and long-term relationships are only going to be
built not only, but can be built easily if you can have the other services they can provide
human life experiences to them, ok.
And, but there is a catch here. So, the catch that I wanted to talk about is that you know the
appearance and its capabilities or the appearance and the function of an agent. We can talk
about an agent because it becomes easier to understand, but as I have said multiple times
there is the same idea can be extended to the robots, services and things like that.
So, the appearance of a robot should also match its capabilities in the sense the form and the
behavior they should be in sync and of course, in turn it is it has to match the user's
expectations. So, just to make the thing a bit clear, please look at a picture the picture that you
have on your right side. So, the picture that you have is I am not sure how many of you know,
but this is the Sophia humanoid robot, ok.
So, Sophia is supposed to be the most advanced as of now the most advanced humanoid
robot. And if you look at the entire image of the Sophia robot it an anthropomorphic design,
why it is an anthropomorphic design? It looks like human, right. And it is behaving like
humanity, sort of in a smiling also like humans, to certain extent it has eyes like human, it has
a form like human again and ears and all that.
But then again what we want as I said, that we want the appearance and the capabilities to
match each other and we will and when they do not match each other then what happens?
Then the creepiness comes into the picture. So, then the entire experience becomes more
uncomfortable rather than being comfortable, right.
Of course, I am not sure, but for example, for some it may be seen that ok, while it looks like
human, but then at the same time if you look at the smile has some components which does
not look like exactly like human. So, there is a mismatch between these two and maybe it is
not going to give a very strong sense of comfort that you wanted it to have, right.
543
(Refer Slide Time: 12:11)
So, and it turns out this is a very well studied and established phenomenon which is known as
Uncanny valley effect. So, the uncanny valley effect is what? Uncanny valley effect the term
itself was coined by robotic system (Refer Time: 12:24) hero in 1970s which says that it is
the dip in the emotional response of humans when there is a human-like characteristic that is
being achieved by the robots or by the agents, right.
So, for example, just look at this graph and it is going to become very clear. On the x axis
what you have is the human likeness or human like characteristics ok, you are already have it
is written here. So, human likeness or human like characteristics, right. So, basically as you
go on increasing from left to right.
What you are seeing here that you have certain robotic characters, which are less human like
such as for example, of course, R t D 2 you may recall very famous character from Star Wars
movie, of course, Wall-E you may recall and then the more you go on the right hand side then
you come till for example, you know the since character in the humans (Refer Time: 13:21)
series.
And then terminator in the T-800 and Dolores of course, in the Westworld right, they look
very much like human. So, on the x axis you have humans or the robotic agents looking less
human like from the like on the left hand side, but as you go on the right side looking more
human like. Similarly, on the y axis what you have? You have the familiarity the sense of the
comfort or the familiarity of the humans as rated by humans, right.
544
So, for example, what it is roughly trying to show you, the more the human likeness or the
likelihood of looking like a human increases among the robotic characters the familiarity of
these agents also increases with the humans and it makes sense, right. The more they are
looking like human the more you are feeling familiar about with them and around them.
So, for example, if you look all recall R2-D2 or even for example, Wall-E character that is
there in front of you I mean ok, it was so popular because at the same time it was not human,
but at the same time it had some human like behavior and shape as well. So, for example, it
had two cute eyes and you know it was able to navigate and things like that. So, not exactly
like human, but it has some human like form and people loved it, liked it a lot.
Similarly, was the case with the R2-D2 and so on so forth. And similarly, you know if you
recall this again another very popular characters from the Star Wars movie which is C-3PO.
So, C-3PO of course, also if you recall it was you know more or less like quite the it had like
more human like characteristics and hence the familiarity or the popularity of this character
was also quite high.
But suddenly what started happening now, you see then this is the dip that we are talking
about, this entire reason is the dip that we are talking about. Now, suddenly in this area what
we are looking at, that even though the likelihood of looking like the human or human
likeness is increasing among the characters, there is a sudden dip in the familiarity.
So, of course, I mean we started we can talk about if you recall the I robot you know the
Sonny character from the I robot this is the Sonny character from the I robot. Of course, you
know with the T-800 skeletal form of the Terminator, similarly you know the synths as I said
and so on so forth.
So, basically what is happening in among in the case of these all these characters the
Gunslinger in the west world and all these characters, that they are looking they have started
to look like more like humans. But at the same time, you know there is a mismatch between
their form I mean I will note it down again so that you can recall it again. So, there is
mismatch between their form and their behavior and more importantly the user's
expectations, what do we what do we mean by that?
So, if you look at the form, the form looks quite like human, ok. I mean for example, if you
look at the Sonny robot itself it looks quite like human, but at the same time the behavior was
545
not exactly like the humans, it has some superpowers also or maybe you know for example,
there are certain things that was looking like creepy so, for example, all these things coming
out of the head and things like that, right.
Similarly, if you look at for example, there a skeletal form of the Terminator I mean this
particular character. It is looking very much like human, but at the same time it looks like a
distorted human body. I mean this does not look very good to us I mean this does not look
very comfortable to us right, I mean this looks like I mean it is missing some limbs, some
parts and this and that. So, it is like horrific it is not very so.
So, then of course, you know like we do not feel comfortable around this kind of designs.
Because of course, one thing is the form is looking like human, but at the same time is not
like humans and it is creepy itself. And then the form and the behavior maybe they both are
also not matching you know ok, I mean if you are looking like human, if the robot is looking
like human, it should also behave like human and it is really hard to behave like humans.
We have because of course, the naturalness that is there in the interaction, the voice and all
that I mean it is not easy to plug in and of course, there is a lot of research that is going on
these things. And more importantly there is a mismatch with the user's expectations.
When you show a character which is looking like a human the users expectation is ok, it is
going to behave like a human, but when it is looking like human and it is not behaving like a
human, while behaving like human as I said is always challenging then there is a mismatch
with the users expectations. And that is where you know the familiarity or the like popularity,
it starts dipping in and that is what is essentially, we can say is happening with the uncanny
valley reason.
But then what happens suddenly you know then again, we see, but as we cross a particular
threshold in a 90 percent or whatever you know like maybe something that is very very close
to human. So, for example, this character of course, you know like played by a (Refer Time:
17:53) in this case and then of course, in Dolares Westworld, what is happening? I mean ok
with this the likeliness has increased a lot, but at the same time the familiarity or the
popularity also has increased a lot.
So, one simple thing ok, they are looking very much like humans and they are behaving very
much like humans. So, is as simple as that, there is a very good match between the
546
expectations of the users ok, if they are going to look like humans they are going to behave
like humans and that is how they are behaving in this particular case.
And that is where you this is the point that you can also say that if you know or recall the
turing test to certain extent this is where we can say that they have they may have passed the
turing test also for the humans.
Because normal for a normal human it may be hard to differentiate ok, whether for example,
this character is a human or this character for example, is a robotic agent because of course,
they are looking very much like human and of course, they are behaving very much like
human.
So, maybe they are passing the turing test also after this certain reason. Nevertheless I mean it
is really hard to achieve you know this particular reason, but of course, I mean what you want
to do is you not want to get your design trapped in the uncanny valley effect, ok. So, so that is
what is about the uncanny valley effect.
So, coming back to the summary of this thing what we wanted to understand? We wanted to
understand why we want to have empathetic agents. We want to have empathetic agents
because it makes us more it makes the entire interaction more positive as simple as that. And
why it makes the entire interaction more positive? Because it gives us a sense of familiarity
when the agents are interacting with us just like the humans are and this is what is known as
the anthropomorphism or anthropomorphic design.
While we are doing it and that is underlying a hypothesis of the empathetic interaction, but
while we are doing it we should be cautious about uncanny valley because it may happen that
we are going to make their form like humans, but they are not able to achieve an interaction
like humans and that is where there is going to be a dip in the emotional response or in their
familiarity or in their likeness.
547
So, basically what we would like to have of course, this can be our holy grail in some this can
be called as our holy grail which can be difficult to achieve you know like just human like
form, human like behaviour, human like empathetic reaction, but maybe we can achieve till
here and you know that can also be quite, good. Or for example, you know even here or even
here or anywhere in this region, right, ok. So, with that is about the why empathy and what do
we mean by the empathetic agents and why do we like them so much.
548
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 09
Lecture - 02
Development of Artificial Empathy
So, now let us having understood the basics of the empathy and now let us try to understand
that how the artificial empathy can be generated.
So, for that let us first try to understand that how the empathy has been understood in the
classical sense with respect to the humans. So, empathy has been summarized can be
summarized in as three major subprocesses, one is the emotional simulation, another is the
perspective taking and another is the emotion regulation.
Now, what is emotional simulation? So, basically emotional simulation is all about trying to
analyze and understand the emotional state of an individual. This is what we have been doing
so far essentially by trying to understand the emotions of an individual in different, why
different physiological and the behavioral cues.
549
Now, perspective taking is what there is a perspective taking? So, perspective taking is all
about how can you understand that once you are going to give a response to the user, what the
user is going to feel about it, how the user is going to feel about it, that is what the
perspective taking.
So, what the user will feel you know how I would feel for example, if I would be in the user
situation, if I would receive this particular type of response. We will see for example, and it
will get very clear. And then of course, emotion regulation is all about that your aim or the
aim of the empathetic responses or the interactions is that we want to provide a empathetic
response, but at the same time we do not want to transfer any negative emotion to the towards
the individual.
And that is what is the emotion regulation. So, traditionally we as humans this is how we
approach the empathy or the empathetic interaction between each other. And having
understood this thing will also help as simplify that what it means for an intelligent machine
to be empathetic, ok.
So, let us see with this an example. Imagine that there is an AI empowered chatbot that you
want to create, which is for the mental health issues you know where the individuals those
who are having any type of mental health issue they can go and discuss and interact enhance
some sort of counselling with a chatbot which is AI empowered. And now if we have to
550
create this, we have to make this AI empowered chatbot a bit more empathetic then how will
it look like?
Ok, the very first thing of course, it will have to look into the emotional simulation. So, when
we say that it will have to look into the emotional simulation what it means, that it will have
it needs to have the ability to identify the emotional state of an individual. And in this
particular case it needs to identify the common triggers for the anxiety while it is making an
interaction with the humans, right.
So, basically it needs and how will it be able to do that? Identify what are the common
triggers for the anxiety or when the user is feeling anxiety. Of course, it will have it needs to
have access to a large amount of data may be physiological and behavioral both the types of
data.
And then it needs to have a training on that type of data and using that trained model it should
be able to identify when there is a trigger for a anxiety or when the individual who is
interacting with the chatbot is getting anxious or you know how it is interacting. So, that is
what the emotional simulation is.
Now, of course, next thing would be that it has to understand that how the user wants to be
felt like right, you know like it needs to understand that I need to make use of a language as a
chatbot. I need to make use of a language which is going to help my users to feel that you
know they are being heard and they are being understood. And how can it be done?
The chatbot can be programmed in advance to respond to the users in a you know very calm
and reassuring manner and then that is how they are going to that is how this is going to
interact with the users an intense user as going to feel good about it.
Because of course, you have to understand if you the chatbot having understood what is the
emotional state of the individual if the chatbot is going to respond them in an aggressive way
for example, in an aggressive manner or for example, in a higher pitch higher tone then of
course, the users are not going to feel comfortable about it.
So, this is what it has to understand that you know depending upon the state of the user how
should I respond so that you know they feel comfortable around it. And that is what is the
perspective taking, how will they feel if I respond in particular way. And of course, then
551
emotional regulation is ok, while all this is being done of course, the chatbot has to ensure
that it should not trigger or any negative emotions or the feelings in the users. So, that there
are no negative impacts around it, right.
So, for example, if the user has been talking about a particular scenario which is causing
distress, it may it may not to necessarily poke you know the user to talk more about that
particular scenario. Of course, keeping aside how the therapy works and all that, it has to see
take all these things into account that in what are the different ways in which I am going to
respond so that you know it is not necessarily going to invoke the negative emotions among
the user which I definitely do not want to do it.
So, you know this is how the traditionally this is how the empathy can be understood and
then this is how it can be transferred to a artificial intelligent systems.
Now, let us try to understand two major components related to the artificial empathy and the
empathy generation. One is the empathy analysis and then we will next talk about the
empathy simulation. So, basically empathy analysis is what? Empathy analysis is basically
the first part that we just saw, that we want to understand the emotional state of the user and
we want to understand the overall the whether the interaction that is happening is empathetic
or not.
552
Now, before we talk about the agents or the systems let us understand how traditionally it is
being done or it can be done without making it automated. So, you know in the behavioral
studies it is very very common to study empathy during the interactions and this could be you
know empathy during the interaction for example, of a doctor with a patient and so on so
forth. And then based on this thing there are different types of training which are also
provided.
So, for example, another example could be an interaction that is happening between a call
centre employee and a user who has some issues who recently had some issues with credit
card or something like that, right. So, you this is how you know like this kind of interactions
that traditionally they have been analyzed and you know people are being trained on that to
so that you know they can become a bit more empathetic while doing the interactions.
Now, if so in the absence of the AI or like without making use of intelligent systems how it
was being done? And that is when you know typically what was being done here, that
typically human raters or the external annotators what we call that human raters which are not
part of the entire interaction they were involved in this case.
And they used to you know make use of the behavioral cues of the target which for example,
in this case could be the patient or for example, could be the client, the customer who has
called to the customer care they are they were they are going to look at the behavioral cues
and they were trying to infer you know what is the overall state of the interaction.
Of course, if the user is continuously feeling frustrated about it maybe there is not much
empathetic interaction that is happening, right. So, basically, they are external human
annotators number one they are going to observe the behavioral cues of the target which
could be client patient all that and having targeted that they would like to infer that ok,
Whether the empathetic interaction is being happening and at the same time they would like
to annotate.
For example, at what point of time the empathetic interaction was happening. And this is
what an empathetic interaction looks like, right. So, this is how it has been done traditionally
with the in the case of the in behavioral studies.
And when I said the behavioral cues, I mean there could be like physiological and behavioral
cues both. So, for example, they for example, when they are making a call to the customer
553
care of course, the only a modality that you have access to is of course, you guessed it right,
is the audio modality.
So, of course, they can you know just monitor the audio modality of the client who has made
a call to the customer care and using this audio modality they can try to understand that what
is the emotionally state of an individual and at the same time you know overall whether it is
being empathetic the interaction is empathetic or not.
Of course, in order to understand whether it is empathetic or not they also need to understand
what is the interaction that is the entire interaction and the other party that is involved in the
interaction which for example, could be the doctor or for example, could be the customer care
employee who is taking the call of the client, right.
And of course, physiological cues as you rightly guessed physiological cues could be a bit
cumbersome for example, while analyzing the customer care client. But for example, it could
be very much feasible if the target is a patient who is interacting with the doctor.
For example, it is in a mental health counselling. Of course, we are not talking about very
intrusive a physiological cues, but for example, the user can always come with a and can
always be a wearable sensor through which you know some physiological cues such as you
know heart rate, even oxygen for that matter and things like that can be monitored and
understood.
And this can be analyzed to understood that how the user is feeling and overall whether there
has been an empathetic process that takes place. Perfect, now, so, you know so once this is
how traditionality is being done in the behavioral studies.
Now, the idea of artificial empathy is what? That the computational empathy analysis studies
that are focused on developing this artificial empathy, they want to similarly capture and
model the multi-modal behavioral cues for generating the empathy. The way for example, an
external annotator was doing in case of a behavioral study, right. So, this is what traditionality
it has been doon, it has been done and you guessed it right that there are multiple
physiological and behavioral cues.
Some of which we have already studied to understand how emotions in those cues can be
understood can also be used the same type of physiological and behavioral cues can also be
554
used to understand the empathy rather than just understanding the emotional state, they can
be and the used to understand the empathy that is occurring in the entire interaction and how
the user is feeling about it. And let us try to look at some of the these cues.
So, for example, one particular cue is as simple as the lexical cues. Now, lexical cues is what?
Lexical cues is basically text based data. Now, you may want to ask ok, so, where there will
be have access to the text data and how can text data be used to analyze the empathetic
response? So, for example, for the entire interaction that is happening between a patient and a
client, between a patient and a doctor, we can have access to the transcripts of the entire
interaction.
And using those transcripts we can understand whether there has been any empathetic
interaction throughout between the patient and the doctor. Of course, you guessed it right that
we will have to use some sort of you know NLP here, some sort of you know NLP Natural
Language Processing here in order to understand that what sort of features are there in this
transcripts that we can analyze.
And on the top of those features, may we may have to use some machine learning models to
understand whether there has been any empathetic reaction. So, let us talk about one
particular study which is of quite interest in this case. So, for example, Xiao et al., in 2013,
they made use of a N-gram language model to understand whether the language to understand
the language transcriptions of the therapists and the clients in an motivational interviewing
type counselling.
So, basically let us break this down for you. Ok, what is motivational interviewing type
counselling? So, basically motivational interviewing type counselling is the type of the
counselling where you know the whole job of the therapists or the doctors is to motivate the
clients, motivate the patients towards certain goal or towards certain objectives and that is
what is the motivational interviewing counselling is.
And of course, now in this case, what is N-gram type of modelling? So, basically N-gram
type of modelling is what? N-gram type N. So, N-gram is basically you know you may have
gone through it while we were talking about the emotions in the text, but basically N-gram is
what? N-gram is a sequence of words.
555
N-gram is a sequence of N words for example, for example, if a if we were to take a simple
example of you know, let us say that you know it looks like, let us write a simple example, it
looks like ok, let me just remove this this word now for now, it looks like. So, ok, how many
words are there? There are three different words.
So, what does it mean? How can we model it? How can we model it using N-gram? So, for
example, if we are using a unigram, then maybe we are going to look at only one single
statement.
So, for example, I just as simple as that probability of it, probability of occurrence of it could
be something that is a unigram model. Similarly, imagine if we were to look at, sorry, if we
were to look at the probability of let us say looks given that it has already occurred, then that
is what you can call it as a, this is the unigram model and this is what we can call it as a
bigram model.
So, this is a unigram and this is a bigram model, right. So, basically when we are looking at
two words that is a bigram model, when we are looking at one word that is a unigram model.
Similarly, we can also have a trigram model, you guessed it right, where for example, we
want to understand what is the probability of occurring like when the model has already
looked at it looks, right.
556
So, basically that is what is a trigram model. So, there are different characteristics of these
unigram models and for depending upon what is the type of the task that you are looking at,
you may want to use bigram, trigram or even higher sequence models. So, basically in this
particular study, what the Xiao and his group did, that they made use of a N-gram model.
In this case, they used the trigram model to understand that, to understand that, to analyze the
entire, the language transcriptions between the therapist and the client. And basically, what
they did? They were able to show that just by using a maximum likelihood classifier, which is
a very common type of classifier, which looks at the maximum likelihood of assigning a
class.
So, for example, in this case, they were looking at a classifier, which could assign the entire
transcription, whether it is belonging to a empathetic category. For example, or whether for
example, it was belonging to other category, empathetic or other category. So, there are two
classes, right. I hope you can understand this thing. This is class 1, and maybe this is class 2,
which is other, which could be non-empathetic or which is you know, neutral even.
So, basically what they showed in this study, that by making use of a maximum likelihood
classifier, we just looks at the maximum probability of occurring a particular, of assigning a
particular class on these language models, which is you know N-gram language models that
we just saw, they were able to automatically identify the empathetic utterances. And that is
fascinating, actually.
So, basically, you know they were just looking at the language transcriptions, making use of
the N-gram language model, applying a maximum likelihood classifier on the drop of it, and
they were able to identify when there was an empathetic interaction happening, when there
was no empathetic interaction happening, right.
557
(Refer Slide Time: 16:40)
So, for example, this is how you can make use of the lexical cues or the text based data to
analyze the empathetic process, to understand the empathy, not only the emotional state, but
also to understand the empathy, whether the empathetic interaction has been has happened.
Of course, you got it right, that you need to have the annotated data in order to make a
classifier supervised classification work in this case also. Similarly, you know of course, you
know, like test is has been is limiting to certain extent, but nevertheless, one of the most
common modalities where empathy has been understood in the research, in the literature, is
the vocal cues.
So, basically in the voice, and of course, you already understand that the vocal cues or the
voice modalities of the humans is highly dependent on the internal state. And hence, this is a
very, very good indicator of the empathetic, the empathetic state of the individual, right.
And there are there has been many different studies that has been done on this thing. For
example, one study again by Xiao et al., and his group only that was done in 2014, they
showed whether the prosodic patterns. So, prosodic patterns are basically again, if you recall
your class of emotions in speech, then you may recall the prosodic patterns, the features of
the speech.
558
Prosodic patterns related to the empathy assets assess, they analyze these prosodic patterns
related to the empathy assessments. For example, again in the same type of setting where you
know like there was a motivational interviewing based therapy happening.
So, you can look at this diagram and understand there is a therapist, of course, of course,
there is a client, the therapist and the patient they are talking to each other. So, basically the
whole idea was that entire audio was getting recorded. The next step, of course, audio
recording was being done, that is how you obtain the vocal cues.
Of course, you will have to do some sort of denoising on the top of it, background noise or
some sorts of other noise. So, then you first you recorded the audio, you recorded the entire
audio, you did the denoising to remove the low frequency or the high frequency noises. Then
of course, you may want to segment the utterances.
So, for example, the way we were doing trying to identify a word unigram, bigram and
N-gram and so on and so forth, in the same way you want to segment the entire speech into
different utterances, speech corresponding to one word, speech corresponding to two word
and so on and so forth.
And then of course, you may want to, you know, after the segmentation of the utterances, you
may want to extract the prosodic features within those segments. So, it may happen that you
know you may have entire segment of audio like this. So, for example, time t, but then you
may want to chunk in different segments and for example, where this is belonging to maybe
word 1, this is word 2, this is word 3, this is word 4, this is word 5, this is word 6 and so on
and so forth.
So, now and of course, this is just an example, you can definitely segment the entire audio
sequence depending upon so many things in one word in two words or you know depending
upon, for example, as simple as that, that you can just segment it within of with the segments
of 5 seconds, 10 seconds itself a uniform segment, 10 seconds each and so on so forth.
So, once you have segments then you know for each particular segment, then what they did?
They extracted the prosodic features related to the speech and of course, they did the feature,
some feature quantization on the top of it, they looked at the entire distribution of the
prosodic patterns in the different over the different segments.
559
And of course, you need to have always, as always you need to have an external annotator
and external observer who can look at the entire interaction and can annotate for you that, ok,
this particular segment, yes, it was empathetic, this particular segment it was not empathetic.
So, there is an external annotator or observer who provided the empathetic, empathy ratings
of the therapist that, ok, whether the therapist for example, was being empathetic in this
particular segment or not. And then of course, you know, once you have the empathy ratings,
then you simply ran some model and then you tried to understand did the automatic inference
of the entire thing.
So, that was the model that for example, Xiao et al., proposed for the vocal cues also and then
you know what the, what so, ok. So, for the prosodic features for each speech segment, of
course, they looked at the therapist and they looked at the client client's data as well. Now,
you may want to understand, ask, ok, if they wanted to understand the therapists empathy
ratings, then why did they collect the client's prosodic features, why did they look at the
client's audio signal, any guesses?
Ok, I mean, so, this is quite easy to understand. Your idea is to understand what was the
empathetic response on the, on the target, target here in this case is the client. You want your
client to have an empathetic feeling. And unless until you analyze the emotional, the vocal
state, the vocal modality of the client, you can never know that whether the therapist was,
whatever the therapist was intending, whether that was being having an impact or not, right.
So, of course, you will have to look at the both the therapist and the client and then of course,
the prosodic features included many different features such as, you know, vocal pitch, energy
of the signal, jitter, shimmer and speech segment duration itself, for example, one simple
feature in this case, ok.
So, they looked at these entire features and what they could show with the help of results, that
a group of significant empathy indicators, they were able to find a group of significant
empathy indicators which were able to predict, for example, low versus high, which were
able to predict, you know, what were low and what was, for example, high empathy in this
case.
And I think I have some example here, if I can just understand. So, for example, they were
able to understand that an increased distribution of medium length segments with high energy
560
and high pitch was associated with a low empathy situation. And it should not be surprising
because of course, you know, what it means that, if there is an segment with high energy and
high pitch, may, it may suggest that, you know, the therapist is making use of a louder voice
and is having a raised intonation.
And because of this, the therapist's response within that particular segment cannot be
considered or may not have been considered as high highly empathic. So, for example, these
are some of the ways in which they were able to understand and identify that how the
different prosodic features, they are related to the lower the high empathy, empathy
empathetic responses.
And this is how not only they were able to show, that for example, using vocal cues also, you
can analyze the empathetic situation of the of a particular interaction, perfect.
So, I will not talk about how that how the empathetic responses can be evaluated in all
different modalities, but I guess you got the idea. So, of course, you already saw that
empathetic responses can be understood with the help of the lexical analysis; you just look at
the text data of a transcription of the entire interaction.
You can also look at, for example, vocal cues and understand whether the empathetic
interaction has happened. Similarly, it is not uh.. trivial to understand that it is just quite
trivial to understand, for example, that the facial expressions has also, if you can analyze the
561
facial expressions of the respondents, you can simply understand whether they have been
feeling experiencing an empathetic interaction or not.
And motivated by this idea, for example, one of the first studies that Kumano et al., Kumano
and his their group did in twenty 2011 was they wanted to understand if the co-occurrence of
the facial expression patterns between different individuals who were part of the same group
can also be related the empathetic labels, can also be related to the empathetic responses,
right.
So, for example, you can see in this particular picture. So, on the left hand side you have a
high level view of the entire interaction the way it happened. Of course, this is an image from
the same paper, where for example, there were four participants, they were sitting facing each
other and of course, having a discussion and this is the individual participants' camera
highlighting individual's a facial expressions.
So, now in this study what they try to do? Of course, to take a take a step back, they wanted
to understand if the co-occurrence of the facial expression patterns means the facial
expression of participant 1 for example, with participant 2, with participant 3, with participant
4 can be used to understand that there has been an empathetic response or there has been an
empathetic interaction within that particular segment of course.
So, what they did? In this, they segmented the facial expressions into they categorized the
facial expression into six types. So, you already know, ok they neutral, smile, laughter, wry
wry smile and thinking and of course, others. Others could be something, you know, which is
not among these.
For example, which could be disgusting, shaming and all those kinds of things, right. So,
basically, other feelings. So, there were the six types in which they classified the facial
expressions of an individual, how to do the classification of facial expressions? I think you
already know the answer to it; we have already talked about it in emotions in facial
expressions.
Next, what they did? They not only looked at the facial expressions, they also looked at the
gaze patterns. And you can understand that ok, if when we are looking at the facial
expressions along with the gaze patterns, it means simply we are creating a multi-modal
system.
562
And of course, in this multi-modal system, the reason why we make use of a multi-modal
system is because the hypothesis that the response of a multi-modal system would be better in
comparison to a uni-modal system such as making only making use of a facial expressions,
ok.
So, in this case, they also looked at the gaze patterns of all the participants and then classified
the gaze patterns into three categories. One is the mutual gaze, what is mutual gaze? Mutual
gaze is the labelling where the participants, they were looking at each other.
For example, they were like participants 1 and 2 and maybe they were looking at each other.
So, that there was a mutual gaze between these participants. Now, similarly, what is one-way
gaze? So, one-way gaze is basically ok, one of the participants is looking at other participant,
but other participant is not looking back at the first participant.
So, that is one way gaze. And what is mutually averted gaze? So, you guessed it right.
Mutually averted gaze is basically neither of the participants are looking at each other. So, ok,
nevertheless, facial expressions, six states, gaze patterns, three states, our their objective was
to understand if we can look at facial expressions and gaze patterns, can be predict what is the
empathy label of interaction in this particular segment.
So, empathy of course, then you know in this case, they rather than using a binary
classification, they made use of a three class classification problem where empathy was
empathy, unconcern and antipathy actually. So, they have looked at antipathy as well as the
third class. So, there were three classes. So, you got the idea, right, ok.
Facial expressions, gaze patterns were the data; input data, empathy state is the label. Now,
let us see what they found. So, the results showed that the facial expressions were effective
predictors of the empathy labels, not very hard to understand, but nevertheless, it was good to
see that using an automated analysis, they were able to understand that how the empathy can
be analyzed in maybe making use of facial expressions.
And I think if you look at, go ahead and read the paper, you will find that they had some
interesting observations about the gaze patterns as well, and also about how and what was
happening when they were mixing the facial expressions and the gaze patterns. So,
nevertheless, overall, you got the idea that in order to understand the, the in order to analyze
563
the empathy during an interaction, you can look at the physiological data; you can look at the
behavioral cues.
And there are n number of examples, such as we just talked about three, which looked at the
how, for example, lexical cues; how, for example, the vocal cues and how, for example, the
facial expressions can be used to understand the level of empathy between during an
interaction, ok.
And similarly, you can just take the idea further and maybe make use of multiple modalities,
other modalities, such as the ones that we talked about during our previous classes, to
understand the emotional states, you can also use them to understand what is the relation of
the, this all those modalities with the emotional, with the with the empathy during a particular
interaction.
So, this is how you can do the analysis of the empathy, perfect. So, now the next step is what?
Next step is the empathy simulation that is closely related to the development of artificial
empathy. So, once you have understood that how the how the empathy can be analyzed, of
course, the next step is what?
To understand how can we have an artificial embodiment and display of these empathetic
behaviors in virtual or robotic agents, which can you know display artificial empathy as it can
564
be perceived by the human users. So, we already know that we now can see how it can be
analyzed.
Now, the next time would be to put them into the virtual and the robotic agents. Of course, a
bit of warning here, whatever empathetic responses these agents are going to display, they are
not going to be truly empathetic, right. They are not going to be truly empathetic, is as simple
as that.
Of course, because they are not sentient beings, any virtual agent or robotic agent is not going
to be a sentient being. And unless and until it is a sentient being, it can never feel the emotion
and since it can never feel the emotion, it cannot display a truly empathetic empathic
response.
So, I hope we understand, but nevertheless, there is hope. And the hope is that it has been
shown by previous research such as for example, Tapus and (Refer Time: 30:40) Tapus and
her group that we do not really need the agents to show a truly empathetic behavior.
What we really need, that even if we can have a simulation of a human life behavior by the
virtual agents that can just invoke a perception of the empathy by the user, it is of course, this
is going, this is feasible and this is useful for most of the experimentation and the application
purposes, right. So, the idea is not to make the agents feel empathetic, but just to simulate the
empathetic behavior which is a human-like empathetic behavior in the agents and that is
going to serve us well in the longer duration.
And that is what gives us a lot of hope actually and that is why we are talking about how
should what is artificial empathy and how can we generate the empathy and of course, in the
hope that this artificial empathy even though it is not true, it can really help us in making the
response more empathetic and hence more interactive.
Now, it turns out there are then there has been mainly two different directions using which the
previous research or the literature has been trying to do this simulation of human like
behavior. So, of course, we understand truly empathetic response cannot be generated. So, we
can just do a simulation of a human-like behavior. So, previous research has been mainly
focusing on two different methodologies to generate this simulation of a human-like
behavior.
565
So, for example, one particular one particular direction has been where it has been driven by
the computational model of the emotional space. So, basically it is very very driven by the
theory that is behind the generation of the empathy in humans and the idea is the broader idea
is: If we can understand the theory behind the generation of the empathy in the humans, we
can somehow you know create a cognitive model or computational model of that and using
that we can generate the empathy in the virtual agents or the robotic agents. Of course, it is a
bit not so easy to do.
Other approach that many of the studies they have been following is it is mostly driven by the
data where it is driven by the user and the context in which we want to apply the empathy to a
particular application. And if we have data related to that we are just going to other the
previous research has been trying to make use of that data to generate the empathetic
behavior. So, we will basically look at all these kind of things you know in the different ways
to understand that how can it be done, perfect.
So, now as I said there are two ways one is the computational model and one is the let us call
it as a data driven model. The second one we can call it as a data driven model. So, let us try
to understand both of them, how can we make use of these two models.
How the previous research has used these two different models methods to generate the
empathy. So, computational model as I said before, computational model is basically we want
to understand how the empathy is being generated in the humans and when can we replicate
566
the same in the machines. So, there has been many different models that has been proposed in
this sense to understand how the empathy has been generated among humans.
So, for example, one particular model that has been proposed by Boukricha and his group is
of generating the empathetic reaction can be can has three steps. So, basically the first step is
the empathy mechanism. So, basically this is what is the understanding of the emotional state
of the user who is of the target in this case. Other is the empathy modulation.
So, basically what you want to do? You want to understand that for example, as simple as that
what is the level of the emotional state, what is the severity of the emotional state, what is the
for example, the relationship between the target and the for example, the other observer, other
interaction other individual who is interacting?
And for example, what is the demographic information associated with the user using all
these, you may there can be a modulation of the empathy, there is usually a modulation of the
empathy such as and it is weighted by many different degree of factors for such as you know.
Of course, if you have a sense of familiarity between the patient and the doctor, of course,
there is going to be a more empathy. If there is going to be a liking between the patient and
the doctor of course, there can be a bit more empathy.
So, all these aspects comes under the empathy modulation. So, first trying to understand what
is the emotional state of the user, second how can you modulate, how can it be modulate and
third thing is the of course, is the expression of the empathy is basically making use of the
different physiological and the behavioral cues to express the empathic behavior, which of
course, look like the empathetic behavior to the users in this case. And this is how
traditionally one three-step model has been produced.
567
(Refer Slide Time: 35:55)
And let us try to see for example, how can this be applied in the case of a, if we were to use
this particular model, we were to use this particular model to generate empathy, how can it be
applied to again our earlier example, which is the AI empowered chatbot for the mental
health purposes. So, basically what you can do? Of course, the first step is the empathy
mechanism.
So, basically you want to understand and analyze what is the user's emotional state. It is
simple in the case of the chatbot, which is looking which is interacting with the human,
maybe of course, it has access to the users' wise data for example, we just saw. It has access
maybe to the users' you know transcriptions, the entire transcription of the interaction that is
happening.
So, basically it can look at those cues and it can look for specific words or the phrases, for
example, it can look at prosodic features, it can look at the N-grams. Remember, we just
talked about that in previous and it can understand that you know, like what is the emotional
state of the user in this interaction.
Having understood what is the emotional state of the individual, then it can also you know
understand that ok, that it can also adapt a particular, this is very, very important. Please pay
attention to this particular point. This is very, very important. The chatbot can also adapt a
particular emotional state in response to that in response to that particular emotional state of
the target.
568
For example, if the target is feeling stress, anxious and so on so forth, then of course, you do
not want your chatbot to be responding in a very jolly manner. It needs to be, it needs to look
like that it is concerned about the user’s concern user's current state or target's current state,
right. So, it needs to have a mood, we can call it as a state of the chatbot or we can call it as
the mood of the chatbot should be a bit of showing concern, should be a bit of empathy rather
than for example, showing a very jolly behavior and smiling always, you know, like laughing
for example, which is not going to be very sensitive towards this interaction. So, that is the
empathy mechanism.
In next case, of course, you know, once you have understood user's emotional state, once you
have understood what the chatbots mood could be according to that, next the chatbot would
like to modulate its empathy response and based on so many different things. For example, if
it has access to user's history with the chatbot, it may know some other context, it may know
some more information and it can modulate with respect to that.
Of course, it can also take into account what is the severity of the anxiety. Maybe the
individual is feeling a bit, you know, like less anxious, maybe individual is feeling highly
anxious and according to that, the state of the chatbot also can be, you know, programmed or
the chatbot can, the whatever the responses that the chatbot now is going to take, can be
modulated with respect to that.
And of course, as I said, it also needs to look at the user's demographic information such as
for example, what is the age, what is the gender and so on so forth. Maybe they all may play
a particular role in this case. And they all using all this information, the response of the
chatbot can be modulated, right. Now, for example, the chatbot may just, its response to be
more or less empathic depending upon the situation. If the user is very much concerned,
maybe more empathetic, if the user is less concerned, maybe, you know, less empathetic so,
for example.
Now of course, the next thing would be having understood what is the emotional state or the
empathetic state of the user and having understood what is the degree through which it is, it
has to be modulated. Now, the chatbot has to express the empathy. Now, how can chatbot
express empathy?
So, basically the chatbot can express empathy through its messages. For example, if it is just
interacting using a chat, using words and phrases that convey support, understanding and
569
validation. And of course, you know, like this, the same thing can be done maybe by making
use of a vocal cues, voice modality. If for example, the chatbot is like voice assistant.
So, this is how the expression can be done and you rightly guessed that all the, for example,
you may not want to have a very high pitch or high tone of the voice of the speaker, of that
chatbot when it is interacting, when it is expressing the empathy because of course, we just
saw in a few moments back that in one of the studies Xiao et al., for example, they showed
that the high pitch and the high energy is associated with the less empathy.
So, for example, this is how you modulate the degree to which the empathetic response has to
be shown to the target user. So, I hope so far it is making sense that how can the empathy can
be stimulated with the help of a computational model by for example, in a AI empowered
chatbot.
So, that was the first way in which you know the empathy can be generated with the help of a
computational model, trying to understand how the how the humans, how the empathy is
evolved in the humans, how it is expressed in the humans and having understood that we may
want to replicate the same for in artificial agent which may not be so easy to do.
This is a bit difficult actually and then we will talk about what are the challenges associated
with it. But then the method that is more common among the community and that has been
exploited more is basically the data driven approach. So, in the data driven approach what
570
happens that its very very specific to a particular application and it is of course, oriented
towards the user and based on a particular context only. We will see more about it in order to
understand.
So, basically in this case what happens? It tries to learn the context of the human empathetic
behavior and it tries to understand for example, ok in a particular scenario when the humans
they display emotions and how they display emotions, ok. It is not concerned that how what
is the cognitive model of the generation of the empathy in humans, but it is more concerned.
Ok, Can I understand that when they are displaying emotion and of course, other thing that
the I would like to understand in this case how they are displaying emotion that is it. I do not
want to understand the model that is behind when and how, ok. For example, which is
guiding the cognitive mechanism behind it? And that is what the data driven approach has
been.
So, for example, in one particular work that is by McQuiggan and Lester in 2007, they
designed a framework which they call it as a CARE framework. So, basically in this CARE
framework what they did was in this particular CARE framework, they collected the
behaviors of a virtual agent. In this case, for example, they collected the behavior of a virtual
agent that was being manipulated by a human who was acting in an empathetic manner, right.
For example, feeling frustrated when the user was losing a particular game. And in this entire
way, the entire data of being was being recorded and then entire data for example, was then
further used this entire data was being recorded. And then it was further used to train for
example, a particular virtual agent to display, to display empathetic behavior in response to
when the user was displaying the behavior and or when the user was or and accordingly how
the user was displaying the that particular behavior.
So, let us try to see this with the help of this particular example. Imagine that this is your
virtual agent, right this is your virtual agent which you want to train to show some empathetic
behaviour. And then of course, you know then there is an for example, there is an that there is
a human who is an empathizer in this case, you want to see how this human is doing and then
you want to learn from this particular human.
Basically, what is going to happen in this case, there is a trainer interface and through this
trainer interface, this virtual agent and the empathizer, the human are going to interact or for
571
example, through this virtual, through this interface, the humans are going to control how the
this virtual agent is going to respond.
Of course, it is completely controlled by the humans. This virtual agent's behavior for
example, what is the type of the message they are going to send or for example, what is the
they are going to say, everything is being controlled by this human.
Now, this virtual, there is an virtual environment which has access to all the data of this entire
interaction. When I say that the virtual environment has access to the entire data, which could
be you know that all the temporal attributes. For example, you know in which particular time
what was happening, locational attribute, all the spatial attributes, spatial properties for
example, and then what was the intentional attributes.
So, basically all sort of you know physiological and the behavioral cues that was getting
observed and in what particular time they were being observed, what was the synchronization
between all the cues and everything. So, all this data virtual environment was able to observe.
Now, once the virtual environment, it had access to all this particular data, what it did? It you
know you can see this particular relation. So, basically using all this data, it trained for
example, there is an empathy learner module, which is an intelligent module. For example, in
this case, they simply made use of a Naive Bayesian decision tree. So, you may recall the
Naive Bayesian decision tree from your ML classes before.
They simply made use of a Naive Bayesian decision tree, which is a very simple model
actually, to train on this particular observational data, where you know for example, they had
access to all the physiological and the behavioral cues and when was it happening and at the
same time, what was the type of the response that was being generated by the by the by the
humans and in response to what?
For example, we just saw that for example, when the user was playing a game and when the
user was losing the game, then you know like the other agent or the human is going to feel
the human is going to feel frustrated about it when they are losing the game. So, they are
feeling frustrated and how they are displaying the frustration, maybe you know saying I am
losing the game.
572
And when they said it and before that, what they felt after that, how they felt it. All this data
is being fed to a machine learning model, which is a empathy learner model in this case. They
just simply made use of a Naive Bayesian decision tree and then with the help of this thing
ultimately, they were able to create a model that was being deployed in real time.
So, basically this model was what? Now, this model was they making use of this model, now
they can do two things. They can this model is used to feed to this empathetic behavior
manager, which can you know generate this empathetic behavior and it can do those two
things.
Of course, it can understand, ok it can understand ok, whether the entire interaction is being
empathetic or not, we just saw how? It can interpret the empathetic interactions and it can
understand ok, when should I display the empathy and what is going to be the empathetic,
how should I display this type of empathy.
So, for example, maybe increase my pitch increase my tone, for example, or reduce my pitch,
reduce my tone or for example, say that ok, oh I am not have feeling happy about it and
things like that. So, basically, this is the behavior, this is these are all the things that the
empathetic manager has knowledge to now, thanks to this empathetic module, which is being
trained on this observational data with the help, thanks to with the help of the humans.
And then of course, you know, all this with the help of this, then you are going to make use of
a user, then all this is being fed to a user interface through which a user, a new user is going
to interact with. So, imagine this is your AI chatbot, imagine that this is your AI chatbot with
which a new user is now, another user is now interacting. But now since AI chatbot is being
trained on the data that in turn is being generated from the humans.
Now this is going to show empathy in terms of time and in terms of some expressions based
on the similar data. And it to an end user, it may look like a human-like behavior or it may
look like a human-like empathetic behavior right, and that was the end goal of the entire
model. So, pretty interesting framework that these guys have created. Of course, for more
information, I would invite you to please go through this particular paper.
573
(Refer Slide Time: 48:24)
Similarly, so, CARE framework, for example, it has not been only this one single work that
has been used to generate the empathetic behavior following the data driven approach. As I
said, most of the current work and some of the previous work has used the data driven
approach only to generate the affective behavior, empathetic behavior.
So, for example, one from D’mello and their group in 2013, they built a very nice module
which is known as Affective Autotutor, where they created an you know artificial virtual
agent. You can see, I hope that you can see here, there is an artificial virtual agent which is
the programmed, this this virtual empathetic agent to act in an empathetic and motivational
manner towards students.
For example, those who are feeling frustrated about the module that is being taught and you
know the course that is being taught to them in real time. And of course, the way they were
doing it, as I said, it was entirely data driven. So, the system of course, prepared in advance a
set of facial, prosodic and verbal responses. What sort of facial, prosodic and verbal
responses?
Of course, they were able to look at you know the response of the teachers. For example, that
how the teachers respond, what are the type of facial expression they make when they see
their students are struggling, what is the type of the speech modulation that they do when
they see their students are struggling and what is the type of the verbal response that they
give you know when they see their students are struggling.
574
So, for example, they may say that, ok I know that you know this material maybe can be a bit
difficult, but ok, do not worry, like we can do it together and let me help you understand it for
example. So, all these things were programmed in advance for this affective agent.
And then of course, in order to do it in real time, they were detecting, for example, if you
look at the, now I hope that this particular diagram is visible to you. So, then they were
looking at many different sorts of cues of the users, such as for example, they were looking at
their facial expressions, right. So, for example, they were looking at their facial expressions,
they were also looking at their gaze patterns and of course, you know they were looking at
their postures, body postures also you know.
So, for example, if the seat they were where the user was sitting, they were able to monitor
the movements within that chair itself and they were looking at all different types of you
know body postures, facial expressions and of course, they had access to the conversational
cues as well in terms of the voice data. And by observing all these thing, what they were able
to do? They created a rule based scheme to select the proper response.
So, for example, as simple as that if the user is feeling frustrated, for example, if the users, for
example, if the user’s speech the pitch and the tone or energy in the voice is looking like this,
maybe the user is frustrated. So, generate this particular type of response.
575
So, this can if then if else kind of rule. And actually, they evaluated this model and they were
able to show the results with the help of the results that the students who had the low prior
knowledge in the subject, they gained more knowledge with the help of this Affective
AutoTutor in comparison to a neutral version of it.
So, basically since the user tutor was affective, empathetic, maybe they were able to gain
more understanding of it, right. So, for example, this is another pretty nice interesting study
where it was entirely data driven, they were able to look at the previous data and they were
able to generate a affective tutor, which was empathetic in its response, perfect.
So, now I hope you understand that there are two different ways in which the community in
in which the previous research has been able to generate the empathy in the artificial
empathy. One is of course, by making use of the computational model of the empathy, other
is the data driven approach.
Now, let us try to understand what is the what are some of the limitations of both these
approaches. So, for the computational model of the empathy, of course, very first problem is
that the empathy itself is a very complex construct. And it is being expressed through
multiple multi-modal behavior cues and many of course, it involves walls at least two
individuals to show an empathetic behavior, you not remember a doctor and a patient, a
customer care agent and a client for example, and maybe in many cases it can be more such
as for example, when there is a group conversation happening.
576
So, this is a very very complex scenario and hence it makes the entire job very very
challenging to be being able to understand what is the group dynamics that is happening and
what is the you know happening when the there is a (Refer Time: 53:03) conversation that is
taking place and what is the cognitive modelling behind it.
So, there is not its not has not been very easy to understand it. And in order to understand it
like there are many different things that they have to understand. Of course, they have to
understand what is the stimuli that generates that triggers the empathetic reaction and then
how can we perceive the behavior that is there within two agents or that is there within a
group.
Of course, we already see that physiological and the behavioral cues and these kind of things,
but of course, what could be the most feasible and optimal way to do it. Of course, once we
have understood the perception of it the behavior, then we need to understand how can we
you know modulate the empathy by looking at very different factors such as for example, the
demographic information, the severity for example, of the emotional state and so on so forth.
And then of course, we also need to look at that you know once we have understood this
thing, how can we express the empathy. So, these are the different things in which that they
need to look at in order to create an empathetic response or behaviour. And since this has
been this all together, this is quite complex. Hence, you know and also we have to understand
that the definition of the empathy itself is also not commonly agreed very commonly agreed.
Of course, there has been some definitions that people have been using that the researchers
have been using in different context, but neither the definition of the empathy has been
agreed nor there has been a very established understanding of how these empathetic is the
cognitive model of the empathy. And hence, to make use of that model computational or to
put it into the virtual agents, it is a bit challenging, it has been a challenging task so far.
577
(Refer Slide Time: 54:48)
So, that is about the computational modelling of the empathy. Now, what is the problem or
the limitations of the other model which is the data driven approach? I think you may have
guessed it right. With the case of the data driven approach, of course, the problem is the data
itself, right. And it turns out that in case there has been two different types of limitations that
the researchers have been facing. Both in terms of the quantity and both in terms of the
variety or the quality.
Now, what do you mean by the quantity? So, it turns out that existing words as we talked
about before also, most of the time they have been making use of the audio recordings only,
that is a unimodal system to understand the empathetic reaction that is one. But those audio
recordings also have been obtained only like you know very few large scale psychotherapy
studies that could be totaling to the you know thousands of sessions maybe not more than
that.
And then also you have to understand there are lots of ethical issues involved around getting
this kind of data. Hence, there has been not very good set of availability of this type of data
and now on the top of that we do not just want the data.
You have to understand as a machine learning if you want to create a machine learning model
for it, then you want the entire data to be annotated also and that is where the other problem
is. That ok, first the data itself is limited and now whatever data that is available that itself is
not annotated properly.
578
And when I said it is not annotated properly it means of course, the very first thing that we
want to have is we want to have the annotation of what is the emotional state or the
psychological assessments of the mental and the behavioral state of the target and as well as
for example, the doctor for example, or the customer care employee. And then at the same
time we also need to see that all these have to be very very time synced, right.
We need to have the time synced transcripts, we need to have the time synced speech
segments in order to train and validate a proper machine learning model. So, this is where it
has been fairly limited in terms of quantity. And as I said in terms of the quality also it has
been fairly limited with respect to modalities and also with respect to the scenarios.
So, what do you mean by with respect to modalities? Most of the time only the audio data has
been made available and is available to the community, most of the cases to analyze and
understand it. And most of the time you know of course, many times it is due to the
confidentiality and privacy issues as well.
Because, for example, in the case of the psychological counselling where most of the work
has been done so far, the video and the physiological data is not collected actually and is not
hence available. But nevertheless, other problem is that if the respect to the scenarios as well,
we do not we can see that the there is the response or the impact of the empathetic reactions,
interactions can be in other scenarios as well such as for example, education, customer
service and medical care as well.
And basically, in these scenarios you know we can generate more data which is time synced
data which is having multiple modalities also and then where the most of the work can be
done to make more efficient data driven models to generate the empathy, right.
579
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 09
Lecture - 03
Evoking Empathy
So, when we want to talk about Evoking Empathy in these type of agents, then there are
certain elements that we need to consider and most of the time,
These are the elements that we would like to manipulate to obtain a particular type of
empathetic response. Number one is of course, we need to look at the type of the behaviours,
that these agents are going to exhibit. We need to look at the appearance and the features.
These 3 things, the type of the behaviour, appearance and the features are again coming from
the by from being motivated by the anthropomorphic design. So, you want to see that what
type of; what different types of behaviours your robot for example, is going to have is it
going to speak, is it going to walk, is it going to gesture and so on so forth.
Appearance, is it going to have an appearance like a human, like an animal, like a toy or
caricatured or so on so forth. Similarly, what type of features it is going to have, is it going to
580
be for example, black in color, white in color, what is going to the type of the height. For
example, if it is going to look like a human, what will be the height of this human, machine or
the robot or the agent and so on so forth.
So, mostly these 3 things are already well understood and are being motivated by the
anthropomorphic design that we just saw. Many a times of course, we also want to look into
the context and the situation that is characterizing the occurrence of the event, which is
leading to a particular type of emotion.
So, for example, you may want to look into that ok. If you are target trying to create a website
for example, for a user and then you are having a embodied agent, as a chatbot for the user on
the website. Then why exactly your user is going to show for example, a particular type of
emotion towards the chatbot that you have, embodied agent that you have placed on the site.
Maybe the there could be a context that the agent was doing very well and it helped the user
in a in a particular situation and accordingly a particular type of emotion can be evoked into
the humans when they are interacting with this type of embodied agent. Other thing for
example, then there are different types of mediating factors that also can be considered and
can also be manipulated.
So, for example, we can also take into account for example, if there is a type of relationship
between the observer and the target, for example, between the agent here and the human in
which you are trying to evoke a particular type of emotion by looking at the this particular
type of agent as simple as that.
So, for example, your user is a female, who is coming on the site for the some query.
Accordingly, maybe you want to create an embodied agent, which is going to be more
looking, more comfortable for this female user which could be a male, which could be a
female; which could be I do not know like depending upon the characteristics that you want
to; that you think can be helpful in making it more appealing to that particular user.
So, then there are other different types of mediating factors. So, for example, you may want
to look into that, what is the mood of the observer itself, right. So, for example, if you are
looking at the user, then how the user is going to behave while the how the user is behaving
or what is the mood in general, the mood of the user when it is trying to access the particular
service.
581
So, these are certain elements that you want to consider and manipulate even when you want
to evoke empathetic response among the humans for example, or among other agents while
they are observing a particular agent. Now, next we want to there is a very important question
that how are we going to measure the empathetic responses that are going to occur in these or
for example, in any other type of situation.
So, we are going to talk a bit in more detail about it in the upcoming slides. But in general,
there are these are the 5 different things that you may want to look into while creating to
while trying to evoke empathetical response.
We will see some examples and it will become a bit more clear. First example that we may
want to look into is the how are we going to generate empathetical response in virtual agents,
right. So, for example, on the right hand side what you are looking at, this is a figure from a
paper by Slater M and his group and their group, which is a paper on the bystander responses
to a violent incident in an immersive virtual environment.
Now, the idea is quite simple here. We all know that we see lots of violent incidences when
we are walking or passing through our daily life. But it is not it does not happen always that
whenever there is a violent incident, we get into we do some intervention and we try to stop
that intervention or the public gets into that intervention and tries to stop that particular type
of intervention.
582
And it turns out that among many different things, the social identity is very very critical
when we want to analyze the bystander's response in these situations. So, what it means that
it has been shown that for example, if there is a conflict that is happening an among with a
with a particular individual or with a particular group that you associate yourself to, then
maybe chances are high that you are going to get involved.
So, for example, you as a student, if you are going to if you are passing on a street and then
you see some other students being harassed, being bullied by a group of I do not know adults
or something like that, then chances are high that maybe you are going to intervene because
you associate yourself as a student there is an association of identity. There is a social identity
that you both share of a student.
Similarly, it can be on the base of the race, religion and so on so forth. Of course, the thing is
that its very, very hard to analyze and understand this particular type of response in a real
settings. You cannot just create these settings and understand and analyze these it could be
quite unethical.
So, then what they did in this scenario, in this paper that they created two different types of
agents, so for example, there is one agent maybe you can see that there is some particular
logo, that you can see here. So, this particular logo is signifying that this particular individual
is a supporter of a particular football club for example.
Then I mean of course, this particular individual is supporter of a different football club. And
that is what is the social identity that they are sharing and that is clearly visible on the type of
the T-shirt that they are wearing. For example, you can think of one is the supporter of
football club X, one is the supporter of football club Y.
Now, you can see that ok, this was the scenario that was given to the presented to the humans
even and then you know now the scenario is very very simple, that how many times the
humans are going to respond or intervene when there is a particular type this type of conflict
which is maybe resulting you know slowly from verbal, aggression to maybe you know very
very physical aggression maybe where there is a pushing.
And then you know you would like to understand the end of they try to looked into that ok,
how the participants when exactly will be the will the participants will be intervening. So, it
turns out that you know for example, it turns out that you know the more the participants
583
perceived that victim who was looking to them for help, the greater the number of
interventions occurred in the in group rather than in the out group.
So, for example, if individual was a supporter of football club X, other individual was a
supporter of football club Y and the X was being harassed and then the when the participant
also identified itself as a supporter of X football club then maybe the participants showed
more interest in intervening in the situation, rather than when it was belonging to the let us
say to the other group.
So, so it clearly validated this hypothesis that you have, that you know social identity is
critical as an explanatory variable in trying to understand the bystander’s response. So, this is
a very nice example of how the virtual agents they can evoke, empathy among the humans by
trying to with their design with their appearance with their behaviour.
So, you can see therefore, example, in this case this particular agent it had an appearance,
which is was making it look like a which was giving it a feeling which was making it a fan of
a particular football club and if you are also going to identify yourself as a fan of a particular
football club. Then you are going to intervene you are you are willing to help this particular
agent and that is a type of the emotional response or the empathetic behaviour that you can
you are going to show towards this virtual agent.
And the only reason you are able to do that, because virtual agent it evoked that particular
response into you with its own particular type of design and the appearance and the behavior,
right. So, for example, this is a very good example of how can the empathy; empathetic
empathy be evoked in by the virtual agents.
584
(Refer Slide Time: 10:23)
Here we have another very good example of how the empathetic can be evoked in the robots;
by the robots. So, for example, a. so this this is for those who do not know about this
particular robot. Its a Kismet robot. So, Kismet robot is its a very very it has a very very
expressive mechanical phase with lots of anthropomorphic features.
And what we are going to do, is that we are going to look at the particular video of the
Kismet robot, and then in which we can see that where the Professor Cynthia is going to talk
about how this robot was invented and what are the different capabilities that it has.
And that will give you an idea that how by the expressions by its mechanical expression’s the
this Kismet can even evoke some sort of emotional responses among the observers or with
the people that it is interacting with. So, let us try to look at the video now.
585
(Refer Slide Time: 11:34)
So, here we have the video. Kismet is an anthropomorphic robotic head that is specialized for
face-to-face interaction between humans and this robot. Kismet can express in three
modalities; one is through tone of voice. So, we can actually have the robot sound angry
when it is angry, sound sad when it is sad and so forth.
Another is through facial expression which you have talked about so smiling when it is
happy, frowning when it is sad and so forth. And body posture is also critical. So,
approaching leaning forward when it likes something, withdrawing when it does not like
something.
586
(Refer Slide Time: 12:25)
So, another important skill for the robot to be able to learn from people is being able to
recognize communicative attempts. And the way we have done that with Kismet right now is
to have the robot recognized by tone of voice. Are you praising it? Are you scolding it? So,
we have to give the person expressive feedback for the case of praise the robot smiles.
Look at my smile. The case of prohibition the robot frowns. Yeah, I do. Where did you put
your body? For intentional bid the robot perks up. Hey, kismet ok (Refer Time: 14:15)
Kismet Do you like the toy?
So, again, to close the loop, it's critical not only that the robot elicit this kind of prosody that
people will naturally give. But then the people can actually see from the robot's expression
and face that the robot understood. One very critical point of Kismet is that its responses have
to be well matched to what people expect and to what is familiar to people. By doing so, we
make the robot's behavior understandable intuitively to people. So, they know how to react to
it, shape the responses to it.
By following ideas and theories from psychology, from developmental psychology, from
evolution, from all of the study of natural systems and putting these theories into the robot
has the advantage of making the robot's behavior familiar, because it is essentially life like.
587
Do you laugh at all? I laugh a lot (Refer Time: 15:23) I laugh a lot.
Yeah, I do.
This is a watch that my; this is a watch that my girlfriend gave me.
Yeah, look, it has got a little blue light in it too. I almost lost it this week.
588
I do not know how you do that. You know what it is like to lose something?
We do not, too.
No.
Stop, you gotta let me talk. Shh, shh, shh. Kismet, I think we got something going on here.
So, for example, now you can, you can; so this is; you can see that how the Kismet robot is
interacting with the humans and how it is able to; you know, show some empathetic
responses to the humans and in turn it is able to generate some empathetic responses from the
humans. So, it is it's quite kind of very mechanical, but it is very very anthropomorphic in
this sense, right. So, this is how, for example, this is another example of how the robots, they
can generate the empathy.
589
So, now we first; so we looked into that ok, how the empathy can be evoked by the agents
among the humans and how they can manipulate the emotional response of the humans in
their own favour or as per the situation. Now, let us try to understand, that how can we
generate the some empathetic in the virtual and the robotic agents as per our emotional state
or as per the response of the humans.
So, this is the, we are going to talk about the second type of agents that are going to respond
emotionally to the situations that are more congruent with the users or for example, other
agents' emotional situation. So, that is the second type of the robot that we are going to look
into.
590
(Refer Slide Time: 17:45)
And here, we are going to look into, for example, one a very famous paper, very famous
example of this empathetic companion, which is; was a paper that was by Helmut and
Mitsuru, where they were talking about an empathetic companion, which is; here they talked
about an character based interface, which can measure the affective information of the user
and it can even address the user affect by employing some embodied characters.
So, this is on the right, I hope you can see this diagram. On the right, what you can see that
there is a character based interface, where it's a virtual interview scenario. And in this virtual
interview scenario, there are certain questions, which are being presented to the candidate and
at the same time, you of course, you; so you are the candidate and you as in candidate, you
can answer the questions or whenever and while these questions are being presented to the
candidate to you, to the humans, Then the physiological data of the humans, of the user of the
candidate, which includes in this case the skin conducts and the electromyography. So, skin
conductance and the EMG is being analyzed in real time.
So, its being captured, its being processed, its being classified. And then in response to, and
then with this, they are trying to understand what is the emotional state of the user and in as
per the emotional state of the user, certain the character it is trying to give certain response or
it is trying to generate certain empathy, empathetical response to that particular type of
situation.
591
So, for example, maybe there is a question that was posed to you that how long you have
been working or are you fresher or do you have some professional experience and then you
have certain options, but of course, we all agree that this is not a very comfortable question,
especially when you do not have a lot of relevant experience.
And in that situation, maybe your skin conductance and the EMG is going to indicate that
you are experiencing a stress or a negative emotion. And identifying this thing, so for
example, the character is able to respond with an empathetic character, character is able to
give empathetic response by saying for example, it seems that you did not like this question
so much or maybe you are under a stress or maybe let me change this question, let me ask a
different question or as per the situation.
So, this was the type of the interaction that happened, in this character based interface and it
turns out that whenever there was this empathetic empathic feedback that the character was
providing to the user, it has a very positive; it has a positive feedback on the interview's stress
level.
So, of course, you will have to go into the paper to read more about it, but what they showed
that whenever they compared the empathetic feedback with the non empathetic feedback or
when there was a no feedback and they showed that the overall interviews stress level was
lower than in comparison to when there was no empathic feedback was being provided.
So again, a very good example of how an agent is observing your emotional state and how it
is able to adapt to that emotional state in order to make you feel a bit less stressed. So, that is
the empathic companion, I will definitely invite you to please go ahead and look at the paper
in order to get more details of this thing.
592
(Refer Slide Time: 21:25)
Again, we have a very good example of another robot virtual agent and then the robot. So,
now this robot is known as the Shimi robot is it is from the Georgia Tech and then we are
going to look at again the video of this robot in order to understand that what this robot does.
593
(Refer Slide Time: 21:49)
So, Shimi is a personal robot that can communicate with humans, but its communication is
driven by music. Everything we do here is a center for music technology, tech is driven by
music and the way he communicates both verbally, with audio and with gestures is based on
deep learning analysis of music datasets and motion datasets, that allows him to analyze the
emotions in our speech and actually respond in an emotional way to us.
Hey, Shimi, I had a great day of work today, I got a promotion and I am feeling so good.
Hey, Shimi, I am feeling really down in the dumps today, I am pretty sad.
Shimi will understand your emotion based on how to speak and respond with this kind of
emotion response both in gestures and in voice. Allowing you to have a companion, a
companion that is driven by music. What we did in order to let him understand emotions and
project emotion and we analyzed datasets of musicians playing angry music, sad music,
happy music and we put it into a deep learning system powered by NVIDIA.
594
That will try to capture features from this kind of musical phrases and this is what is driving
Shimis contour and prosody and rhythmic and the way he actually moves and speak because
we feel that music is a great medium for projecting emotions. And if Shimis communication
is abstract like music, but also emotional like music, we feel that this can avoid the uncanny
valley and allow for great interaction.
Ok. So, I hope that you enjoyed the video of the robot.
595
So, if you look at this Shimi robot what it does? It tries to understand the emotional state of
the human which is interacting with the Shimi robot and then it tries to respond accordingly
to that particular emotion. And one interesting thing about this Shimi robot that it does not it
is not; it does not use a verbal language that we use. But rather it uses a musical language,
which has been based, which is based on some native languages, indigenous languages in
from Australia.
And now you can see that, what the type of the musical response that the robot had. And
while doing so, as of now the Shimi robot's capabilities are a bit limited, in the sense that it is
it only looks at the valence and the arousal and by looking at the valence and the arousal, it
only tries to, it is able to classify or understand only four different emotional states on the
valence arousal scale.
So, now you can quickly try to figure it out that you know the valence, it tries to analyze the
valence by semantically analyzing the spoken language and it tries to look for the words that
represent positive and the negative feelings. So, for example, I had a bad day. I had a bad day,
ok there is a bad word in it and; so on and so forth.
And then while looking, ok. So, now we already talked about that how the emotions can be
evoked by the virtual agents and how the virtual agents can respond to the emotions that are
being evoked in the humans both. Now, let us try to look into that how can we have more
empathetic response, which is beyond the just the analysis of the emotional states.
596
(Refer Slide Time: 26:16)
So, we have to first understand that whenever we are talking about the emotions, we usually
the emotions are beyond just the basic emotions. And in this sense the agent’s ability to
perceive for example, things which are beyond emotions such as “belief, desires and
intentions” can be quite limited. And they do have a very important role on the emotion,
emotional responses that are being evoked or that are being generated.
So, but as of now, most of the focus of the affective computing community is on the
representing the basic emotional states, which are quickly active, short and quite focused.
And the example is could be the immense basic emotions or for example, the way we saw in
the case of the Shimi robot.
And then there are other affective states such as; mood, personality, emotional intelligence,
etcetera. And which are also which plays also very very important role on the evoke in
evoking the emotions among the humans or in evoking a empathic response.
And now what we want to do definitely we want to understand that can we look into any of
these factors also, while we are trying to evoke emotional response or while we are trying to
provide a empathic feedback.
597
(Refer Slide Time: 27:45)
In order to do so, we would like to look at a very interesting term which is a theory of mind.
Now, theory of mind is basically the capacity to understand the other people by ascribing
mental states to them. So, basically this is the idea that we do not only know about ourselves,
but we also want to have the knowledge that others' mental states could be different from our
own.
And hence their desire, their belief, their intentions, their thoughts, their emotions can also be
a bit different from our. And their knowledge set in general is different from our knowledge
set and this acknowledgement itself is known as the theory of mind. And why is it helpful?
So, the theory of mind is helpful, because it allows you to infer the intention of the others and
it allows you to understand that what is exactly going on in someone else's head, mind
including; what are they hopeful about, what are their beliefs, what are their expectations and
what they fear about maybe.
And it's a very very interesting and important paradigm in the psychology and to test this
particular theory of mind, there is a very influential experiment which is known as the false
belief test that is usually done, for the kids and to understand that to what, to what aspect
extent they possess this theory of mind.
598
(Refer Slide Time: 29:28)
So, basically, I will let this video play and then I will let you go through this false belief test
in order to understand that what this test is all about. So, I hope that you enjoyed the video.
Now, in the false belief test, essentially what you saw that these two characters, while one
character, one individual, one kid is able to; while one kid is able to understand that what the
others are going to will be thinking or do think, other kid is not able to understand or
apprehend that what the other kid will do in that situation. What the other kid will think about
that for example, where the trolleys.
599
And now motivated by this thing what we can do? We can create an agent which is an
affective agent, which also possesses this theory of mind and with this it can sort of you know
have a mind reading skills. And why we want to give this sort of mind reading skills to the
agent, because it is going to understand not that ok apart from what is my belief and desires
and intentions, what for example other humans beliefs and misconceptions may be are.
What are their intentions, which could be valid which could be invalid. And then accordingly
it can come up with the logic that how to best help humans in for example, obtain their object
or the goal of the desire. If I understand that what their beliefs are, what their intentions are,
then accordingly the robot can help to the humans to obtain the particular object or the your
goal of desire accordingly.
And for this they need to of course, share the mental attention between these two and then
there is again a very good nice paper by Cynthia Breazeal, Breazeal and her group which
talks about this Leonardo the robot the humanoid robot and basically, I would like to again
play a video. So, that you can understand what I am talking about.
600
(Refer Slide Time: 31:44)
In this video the robot Leonardo demonstrates his ability to recognize the intentions of his
human partners. Even when their actions are based on incorrect information. Leo keeps track
of objects in his environment based on data from his sensors.
601
(Refer Slide Time: 31:59)
At the same time Leo also models the individual perspectives of his human partners. Here
everyone watches as Jesse places cookies in the box on the right and chips in the box on the
left. Since both people are present, everyone's beliefs are the same.
Leo's cognitive architecture based on ideas from psychology known as simulation theory,
reuses its own core mechanisms of behavior generation to understand and predict the
behavior of others.
602
(Refer Slide Time: 32:26)
In this demonstration Leo tracks sensory data from an optical motion capture system. This
same data is presented to duplicate systems, which represent the unique visual perspectives of
his human partners. Now, as Matt leaves the room Jesse decides to play a trick on him and
switches the locations of the two snacks.
Since Matt is absent Leo only updates his model of Jesse's beliefs. Now, Jesse seals the boxes
with combination locks, preventing easy access to the snacks. When Matt returns hungry for
a bag of chips, he tries to guess the combination to the box where he remembers seeing the
603
chips. As Leo watches Matt reaching for the lock, he tries to infer Matt's intention by
searching for an activity.
Model that matches the observed motion and task context. Once a matching activity is found,
Leo uses his model of Matt's beliefs to predict what Matt's goal might be.
604
Then Leo uses his own model of the true state of the world to search for a way to help Matt
achieve his goal. Having correctly inferred Matt's intention, Leo assists him by opening a box
connected to his control panel providing Matt with the chips he desires. Thanks Leo.
Now, Jesse returns and tries to open the same box. Leo correctly infers that Jesse wants the
cookies, since Jesse is aware of the actual contents of the boxes. Matt and Jesse both perform
the same physical action, but Leo's ability to model their individual beliefs, allows him to
correctly assist them in achieving their different goals.
605
So, basically you saw in the video that how the Leonardo was able to understand that what
was the desire of the human. Of course, humans this particular individual was looking for a
packet of the cookie and then the again since Leonardo knew that ok where the cookie was,
packet was it was able to help the human obtain that particular cookie.
So, in that sense this how this how this robot is able to do so. So, this robot is able to you
know look at, reuse its belief construction systems by from the visual perspective of the
human. And it is predict predicting actually it is predicting that ok, what the humans are
believing that particular time of point of time and whether that belief is true or not.
And it is doing by looking at all the visual sensors that it has and it is applying theory of mind
behind it and in doing so what it is able to do, it enables the robot to recognize and reason
about what it exactly the individual wants. At this particular point of time and accordingly
help the user to obtain that particular goal. In this case getting a packet of the cookie
And hence it is really important if we can enable the robot with such a type of capability, then
the robot can not only generate an empathetic response, but it will be able to generate a
response which is also going to take into account the beliefs or the desires or the intentions
that the humans have perfect.
606
So, that was about the how can we generate or evoke the emotional responses beyond the
emotional empathetic response, beyond the emotional states. Now, we will be talking about
how can we assess these empathetic responses.
And it's not an easy question. The basic idea is of course, we want we now we know that ok
how to create an empathetic agent, we know what are the different types of empathetic agent,
why we want to create them and so on so forth
But unless and until there is a performance metric, we do not know that how can we measure
the success in building these affective agents. In general, it's really hard problem and there is
no consensus yet in the community and how to assess these empathic responses. And more so
in interactive settings in online settings, in real time settings, it can be really difficult to judge
the implications of the empathy or the evocation of this empathy.
One thing that can be done for example, I briefly talked about mentioned this word turing test
in the beginning that what can be looked into that ok, how well a system is imitating the
human behavior? While trying to generate the empathy, the idea is very simple that by
providing by doing making anthropomorphic design, we want the agents to imitate the human
behavior.
And if we are able to create a system which is able to imitate the human behavior its as good
as the humans, then maybe it has passed the Turing test. And this is the best for example, that
607
as a agent or as a machine it can do. So, that can be one criteria that ok. Is the agent able to
evoke an empathetic response to the extent that for example, a human could have been do,
could have been done. And if it is able to do so, ok it has already passed the Turing test and
that is a very good measure of the test; of testing the effectiveness of it.
But then that can be a bit really hard and before even we go that of course, we will stuck in
the Turing test sorry in the uncanny valley if itself and then so on. So, then there are different
psychological benchmarks that we can look into. For example, we can look into the how
autonomous the agents are no matter what type of agents we are talking about, before even
looking at their empathetic responses, first thing we may want to look into that ok.
Whether the agents are autonomous or the empathetic responses that they are generating are
non-autonomous are being controlled by other humans, because of course, if it is being
controlled by other humans then maybe they are not very empathetic, right. This is very
artificial the response
But of course, in while doing. So, you may have to answer, the question that whether the
humans themselves are autonomous. And again, without going too much into the psychology
of this thing, but then you may want to look into that ok for example, sociobiologists, they
have one theory about it, moral researchers they have one theory about it where for example,
the sociobiologists they say that ok.
Everything that is being, done is being controlled by the genes is the result of the evolution
and hence for example, they may not be autonomous and they are being controlled by
everything, but then comes the moral researchers and philosophers like Aristotle and Socrates
for example, even those who say ok.
If everything is result of a gene and the evolution, then then of course, then humans cannot
have an autonomy and if the humans cannot have an autonomy, then they cannot be held
morally accountable. But nevertheless, without going into the discussion of the psychology
here may want to see that ok whether the response is that the agents are generating, whether
they are autonomous or not autonomous one criteria.
Sorry. So, other thing for example, that we can do is we can look into the imitation as well.
So, imitation this turns out that ok, is as simple as that if we like a particular character, if we
like the behavior of a particular character even from the movies, from the series we start
608
imitating them. Maybe sub-consciously and then this can be a very good criteria to look that
ok.
Whether the humans are imitating the humanoid robots or the machines or the services that
you are creating for example, and if so then how is that can be compared with respect to the
human to human imitation and whether the imitation is of the same extent or less and then
this can give a very good; this can become a very good criteria of trying to judge how
affective how affective the empathetic response was or the empathetic interaction was or is of
the agent.
And of course, all this can be evaluated in terms of for example, a Likert type scale where
you can have you know I do not know. Maybe 1 to 9 score of you can give a score of 1 to 9,
where for example, 9 may represent completely autonomous, 1 may represent 0 autonomy, 9
can represent that they are imitating 100 percent, 1 can represent they are imitating 0 percent
and so on.
Similarly, for example, we can look into the moral values. So, the question is ok, when we are
, the humans are interacting with these agents, would they like to ascribe the intrinsic moral
values to these humanoid robots or the agents. And if so, then to what extent, and again we
can have a scale of 1 to 9.
609
Because if they are ascribing if they are willing to ascribe the moral values to these robots.
Then then maybe they are thinking that ok, the robot is very very human like is very very
empathic and maybe that is where maybe it is very successful in giving this making this
interaction very very human like.
Similarly, you can look into the moral accountability that for example, when we are ascribing
some moral values whether the robot is just, whether the for example, the robot is fair and so
on so forth in its adaptive interaction’s can it be held accountable as well.
And for example, when it is doing a positive feedback can or vitamin something is going
wrong, can by say that ok you know it was the agent is should be held accountable for it,
because the agent made a emotional adaptation that was not supposed to be done for example,
right and it is causing the human harm.
So, the idea is to what extent the people can hold this agent responsible. And on the basis of
this moral accountability itself, again on a scale of 1 to 9 it can be evaluated. Again, for
example, when we are talking about this emotional adaptation or this empathetic interaction,
we can look into the privacy also, that for example, to what extent it is invading the privacy
of the humans.
For example, in order to understand the emotions, is it looking at the facial emotions, it is
looking at the identity of the human. It is looking at the race of the human and so on so forth.
And in that sense, I mean is it getting the information that it should not maybe get. And in
that sense you know, like the to what extent people are comfortable in sharing that particular
type of information with the robot or with the machine.
So, privacy could be one aspect on which, on the basis of which this particular agent can be
evaluated or this empathic responses can be evaluated. Another very important criteria could
be reciprocity. So, reciprocity is as simple as that, so usually it happens that you know when
someone is being empathetic with you, you would like to become empathetic with that
individual.
So, is like you know you behave with an individual, to the same extent that particular
individual behaves with you, right. And in that sense, so are the people willing to reciprocate
the this behavior with the humanoids as well, with the robots as well. And if so, then maybe
you know the robot is quite successful, in generating empathetic response because the
610
humans are treating it like humans for example, could be agents, could be machines or could
be robots as I said before.
And it turns out of course, it can be a bit tricky again to make use of this all the psychological
benchmarks. And then in that sense then you can simply you know use some self-reported
question as such as for example, you may want to list down the properties of the social
robots.
The agents or the machines and you may want to get this rated by the humans, which could
be all the psychological benchmarks or for example, as simple questions like as simple as that
did you like the empathetic interaction that you had with the agent as simple as that.
And then based on this questionnaires or you can also do the content analysis. So, for
example, if there is a conversational agent, which is chatting with the humans you may want
to look at the transcript and you may want to see what type of content is being generated and
see to what extent it was empathetic and to what extent it was successful.
So, that is for example, these are the few ways, in which you can evaluate in general the
empathetic responses of the agents its a very very fascinating area, there has been a limited
work in this so far. And accordingly, the assessment can be a bit tricky, but this is what we
have. So, far and hopefully you know it will improve down the line or in the future.
611
Perfect. So now, we come to the conclusion of the class. To conclude in this module, we
talked about the empathy and the empathetic agents. How can we evoke empathy among
virtual agents? How by virtual agents among humans and how can empathy be generated in
virtual and robotic agents for humans? We also talked about how the empathy can be
generated beyond the expression of the emotional states and we also looked at briefly how
can we do the assessment of the empathetic responses.
Now, when we talk about the empathy and empathetic agents, then we understood that we
want to have the empathetic agents because presence of these empathetic responses by agents
it leads to a better, more positive and appropriate interactions. And that is where we also
learned about the anthropomorphic design, uncanny valley and so on so forth.
When we were talking about how can we evoke the empathy by the virtual agents and by the
humans we looked into that the appearance and the function of a particular agent it plays a
very very crucial role on how the people are going to perceive it. And accordingly, how they
are going to interact with it. So, this again is coming from the anthropomorphic design, that
you really want to look into the appearance, the functions and the interaction of the machine,
the agent, the services with the humans.
When we while talking about the empathy beyond emotional states, we understood that just
by analyzing the basic emotional states, we may not be able to generate a very empathic
reaction reactions and interactions and hence we want the agents to have the ability to
perceive beliefs desires and the intentions of the humans, which can really help them to align
their empathetic responses with the goals and the desires of the humans.
And we also looked into that how the evolution of this empathetic responses can be done.
And while there are no general agreements on about it, but then we looked into how for
example, some of the psychological benchmarks can be used on the Likert scale for example,
of any 1 to 9, 0 to 5 or something like that to do the assessment of the empathetic responses
and in trying to understand how they were helpful, how successful they were in being
empathetic.
So, with that we finish this module and we will see you in the next module. Great learning.
612
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 10
Lecture - 01
Part - 01
Emotionally Intelligent Machines: Challenges and Opportunities
Hi friends. So, welcome to this week’s module. In this week, we are going to talk about
Emotionally Intelligent Machines, Challenges and the Opportunities. So, so far, we have
already discussed a lot about how to process the emotions, how to elicit the emotions and
what are some of the applications that can be there.
Today, we are going to take a step further and we would like to understand that about certain
domains in which these can be used and what are certain open issues. So, 1st, we will be
discussing the affect-aware learning. 2nd, we will be discussing the use of affective
computing in games.
And 3rd, we will discuss in this week about the open issues that are there, and some
indications on how to solve them. Particularly, our focus would be on the online recognition
and the adaptation of the emotions while discussing these fields. With that let us dive in.
613
(Refer Slide Time: 01:27)
So, first affect-aware learning that is the use of affective computing in the emotional agents
or in the emotional settings, in the emotional educational settings. So, it has been widely used
in the learning technologies, in developing the learning technologies and it is also known as
the affective learning technologies, where the emotion aware learning happens, those type of
tools are known as the affective learning technologies.
And it has been seen that the learners experience during the interaction with these type of
technologies it varies a lot. When we say the learner’s experience, it means not only their
614
emotions, but also their overall learning affects. So, please pay attention that overall, what we
would like to do, we would like to, sorry overall our aim when we are interacting with the
learning technologies is to improve the learning.
So, then when we talk about the experience of the learners, it is not only the type of the
emotions that the learner’s experience, but also what is the improvement or the gain in the
learning while interacting with these type of technologies. Otherwise, these affective learning
technologies they will not be very fruitful.
So, first we would like to understand that what are the different types of emotions that can
occur during the interaction with the affective learning technologies and accordingly maybe
we need to develop the technologies. So, for this meta-analysis of 24 studies was done by
some researchers which involved 1740 students from different countries including US,
Canada and UK.
In these meta-analysis, each student roughly spent around 45 minutes and hence we had a
total all together data of around 76,000 minutes which is significant data that we have. And
these studies were collecting the data, not only from the classroom settings also in the typical
research lab settings and also during the online settings. So, then we have a wide range of
application settings in which these learning technologies were employed and the studies were
conducted.
Now, it was found out that a total of 17 different types of affective states were tracked during
these studies. Now, of course, 17 is a big number, and we would like to understand that we
were the major affective states. So, if you look at the major affective states, then there were
these 6 affective states that were exhibit exhibited by the learners and hence who are found to
be most useful during the interaction with the affective learning technologies. So, of course, it
is not a surprise engagement and boredom.
So, basically, engagement and boredom of course, you would like to understand and monitor
whether the student or the individual interacting with this learning technologies is engaged or
is bored. Of course, confused whether there is a complex topic that is being taught and hence
it is making the individual the student confused about it. Whether the individual is more
curious to learn about, and of course, how is the happiness and the frustration of the
individual while learning with the these technologies.
615
So, now having understood that ok these are some of the most commonly used affective
states and exhibited by the learners, then how would you like to use it? So, the way we can
use it that whenever we are developing an affective learning technology, so, what we can do?
We can try to focus more on the monitoring of these states and accordingly we can select the
type of the sensors, we can select the type of the ground truth, we can select the type of the
machine learning deep learning algorithms in order to recognize and also to adapt to these
different states.
So, these are the most relevant affective states that we can use during the affective in the
interaction with the affective learning technologies.
Now, let us try to look at some affective learning technologies and how they have used the
affective computing in general. So, first, we will talk about the affective AutoTutor. So,
affective AutoTutor was proposed in December, this came in December 2012. But this was a
work of previous earlier several years.
And it is said to be one of the first reactive conversational intelligent tutoring system. So, it
was a intelligent tutoring system, but it was first reactive conversational intelligent tutoring
system, in the sense that it was able to react in an conversational way to the emotions of the
individuals or the students.
616
So, you can see here, I hope that this particular graph or image is visible to you. So, you can
see that there is some screen interface that is there in front of a user which could be a student,
which could be an individual. And then, this particular machine or the interface that is there,
it is tracking 3 different modalities of a particular user.
So, it is making use of some camera. With the help of the camera, it is tracking the facial
features. It is also looking at the contextual, it is also looking at the contextual cues and then
it is also looking at the body languages of the individual. So, this is, these are the different
cues and the modalities that the different the system is targeting or the system is tracking.
And by tracking all these 3, what it is trying to do? It is trying to detect the boredom,
confusion, frustration or the neutral state of the individual because on the basis of these
states, the tutoring system would like to do some adaptation. So, apart from the monitoring
part, now there is a part of that how to take a particular decision.
So, this is where a decision level fusion algorithm was employed by this tutoring system, in
which it clubbed the data from the different modalities, such as the camera or the body
languages. And then it applied some fusion system. And after applying that fusion system, it
was giving or getting a particular diagnosis of the type of the emotional state that the
individual is experiencing.
So, hence that is how we are making use of the different modalities to arrive to a common
decision of what is the type of the state among, let us say these 4, that the individual is
experiencing.
617
(Refer Slide Time: 08:25)
Now, once we understood that what state the individual is, the next we would like the system
to do some adaptation to it. And the type of the adaptation that happened here in the case of
the AutoTutor was that it provided empathetic and encouraging and motivational dialogues
and emotional displays in response to the some cues that it tracked or monitored.
So, for example, when it understood or tracked that there is a mild boredom among the
individuals or the individual or the user who is interacting with the tutoring system is
experiencing some boredom is getting bored, then, it may give certain type of cues with
certain emotional displays as you can see in this image. I hope this image is clear.
So, this particular animated character is providing certain emotional displays and then there
are certain emotional dialogues that are being generated here. So, for imagine, I am not sure
whether the image is very clear or not. So, this was a particular topic that was being taught let
us say about the CPU, RAM, and the central processing unit of a machine.
And then, the agent or the system tracked or monitored that the user is feeling bored about it.
So, once the system was able to track that the user is feeling bored about it, then it presented
a certain motivational dialog saying, that ok, “This stuff can be dull sometimes, so I am going
to try and help you get through it.”
618
So, this is a simple statement, simple motivational dialog, of course, combined with certain
emotional cues as you can see in the animated character, that the system presented which in
order to help and motivate the user to improve the learning.
Now, the next question whether the learning improved and as you can see in this diagram that
is presented on the top. So, here we are looking at the learning gains from different sessions
between the regular and the affective system. So, regular is the one where there was no
feedback provided so, regular is the one where there was no feedback provided and affective
was the one where the non-monitoring was being done, but also certain type of feedback was
provided.
So, as you can see in the diagram that is being presented on the right on the top right corner,
that the learning gains. It increased from session 1 to session 2. And so, you can see that these
are the learning gains. So, learning gains in the session 1, they were not very high, but in the
session 2 they were higher for example.
So, overall, the learning gains we can conclude that the affective tutor was able to improve
the learning or improve the learning of the user with the help of this type of adaptive
feedbacks. So, that was one very simple example of how this affective Autotutor was able to
track the emotional state of a user, of a student and how it was able to adapt and through that
adaptation how in gain in learning was achieved.
619
Next, we will see another example which is of the GazeTutor. So, in case of the GazeTutor,
the basic idea was that if we can monitor the periods of the waning attention and if we can
attempt to encourage those at those particular type in those particular periods, then these type
of interventions can be very helpful.
And building on this hypothesis the GazeTutor is basically nothing, but a multimedia
interface which is consisting of an animated conversational agent. CA is basically the
conversational; CA is nothing but the conversational agent. So, CA is the conversational
agent.
So, as you can see on the left hand side this is the GazeTutor interface where there is some
image about which is being used to present or to display or to teach some particular topic, and
then there is some animated conversational agent that is you can see on the left hand side.
Now, on the right hand side what you can what you can see is of course, the idea was that ok,
it was able to track your attention period, and then it was able to adapt to that particular
attention period or it was able to intervene at that particular point of time. On the right hand
side what you can see, that this particular line, it represents the attention before or the gaze of
the individuals that was before and after the intervention.
So, for example, this is representing that where the this particular area the left side area is
representing where the user was looking at when before the intervention and what happened
when the intervention was provided. So, if you look at the left hand side then what it is telling
you roughly, for example, if you look at this thing the x axis of course, is telling you the time
and the y axis is telling you the probability where the individual is looking at.
So, for example, if you look at just one the off screen time here so, off screen time you can
see, the off screen time is gradually increasing with as the time progresses. So, what it means
that individual is neither looking at the screen, in the individual is not looking at the tutor,
individual is not looking at an image, right.
So, basically, this is the tutor, this is the screen interface and this is the particular image that is
being presented, but the individual user is not the user is not looking at either of them and the
off screen time is gradually increasing. The individual is not paying attention at all. And then
a certain intervention was provided.
620
Now, if you look at this as certain intervention was provided, then what happened? Then,
gradually very quickly the off screen time, it started to reduce, and then gradually, it came at
a particular level it got sustained at a particular level where the individual’s probability of
looking outside the screen was around 50 percent 0.5.
Similarly, the same type of analysis can be looked at the tutor and the image. So, you can see
that this is the decline that you can see with respect to the tutor and the image. So, before the
intervention the individual was not looking at the tutor or the amount of time that it was the,
as the time progresses the individual started not looking at the tutor. So, this is the tutor, not
the agent. And then, this is the image.
So, basically, the individual user was neither looking at the tutor not nor looking at the image
and this is represented by the fact that the time, the probability with which it was looking at
there started getting reduced. But then, as the intervention was provided, after the particular
type of intervention you can see that the probability with which the user was looking at the
tutor or even for that matter to the image, it started increasing.
Of course, it could have increased a bit more, but nevertheless it was much better than
without the intervention. So, that is how the GazeTutor was taking care of the waning
attention period, and was able to provide some intervention at this particular point of time.
621
(Refer Slide Time: 15:48)
So, now we will look at the affect-aware games. So, we saw so far that if we go for a full
sensor based approach, dedicated sensor based approach, then there is a problem of the
scalability, right. So, let me just write it down for you.
So, if you are going ahead sorry; if you are going ahead with the sensor, then dedicated
sensors they will require, you will face a problem of a scalability. And if you are going sensor
less, then you will have a problem of accuracy. So, maybe the accuracy you are somehow
making a trade off with the accuracy, sensor less, right.
So, now, so there could be a possible solution where we are we where we which we are
calling it as a sensor light solution. So, sensor light solution means we are going to use the
scalable sensors whenever it is feasible, such as for example, you know cameras and the
microphones they are very very widely available sensors which can allow you to capture
them audio visual modalities and only the audio modalities. And that is one approach.
And then, at the same time what we can do? The non-scalable sensors can be replaced with
the scalable proxies. So, for example, if you are trying to make use of camera can be very
good option here; and for example, because once that it is already available in most of the
laptops. Similarly, for example, it can be purchased at very very low costs.
And previous research has already shown we have talked about it that you can use the camera
as well to for example, record and monitor the heart rate, and the heart rate variability and
622
related features. So, this seems like an excellent solution. And other thing for example, that
you can do, you can also simply look at the webcam data and you can apply the motion
tracking techniques on it.
So, for example, you can do the posture analysis, gesture analysis all with the help of the 2D
visual data. Of course, you will need to work a bit more on the software side, but working on
the software side is a bit more easier, it is more accessible than you know working on the
hardware and trying to make the hardware scale.
So, so that for in this way, you know like there is one type of possible solutions. Other, we
already talked about that you know like the camera can only, not only be used to monitor the
heart rate and the heart rate variability, it can also be used to monitor the gaze patterns of the
eyes. And hence, it can also replace to certain extent the eye tracking devices and the sensors
that we have.
So, that is the conclusion of this thing. If you have the possibility of going making use of the
dedicated sensors, go ahead. Many times, your application domain will not allow it. Then,
you may want to use the existing sensors which are there available in the system to track the
behavioral data or for example, you may want to replace the dedicated sensors with the
sensor proxies and hopefully it may work wonders for you, ok. Sorry. So, that was about the
sensor and what to do with the sensors that are available to us.
623
Now, let us talk about the accuracy as well. So, in general, there is a we know that you know
the naturalistic affective detection has seen a lot of research and it has been improved a lot.
But of course, there are lots of problems that are associated still with the affective computing
domain and in general the research and the development that is there.
One thing we already talked about it there, most of the time the sensors that we require they
are intrusive. Many times, it could be expensive as well, many times it could be noisy as well
and more importantly they are not scalable. Other thing that we see that you know technical
challenge, if you look at talk about the technical challenges, then the detection itself can
suffer from the weak signals that are embedded in the noisy channels.
So, many times for example, you are trying to look at a particular data and then, but that data
itself is surrounded by lots of noise around it. For example, maybe you are trying to capture
the emotions in the voice, as simple as that. You are trying to capture the emotions that are
there in the audio modalities of a user, but now we are no more in a lab setting, we are in a
naturalistic setting where there is a lot of noise around it.
And hence, your target user’s voice data is getting embodied, is getting surrounded by lots of
noises that are around there. And hence, it really becomes very very challenging to segregate
that data of the user, voice of the user and then and do some analysis on the top of it.
One thing also that we have seen even though we did not talk about a lot in detail about the
machine learning and the deep learning algorithms itself. But most of the time what we want
to do whenever we are talking about the affective computing, it is going to rely heavily on the
machine learning and the deep learning algorithms.
And whenever we talk about the machine learning and the deep learning algorithms, it is no
surprise that they need a lot of adequate and realistic training data in order to make lot of
sense. And many times, what happens that emotions data associated with the emotions are
maybe very very difficult to even capture to annotate. And hence, in turn our machine
learning and the deep learning models may suffer, and their accuracies may suffer.
So, other thing for example, that may happen that whenever we are talking about the affective
computing, most of the time we are just looking at the transient emotions. But now if you
want to incorporate the context and the appraisals as well into it, then this becomes a bit
troublesome because you know if you want to incorporate the context and the appraisals.
624
And let us say you know the users beliefs desires and intentions around it, then it can be
become very tricky because then you will have to be able to you know track lots of different
things, and which may be very very difficult here. In general, you know whenever we are
talking about the affective computing.
Also, there is a common mistake that the researchers do that the developers in the community
do and they are doing, that many times they are taking a replace using one for the another and
the another for the one. So, for example, whenever they are talking about the model of the
emotions, they are not able to discriminate properly between the categorical models or the
continuous models.
Similarly, for example, many times the mood versus emotions is not being differentiated. So,
you may refer to mood which is you know long term emotions or maybe you are referring as
a transient emotions, but you are just you know using the one for the another. And hence, you
are not able to differentiate these things, and hence there is a lack of clarity on your side.
Nevertheless, even after this if you are able to do some recognition and you are able to make
use of some monitoring, then or what happens? That the generalizability itself, it becomes a
big, big issue.
And generalizability in general for the machine learning and deep learning algorithms itself is
a problem, but more so, with respect to the emotions and then the affective computing it is a
bigger problem. Because you may want to look into the individual variability, we talked
about this.
You may want to look at the cultural variability, cultural differences. For example, the death
of an individual, at the time of the death of an individual the way it is being expressed in
Indian communities is very very different from the way it is being expressed you know the
lamentation or the sorrow that is being expressed in the outside community.
So, of course, there are lots of cultural differences that are surrounding it. Of course, then
context, time, etcetera, they also play a lot of role. And so, generalizability becomes a big, big
issue on especially we will talk about the affective computing models.
625
(Refer Slide Time: 24:15)
So, the question that in general we want to ask now is this, that ok, what type of accuracy or
to what extent an accuracy is a good accuracy with which you know we can go ahead and
deploy the system and we can make it work? So, in general, we already saw that affect
detection itself is a very tricky problem. And we can very confidently say that ok, it is going
to be very very unlikely that very soon we are going to have a perfect affect detectors.
You know that are not only able to you know do a 100 percent classification of the emotions
of a user in real time. But also, they are able to generalize well to the new individuals to their
international context and in you know very noisy situations that are around us.
So, it is not going to happen anytime very soon. It will require lots of advancements, not only
on the hardware side, but also on the software and the machine learning and the deep learning
side from us. So, the idea is ok, when should we start taking the information that we are
getting from the emotion recognition systems and build on the systems that can adapt to it or
in general when we want to close the loop.
So, there could be two possibilities now, whether we can go wait until there is a perfect affect
detection system in order to build the adaptive system on top of it, which is going to work
perfectly fine. Or we can just go for a not so perfect affected affect detection system and we
can try to make it work.
626
So, the thing is here. Since, it is a very very tricky problem, it may take a lot of time and the
resources to arrive to that situation. So, even if we are able to get a moderate degree of
recognition accuracy. And what is a moderate degree of recognition accuracy? For the
emotions, that depends on them situation to situation and domain to domain and as per use
cases. We will talk a bit more about it.
But the moment we are able to get or you are able to get a moderate degree of recognition
accuracy, we believe that should be sufficient to, so to create the affect-aware interventions.
So, affect-aware intervention means adaptive interventions. We just saw some of the
examples, some of them.
But of course, we have to take into account the fact that there should be fail-soft. Fail-soft in
the sense that they should not do any harm. Let us say even if they are doing some adaptation,
which is based on the incorrect classification. So, that is a very tricky suggestion that you
want to look into it.
And of course, now for example, when we say that the moderate degree of recognition and
then accordingly the severity of the interventions. So, of course, you want to make an
adaptive system, but since you are making an adaptive system which is not entirely, which is
based on not so perfect affect detection systems. So, you may not want to put a lot of hard
confidence in the severity of the interventions.
So, you may want to take it up in the pinch of salt, that ok, I got a particular classification, I
got a particular emotional state of a user based on whatever sensors or the models or the
software that I have deployed. And there is a possibility that it may not be so correct. Hence,
the adaptation that I am going to do is going to be accordingly very not so severe and could
be a moderate.
And then, it can be calibrated to the extent, you know that I am able to put the confidence in
my detection accuracy. So, for example, imagine I am I can come in my particular use case, I
am able to use make use of the dedicated sensors, dedicated hardware’s, and I am able to use
instead of the art algorithms, and then I know the detection accuracy is good or is almost
perfect in my case.
So, so then maybe you can put more confidence in the system and accordingly the
adaptations that you are making, maybe you can put more confidence in the adaptations as
627
well, and then accordingly so on so forth, right. So, this is a very good important thing to
understand that you need to take into account the system that you are building.
And then, the use case that you are building, and accordingly, you need to define that ok,
what could be a moderate degree of recognition for you, what could be the severity of the
interventions that you want to go ahead. But bottom line is you may never want to wait for a
perfect affect detection system in order to start building adaptive system, ok. So, that I hope
is a bit clear.
Now, having talked about ok, the sensors, having talked about that what is a good accuracy,
what enough what accuracy is good enough. Now, let us talk about a bit on the adaptiveness,
right. So, the question that we want to answer ok, what should be the adaptiveness of the
system?
And this is a very tricky question again. And the severity of the adaptive interventions as we
said already is determined on the basis of the confidence that you are able to put into the
system. But, let us see what are the different levels of the adaptation that we can have in the
system.
To begin with, for no surprise, we can have a level 0 adaptation which we also call it as a no
adaptation at all. So, in the no adaptation at all, what it means that we do not expect the
system to alter its behavior in response to the emotional state as simple as that you know.
628
Whatever is going to be the emotional state, we are going to monitor it, but we are not going
to do anything about it. We are simply going to take that information you know for some
other analysis purposes, but we are not going to let the system's behavior altered by the
response and response to the emotional state. And that is what is happening for most of the
systems that we have.
And for this what happens you know, a predefined interaction scripts mostly you know is
used for the machines, for the services that we used, we just talked about the gaming
examples. So, you know when you are sad the non-playing characters also around they show
a sadness, but all this is very very predefined script. And it does not really know; first thing it
does not really know that what your emotion is, and even if you knows it does not, it decides
not to do anything about it.
So, this type of system is where there is no adaptation as all happening and most of the
machines that we are happening interacting with the today or the services including for
example, you know the voice agents such as Alexa, Siri, most of the machines are like that.
Their adaptation is not based on our emotional state and many times our emotional state is
also not being monitored. So, that is the level 1, sorry level 0.
Then, comes the level 1. So, the level 1 base is basically you know what it does, it tries to
monitor your emotional state number 1, and then, it also tries to recognize that ok there is a
need for the adaptation at a particular time. So, it tries to identified the time of the
intervention or the need for the intervention. And, but of course, it just does that. It simply
tries to identify, identify it is that ok there is a need for the adaptation, but it does not perform
any adaptations at all.
And for example, you know the of course, all the when how will we identify that ok; now
there is a need for the adaptation, it could be on the basis of many different metrics and some
metrics of indicators could be like this. For example, the system is able to, system or the
service that you are interacting with is able to understand that ok, you are experiencing a
negative emotional state because of your voice or through any different modality.
And now it knows that ok. Since, you are experiencing a negative emotional state you are
feeling low, you are feeling sad, you are feeling bored and so on so forth. So, what it means
that ok there is a need for the adaptation here. Similarly, whenever there is a change in the
emotional state, maybe you can just look at this metric.
629
The systems can look at this metric, that whenever there is a change in the emotional state of
a user, maybe the system also needs to adapt to that and that is the metric that you are; of
course, you are not doing the adaptation as we already talked about it. That, no adaptation is
being performed at this stage, but you are able to identify the need for the adaptation.
Negative emotional state, we already talked about it. Changes in the emotional state,
whenever you change your emotional, let us say from one to another maybe you are feeling
happy and suddenly you started feeling sad. So, then maybe the system will get an alert oh,
the individual was feeling happy and now the individual is feeling sad. There is a change in
the emotional state and maybe this is a right moment to do some adaptation. But in the level
1, of course, no adaptation is going to do.
And then, you can think about like lots of other different metrics. I will let you explore and
think that what could be the other metrics on the basis of which you may want to do some
interventions or adaptations, right. And so that is the level 1 system. So, level 0, no
adaptation. Level 1, it just recognizes the need for the adaptation, but in general there is no
adaptation. As of now, yet we are not doing any adaptation, ok.
So, now we go to the level 2 which is a bit more fascinating, which is the single task
adaptation. So, single task adaptation means what? What it does? A single task that is being
performed during the entire cycle is adapted over time to optimize the particular metric.
630
So, we already have some metric using which we are saying that ok; we have first we are
tracking the emotional state of the user, we also have a metric or indicator which is telling us
that we are recognizing the need for the adaptation. And then, we are also in the level 2, now
on the top of it we are also doing the adaptation in a single task.
And this adaptation can be you know we may want to look into that ok, what is the metric
that we really want to improve in terms of the performance, performance metric and how can
be it improved by doing a particular type of adaptation, just to make example very clear.
For example, maybe you know you are looking at let us say, let us say you your aim that ok
you are building a conversational AI agent and then your aim is to make the user happy. So,
you know at the end of this interaction, the user should be feeling happy, user should remain
happy, user should feel happy about the interaction of once the interaction is over.
So, whenever the user you are monitoring the emotional state of the user, and whenever you
see that ok, whenever you experience that ok user is feeling a bit low, then what you want to
do you may want to do certain type of adaptation which will make the user happy. Because
you your aim was to make the user happy and so you will do a particular type of adaptation
that is going to make the user happy.
Similarly, for example, if you talk about the affective learning systems. So, where your aim is
to improve the learning, then you may maybe what you want to do, you are tracking the
emotional state of the user and your continuous aim is to keep the user engaged or keep the
boredom of the user as low as possible and then so on so forth.
So, then accordingly, the type of the adaptation that you would like to do will be to address
that particular goal. Maybe you want to make the user happy, maybe you want to make the
user engaged, maybe you want to make the users boredom low and then so on so forth. So, I
hope that this is what, this is what it is clear. That, you are able to adapt with respect to the
performance metric that you have kept as a goal for yourself.
And many times, this adaptation you know it itself what can what can be the adaptations
could be there? It could be the result of some predefined behaviour. So, you are saying that
ok my user is feeling sad, I am going to make the user happy. But what will I do to make my
user happy, now this is a question. And this particular type of behavior, in this particular
system, in the level 2 system is the result of a predefined behaviour.
631
So, for example, we already saw, we already looked at the example of the shimi robot. So, in
the case of the shimi robot for example, what happened? That the shimi robot was able to you
know play some music as per the emotional state of the user, but these particular type of
music or the behaviour of the gesture of the robot was predefined by the script that the
developers have already built in.
Similarly, adaptation can also be the result of the accumulated experience or the learning that
happens over the period of time. So, for example, while it is not happening to the extent that
we want it to be done, but in the intelligent tutoring systems what can happen that you may
you know that the entire experience of the user’s interaction with the system over a period of
time. And then, you can learn from that particular experience and you are able to adapt on the
basis of that.
And then you know for example, what type of adaptation will work for this particular guy.
So, very simple example is the you know when for example, as a teacher, when the teacher is
taking a class in the of for certain students, then teacher knows by the time. You know in the
beginning maybe the teacher may not have a very good idea that you know what, how should
I teach a particular topic to a particular student in order to make him or her understand.
But over the period of time, you know the user know the teacher knows, a good teacher at
least knows that ok, I need to address this particular problem of a, this particular student in
such a way, so that you know like it helps that particular student in a specific way.
So, the type of the adaptation that the teacher does for a specific student is different that it
does for the other student. And this is the result of the learning that teacher has done over the
period of time. And this is what we are envisioning here. That, if the system can learn from
the accumulated experience of interaction with the users.
Then it can improve the, it can create the, it can does the adaptation not on the basis of some
predefined behaviour, but on the basis of the learning that it is doing. And it is going to be
very very adaptive and personalized.
632
(Refer Slide Time: 38:48)
So, that is the level 2 adaptation which is a single task adaptation and of course, keeping a
particular performance metric in the mind. Now, level 3 is a next level adaptation. In the level
3 what happens, that rather than targeting a rather than doing the adaptation in a single task,
where for example, you know maybe in the case of the shimi robot, maybe the robot was only
able to do the adaptation in its voice or maybe was only able to do the adaptation in its
gesture,
Now, we are talking about that set of different tasks that are happening during the process
cycle, can be adapted over the time in response for to optimize a particular performance
metric. And this adaptation can be of many different types. So, for example, adaptations
could be that there are multiple tasks that the agents are doing, that your machines are doing,
that your services are doing, you can simply do the reordering of the tasks. Or you can simply
do the adaptation of the individual tasks that are happening, but in parallel.
So, for example, a good example of this thing would be that if you want to say that ok, you
want to do the reordering of the tasks based on the adaptation. Maybe you know for example,
you are trying to create an intelligent tutoring system here, and the tutoring system you know
the moment for example, the user logged in to learn a particular concept, maybe the system
felt ok, the individual looks a bit you know like low on the energy today.
So, if the individual looks a bit low on the energy, I am going to you know teach maybe
topics that are easy to follow first, rather than you know start with the hard topics. And
633
maybe I am going to teach in a way that is you know at a very very basic level. For example,
multiple ways that this type of adaptations are being done.
And so, your reordering also and the multiple tasks are being you know taken into account.
So, you are looking at you are not only adapting your teaching style, you are also changing
the content that you want to teach for example.
Or in the other way you are simply adapting multiple tasks, but you are not doing any
reordering of that. For example, imagine that you are interacting with a chat bot, a
conversational bot, and the conversational agent is not doing the reordering of anything, it is
following the order that it is supposed to follow, but maybe you know it is not only adapting
its gestures in response to your emotional state. But also, it is adapting its voice as well in
response to your emotional state.
So, there are two tasks at least you know. And maybe on the top of it maybe you know the
task that it is supposed to perform, actually, maybe you have a problem with the bank or
something like that, it is also able to do it in a way that really praises you. So, there are lots of
different tasks that are happening, lots of different processes that are happening, and the
adaptation of all these processes or multiple processes are happening in at the same time in
this level 3 type of adaptation.
And of course, nevertheless, no matter whether you are adapting a single task, multiple tasks,
whether you are reordering the tasks, whatever the type of the adaptation, this all the
adaptation has to keep in mind the performance of the system. So, basically, you have a
particular goal in mind, you want to make the user happy.
You want to make the user feel fulfilled, you want to make the improve the learning of the
user whatever. You have already all these goal predefined. And all these adaptations are
happening as per the goal, broader goal that you have set for your system, right.
And in this case, of course, you know, rather than having a predefined script, what you
simply have, you simply have adaptation that is the result of the accumulated experience. So,
it is very very personalized and very very customized for each different user. So, the
adaptations are going to be different for each different user depending upon their likes and
dislikes. So, that is quite interesting.
634
Now, in the level 4 adaptation, it is very much like the level 3, but there is a critical difference
that the process of the adaptation is carried out between multiple independent agents. So,
what happens? That till level 3, we are assuming that we have a system or a service or a
machine where there is only one agent with which we are interacting with. And that agent is
adapting to our task or is not adapting or is adapting multiple tasks that the agent itself is
performing.
Here, we are saying that we are interacting in a multi-agent setting or we are interacting with
the multiple services at the same time. And all these multiple services, multiple agents, they
are talking to each other and they are saying that ok you know, like you adopt this, you adopt
this, and I will adopt this, and let us collectively make the user feel good about the entire
thing.
So, for example, you know, maybe you are playing a game and the game there are multiple
characters. So, when there are multiple characters, rather than one character adapting to you,
all the different characters, they are adapting to you, but they are doing in sync by
communicating with each other. So, this is really fascinating because now, the your user
experience is being looked at holistically and comprehensively, and maybe it can provide a
better experience overall.
And while doing so, of course, what the agents can be do, they can communicate the different
adaptations and they can be applied individually within each agent. So, as I said agent A can
say to agent B, service A can say to service B, that you know, we have to do this, this, this
and service A will do this type of adaptation, service B will do this type of adaptation.
Similarly, you know, agent A, agent B, machine A, machine B or whatever different types of
things. So, for example, you know, there are different, let us say, in the gaming itself, I will
there are two different characters that are around you, they are supposed to be helping you in,
I do not know, finding some treasure and then.
So, agent A is going to, you know, maybe say that ok, maybe the user is not able to find it, let
us help the user find it. So, agent will say, ok, maybe I am going to, you know, clear the path
for the user, while let us say, you know, you take care of the, I do not know, enemies that are
there on the path. So, different things, right. So, different, of course, depending upon the
capabilities of the agents or the services.
635
And as I said again, the creativity is the only limit here again for you, on the type of the
adaptations that you can do for your machine, ok. So, again, when we are talking about this
multiple agents, this multiple agents can be both real and simulated and of different types as
well. So, you, we are talking about the agents in the games, we can also have one agent in the
embodied agent, in the animated agent, one robotic agent and, and then so on so forth.
So, basically, all the different types of agents or the services that we can envision, they can
work together, imagine you have a robot at your home, you have a Alexa also, you have a Siri
also, and they all are talking to each other in order to make you feel happy, I mean, that
would be really nice, for example. So, that is what is known as the communicated task
adaptation.
Now, in conclusion, when we looked at the, so this is the, let us look at the conclusion now.
So, when we looked at the open issues here, we already saw that we can use the scalable
sensors, whenever there is a feasibility of it, such as in the case of the cameras and the
microphones. And we can also replace the non-scalable sensors with the scalable proxies,
which can look at the latent behaviour, such as behavioural data. The one, that we, for
example, the way we capture it from the keyboard, typing and so on so forth.
Camera is a very very ideal choice, because it is already available in all the laptops systems,
and if not, then it can be also purchased at very very low costs. Motion tracking techniques
can be applied to the video data. Hence, you know, like it can also enable this kind of
636
tracking, and so this is, these are some of the things that you want to look into the, when you
are looking at the conclusion of it.
637
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 10
Lecture - 01
Part - 02
Emotionally Intelligent Machines: Challenges and Opportunities
So, we saw so far that if we go for a full sensor based approach dedicated sensor based
approach then there is a problem of the scalability right. So, let me just write it down for you.
So, if you are going ahead sorry if you are going ahead with the sensor, then dedicated
sensors they will require you will face a problem of a scalability. And if you are going
sensorless, then you will have a problem of accuracy.
So, maybe the accuracy you are somehow making a trade off with the accuracy sensorless
right. So, now; so, there could be a possible solution where we are which we are calling it as
a sensorlite solution. So, sensorlite solution means we are going to use the scalable sensors
whenever it is feasible; such as for example, you know cameras and the microphones.
They are very very widely available sensors which can allow you to capture the audio visual
modalities and only the audio modalities, and that is one approach. And then at the same time
638
what we can do? The non-scalable sensors can be replaced with the scalable proxies. So, for
example, if you are trying to make use of camera can be very good option here and for
example, because once that it is already available in most of the laptops.
Similarly, for example, it can be purchased at very very low costs and previous research has
already shown we have talked about it that you can use the camera as well to for example,
record and monitor the heart rate and the heart rate variability and related features. So, this
seems like an excellent solution and other thing for example, that you can do you can also
simply look at the webcam data and you can apply the motion tracking techniques on it.
So, for example, you can do the posture analysis, gesture analysis all with the help of the 2D
visual data; of course, you will need to work a bit more on the software side. But working on
the software side is a bit more easier, it is more accessible than you know working on the
hardware and trying to make the hardware scale. So, that for in this way you know like there
is one type of possible solutions.
Other we already talked about that you know like the camera can only not only be used to
monitor the heart rate and the heart rate variability; it can also be used to monitor the gaze
patterns of the eyes. And hence it can also replace to certain extent the eye tracking devices
and the sensors that we have.
So, that is the conclusion of this thing. If you have the possibility of going making use of the
dedicated sensors, go ahead. Many times, your application domain will not allow it. Then you
may want to use the existing sensors which are there available in your system to track the
behavioural data. Or for example, you may want to replace the dedicated sensors with the
sensor proxies and hopefully it may work wonders for you ok. Sorry.
639
(Refer Slide Time: 03:39)
So, that was about the sensor and what to do with the sensors that are available to us. Now, let
us talk about the accuracy as well. So, in general there is a we know that you know the a
naturalistic affective detection has seen a lot of research and it has been improved a lot. But
of course, there are lots of problems that are associated still with the affective computing
domain and in general the research and the development that is there.
One thing we already talked about it there most of the time the sensors that we require they
are intrusive. Many times it could be expensive as well; many times, it could be noisy as well
and more importantly they are not scalable. Other thing that we see that you know technical
challenge if you look at talk about the technical challenges, then the detection itself can suffer
from the weak signals that are embedded in the noisy channels.
So, many times for example, you are trying to look at a particular data and then, but that data
itself is surrounded by lots of noise around it. For example, maybe you are trying to capture
the emotions in the voice, as simple as that you are trying to capture the emotions that are
there in the audio modalities of a user.
But now we are no more in a lab setting, we are in a naturalistic setting where there is a lot of
noise around it. And hence your target users, voice data is getting embodied is getting
surrounded by lots of noises that are around there. And hence it really becomes very very
challenging to segregate that data of the user, voice of the user and do some analysis on the
top of it.
640
One thing also that we have seen even though we did not talk about a lot in detail about the
machine learning and the deep learning algorithms itself. But most of the time what we want
to do whenever we are talking about the affective computing it is going to rely heavily on the
machine learning and the deep learning algorithms. And whenever we talk about the machine
learning and the deep learning algorithms, it is no surprise that the need a lot of adequate and
realistic training data in order to make lot of sense.
And many times, what happens that emotions data associated with the emotions may be very
very difficult to even capture to annotate. And hence in turn our machine learning and the
deep learning models may suffer and their accuracies may suffer. So, other thing for example
that may happen that when we are talking about the affective computing most of the time, we
are just looking at the transient emotions.
But now, if you want to incorporate the context and the appraisals as well into it, then this
becomes a bit troublesome. Because you know if you want to incorporate the context and the
appraisals and let us say you know the users beliefs, desires and intentions around it then it
can be become very tricky, because then you will have to be able to you know track lots of
different things and which may be very very difficult here.
In general, you know whenever we are talking about the affective computing. Also, there is a
common mistake that the researchers do that the developers in the community do and they are
doing that many times they are taking replace using one for the another and another for the
one. So, for example, whenever they are talking about the model of the emotions, they are not
able to discriminate properly between the categorical models or the continuous models.
Similarly, for example, many times the mood versus emotions is not being differentiated. So,
you may refer to mood which is you know long term emotions or maybe you are referring as
a transient emotions. But you are just you know using the one for the another and hence you
are not able to differentiate these things and hence there is a lack of clarity on your side.
Nevertheless, even after this if you are able to do some recognition and you are able to make
use of some monitoring then what happens, that the generalizability itself it becomes a big
issue and generalizability in general for the machine learning and deep learning algorithms
itself is a problem. But more so with respect to the emotions and then the affective computing
it is a bigger problem.
641
Because, you may want to look into the individual variability, we talked about this; you may
want to look at the cultural variability, cultural differences. For example, the death of an
individual - at the time of the death of an individual the way it is being expressed in Indian
communities is very very different from the way it is being expressed you know the
lamentation or the sorrow that is being expressed in the outside community.
So, of course, there are lots of cultural differences that are surrounding it. Of course, then
context, time etcetera they also play a lot of role and so, generalizability becomes a big issue
on especially when we talk about the affective computing models.
So, the question that in general we want to ask now is this that ok, what type of accuracy or to
what extent an accuracy is a good accuracy with which you know we can go ahead and
deploy the system and we can make it work. So, in general we already saw that affect
detection itself is a very tricky problem.
And we can very confidently say that ok it is going to be very very unlikely that very soon we
are going to have a perfect affect detectors you know that are not only able to you know do
100 percent classification of the emotions of a user in real time. But also they are able to
generalize well to the new individuals to the interactional context and in you know very noisy
situations that are around us.
642
So, it is not going to happen anytime very soon. It will require lots of advancements not only
on the hardware side, but also on the software and the machine learning and the deep learning
side from us. So, the idea is ok, when should we start taking the information that we are
getting from the emotion recognition systems and build on the systems that can adapt to it or
in general when we want to close the loop.
So, there could be two possibilities now, whether we can go wait, until there is a perfect
affect detection system in order to build a adaptive system on the top of it which is going to
work perfectly fine. Or we can just go for not so perfect affected affect detection system and
we can try to make it work.
So, the thing is here since it is a very very tricky problem it may take a lot of time and
resources to arrive to that situation. So, even if we are able to get a moderate degree of
recognition accuracy. And what is a moderate degree of recognition accuracy for the
emotions that depends on them situation to situation and domain to domain and as per use
cases, we will talk a bit more about it.
But the moment we are able to get or you are able to get a moderate degree of recognition
accuracy, we believe that should be sufficient to create the affect aware interventions. So,
affect aware intervention means, adaptive interventions we just saw about some of the
examples some of them.
But of course, we have to take into account the fact that there should be fail soft. Fail soft in
the sense that they should not do any harm let us say even if they are doing some adaptation
which is based on the incorrect classification. So, that is a very tricky suggestion that you
want to look into it and of course, now for example, when we say that the moderate degree of
recognition and then accordingly the severity of the interventions.
So, of course, you want to make an adaptive system, but since you are making an adaptive
system which is not entirely, which is based on not so perfect affect detection systems. So,
you may not want to put a lot of hard confidence in the severity of the interventions. So, you
may want to take it up in the pinch of salt that ok I got a particular classification.
I got a particular emotional state of a user based on whatever sensors or the models or the
software that I have deployed. And there is a possibility that it may not be so correct. Hence,
the adaptation that I am going to do is going to be accordingly very not so severe and could
643
be a moderate. And then it can be calibrated to the extent you know that I am able to put the
confidence in my detection accuracy.
So, for example, imagine I am I can come in my particular use case, I am able to use make
use of the dedicated sensors, dedicated hardwares. And I am able to use state of the art
algorithms and I know the detection accuracy is good or is almost perfect in my case.
So, then maybe you can put more confidence in the system and accordingly the adaptations
that you are making maybe you can put more confidence in the adaptations as well and
accordingly so on so forth right. So, this is a very good important thing to understand that you
need to take into account the system that you are building.
And then the use case that you are building and accordingly you need to define that ok what
could be a moderate degree of recognition for you, what could be the severity of the
interventions that you want to go ahead. But bottom line is you may never want to wait for a
perfect affect detection system in order to start building a adaptive system ok; so, that I hope
is a bit clear.
Now, having talked about ok the sensors, having talked about that what is a good accuracy,
what enough what accuracy is good enough; now, let us talk about a bit on the adaptiveness
right. So, the question that we want to answer ok what should be the adaptiveness of the
system and this is a very tricky question again. And the severity of the adaptive interventions
644
as we said already is determined on the basis of the confidence that you are able to put into
the system.
But let us see what are the different levels of the adaptation that we can have in the system.
To begin with for no surprise, we can have a level zero adaptation which we also call it as a
no adaptation at all. So, in the no adaptation at all what it means that we do not expect in the
system to alter its behaviour in response to the emotional state as simple as that you know.
Whatever is going to be the emotional state we are going to monitor it, but we are not going
to do anything about it, we are simply going to take that information. You know for some
other analysis purposes, but we are not going to let the systems behaviour altered by the
response in response to the emotional state and that is what is happening for most of the
systems that we have.
And for this what happens you know a predefined interaction scripts mostly you know is used
for the machines, for the services that we use we just talked about the gaming examples. So,
you know when you are sad the non-playing characters also around, they show a sadness.
But all this is very very pre defined script and it does not really know first thing it does not
really know that what your emotion is. And even if you knows, it does not it decides not to do
anything about it; so, this type of system is where there is no adaptation as all happening.
And most of the machines that we are happening interacting with the today or the services
including for example, you know the voice agents. Such as Alexa, Siri, most of the machines
are like that. Their adaptation is not based on our emotional state and many times our
emotional state is also not being monitored; so, that is the level 1, sorry level 0.
Then, comes the level 1. So, the level 1 base is basically you know what it does, it tries to
monitor your emotional state number 1, and then, it also tries to recognize that ok there is a
need for the adaptation at a particular time. So, it tries to identified the time of the
intervention or the need for the intervention. And, but of course, it just does that. It simply
tries to identify, identify it is that ok there is a need for the adaptation, but it does not perform
any adaptations at all.
And for example, you know the of course, all the when how will we identify that ok; now
there is a need for the adaptation, it could be on the basis of many different metrics and some
645
metrics of indicators could be like this. For example, the system is able to, system or the
service that you are interacting with is able to understand that ok, you are experiencing a
negative emotional state because of your voice or through any different modality.
And now it knows that ok. Since, you are experiencing a negative emotional state you are
feeling low, you are feeling sad, you are feeling bored and so on so forth. So, what it means
that ok there is a need for the adaptation here. Similarly, whenever there is a change in the
emotional state, maybe you can just look at this metric.
The systems can look at this metric, that whenever there is a change in the emotional state of
a user, maybe the system also needs to adapt to that and that is the metric that you are; of
course, you are not doing the adaptation as we already talked about it. That, no adaptation is
being performed at this stage, but you are able to identify the need for the adaptation.
Negative emotional state, we already talked about it. Changes in the emotional state,
whenever you change your emotional, let us say from one to another maybe you are feeling
happy and suddenly you started feeling sad. So, then maybe the system will get an alert oh,
the individual was feeling happy and now the individual is feeling sad. There is a change in
the emotional state and maybe this is a right moment to do some adaptation. But in the level
1, of course, no adaptation is going to do.
And then, you can think about like lots of other different metrics. I will let you explore and
think that what could be the other metrics on the basis of which you may want to do some
interventions or adaptations, right. And so that is the level 1 system. So, level 0, no
adaptation. Level 1, it just recognizes the need for the adaptation, but in general there is no
adaptation. As of now, yet we are not doing any adaptation, ok.
646
(Refer Slide Time: 18:09)
So, now we go to the level 2 which is a bit more fascinating, which is the single task
adaptation. So, single task adaptation means what? What it does? A single task that is being
performed during the entire cycle is adapted over time to optimize the particular metric.
So, we already have some metric using which we are saying that ok; we have first we are
tracking the emotional state of the user, we also have a metric or indicator which is telling us
that we are recognizing the need for the adaptation. And then, we are also in the level 2, now
on the top of it we are also doing the adaptation in a single task.
And this adaptation can be you know we may want to look into that ok, what is the metric
that we really want to improve in terms of the performance, performance metric and how can
be it improved by doing a particular type of adaptation, just to make example very clear.
For example, maybe you know you are looking at let us say, let us say you your aim that ok
you are building a conversational AI agent and then your aim is to make the user happy. So,
you know at the end of this interaction, the user should be feeling happy, user should remain
happy, user should feel happy about the interaction of once the interaction is over.
So, whenever the user you are monitoring the emotional state of the user, and whenever you
see that ok, whenever you experience that ok user is feeling a bit low, then what you want to
do you may want to do certain type of adaptation which will make the user happy. Because
647
you your aim was to make the user happy and so you will do a particular type of adaptation
that is going to make the user happy.
Similarly, for example, if you talk about the affective learning systems. So, where your aim is
to improve the learning, then you may maybe what you want to do, you are tracking the
emotional state of the user and your continuous aim is to keep the user engaged or keep the
boredom of the user as low as possible and then so on so forth.
So, then accordingly, the type of the adaptation that you would like to do will be to address
that particular goal. Maybe you want to make the user happy, maybe you want to make the
user engaged, maybe you want to make the users boredom low and then so on so forth. So, I
hope that this is what, this is what it is clear. That, you are able to adapt with respect to the
performance metric that you have kept as a goal for yourself.
And many times, this adaptation you know it itself what can what can be the adaptations
could be there? It could be the result of some predefined behaviour. So, you are saying that
ok my user is feeling sad, I am going to make the user happy. But what will I do to make my
user happy, now this is a question. And this particular type of behavior, in this particular
system, in the level 2 system is the result of a predefined behaviour.
So, for example, we already saw, we already looked at the example of the shimi robot. So, in
the case of the shimi robot for example, what happened? That the shimi robot was able to you
know play some music as per the emotional state of the user, but these particular type of
music or the behaviour of the gesture of the robot was predefined by the script that the
developers have already built in.
Similarly, adaptation can also be the result of the accumulated experience or the learning that
happens over the period of time. So, for example, while it is not happening to the extent that
we want it to be done, but in the intelligent tutoring systems what can happen that you may
you know that the entire experience of the user’s interaction with the system over a period of
time. And then, you can learn from that particular experience and you are able to adapt on the
basis of that.
And then you know for example, what type of adaptation will work for this particular guy.
So, very simple example is the you know when for example, as a teacher, when the teacher is
taking a class in the of for certain students, then teacher knows by the time. You know in the
648
beginning maybe the teacher may not have a very good idea that you know what, how should
I teach a particular topic to a particular student in order to make him or her understand.
But over the period of time, you know the user know the teacher knows, a good teacher at
least knows that ok, I need to address this particular problem of a, this particular student in
such a way, so that you know like it helps that particular student in a specific way.
So, the type of the adaptation that the teacher does for a specific student is different that it
does for the other student. And this is the result of the learning that teacher has done over the
period of time. And this is what we are envisioning here. That, if the system can learn from
the accumulated experience of interaction with the users.
Then it can improve the, it can create the, it can does the adaptation not on the basis of some
predefined behaviour, but on the basis of the learning that it is doing. And it is going to be
very very adaptive and personalized.
So, that is the level 2 adaptation which is a single task adaptation and of course, keeping a
particular performance metric in the mind. Now, level 3 is a next level adaptation. In the level
3 what happens, that rather than targeting a rather than doing the adaptation in a single task,
where for example, you know maybe in the case of the shimi robot, maybe the robot was only
able to do the adaptation in its voice or maybe was only able to do the adaptation in its
gesture,
649
Now, we are talking about that set of different tasks that are happening during the process
cycle, can be adapted over the time in response for to optimize a particular performance
metric. And this adaptation can be of many different types. So, for example, adaptations
could be that there are multiple tasks that the agents are doing, that your machines are doing,
that your services are doing, you can simply do the reordering of the tasks. Or you can simply
do the adaptation of the individual tasks that are happening, but in parallel.
So, for example, a good example of this thing would be that if you want to say that ok, you
want to do the reordering of the tasks based on the adaptation. Maybe you know for example,
you are trying to create an intelligent tutoring system here, and the tutoring system you know
the moment for example, the user logged in to learn a particular concept, maybe the system
felt ok, the individual looks a bit you know like low on the energy today.
So, if the individual looks a bit low on the energy, I am going to you know teach maybe
topics that are easy to follow first, rather than you know start with the hard topics. And
maybe I am going to teach in a way that is you know at a very very basic level. For example,
multiple ways that this type of adaptations are being done.
And so, your reordering also and the multiple tasks are being you know taken into account.
So, you are looking at you are not only adapting your teaching style, you are also changing
the content that you want to teach for example.
Or in the other way you are simply adapting multiple tasks, but you are not doing any
reordering of that. For example, imagine that you are interacting with a chat bot, a
conversational bot, and the conversational agent is not doing the reordering of anything, it is
following the order that it is supposed to follow, but maybe you know it is not only adapting
its gestures in response to your emotional state. But also, it is adapting its voice as well in
response to your emotional state.
So, there are two tasks at least you know. And maybe on the top of it maybe you know the
task that it is supposed to perform, actually, maybe you have a problem with the bank or
something like that, it is also able to do it in a way that really praises you. So, there are lots of
different tasks that are happening, lots of different processes that are happening, and the
adaptation of all these processes or multiple processes are happening in at the same time in
this level 3 type of adaptation.
650
And of course, nevertheless, no matter whether you are adapting a single task, multiple tasks,
whether you are reordering the tasks, whatever the type of the adaptation, this all the
adaptation has to keep in mind the performance of the system. So, basically, you have a
particular goal in mind, you want to make the user happy.
You want to make the user feel fulfilled, you want to make the improve the learning of the
user whatever. You have already all these goal predefined. And all these adaptations are
happening as per the goal, broader goal that you have set for your system, right.
And in this case, of course, you know, rather than having a predefined script, what you
simply have, you simply have adaptation that is the result of the accumulated experience. So,
it is very very personalized and very very customized for each different user. So, the
adaptations are going to be different for each different user depending upon their likes and
dislikes. So, that is quite interesting.
Now, in the level 4 adaptation, it is very much like the level 3, but there is a critical difference
that the process of the adaptation is carried out between multiple independent agents. So,
what happens? That till level 3, we are assuming that we have a system or a service or a
machine where there is only one agent with which we are interacting with. And that agent is
adapting to our task or is not adapting or is adapting multiple tasks that the agent itself is
performing.
Here, we are saying that we are interacting in a multi-agent setting or we are interacting with
the multiple services at the same time. And all these multiple services, multiple agents, they
are talking to each other and they are saying that ok you know, like you adopt this, you adopt
this, and I will adopt this, and let us collectively make the user feel good about the entire
thing.
So, for example, you know, maybe you are playing a game and the game there are multiple
characters. So, when there are multiple characters, rather than one character adapting to you,
all the different characters, they are adapting to you, but they are doing in sync by
communicating with each other. So, this is really fascinating because now, the your user
experience is being looked at holistically and comprehensively, and maybe it can provide a
better experience overall.
651
And while doing so, of course, what the agents can be do, they can communicate the different
adaptations and they can be applied individually within each agent. So, as I said agent A can
say to agent B, service A can say to service B, that you know, we have to do this, this, this
and service A will do this type of adaptation, service B will do this type of adaptation.
Similarly, you know, agent A, agent B, machine A, machine B or whatever different types of
things. So, for example, you know, there are different, let us say, in the gaming itself, I will
there are two different characters that are around you, they are supposed to be helping you in,
I do not know, finding some treasure and then.
So, agent A is going to, you know, maybe say that ok, maybe the user is not able to find it, let
us help the user find it. So, agent will say, ok, maybe I am going to, you know, clear the path
for the user, while let us say, you know, you take care of the, I do not know, enemies that are
there on the path. So, different things, right. So, different, of course, depending upon the
capabilities of the agents or the services.
And as I said again, the creativity is the only limit here again for you, on the type of the
adaptations that you can do for your machine, ok. So, again, when we are talking about this
multiple agents, this multiple agents can be both real and simulated and of different types as
well. So, you, we are talking about the agents in the games, we can also have one agent in the
embodied agent, in the animated agent, one robotic agent and, and then so on so forth.
So, basically, all the different types of agents or the services that we can envision, they can
work together, imagine you have a robot at your home, you have a Alexa also, you have a Siri
also, and they all are talking to each other in order to make you feel happy, I mean, that
would be really nice, for example.
652
(Refer Slide Time: 30:14)
So, that is what is known as the communicated task adaptation. Now, in conclusion, when we
looked at the, so this is the, let us look at the conclusion now. So, when we looked at the open
issues here, we already saw that we can use the scalable sensors, whenever there is a
feasibility of it, such as in the case of the cameras and the microphones. And we can also
replace the non-scalable sensors with the scalable proxies, which can look at the latent
behaviour, such as behavioural data. The one, that we, for example, the way we capture it
from the keyboard, typing and so on so forth.
Camera is a very very ideal choice, because it is already available in all the laptops systems,
and if not, then it can be also purchased at very very low costs. Motion tracking techniques
can be applied to the video data. Hence, you know, like it can also enable this kind of
tracking, and so this is, these are some of the things that you want to look into the, when you
are looking at the conclusion of it.
653
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 10
Lecture - 34
Tutorial Research Paper Discussion
Good morning everyone. I am Ashwini. I am a PhD scholar under Dr. Jainendra Shukla at
Human Machine Interaction Lab, IIIT Delhi.
654
(Refer Slide Time: 00:45)
655
There are several studies that are worked on in this direction. Before going into the details of
this paper, let us understand what is empathy, what are the factors that affect the empathetic
responses, what are the different kinds of empathy and levels of empathy. Literature has
defined empathy in different way.
For this study, the authors have adopted the definition in such a way that the empathy is
defined as the responses of the empathizer towards the emotions of the target, which aligns
with the definition given by Hoffman et al. There have been two kinds of empathy. One is the
cognitive empathy and the second one is affective empathy.
In cognitive empathy, the empathizer perceives the emotions of the target in a rational or
logical manner while in affective empathy, the empathizer considers or perceives the
emotions of the target more emotionally or a natural way. This could be considered as an
intrinsic empathy.
There are different factors that affect the emotional response, which also involves empathetic
response. First one is the intrinsic features of the shared emotions. This is in some way
representing the nature of the target’s emotions. What is the emotion expressed by the target
or the user, whether it is positive or negative, whether it is strong emotion or a subtle
emotion, what is the salience of the emotions expressed by the target.
Second one is the characteristics of the empathizer. This basically represents the personality
of the empathizer, whether the empathizer is extrovert or introvert, the gender of the
empathizer, the age of the empathizer, the mood of the empathizer, etcetera affects the
emotional response.
Third one is the relationship between the empathizer and target. How well you understand the
target depends upon what is the relationship that you have with the target or the user. You
may not respond in the same way to a stranger as that of a friend. Third one is the situational
context. It depends upon when and where and how you respond to the user’s emotion.
The empathetic behaviors has been categorized into two levels, one is parallel empathy and
reactive empathy. In parallel empathy, you mimic the emotions of the target or the user. That
means if the user is sad, you respond in a way that aligns with the emotions of the user. In
reactive empathy, you feel and empathize with the user in such a way that the user’s distress
656
is reduced. You uplift the positive emotions in the user while reducing the negative emotional
energy in the user.
The existing literature has studied different kinds of empathy, parallel as well as reactive.
Most of the works have focused on parallel empathy, where the emotions of the user is
identified and the empathizer also aligns with the emotions of the user. Even though the
reactive empathy has been studied, it has been limited to generating responses in the form of
verbal comments.
Machine learning techniques have been extensively used for developing models for
empathizing, developing; machine learning models have been extensively tried for
developing emotional empathetic model. One of those is companion assisted reactive
empathizer.
In this, the empathizer is developed in a virtual environment with human trainers. The human
trainers interact with each other during a virtual interaction platform. During these
interactions, they exhibit emotions and these emotions are understood and perceived by each
one of these partners and they react appropriately to the emotions of the partner.
This has been used for training which also involves the physiological signals like heart rate,
PPG etcetera from the interactive partners, and which is used to understand the emotions of
the interacting partners, and used to generate a response in return to it. One of the drawbacks
657
of this study is that we have to understand the different context, the different set of
possibilities for these interactions to happen in order to predict the appropriate responses for
each interaction in sessions.
By looking into the literature, we could understand that there are many gaps like most of
these studies are specific to a particular context. They have considered only a specific kind of
empathy model for example, either the parallel empathy or the reactive empathy or these
studies lack autonomous decisions by the empathizer.
658
(Refer Slide Time: 07:52)
In this study, the authors have tried to generate a autonomous empathizer which understands
or perceives the emotions of the target and generates appropriate reactive or parallel
responses in response to the user’s emotions. To understand or perceive the emotions of the
target, here the authors have relied upon facial expressions. Facial expressions are one of the
major components which aids the expression of emotions.
Coming onto the methodology, there are 3 stages in this process. First is to detect the target’s
affective state. Once that is detected, we have to understand what is the target’s perspective of
659
the emotion. Once that is understood by the empathizer, the empathizer has to generate an
empathetic response to that emotion. And this process goes on until the interaction ends.
This work has 3 different modules. One is the emotion detection module, where the emotions
felt by the target is perceived using facial expressions of the target. Now, the emotions of the
target is perceived using this detected module. And finally, based on the detected emotions,
the empathizer generates responses, empathetic responses to be specific, based on the
emotions and the intensity of the emotions expressed by the user.
660
The first module or the first step in this process is to understand the emotions of the user.
How will you understand the emotions? As I said, one of the major factors that helps us to
express our emotion is our face or the facial expressions. To define the emotions using facial
expressions, Ekman has defined an emotional model which has 6 basic emotions; happy, sad,
fear, anger, surprise and disgust.
His emotional model is a categorical one and for the purpose of his study, the authors has
used Ekman’s emotional models. And Ekman has also defined that all these emotions could
be expressed using different facial muscles which are called the facial action units. There are
55 facial action units which could express different emotions by activating or by relaxation or
contraction of these facial muscles.
From the video stream using open phase toolkit, these facial action units are extracted which
is sent to a stacked auto-encoder model and finally, classified into the different emotions
using a Softmax Classifier. Again, once the emotions are identified, we have to understand
the intensity of emotion to generate appropriate empathetic responses, which is again
performed using a machine learning model consisting of a stacked auto-encoder and a
Softmax Classifier.
I am not going into the details of the architecture; you can find it in the paper.
661
Once we detect the emotions using the facial action units, next is to understand the
perceptions, emotional perceptions of the target. For representing the emotional perception of
the target, we assign the detected emotion to the target. And this is considered as the
emotions expressed by the user with whom the empathizer is interacting.
The last step in this procedure is to generate the empathetic behavior by the empathizer. The
authors have defined both reactive empathy as well as parallel empathy depending upon the
intensity of the intensity of the emotions expressed by the user. For subtle emotions or
positive emotions, parallel empathy is used and for strong emotions or negative emotions,
reactive empathy is performed.
You can see the different emotional responses and the combinations used in these tables. For
example, if the emotion is happiness and the intensity is high, then the emotional response is
parallel. For example, if the emotion category is positive and the emotion type is happiness or
surprise, the based on the emotional intensity, whether it is weak or normal, the responses
also vary.
Now, how should we generate the empathetic response? What are the changes that has to be
made in the response behavior of the empathizer? You know, the emotions could be
expressed or humans express their emotions in different ways. There will be some changes in
their facial expressions.
662
The gestures that make also represent the emotional intense of the person, also the pitch and
the tone of the voice they generate or the intonation of the pronunciation, intonation or the
pronunciation or intonation of the emotional state. Considering that, in this study, the
emotional empathetic response of the empathizer is defined using various parameters.
One is the stiffness of the body activities or stiffness of the joints of the interactive agent or
the empathizer. Second is the pitch and intensity of the voice of the robot or the empathizer
and also the eye color of the empathizer. These things are used to respond with parallel
empathy. And if the empathizer is adopting a reactive empathy, then the eye color as well as
verbal comments were used.
The empathetic behavior provider module: This, the empathetic behavior also defines the
personality of the empathizer, whether the empathizer should create a introvert behavior or a
extrovert behavior. This depends upon the similarity attraction principle used in psychology.
Usually, people respond positively or people find it more interesting to interact with people
who are similar to their traits.
So, depending upon the personality of the target, the empathizer’s behavior is also changed,
and accordingly the speech, the eye color, the behavior, gestures, etcetera are also changed.
663
(Refer Slide Time: 16:07)
Coming on to the results. The facial emotion recognition module is trained on RAVDESS
dataset, which is a popular dataset used for emotion detection.
After which a user study is conducted to decide whether the autonomous empathetic,
autonomous cognitive empathetic model is better than the existing models or not. In this a
user scenario or a interactive scenario is defined, in which the participants or the target is
shown videos in different emotional categories aligning to the Ekman’s basic emotion model.
664
These videos were selected from America Got Talent Show, and during these interactions the
different emotions are elicited in the users and a social robot, which is pepper robot in this
case has been used to respond appropriately to the user’s emotions. The study was conducted
in two different parts.
One in which the empathizer or the social robot responds according to the autonomous
cognitive empathetic model and in the second part the robot or the empathizer responds to a
baseline model which is the basic empathetic model. The difference between the basic
empathetic model which is the baseline and the autonomous cognitive empathetic model is
that, in basic empathetic model only the eye color and the verbal comment is produced by the
empathizer.
While in ACEM or the autonomous cognitive empathetic model, the empathizer changes the
speech in donations, then the stiffness of the body, the eye color as well as generates the
verbal comments depending upon the emotions and intensity of emotions of the user.
And after this the responses of the empathizer is evaluated on different parameters. One is the
intimacy. Intimacy shows how sensible or how sensitive is the empathizer towards the target
or the user. Second one is emotional security. This means how well the participant or the user
feels confident and comfortable in interacting with the empathizer or the robotic agent here.
665
Third is social presence. In social presence the users evaluated the empathizer based on its
sociability; how well they relate this robot or the empathizer as a social entity.
Next is perceived enjoyment which shows whether they whether the users enjoyed the
interaction, what were their feeling about the interaction, whether it is positive or negative.
And next is the perceived sociability. Again, this shows how this empathizer or the robotic
agent could be used in a social interaction. Trust. Trust represents how well the empathizer
could respond to the respond to the user reliably and what does the user understand about the
integrity of the interactions.
Next is engagement. Engagement represents how well the interaction went, so that or how
engaged the interaction were or how engaged the users were in the interaction so that the
interaction can extend for a prolonged time.
666
(Refer Slide Time: 20:11)
So, the robotic or the empathizer target interactions were evaluated based on these
parameters. And the results are shown in these tables. It is evident from the figures that the
autonomous cognitive empathetic model performed far better than the basic empathetic
model in most of these parameters.
Coming on to the contribution of this paper. This model has been seen to provide more
effective interactions in terms of intimacy, emotional security of the target. They found the
interactive agent or the empathizer as more social and considered it as a social entity good for
667
social interactions. They found that their emotions were understood better by the empathizer.
And it also showed that their empathetic, their emotional responses were dependent upon
how well the empathizer responded to their emotions.
And according to the empathizer’s responses, their moods or their emotions also varied. And
they enjoyed these enjoyed these interactions, and they were more engaged and they had trust
in these interactions or in these interactive agents.
Coming on to some of the limitations of these studies, they having mixed responses on how
well this, there have been there have been mixed emotions on the response behavior
generated by the empathetic agent here. Some found that having more tactile responses like
hugging or a touch on the shoulder might have made the interactions more better, must have
made the interactions better.
And some thought that more expressions on the face of the empathizer or the Pepper robot
might have been improved the interactions. Further, in this study only facial expressions were
considered as an indication for emotions or intent of the user. Sometimes there may be other
factors that can represents emotions better. Considering a holistic approach in the emotional
perception will help in understanding the emotions of the target and react appropriately.
And this method, in general is restricted by the bottleneck of the performance of different
facial expressions algorithms or facial expression detection algorithms.
668
(Refer Slide Time: 23:28)
In short, this paper provides the in short, this paper studied a, in short, this paper explored an
autonomous cognitive empathy model which could understand the emotional intents or
moods of the user and generate a empathetic response system which is appropriate to the
emotions of the user.
This study also conducted experiments to validate their claims, and the proposed method is
found to be effective in making the interactions more engaged and affective using robots.
669
For more details, you can refer to Bagheri et al’s paper.
Thank you.
670
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week -11
Lecture - 01
Case Study: Emotional Virtual Personnel Assistance
Hi friends, welcome to this week's module which is a Case Study. And in this week, we are
going to talk about how to develop a Emotional Virtual Personal Assistant as part of this case
study.
So, basically so far, we have learned a lot about the theoretical aspects of affective
computing. Now, the idea of this week's lecture was to bridge the gap between the theory and
the practice for you. So, here we will try to understand how can we apply that we have
already learned in the previous weeks to develop an application, to develop a device which is
emotionally intelligent and can be used in real life.
I will try to provide as much codes as well and hopefully you will get to play with it in your
own time. So, here is the outline for this week's module. First, I will be talking about of
course, the motivation behind the development of such an emotionally intelligent virtual
671
assistant. What could be some of the challenges? I will be talking about the methods and the
techniques using which such an assistant can be developed.
And I will be talking about the ethical concerns around such a system when it is deployed in
real life. And I am glad to inform you that we also have managed to do an interaction with
Dr. Aniket, who is a faculty in Purdue University. And his most of the work is related to the
affective computing and emotionally intelligent machines.
And we will be talking with him at the last of this lecture to understand how has he been you
know investigating in this area and how he has been developing some of the real world
applications by making use of his research in his lab at Purdue, ok. So, with that let us get
started.
672
(Refer Slide Time: 02:26)
So, first thing that we want to understand is what exactly is a virtual assistant. Now, when we
talk about a virtual assistant, basically virtual assistant is something that we all have seen is
just like Alexa, Siri, Cortana, etcetera. And the idea of virtual assistant is that it takes an input
or it interacts with the humans and it takes as an input their voice. And it could be started
with a follow some wake word such as, Hey Siri, Hey Google. And then of course, since
speech is the modality that we are taking it as an input.
So, please pay attention that speech becomes the mode of the communication. Then the next
thing happens is the automatic speech recognition. Of course, the assistant tries to understand
what is that the user wants to communicate or wants to understand. And this is where you
know when we talking about the automatic speech recognition, this is where we also we will
try to understand the emotional context of the entire interaction between the agent and the
humans.
Once we have understood the intent and the emotions that are there in the speech, we will try
to of course, you know apply the NLP and try to understand how can we respond to this
particular request that the user has made. And then there is of course, dialogue management
system which helps to create this interaction between the user and the agent.
And then once you are the agent is ready with the response, then the agent you know converts
the whatever response that it has in the text format to a speech format and send it back to the
user. So, as simple as that, you may ask for example, Hey Siri, how is the weather today?
673
And Siri would respond that, ok, the weather is sunny and warm for example, right? And this
is where you know we want to insert the emotional intelligence component into it. So, what
we are talking about here is trying to develop an Alexa, Siri or Cortana kind of assistant, but
which is also emotionally intelligent. So, I hope that it is exciting and fascinating enough for
you.
But if not, then let us try to look at that what could be some of the use cases of such an
emotionally intelligent virtual assistant. So, for example, this kind of assistant can be used in
mental health and the therapy, where the what is this kind of assistant can be used to help
people with mental health disorders or for example, emotional struggles.
This kind of agent, it can recognize the user's emotional state and it can provide guidance and
it can, you know support, with the coping strategies and emotional support in the real time.
And on the top of it, this kind of an agent, it can track the user's emotional state over a period
of time. And it can provide insights to the therapists and to other healthcare providers or let
us say you know to their nearby stakeholders to their nearby relatives and friends.
So, that is one interesting application. Other application that could be is the customer service
in providing the customer service and the support. So, in this case, such an agent can be used
in to provide a more personalized and empathetic experience to the customers. And the idea
is here that such an agent while doing an interaction with the customer will be able to
recognize user's emotions.
674
And it will be able to tailor its responses accordingly to provide let us say more positive and
satisfactory responses to the user. And we all have seen you know how the customer service
agents interact with us. And definitely it is a very tiring job hats off to the customer you know
care agents. But of course, we can always do better in this area.
One other domain that it can be really interesting to have this kind of an agent is the
education. So, in the education domain, such an agent can be used to help students learn more
effectively. So, the idea is very simple, that if you have such an agent, it can try to identify
when a for example, a student is struggling and you know needs more support or needs
additional support.
And hence, it can help provide targeted feedback and support to them to overcome you know
whatever obstacles they are facing in learning a particular topic or a concept. Other for
example, very important category could be the entertainment. So, basically this kind of an
agent can be used in entertainment to create more engaging and interactive experiences which
can be really of lot of demand.
So, for example, imagine that you have a video game which is using this kind of virtual
assistant. It can adapt to the player's emotional state and maybe provide a more immersive
and you know adaptive environment for to for the user to enjoy. So, this can be a very
interesting application.
Other interesting application that frankly speaking I and all my colleagues in the academia
would definitely appreciate a lot is having an virtual assistant that can be used in the
workplace to help employees manage their stress, improve communication and increase
productivity.
So, it is very common for the work place for the individuals who are working in a particular
workplace to get overwhelmed from time to time with the anxiety and stress and the lots of
tasks that are their hand. So, the idea is that, can this kind of system provide understand you
know when the employee is feeling for example, overwhelmed or frustrated.
And maybe it can provide support and additional resources to help them become more
effective and overall have a positive experience while conducting their job. So, that is one
and of course, these are only few pointers that I am mentioning, but of course, there could be
many more applications.
675
And to answer to what else I would let you think about it that what are the other domains
where you would like to use such an emotionally intelligent virtual assistant. So, imagine that
if you have a Alexa or a Siri or a Cortana which is also emotionally intelligent what can you
do with it. So, the creativity is the only limit here.
So, I hope that you have been feeling motivated enough to have such a system. And to
develop such a system and to go through the case study of it. Now, of course, such an system
is not going to be easy to develop. So, let me make your hopes not so high, but of course, we
will see what could be some of the challenges. And by no means this is an exhaustive list. I
am just trying to provide some pointers here.
Of course, the very first challenge that is there is the accuracy. So, what we want? We want of
course, such a system to be as accurate as possible, but it can be really difficult because there
is lot of variability in the emotional expressions as across individuals, cultures and the
context. And I hope you can see that the type of the modalities that we are talking about here
is the speech modality.
So, you may want to recall the concepts that you have learned during the emotions in speech
lecture by my colleague Dr. Abhinav. So, the idea is that ok, here there is lot of variability.
And this variability can be due to different many different factors which was discussed in that
particular lecture.
676
So, for example, it could it could be due to the speech patterns due to the you know accents
and different dialects, right. And as simple as that for example, when you are making a
speech there is a even bigger challenge apart from the dialect and accents and everything is,
how can such a system detect sarcasm or irony in a speech when it is communicating with the
humans.
We humans are very good at making, you know sarcastic comments and of course, we use it
very efficiently in our day to day communications. So, if we were to use such a
communication with an emotionally intelligent agent, how can such an agent detect that kind
of thing. And of course, these are can be you know the agent has to look into the tone also has
to understand the context and so many other things.
So, nevertheless accuracy remains a big challenge for this. Of course, other big challenge of
such a system that we can envision is the operational requirements in real time. It has to be
able to perform in real time. Because if not real time then we will not be able to provide a
seamless user experience, right.
And of course, what it requires? It requires the processing and the response time should be
very very efficient. And of course, it has to be supported with a high performance hardware
and the software. So, this comes as a trade off between what we want to invest and what we
want to gain out of it. But nevertheless, the idea is that we should not be able to make the
compromise with the user experience. Otherwise, the users will not interact with it and it will
lose its purpose, ok.
So, third point is the multi-language support. Multi-language support, I think this is even
more applicable to a country like us like India, where we already have 20 plus official
languages. And unless and until we are able to cater to the needs of all the languages, we will
not be able to reach to the big portion of the individuals in the domain.
One other interesting challenge is of course, is of the bias and the discrimination. And it turns
out that like any other technology, this particular technology or the virtual assistant that we
are going to develop. It may also reflect the biases of the individuals like us and also, like the
data sets, who create and train the models and it can potentially lead to the unfair treatment of
certain groups.
677
And this biases that we are talking about and hence resulting discrimination can be due to
several factors. It can be for example, due to data bias. So, for example, the data that has been
recorded. Imagine if we are recording a data that is only making use of individuals or the
participants from northern part of India.
Then of course, we may not be able to cater to the exact requirements of the users from the
southern part of India for example. Other thing could be is the gender bias. What is gender
bias? Imagine that we are having a data where all the participants or majority of the
participants, they are belonging to one particular gender, say male.
And then there are very few females in the category. Then of course, the model may have a
very hard time in trying to understand the emotions from other gender or other genders in this
case. Similarly, of course, we can have a culture bias as well, which we already talked about
in the last thing.
Culture bias is basically, you know it turns out that there are certain regions of certain regions
where people may have a higher tone in comparison to the other regions. And similarly, there
could be other cultural differences. So, taking into unless and until we are able to understand
the entire demography of the participants, which are our target users, we will not be able to
completely get rid of the bias that may be present in the such a system that we are trying to
develop.
And last, but not the least, of course, there is a privacy concern can be a big concern when it
is comes to the development of the such virtual assistant. And the idea is of course, we should
not be able to collect the data without the user's knowledge or their consent. And of course,
even if we are collecting the data, we need to have a very robust data encryption protocol. We
need to have a secure storage of such a data and all the transmission protocols.
And we need to have the compliance with the privacy regulations, not only at the national,
but also at the international levels such as GDPR and the HIPAA regulations that we have
now. So, this is the very, very important requirement. And these are some of the interesting
challenges that can be there in while developing such a system.
And of course, as I said, I mean these are by no means an exhaustive list. So, there could be
many other challenges. And I invite you to please brainstorm about it. What do you think
678
could be more challenges when it comes to the development and the usage and the
deployment of such a emotionally intelligent virtual assistant, ok.
So, now having understood that what exactly we are talking about, we are talking about a
development of an emotionally intelligent virtual assistant. In short, it's Alexa or Siri plus
having emotional intelligence. We already saw that can be have lots of use cases in the
education, entertainment, mental health, workplace productivity so on so forth. And then we
also looked at lots of challenges.
So, having looked at all those things, now we can try to formulate some of the requirements
that it may have. Please note that the exact requirements may vary from one to another and it
will depend on the business use case that you are trying to have and which may come from
you know the managers or which may come from the business stakeholders also.
So, of course, the very first requirement that whatever system we are developing, it should be
able to recognize emotions in speech as simple as that. And let us say that you know for the
sake of the simplicity, taking as a example, it should be able to distinguish at least the basic
emotions of happy, sad, angry and neutral. So, we are just talking about the basic emotions
and rather than talking about you know like the entire spectrum of the emotions to keep our
life easier. Of course, we have already gone through the emotions in speech.
679
So, we are going to rely on that understanding a lot. If you have not gone through that or if
you do not recall some of the concepts, then I will invite you to please go through that and
revise emotions in speech and then maybe come back to this part later. Of course, this is the
first requirement. Of course, second requirement is it should be able to take the data as an
input in a voice format. So, we do not expect the user to type like a chatbot, but we expect the
user to say something using a speech modality.
So, it should be able to recognize the emotions in their speech, it should be able to take which
underlying hypothesis, it should be able to use the speech as an input. Of course, while doing
so, the idea is the it not only should be able to understand the emotions, but also it should be
able to understand user’s intents and it should be able to generate appropriate responses.
So, for example, maybe the user is asking something and it should be able to understand what
is the user asking and there may be certain emotions attached to it so both. So, intent plus
emotions. And as of now, most of the time the virtual assistants that we see, Alexa, Siri,
Cortana and all that, they are able to understand the intents, but maybe they are not they do
not have a lot of understanding of the emotions or emotional intelligence, right.
And of course, it should be able to provide generate appropriate responses taking into account
both the intent plus emotions, ok. That is the difference between the agent that we are trying
to develop and the agent that has been developed in the past. And of course, we already
established that unless and until its working in real time environment or in a near real time
with a near real time efficiency, it will not be able to serve the need of the user.
And having raise the concern of the privacy, of course, we already understand now that we
should be able to protect the user data, primary we are talking about the speech data and of
course, we may have access to certain demographic information such as you know the user's
identification information and location and all that.
So, we should be able to protect user data, speech and the personal information. As I said by
no means all this is an exhaustive list. And for example, we can keep on adding the
requirements to it and one nice requirement to have would be. So, I would say that these, the
top 5, these are something that we can term it as a essential requirement because without
having it, I we may not be able to you know launch even to a particular city or even a
particular region.
680
And this is something that we can call it as you know good to have kind of thing. But of
course, if it is not, if you do not have this multilingual support, we may not be able to cater to
the multiple users to a user-static scale. But nevertheless, having top five is a good start for
us.
And of course, as I said, the exact specifications of such an agent may depend on the business
use cases. And I will invite you to brainstorm about it that what could be the business use
case that you are targeting and what additional requirements could be there for such a
business use case, perfect.
So, having understood now that what exactly we want to develop. So, we want to develop an
emotionally intelligent virtual personal assistant. Just like Alexa, Cortana, Siri, but with an
emotional intelligence. I will keep repeating it again and again. So, that you understand it
very thoroughly. Now, let us try to dig in a bit more technical aspect of it, that how exactly
can we develop such an agent?
681
(Refer Slide Time: 20:21)
And I think we know already the answer to it. But of course, here we will try to put different
pieces together and hopefully it will make some sense. So, here is a very skeleton code that I
have tried to prepare which gives rough idea about what are the different steps that we are
going to follow to develop such an agent, ok.
Of course, imagine you know this skeleton agent code has been developed in Python. I will
provide you the code files as well, just so that you can play with it. So, for example, you
know this is of course, you are going to have to input some libraries such as speech
recognition as simple library which.
And then you know of course, you are going to initialize the speech recognition and the
language identification objects, you know create some initial objects and do some
initialization. You are going to initialize the virtual personal assistants depending upon you
know you may be using a particular set of hardware and the software. So, you want to
initialize them.
And then this is where you know you are going to keep track of the user's personal
preferences and users for example, history over a period of time and things like that and that
is where we are going to just call it as a context. So, basically you would like to keep a track
of the user's context and hence you may have a context object as well and you want to keep
updating that. Then of course, you know then we are going to this is something that is going
to run in a infinite loop continuous loop and basically
682
So, that is why we have a while true loop of course, while true then we are going to have we
will keep listening to the user's input. And once we listen to the user's input, we are going to
get an audio of it.
Once we have the audio of it then of course, depending upon what are the functionalities that
we have, may be one of the first thing that we may want to do, we may want to identify what
is the language in which the user is speaking. Having identified the language, may be may of
course, the next step would be to understand what is the emotion that is there in the this
particular speech data.
We may want to you know understand; convert this thing in the text format to be able to do
some more processing over it. We may want to having identify the emotions; we also want to
identify the intent. For example, if the user is asking something, saying something,
commenting something, things like that. And so, you may want to understand the intent of the
user as well.
And then of course, then taking into account please pay attention to this point, taking into
account the intent, the emotion and the context. So, the context could be the user's
preferences, users history over time, how the user has been interacting with the system,
maybe the system is able to you know understand what sort of preferences the user is
developing while having the interaction with the agent. So, this is the context.
683
So, basically now the appropriate response is going to be developed with the help of the
intent, emotion and context. These are the three things that are really important for such an
emotionally intelligent virtual agent. And after that of course, you know once we have a
response text.
We can simply convert the text to a speech because we want to convert communicate back to
the user in the form of a speech. And once we have this response audio, we can simply play
the response audio in a speaker and then that is how the user is going to understand to it. And
of course, then accordingly if we have certain update in the user's context, we can do that
update in the context and then this can go on you know in a while low.
Roughly speaking, this is the skeleton of the our emotional intelligent virtual assistant. As
you can see, this is really, really, really simple. But of course, this captures the gist of the
entire system. I will invite you to please go through the skeleton code that will be provided to
you with this week's lecture module, perfect.
So, we have understood the top level idea. Now, let us try to see what how an end-to-end life
cycle of this system will look like. So, this turns out that the end-to-end life cycle of a system
is not going to be very very different from a typical machine learning or deep learning
systems, where for example, we have two major components of it.
684
One is the machine learning aspect of it and one is the operational aspect of it, where we have
the deployment and all that. So, we will talk about all these aspects in bit detail, but basically
when we talk about the machine learning aspect of it, what it means? Of course, we need to
have some data.
We are going to you know, play with the data, do some pre-processing, feature selection
extraction, all those kind of things. And then after playing, we are going to create a machine
learning model, which is going to you know in this case, going to help us understand the
intent, the emotions and maybe the context also to a certain extent.
Of course, then we are going to you know, do the registry of such a model and then we
deploy in real system. And once we have done the deployment, then maybe of course, we
will have to keep monitor the usage and the working and the functioning of such a system.
And as and when required, we may want to re-train the system and having retrained, we may
create another version of the model. And then that is how you know the operational
deployment it goes on.
So, basically this is the rough roughly the end to end life cycle of our virtual assistant will
look like. One aspect that we are not focusing a lot here is the hardware aspect is because it is
understood that of course, along with this software, we may want to give it a shape of a
hardware, a tangible interface, which is could be in the form of you know like for example, a
Google speaker or for example, any device which is you know sitting on your table.
But nevertheless, there is going to be an hardware plus software component. Here, we are
focusing more on the software, but you can easily integrate the hardware development part
into it as well, perfect. So, now we have understood the end to end life cycle. So, the very
first step is the preparation of the data.
And this is I would say this is the second step, I would say this is the third step, this we can
call it as the fourth step, this is the fifth step and of course, this is the sixth and the seventh
step as required. So, let us talk about the very first step, which is the data or the preparation of
the data.
685
(Refer Slide Time: 26:52)
So, ok, before I discuss it, what do you think what could be the type of the data that we will
need? So, our idea is that we want an agent which is able to recognize the emotions in the
speech. If you recall your concept and understanding from the emotions in speech module,
then of course, what we need? We need a data set of audio recordings and hopefully
corresponding emotion labels.
And there are some datasets which are existing. And for example, we have talked about the
EmoDB and RAVDESS datasets. So, maybe we can make use of either of these datasets.
Apart from this, there is this IITKGP datasets also, also that was discussed in the class. And
we may you may want to use the data set as well. Because for example, EmoDB and
RAVDESS, for example, they do not cater to the Indian population, because they have been
recording recorded mostly using western participants.
Nevertheless, for the sake of the simplicity, let us assume that we take one dataset, it is the
EmoDB dataset. What exactly this EmoDB dataset or these datasets that we are going to use
will have or need to have? Of course, they need to have the text data and the speech data that
is labeled with the emotion categories.
And such as, for example, what we are talking about is imagine that we have an audio signal,
which is saying, I am so happy to be here. There is an audio, there is a corresponding text,
and then of course, there is an emotional label associated with it, which is happy. Similarly,
686
we have another audio signal; I am feeling really sad today. There is an audio, there is a
corresponding text, and there is an emotional label to it, which is sad.
Similarly, for example, we have another audio signal, I am so angry with you right now.
Audio corresponding text, audio, corresponding text, and then there is this emotional label
angry ah. And then just for the sake of completeness, there is another type of audio signal,
which could be you know, for example, I do not really care one way or the other. It is
something like you know a like neutral tone. So, you have a speech data, which says, I do not
really care one way or the other.
Then you have a text data corresponding to it, and then you have a emotional label attached
to it, which is the neutral label. So, this is sort of data set that we are going to have. Of
course, it the exact choice of the data will depend on so many different things, ways.
Maybe you may want to curate your own data set as well, but nevertheless curating your own
data set so far, I hope that you understand it can be a significant time consuming process, and
it will provide significant resources, depending upon your need and the requirements you
may want to use the existing data sets. Just for the sake of this case study and example, we
are saying that, ok, let us just go ahead with the using of using the EmoDB data set.
And hence, what we would like to do? We would like to of course, load this data and maybe
you process it to certain extent. So, this is the simple skeleton code. Again, as I said, I will be
687
providing you the code. So, basically, essentially what we are doing in this particular thing,
that of course, we are importing certain libraries. You may be already familiar with certain
some of them.
numpy, pandas all are basic pre-processing libraries and Librosa OS basically, you know, to
connect to OS for example, in the operating system libraries and so on. So, the idea is of
course, we are going to have a path to the data set. Maybe it is going to be in your virtual in
your drive online drive or maybe it is going to be an audio system.
Then of course, there is the labels we already agreed about it, that we are going to have only
four emotions, happy, sad, angry, neutral for now. We can of course, since we are talking
about the speech data, there is going to be some sampling frequency to it. We are going to
limit the sampling frequency let us say to the 16 Hertz, 16 kilo Hertz. Then we are going to
have a function to pre-process the audio files that we are going to capture.
So, basically, you know what type of pre-processing we can do? We can basically, you know,
simply do the re-sampling of the data from whatever target frequency whatever frequency it
had to the, maybe you know, original target frequency. So, which could be, you know. So, for
example, 16,000.
We can normalize the file. So, basically, if you remember the normalization, the idea is, if the
data is coming from different users, we may want to normalize it so that we should be able to
make a fair comparison across different users. So, you may want to normalize it. So, that is
how you know, you are going to have a normalized or pre-processed version of the audio file,
that is the function to pre-process the audio files.
Next, we are going to have another function which is going to load the EmoDB data set. We
assume that we are using the EmoDB data set. You can replace it with any other data set of
your choice. Of course, we are going to have, you know, like all the audio files folder, we are
going to have the emotion path. And for all the audio file in the audio files, simply we are
going to get the y, that is the sample of it. And then we are going to get the corresponding
label. So, the audio file and the corresponding label.
So, basically y and the label. And then of course, we are going to, we can simply append to
the data that we have and then we can create a matrix kind of a structure for it. And this is the
688
final data that we have already loaded from the data set, from the data set that was available
in a particular folder in a drive or in a on our operating system, ok.
So, now ah, when we call the main function, what is happening? Basically, the very first
thing, we are going to load the EmoDB data set. Once we have done the loading of the
EmoDB data set, we are going to shuffle the data set. So, if you recall shuffling the data set is
basically done to avoid any sorts of bias, you know, that can be there.
And then we simply shuffle it, you know, and have a good mix of the data that is available to
us. Next, of course, we are going to split the data set into the training and the validation sets
or training in the testing for the sake of the simplicity. So, for example, in this case, we
decided to use 80 percent of the data for the training and the remaining 20 percent will go for
the validation or the testing set of the data, ok.
Just to make our life easier, we are going to convert the entire data set into a pandas data
frame. So, that to have better manipulation control over the data. And simply we converted
the entire thing into a data frame, which we call the train df and the testing data frame, we
call it as the validation df.
So, this is just an skeleton code. Of course, you may want to tweak it here and there in order
to suit your needs, right, perfect. So, now this caters to our first part of the life cycle, which is
the preparation of the data. Next, we want to explore the data.
689
When we say that we want to explore the data, what it means, that we may want to
understand, you know, what sort of feature selection and feature extraction can be done. We
are for now for the sake of the simplicity, we are restricting to a very simple machine
learning, machine learning pipes pipeline rather than you know going into the deep learning
pipeline.
More or less the steps are going to be the same. So, it turns out if you recall your emotion in
the speech lecture, then there are lots of relevant features for the emotion recognition. We
may want to use for example, some of them, which is the MFCC coefficients. We can look at
prosody features such as pitch loudness, speech rate; we can look at some of the spectral
features, spectral centroid, spectral flux etcetera.
We can look at some of the time domain features. Now, what exactly these features represent?
You may want to refer to the emotion in speech class to have a better understanding of it. The
idea is that we may want to extract certain features, which are going to be relevant for the
recognition of the speech, emotion in the speech data, ok.
So, let us say that we just identified a list of relevant features. Then of course, once we have
certain features, which could be huge in number, then what we want to do? We want to we
may want to perform the feature selection to reduce the dimensionality of the entire features
and to optimize the performance. And there are lots of ways in which we can reduce the
dimensionality of the feature space.
And for example, there is PCA, LDA, mutual information based feature selection methods
there are many and then recursive feature elimination and, but, ok. So, we may want to use
one or the other depending upon your specific requirements. The discussion of all these
techniques is beyond the scope of the class.
But of course, I will request for the interested users to look at these techniques and
understand what could be the pros and the cons of one technique over the other when we
want to do the feature selection. For the sake of the simplicity, we are just going to use let us
say PCA for the feature selection and the reduction of the dimensionality.
Of course, it is important to have an balanced number of features that we are going to use in
accordance with the model that we are going to choose. Because it turns out that if we have
690
very small number of features and if we have chosen a very complex model, the entire thing
is going to result in an overfitting.
We want to avoid that. Of course, the basics of it can be understood in the in a common
machine learning or a deep learning course. So, to summarize, we have already identified the
data. Now, we are trying to identify what are the some of the features that we can use and
what is the feature selection technique that can be used to reduce the dimensionality of it.
So, as I said, we are going to use these many features. We are for the sake of the simplicity;
we are going to use PCA. So, let us try to see a skeleton code which is going to help us do
what we want to do for the feature exploration. So, of course, this is you know some library,
initial libraries that we are going to use.
And then these are some of the parameters that we have to define in order to initialize and
certain components of related to the different features and the feature selection technique that
is the PCA that we have chosen. So, of course, the exact understanding can be taken by going
through by going in detail about a particular feature or the particular technique.
So, once we have done that, then let us try to define different functions that can extract the
different features for us. Of course, one simple feature could be is the MFCC feature as we
agreed. So, we defined an extract MCC function for that. Other function is to extract the
prosody features from an audio signal which is going to be you know the pitch magnitude.
691
Again, some other features and then we are going to have another function to extract spectral
features from an audio signal. This is again some spectral features that we have already
defined. We can calculate from the given same audio input. And then we can have some time
domain features from the given audio signal.
So, please pay attention that these are the skeleton code. So, you may want to tweak it or you
may want to add more details to it in order to make it executable. I will be providing you all
these codes for your understanding, ok. So, once we have defined how to do the feature
extraction, how to do the feature you know selection, how to do the feature extraction here.
Next, we can simply do the feature extraction of the selection in a main file let us say. So, we
can simply load a particular audio file.
We can extract the MFCC; we extracted the prosody features using the earlier defined
functions. Similarly, the spectral features, similarly the time domain features, we simply
concatenated all the features into a feature matrix and we can simply apply the PCA on the
top of it.
And then this features PCA is going to is going to represent the reduced dimensionality of
the entire features that we have selected. And how many components since we have defined
20 components. So, this is going to reduce the dimensionality to the 20 principal components
of the PCA module.
Perfect, so, I hope that now the second step is also a bit clear. So, to summarize what we are
trying to do is we have done the prepared data. We have already explored the features that are
there. Next, we may want to dive a bit into the model.
692
(Refer Slide Time: 39:09)
So, let us try to look into the model now. So, basically for the model, now there are two, we
are trying to club two steps, model and the registry, we will talk a bit more about it. So,
basically when it comes to the model, of course, we are talking about a machine learning or a
deep learning model. The very first thing we have to do is we have to define the machine
learning model architecture.
It could be as simple as that. So, for example, maybe if you want to define the type of the
model first. For example, you may want to define use, make use of neural networks. Then of
course, you will have to define how many layers and you know like how many neurons and
what would be the input layer, what would be the output layer look like and so on so forth.
So, that is how you define first the machine learning model architecture.
You are going to of course, look at what could be the loss function optimizer and the metrics
for the evaluation, for a particular problem. Of course, accuracy can be one particular metric
that is simply to have F1 score could be another metric. So, you can you know of course,
make use of one or the other metric depending upon the specific requirements.
Then of course, you are going to do the training of the model on the training set that we have
split before. We are going to evaluate the model's performance on using the evaluate function
on the validation set. So, that you know it does not hamper the it does not affect the models
and introduce any bias in the model performance, ok. And then of course, you know
693
depending upon what the response we are getting, we are going to we can tweak the model
architecture.
For example, reduce the number of layers, increase the number of layers, reduce the number
of neurons, increase the number of neurons so on so forth. Of course, we may want to also
tweak the hyper parameters. For example, what could be the learning rate? Can we reduce the
learning rate? Can we increase the learning rate so on so forth. And of course, dataset itself,
maybe you want to use the entire data set, you may want to use the another dataset, you may
want to use multiple data sets.
So, all these are the questions that you will have to answer while tweaking the performance
of the model. And then at the end, you will have to re-tweak this 4 and 5 again and again and
again until you arrive to a satisfactory performance. Of course, what can we say a satisfactory
performance?
So, for example, if we are talking about a four class classification problem here, for our
emotionally intelligent virtual agent, a random chance of doing the classification or random
chance of identifying the emotion that is there in the audio is 25 percent right, four classes, 25
percent chance.
So, we would like to have you know at least 75 percent to begin with. And then of course,
ideally, we want to have as high performance as we can without having you know a lot of
variance in the performance. So, of course, these are certain you will have to look into the
model architecture to understand you know what how can it be achieved.
Once we have done the training and the testing, training and the validation of the entire thing,
then we may have a separate testing set where we can get an, this is important unbiased
estimate of the model's performance. So, then this is what we are going to report to the
external stakeholders. Once we have model training and testing done, we may want to save
the trained model.
And when we say we want to save the trained model means model and its parameters. Of
course, using some save function, we will see a bit in a bit. And it turns out that whatever
model that we are saving; we have to register those models with some model registry. So,
basically without going into too much detail, the model registration is what? We want to
register the model and its metadata along with its performance into some repository.
694
So, that we can have a better control version control over it. And this is something that is an
essentially step sort of you know before doing the deployment of the model. And this is
something this register model registry allows you to track the different versions of the model.
And of course, you know if required, replace one version over the another depending upon
how it is performing in the real deployment. So, let us see how can let us see any skeleton
code which is going to do these 8 steps for us in short, ok.
So, just for the sake of the simplicity, let us say that we defined a neural network model
architecture with some you know like layers and with some this ReLU functions as an
activation softmax, exact details of course, you can look into the architecture of the neural
network.
Once we have defined it, then we are going to you know compile the model by defining what
is going to be the loss function, what is going to be the optimizer and what will be the my
matrix for the accuracy. You of course, the next is going to be you will be training the model
on the training data set.
These are certain parameters that are required for doing the training of the model and then
you are also passing the validation data on which you want to tweak the hyper parameters of
the model. You are going to look at the validation evaluation of the model on the validation
set.
695
You may have you may obtain certain accuracy. Once you have obtained the certain accuracy,
of course, this is you know the repetition of steps 4 and the 5 as we said before until we are
achieving a certain satisfactory performance. Once it is achieved, then of course, the next
thing is finally, you are going to report the test accuracy on the test set and then.
So, ok like just to make it clear you know you may want to split your data into three different
sets training, testing and validation and usually for example, the simple thumb of rule is 70,
20, 10. So, basically you have 70 percent data for the training, 20 percent data for the testing,
10 percent data for the validation and or you may want to interchange one thing over the
other, right.
So, that is how you create three different sets from one given data, ok. So, now, and this is
basically you do all to you get an unbiased estimate and the details of it can be understood by
taking any machine learning course. So, once we have you know created a particular model.
So, basically this is what we are saying that our model is now ready here.
So, you may want to save the model to a certain path. So, basically you know like this is
where you are going to dump the model, we are actually saving the model with like the in
pickle format. And once you have dumped the model, then maybe you want to register the
model with a model registry.
You may use any model registry. So, for example, this is your model registry, URI, there is
going to be some metadata that is associated with it such as you know what is the model
version, what is the accuracy of the model that you just obtained. And then these are certain
other parameters that you require to do the registry of the model onto in on some registry
data, right, ok.
So, then roughly this is what the skeleton code will look like for our emotional intelligent
virtual assistant. So, and now let us quickly go back. So, now where are we are where we are?
We already completed the prepared data; we already explored the features. So, we already
sort of you know clubbed the model building and the registry. Now, we want to look into a bit
in the deployment and of course, the monitoring can go further, ok. So, model and
registration is done for us ok, deployment.
696
(Refer Slide Time: 46:34)
So, of course, when we are talking about the final deployment, it can be done you know on a
particular hardware. Maybe you know like of course, you may you can keep it in a laptop
itself, you may want to keep it in a mobile phone itself or you may want to create a dedicated
tangible device for example, which you can place on your table or on the user's table.
So, assuming that you know like you decide for a sub-particular hardware, which could be for
example, as simple as that Raspberry Pi, you have to understand and take into account that it
should have sufficient processing power. We are talking about evaluate running some
machine learning models on it and it should have sufficient memory of course, and storage
RAM and the storage to run the model that we are trying to create.
So, having chosen the light TensorFlow Lite framework, next what we want to do? We want
to optimize the entire model by converting it into a TensorFlow Lite format and maybe we
may have to you know quantize it to reduce its size and computational requirements,
depending upon you know whether for example, it is fitting to the hardware, whether it is
697
giving a real time performance and so on so forth. So, these are really tricky steps and you
will have to fine tune depending upon certain things.
Then of course, once you have done this thing, once you have converted into a TensorFlow
Lite format, you want to load that optimized model onto a raspberry Raspberry Pi. And then
you want to you know interface it with its Raspberry's microphone and of course, speaker.
Fortunately, Raspberry Pi has both the microphone and the speaker. So, it can work as
roughly as a good hardware for you.
So, of course, having loaded it onto the Raspberry Pi, next we may want to deploy the model
on the Raspberry Pi. And then we want to test it by you know running different audio inputs
and we want to understand whether for example, the output that it is giving, whether it is the
output that we wanted to have or comparing it with the ground truth in the real environment.
So, just a rough idea about how the development will look like. Of course, please pay
attention that we have been focusing solely now on the like on the emotions only.
More or less the similar kind of software development cycle can be followed for the intent
which I am not talking about because this is something that Alexa, Siri, Cortana has already
been doing. And other thing that we have not talked about is specifically is the context.
So, context is something that also has been integrated to certain to certain with certain
proficiency efficiency in the these existing virtual models. But, so, we have not been talking
about the intent and the context, but more or less the idea is that the way we are going to
understand the emotions.
We are going to understand the intent and the context and we will be referring to the bigger
blog there. While generating a appropriate response we will be looking at the emotions, the
intent and the context all together. Roughly I wanted to give you a brief of how an end to end
deployment will look like.
698
(Refer Slide Time: 50:09)
So, just an skeleton code for this real time life deployment, again I am saying that this is just
a skeleton code. So, basically of course, you may want to check the Raspberry Pi that you are
for example, or any hardware that you are using for its power memory and storage. And once
you have this thing. So, you may want to optimize the model by using TensorFlow Lite.
So, just simply load the model. After loading the model, you simply convert the model, you
are going to convert, save the optimized model with some other name. And then you know of
course, then you are going to load this model onto the Raspberry Pi with some interpreter.
And then of course, the next is you may want to now run and test it on the new data that is
coming, right. Having trained, having model trained already on an existing data.
So, you may want to define a pre-processing audio. So, basically pre-processed audio
function you can use from the previous that we just, we already looked at a skeleton code,
how can we pre-process audio, normalization, re-sampling and all those things. Actions, we
have a function for that. We can define another function you know to make predictions using
the loaded model for a new data. Of course, simply we can, you know like we can collect the
audio.
We can do some pre-processing as required. And basically, this is where you know like we
are going to load the model, the optimized model and we are going to run the inference on the
optimized model. We are going to get the output in a Tensor format and then we are going to
you know convert into a some readable format or some interpretable format. And this is the
699
output of the predictive motion function will look like. Once we have the emotion output,
similarly you know like we have this predict emotion.
We can have the similar way, we can you know something like we can have another function
such as you know identify underscore intent for example, which we already had for example.
Similarly, we can have something like you know identify or add something like context to it.
You know these kind of things we can have also. All together you know these three will come
together, these three functions will come together. And then what we can do? We can have a
prediction of the emotions.
Similarly, we can have a prediction of the identification of the intent, identification of the
context. And then accordingly you know rather than just returning the predicted emotion,
maybe we want to generate the appropriate response as we saw in the very first skeleton
code. And then that is what we are going to return and that is how we are going to simply
finally, test the model in the real time, right.
So, that is roughly and in a very very simplistic way the deployment of a model on a
particular hardware such as Raspberry Pi. I will be providing you these codes; hopefully you
will be able to go through these codes. These are just two examples, just the skeleton codes,
please feel free to tweak it as you wish and feel free to play with it.
And hopefully I will be very happy to see if you are able to integrate all these pieces together
and come up with real and nice working emotionally intelligent virtual assistant for yourself
maybe, ok. So, with that then let us move to the next module.
700
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi
Week - 11
Lecture - 02
Ethical Concerns
So, far we try to understand in a very simplistic fashion that what methods and techniques
can be used to develop such an emotionally intelligent virtual agent. And I believe the
possibilities are endless, of course, in their development as well as while doing their
deployment in the real world.
And nevertheless, when it comes to the usage of such a system, we always have to be very
cautious about it. And in the next week we will be focusing on the ethical aspects of the
affective computing in general. Nevertheless, I wanted to devote some time on for example,
what could be some of the ethical concerns that are related to the this emotionally intelligent
virtual agent which you may develop and may want to decide to use in some real-life settings
as well, right.
So, let us go ahead with that. Of course, the first Ethical Concern that we have also
highlighted in the beginning of the module is of the privacy. So, it turns out that any such
701
emotionally virtual agent that you are going to develop let it be used in a mental health
session by a doctor or for example, in an education domain with some students or let us say
you know in a workplace environment with some employees.
It will be collecting lots of personal data because unless and until it has access to the personal
data it will not be able to personalize it, right. So, you may remember that we were talking
about the context. So, basically this context it comes with the understanding of the user’s
behavior and users personal preferences and all that.
So, it is we will be collecting lots of personal data. But of course, the problem is that if we do
not have a proper regulation over it, then the this virtual agent that we are developing it can
be used to collect this kind of data and you know for the this can be sold for the commercial
purposes and more importantly without even user’s consent, right.
So, that is the big challenge and of course, we do not want to invade into the privacy of an
user and we would like to respect that. So, maybe this is some of one of the important thing
that you would like to keep in mind. Other ethical concern is the of the an emotional
manipulation. And this is common not only about the emotionally intelligent virtual agent,
but it is going to be common across all affective computing applications.
So, the idea is of course, you know such an agent will be able to understand your emotion,
will be able to respond to your reactions in an emotionally intelligent fashion and in turn, it
turns out that it may affect your emotions. For example, you may be feeling sad and it would
like to make you happy as simple as that. And of course, you know like it may want to nudge
you to do some behavior and down the line maybe some belief system can also be changed.
And it all may happen without your knowledge or consent. And we have to understand that to
what extent it is acceptable. Because of course, there is a manipulation of the emotional
aspects going on and of course, it was not a concern before maybe by the AI, but because
there was not a lot of talk about the emotional manipulation, but now since the manipulation
is happening of the emotions, we need to take a bit at this as well.
Bias and the discrimination as with any AI system, this affective computing systems or this in
this case this particular emotional virtual agents. Of course, they also have lots of bias and
discrimination in place. We did talk about certain things in the beginning, I would like to
702
emphasize it again. Imagine if you were to develop such a system, I mean it may not work for
example, you know, one particular group, one particular ethnicity.
Imagine that you are using the data from western population, it may not work for the Indian
population or if you are using the data only from one part of India, it may not work for the
another part of India, users from the another part of India, one language may note one data
collected from one language may not work for the second language and so on so forth, right.
So, then this is going to be lot of discrimination and the access can be hampered in that way.
Of course, then there are certain other consequences that we may not have imagined and may
can also come such as for example, the addiction or the dependence on the virtual agent. So,
the idea of this agent is what you want to create an agent which can for example, help in the
emotional management of the employee or you know a student in any particular setting.
Now, of course, it may turn out that it is like your candy, you know. So, basically candy is
sweet and you develop some addiction to it. So, whenever you want, whenever you have
urged to it, you just take a candy. So, maybe you know like now you are try you will try to
find this develop the same type of solace in the this kind of virtual agent as well.
So, whenever you are in the need of some emotional relaxation, you may want to you know
like just talk to this virtual agent only which may not be a very good idea to begin with.
703
Because of course, it is going to cut your first thing from the social contacts and you will
develop some sort of addiction to it which is not your natural state. Of course, apart from this
you know like within this ethical realm also there could be a legal concern as well.
For example, which is a very very interesting concern which is of the liability. So, of course,
you know while doing this emotional manipulation, while doing this you know developing
this kind of addiction and all that there could be some legal liability for you as a developer or
you as a manufacturer.
So, you may want to take that into account. That for example, if you were to place such a
system for example, in a hospital where such an agent is interacting with the patients, if
something some to go wrong then who is going to take the blame for and who will be
blamed. So, that is something liability which is really important.
Other thing is of course, again then there ethical concern is of the deception. So, this virtual
agent that you are developing, it is not really you know emotional in the sense it does not
understand the emotions, it cannot communicate the emotions, we are just trying to
approximate the feeling of the emotions in the system and it can be deceptive.
So, we need to make the users understand that this is artificial, this is not a natural and hence
the agent is not genuinely experiencing emotions. And for example, you know deception like
there are lots of things that can happen because of this and if you recall we were talking about
we talked about anthropomorphism also in the previous module.
So, basically, I will like to tell you one short story and I think that will help you understand
what can this deception do in turn. So, for example, there was a simple robot that was
developed for by US Army to detect and to diffuse the bombs in the war, in the war field
settings. And such a there was one such a robot I am not able to recall the name that was
deployed with the US Army in Afghanistan I believe.
And then you know like during one of the operations you know when that robot was trying to
detect the bomb and then the bomb you know exploded and the robot blew well robot died at
the some sort of robot died you know. So, it end out that you know when the robot died and
the robot the remains of the robot when it came back to United States.
704
There was a lot of sorrow among the public and there was a lot of disappointment and grief
among the not only the army persons, but also among the common public and it was widely
captured actually in the media as well. So, I mean ultimately if you see all this thing, I mean
the robot itself is not a living being, I mean it's a artificial entity it does not have a life.
So, basically you know of course, if the users were to develop some addiction with this
emotionally intelligent virtual agent, then it may result into some sort of isolation from the
social gatherings as the users may tend to rely more for the emotional support to this kind of
agent rather than you know engaging in the human interactions.
So, this can be again very very common scenario. And especially for example, in the
healthcare settings where the agents who are the participants or the users who are interacting
with this virtual agents are already you know under some sort of mental issues mental health
issues such as for example, depression and they may not want to talk about with a human.
But then again, they feel more comfortable in talking with such an agent that you are
developing and may ultimately develop an addiction to it and which may not be a very good
idea to begin with, perfect.
705
(Refer Slide Time: 09:32)
So, these are some of the ethical concerns that can be there by no means this is an exhaustive
list. Of course, the exact understanding of the ethical concerns you can take these as a
pointers, but you will have to understand the scenarios in which you are deploying your
emotionally intelligent virtual agent. And also, what sorts of capabilities and functionalities
are you planning to improvise in this agent, right.
So, and nevertheless let us take some example of some certain scenarios and see how it is
exactly how these kind of concerns can arise and it is easily being manipulated. So, for
example, imagine that there is a company, maybe yours company which is creating an
emotionally intelligent virtual agent, that collects voice patterns you know and to provide
tailored emotional support to the users.
So, what could be some of the ethical concerns related to it? I would like you to take some
time and think about it, what are some of the ethical concerns that could be there. So, for
example, one thing that we already saw that we are talking about the privacy. And if the
company is collecting the voice patterns and maybe you know, it has not even obtained the
users consent to collect this personal data, you know like then it is going to raise the privacy
issues.
So, you rightly understood that in order to address the privacy concerns, of course, the very
first thing is that you need to have proper user consent and you should not be doing anything
beyond what the user has consented for, right. So, for example, you have obtain the consent
706
to provide user certain services, you should not be just start selling the data without even
having the consent of the user.
So, that could be, for example, one issue with this such a company for example um, ok. Let
us take another example, a very interesting example. Imagine you develop a virtual reality
therapy platform which is creating again making use of this again, you know EVA which is
Emotional Intelligent Virtual Agent that helps patients with PTSD, relive traumatic
experiences in a controlled environment.
So, basically these kind of practices are also not very uncommon with the war veterans. And
to help them release the trauma that they have face during that may have, they we may have
face during the war. Now, what could be the consequence of it? I want you to think about it
and then come up with certain things. Of course, privacy is one thing that could be there,
what could be the other concerns?
So, it turns out there are lots of things associated with it. But one of the things that really can
happen, it can really worsen the symptoms of the patients who those who are going do
through this said therapy. Because of course, you are want them to relieve their traumatic
experiences, but it may happen that your virtual agent may not be able to control the intensity
and manage the emotions and then you know it may severe their emotional responses.
So, you will have to little bit cautious about that. Of course, addiction is another thing that
can be really happen and there are lots of other ethical concerns that can arise into this
scenario. Let us take another scenario, which is where there is a company which creates
another virtual agent for the customer service and it uses to provide personalized support.
Sounds like a very good plan and most of the companies would love to run this kind of
system and I will encourage you to please think about it and maybe you know, turn it into a
business plan as well. But now, what could be some of the concerns? Of course, privacy
could be one of the concerns. We already talked about that emotional manipulation could be
there in this case, which could be chance could be a bit limited because the users are not very
exposed to this for long.
So, what could be some of the concerns that can come with it? Ok, this is really interesting
observation. So, basically what may happen that the idea for this particular agent that you are
707
creating was to provide a personalized support and of course, you know, to recommend for
certain products or certain services.
But it may happen that it may not be fair at all while doing so. And it may be promoting one
product over the others and hence, it is violating the trust of the user, which is you may know
that it is already being done by lots of social media and technical giants around us. But this is
also one of the biggest concern that can be there in developing such an agent.
So, these are just some sample scenarios, you know, where we try to look into that what could
be some of the privacy and the ethical concerns related to it. And of course, as I said, these
are just the pointers. I would encourage you to please, you know, think more about it and
think about the users, think about the business use cases and accordingly, you know, you may
come up with more detailed list and also accordingly, you can come up with certain solutions
to these scenarios, ok.
So, now we have understood the ethical concerns and now next we have a very exciting
session where we are going to have an interaction with Dr. Aniket. So, as I said before, Dr.
Aniket is a faculty in Purdue University and having a lot of experience in affective computing
and related areas.
So, in this particular interaction, we will be talking about some of his research and we will
trying to address and understand how he approaches a research problem and you know, tries
708
to convert into from an idea to a product or to a service. So, guys, we have Dr. Aniket Bera
with us, who is an associate professor at the Department of Computer Science at Purdue
University.
He directs the interdisciplinary research ideas lab that is intelligent design for empathetic and
augmented systems at Purdue, working on modeling the human and social aspects using ion
robotics, graphics and vision. His core research interests are in affective computing, computer
graphics, AR, VR, augmented intelligence, multi-agent simulation, autonomous agents,
cognitive modeling and planning for intelligent characters. He is currently serving as the
senior editor for IEEE robotics and automation letters.
In the area of planning and simulation, his work has won multiple awards at top graphics
slash PR conferences. He has previously worked in many research labs, including Disney
Research and Intel. Aniket's research has been featured on CBS, Wired, Forbes, Fast
Company, Times of India, etcetera.
And you can know more about Aniket's work at https cs dot purdue dot edu slash hyphen ab.
So, Aniket, we are really excited to have you. Thanks a lot for taking your time. And for the
sake of the discussion, you know we will start with trying to understand that what really
inspired you to pursue your career in affective computing.
Well, thank you so much for the introduction. Well, affective computing as you know as the
name suggests, or as you know many people before me, many people you know have worked
on it from a psychology point of view, from computational psychiatry, or even computer
science. So, it is about modeling human behaviors and human emotions.
And that is one aspect which I always felt like kind of was missing in modern computing
technologies. We have always optimize things based on certain principles, time, money, but
we leave the human out of out of our competition. We leave the human emotions out of all
these calculations. So, my whole research on bringing the human emotions, bringing human
behaviors back onto the computing design is one of the reasons why I am, I work on affective
computing.
Problems like ranging from can a robot serve you better if it did know your emotions, can a
virtual agent in this meta verse realm, can it better understand you, can it better know you, if
it understood your emotions? A good example I gave this, and this is one of the research, we
709
have working on with a hospital with therapists, is can I design this virtual therapist who
know and understand your internal states?
And I was give this example to my students, is like, let us say your friend were to ask you a
question, how are you feeling today? You can answer in many different ways. You can say,
you know, I am feeling ok. You can say, I am feeling ok. Or you can say, I am feeling ok.
These are all the same text. These are all the same content. But the way we represent this
content is so different. My body language is so different. My vocal patterns are so different.
My emotions are so different.
If I was able to understand this better, I can serve you better, I can help you better, in this case
that I am feeling ok. Clearly, you know, this person is happy. If this person is feeling ok. But
somebody, your friend, were to say, I am feeling ok, I guess. That is not ok. You want to ask
more questions to this person. Maybe get to know a little bit more about if there is something
bothering this person, right?
So, from psychiatry, computational psychology, social robotics in this meta verse realm, all of
these need to bring the human at the center for all these discussion. Then that is why I feel all
our research on affective computing really is. Perfect. Sounds really interesting. So, Aniket, if
I may ask, like, which was one of the first work that you decided to do in the affective
computing domain? And what exactly inspired you to, you know, move into that direction?
So, many years ago, you know, when I was doing my PhD and was do, you know, publishing
papers in crowd simulation. So, we were modeling how people move in you know crowds,
you know. Of course, I am being from Delhi. I know what it feels like on, you know, when I
am, you know, I remember the time, like, let us say Rajiv Gandhi Metro station.
Or something, right? I, or Nehru place or these kind of places. So, I remember that when
there is a lot of crowd, there is a lot of people, it is always a challenge to go from place A to
place B. But within that challenge, this is always to be, ok. How do I optimize my path? You
know, I see an aggressive person walking next, you know, cutting lines. Maybe I want to
choose a different path. If somebody is shy or more conservative person, I can cut across that
path, right?
So, this is where all of this research initially started modeling crowds, modeling people when
they are moving its crowds. Initially, we used to model them as just like circles or cylinders,
710
you know, in a very simplistic representation. Of human beings, right? Unfortunately, human
beings are not circles nor cylinders, right? We emote, we have body languages, we have
expressions, we have. So, many different dimensions of, you know, perception to everybody
else. So, one thing led to the other.
We moved from circles to full body gates, to gates we move to full body emotion
representation to motion. We went to facial representation and now we kind of proud that we
have modeled all of these, not just visual affect, but affect from vocal patterns, affect from
visual, facial expressions, from body language.
So, yeah, I mean, what started something maybe 10, 12 years ago from crowd simulation is
now at a more microscopic human-to-human interaction. Of course, a lot of things change in
this 10 years, but I would like to thank my initial research to where I am right now.
Ok, perfect, perfect sounds interesting, thanks. So, I if I understand, well, you are already
working, you started working in the affective computing may be 10 years back.
Yes.
But as of now also, I believe we have n number of active projects going on.
Yes.
In affective computing domain. So, can you please elaborate some of the projects and what
exactly you are doing there?
Yeah. So, as I said, one of these projects is from we are working with Baltimore hospitals in
you are. In, in, when I was at (Refer Time: 22:14) Maryland in before, we were working on,
we working with psychiatrist, child psychiatrist therapist to and sort of bridge the gap
between the, the therapy demand supply.
So, what happened is, when COVID happened, specially for single mothers or people who
could go outside, they had, like the children were stuck at home. You know, they start, they
could not go outside to work because of COVID, the children were stuck at home, they had to
work remotely. A lot of these problems, even in a, I would say, quote unquote developed
world, people were always, you know, the mental health problems increased a lot.
711
Unfortunately, the more rural we went, you know, in Maryland, the state of Maryland, we
saw these problems escalated because of the lack of availability of resources. The nearest
Walmart would be like 30, 40 minutes away, you know, people could not drive, leaving their
kids at home. So, the mental health problems increase in week. So, I reached out to some of
my colleagues in psychiatry.
Like, is this a problem we can solve? Can we bridge the gap between therapy available?
Because there is limited number of therapists. But a lot more number of people actually need
mental health. So, we wanted to say, can we help them? You know, these therapists are
overburdened. They are doing. Like 20 patients a day, 30 patients.
A day, can we give them some supplemental information? Can we find causal relationships or
what happened in the previous session the therapist might have forgotten something. Can we
leverage, ok, this person was sad in the last session because certain so things that happened.
And you said this thing, which really helped him or her.
To you know, made, made them happy, maybe bridge the gap again in this session. So. Sort
of helping therapist, you know, connect between different sessions, having a virtual therapist
between two different real therapy sessions. So, we are working with hospitals on those kind
of problems. We are working with another set of problems in the metaverse realm, right?
Everybody.
Is talking about meta verse and how everybody, now we will all live in the metaverse. I do
not know if that will ever happen. But if it does happen, we will interact with these intelligent
and yet functional virtual agents. And when I mean intelligent, sure, with the current
methods, they will be smart enough to figure out what to do. But would they be emotionally
intelligent to figure out what we should do?
You know, looking at people, deciding what to do, what not to do. So, that is where another
metaverse, variate problems we are working on. We are working on social robots. Can we
deploy these robots in hospital settings, in evacuation settings which interact with people.
We are we also have some project with the military, where we, you know, robots are jointly
solving tasks with the army. I mean, to do anything jointly, we need to understand human
dynamics, human politics. And I do not mean politics in a more broader sense. I mean,
politics in decision making, right. So, that is where we think that these kind of intelligent,
712
emotionally intelligent agents, when I mean, agent could be robots, virtual agents, whatever
this should might be.
If they are emotionally intelligent, they will be more useful. And a variety of problems or a
stem from that that. You know perfect sound. So, I think it. So, I think you used a very
interesting word Aniket saying that ok, we are talking about the emotionally intelligent
agents, where this agent could be a social robot, this agent.
Yeah.
Yes.
Could be like something else as well, a machine may for that matter. So, I believe then many
of the learnings that you may be doing in one project. On one type of agent. It can be easily
transferred.
Yeah.
To the other type of agent. So, would you like to give an example where you were able to do
so or you thought that it is really possible or this can you know may be helpful?
So, yeah, it is a very good, you know, we leverage ideas from one project to the other. And
one of the key things, which you know we were discussing earlier that between humanoid
like robots, like there is so many robots are there, like the Pepper robot this which have
human like hands, you know human like face. And there is this virtual avatars, virtual
characters. So, if I have a motion model representation of human motion.
Like how our hands move, how my body moves, I can put that motion model in either a
virtual agent. And however, I have generated using artificial intelligence techniques or other
methods, I can use it in robotics problems, as well as graphics problems. So, that is where we
leverage, you know, the generation or the motion model aspect of, because human motion is
similar, whether you have a real human doing it or a virtual character or a robotic character.
The motion can be transferred to each of these. Agents. Whatever the problem is. Of course,
it is a non-trivial problem because robots do not emote in the same way. So, let us say, when
713
if I was, you know, my hands or my expression certain way, if I just copy paste those things
onto a robot, will the emotion also translate that that?
That is an unanswered unsolved question. Which I think needs a lot more insights into psych,
human psychology to really say, like maybe what looks happy to me, this body language, if I
transfer it to some robot, it might look angry.
Yeah.
So, that, there is a big, sort of so to speak domain gap between, you know, different things.
And that has to be dealt separately.
Yeah, perfect, means sounds, I think this is fascinating. And also, since you said, you know,
well, I mean, of course, the emotional intelligent part is one thing definitely there. Other
thing, for example, you know like if you are trying to transfer the things from one agent to
another agent, you also need to look at the physical constraints, I believe.
Yes.
So, for example, maybe you may want to look at the degrees of freedom that.
Yes.
Yes.
Versus the degrees of freedom that, for example, the virtual character has. So, I think then, I
believe that definitely it can be really interesting if you can, if you can look at all the
constraints, the physical and the emotional, and then optimize the model from one to another
thing using the learnings that we have from. Some other project.
That is actually, that is the, the constraints you mentioned are the biggest problems. I mean, it
could be worked to advantage as well. But the, like those are, like a robot has significantly
less degrees of freedom in most cases.
714
More humans.
Yeah.
And when we transfer some things, if one or two degrees of freedom are lesser or they work
differently, the whole emotion can change.
Certainly, exactly.
Yeah, we do not even know if there is a cultural mapping between two humans, right.
Yeah.
Like I am from India, I am from Delhi, but sometimes when I am meeting people from you
know different parts of the world, the way they express their emotion is so different from how
I express my emotion.
Yeah.
Right. So, if I put my emotions, my Indian emotions on top of like a Russian man, maybe it
will be a completely, you know people might get shocked when the virtual Russian agent, or I
am just giving an example it could be really.
Yeah.
Yeah.
Even an American person, like I you know like somebody from South America, we all grow
we all grow up in different ways of expressing the same thing.
715
If there is a good mapping that needs to be learned, and I think there is a long journey ahead
when it comes to mapping between different agents, be it humans, human cultural agents, or
even robots and other agents.
Yeah. So, basically, you know, like so there is something, I think definitely you already,
maybe have been working on this. So, we already talk about individual variability.
Yes.
But the moment we talk about the individual variability, it could be not only at the individual
level, but also at the group level and also.
Yes.
Yes.
Yes, yeah.
So.
Definitely.
Exactly.
Exactly.
It has happened to me unknowingly a lot of number of times. Because I feel maybe that is the
comfort space with A and B, but maybe with I am when I am with somebody else, I would
react in a different way. So, I guess eventually the whole problem boils down to can I model
716
these things differently? Right now, most people in the world are modeling these things in a
very homogenous way.
Yeah.
Like, every human is the same. And I think that is a non scalable thing. I mean, we will never
be able to make agents which people can trust. You want trust, you have to relate to that
person.
Yeah, and also, I mean, well, I think, you know like so I think, for researchers like you, who
are working at the intersection of you know, like this human agent interaction, human
machine interaction. So, you, I believe you need not to look at only at the individual
variability of the humans, but definitely as we were talking before.
Yes.
That you are also looking at the individual variability of the agents as well.
Yeah.
Because they may have their own set of variability with respect to the physical constraints
that they are in, with respect to the capabilities, with respect to the performance that they
have. So, I think that that is that is really interesting. So, I think, you know, now Aniket we
will deep down, dig down into one of your very interesting works, which is one also one of
your most cited works, which is emotions do not lie. And audiovisual deep fake detection
method using affective cues.
So, I believe in this, you presented an approach that simultaneously exploits audio and video
modalities and perceive emotions from the two modalities for the deep fake detection. So, I
think that is very interesting given, you know, like we already are seeing a lot of deep fakes.
Yeah.
And so the whole idea, I believe, is how can we leverage emotions that can really help you
detect that in a better way.
Yeah.
So, please enlighten us about more that what exactly motivated you to address this problem?
717
So, this was happening, you know this was a few years ago, we started discussing with our
students and deepfake was a big problem.
Yeah.
It is still a big very big problem right now. And lot of computer vision researchers are
approaching the problem from a very different way. And I and I kept on thinking, and there
was that time, I think there was some elections going on in India as well.
Ok.
And we saw some, some deep faked version of I think Lalu Prasad Yadav.
Ok.
Or somebody I do not remember much of. That era, but like and we are like this looks so real
you know you can make a politician say anything. And specially in a country like India,
where things spread like you know with this technology penetration now, which we did not
have many years ago, everybody has smartphones, everybody has WhatsApp, everybody has
all of these devices.
If you upload something, even though we figured out that it is fake, and at some point, it
spreads like wildflowers. So, it is very.
Yeah.
Yeah.
Right. So, like ok, we need to do something. And I, and we were looking at lot of videos
when we found one very interesting thing about the deepfake videos like. And lot of these
deep fake videos, their face is being masked by somebody else you know, there was a very
big deep fake from, very popular deep fake from Trump saying something and.
Then Obama saying something, and all of these funny you know, voice actors saying all these
things. And we found that you know, all of these things are ok. But the problem is the
emotions do not say the same story as what they are trying to say. When the original actor is
doing it, of course, the emotion is there because that is the original video.
718
But when you mask the deep faked face on top of the video, we saw that it does not,
something is off about it. It is not everything looks perfect, but the emotions are wrong. And
that is ok, maybe what is happening is you know, these models are not able to you know
generate the emotions, facial emotions correctly.
So, maybe we can build on top of already existing vision models. We can also build these
emotions do not lie, where we are saying that if the emotion of the voice matches the emotion
of the visual. Deep faked person or deep for real person, it is likely from a real person. But if
it is not matching, there is something wrong about this video. You know I am not saying it is
deep fake, but it is likely to be deepfaked.
There is already existing other methods are there, which can supplement what we are doing.
We are not competing with other methods. We are actually building on top of other methods.
And we saw that in many cases, we got better results. In fact, I think 30, 40 percent more
better results than previous state of the art, just because we are looking at emotions. So, it was
very clear that emotions are very key part of how we can differentiate between what is real
and what is not.
That is it, I am sure the next generation of people who make deep fake can build emotion into
their model. And then our method will become useless. But, like, we are always there to fight
with whatever new comes up.
Yeah, and you know I think that that sounds interesting. So, I think, so you know like I would
like to know more about this particular work, Aniket.
Yeah.
So, for example, can you tell us ok, I understand that ok, this was you wanted to look at the
audio visual modalities of this entire.
Yeah.
Deep fake video and you wanted to understand about the emotions. So, can you tell me, for
example, let us start from the beginning. Where did you get the data? Like what exactly the
data look like in this case?
So, there is a big deep fake data, I think Facebook themselves released.
719
Ok.
Ok.
Ok.
So, they are like, people can build classifiers to test and train their algorithms.
Ok.
On.
Ok.
Ok.
And we can generate our own deep fake like there is.
Yeah.
Yeah.
To generate deep fakes, right. So, obviously, that was the starting point of the data. But we
can capture like right now in this fancy room we are sitting in, we can capture very high
quality deep fake data.
Which I am I do not know if my method can perform very well, right. You have a very good
green screen behind me. You can, I am sure you guys can make me look like an Hollywood
star. Rather than a poor professor, right. So, you know, things are getting advanced from a
deep fake point of view.
But you know, it is that good old problem of security verse like somebody making a code to,
like a hacker versus a security person. Right. They are both are constantly fighting with each
720
other to make sure what is better. Deep fakes are getting scary. I am not going to lie. I mean,
we are this paper won a lot of awards from in the state government of Maryland. We I think
we won the runner up in the innovation of the year award for this.
But it is it a losing race that we are I do not know; we will constantly fight deep fakes. And it
is almost personal to me because I feel specially coming from India and that the information
spread, the virality of all these things information across social media can be great to reach
the masses, but it also could be dangerous.
To make sure wrong information does not, you know you know you have your favorite
politician saying some things, excuse me saying some things really terrible. Like, I used to
like this guy, what happened? Or you are you know, the person you do not like say something
really good. Maybe I will vote for him now. Like before the things actually flagged as deep
fake or fake, it has already spread across thousands of media platform.
And we need to nip it in the bud if you really want you know democracy like India. I mean,
this is this was almost like I saw that this is really dangerous. If things like this happen, we
need to stop it. I mean, of course, there is also a concern for you know women safety when it
comes to that you know people post fake pictures of women's faces on top of other you know
videos and online you know deep-faking them.
And it is not only like socially very unethical, but it really, you know, puts them in harm's
way in many ways. So, we want to make sure that AI is great. We do a lot of generating
generative models. You see you seen some of our work. You know, we have seen all the work
in this world with really good computer graphics and vision research coming out of this. But
we need to make sure that we are using it in the right way.
Deep fake detection is just one of the many problems. I really hope to work more on this
problem in the future to compete fight against all the other deep fake methods out there. But
yeah, we need expert like you to, you know, work with us.
No, perfect. So, ok, so I think, so, ok so thanks a lot. Thanks a lot Aniket. You know like I
can completely agree with it that you know like the deep fake itself has become a big
problem, especially in a country where you have like a population of billions, you know.
Yes.
721
Like not we are not talking about millions anymore.
Yeah.
Yeah.
Like states but when we are talking about billions and billions of people. So, it is like it can
be really, really fatal in many senses, ok. So, let us, while looking at this paper again. So, we
understand that, ok, you had this data set, maybe that was released by Facebook and similar
data sets where you have a video and then you have corresponding labels as well. So, it is a.
Right.
Supervised learning.
Right.
The problem in that sense. It was already modeled as supervised learning, perfect. And since
the video was already given to you. So, you could extract the audio.
Yeah.
And so, you are working on the audio video modalities both, perfect. So, data makes a lot of
sense now. And now coming back to the now algorithm part, ok. So, I believe then you may
have applied some kind of a fusion architecture or what exactly, how did you bifurcated the
audio and the vision modalities separately and how did you make them work together?
So, definitely, we used a fusion architecture for having a joint emotion for the entire
spectrum, right you know? So, facial data, text data. Yeah, you know, we have another paper
called m3r (Refer Time: 39:32).
Ok.
Where we are combining all these modalities to have a more efficient extraction of emotion.
But these individual modalities have their own emotions. Of course, there are many videos
where there is contradictory emotion like. Sarcasm, right.
722
Hm.
Or irony and where what you are saying and what you actually mean, they are very different.
Yeah.
Right? That is Those are for us, outlier cases, we do not know how to handle them, but you
know, I am sure in the future, we can figure out a way. Also, sarcasm and irony and all these
things are very culture specific.
Yeah.
Yeah.
You know, when I went to when I went to US the first time, I you know every time I did not
understand lot of sarcastic comments, ok, you know, what should I laugh should I? What is
happening? Why is everybody laughing and not me, right? So, these are very hard specific
culture problem. But apart from that, the hope is that these modalities will have some
connections, right. If you are happy or if you are, how do I put this? If you have some anger
expressed in your speech.
That anger should have some expression in your facial expression as well, right. So, if you
have a joint expression like you look at all cues and you have ok, this person is angry or you
know, some anger in there. But you have the voice sort of telling, you know, this, you know I
am happy, I am this, no such thing.
If you listen to just the voice, if you could figure out that this is not anger, something else.
But if you looked at the face, clearly this is angry, something wrong, you know, this is now
there is a big difference in between the anger and happiness.
You know if you see those a little bit so, you know in the emotion spectrum, then like, you
are aware of the VAD. The valence arousal Dominant spectrum.
If they are close enough, you realize that sure, I mean, if there is differences, but they do not
have to be exactly the same. But a one points towards fear and the other points about
723
aggressiveness, there is something fundamentally wrong in terms of what you are trying to
present.
Yeah.
So, the wider the difference between the modalities, the more like then there is a more chance
that it may have been faked. That is all we are trying to trying to predict.
Yeah.
So, in terms of a fusion architectures, like if we looked at everything, then what would be the
emotion? Because we humans, we do not look at individual cues, we look at.
Yeah.
Yeah, exactly.
I am not looking at just your face or just your audio. So, when you capture that and then you
look at the individual modalities, if there is a discrepancy, it is very easy to find out.
Yeah. So, I think in that, that is case, you know like (Refer Time: 42:05) you rightly said. So,
looking at a fusion architecture, and I think the audience will appreciate it also, that looking
at the fusion architecture, which could be, you know, we can start with the early fusion or a
late fusion.
Yes.
Yeah.
But we really, rather than looking at the each individual modalities as a separate entity.
Yeah.
Combination.
724
Perfect. So, I think, you know, that is what the and. So, now, coming back to the last aspect of
it. So, for example, while doing this you know like, I believe now you are looking at the
emotions and everything. So, what were some of the performance metrics that you are
looking at? And how could you. So, of course, you are looking at, ok whether something is
deep fake or not. So, essentially, it is a classification problem.
Yeah.
And of course, you are looking at the accuracy and all that.
Yes.
But how exactly were you looking at? (Refer Time: 42:46).
So, it's based so, in many ways, it is, I mean, of course, the classification accuracy is like, if it
is classified as a defect, it is a defect which is very binary. So, I think the performance
evaluation was, in many ways, this is something we do not did not do completely, but we also
wanted to have a perceptual evaluation, right. You know, if humans cannot figure out if it is.
Exactly.
Deep fake versus whether a computer can figure out if it is deep fake, right. And this is where
we are kind of struggling at this point. If humans can figure out if it is a deep fake, then there
are methods are good enough, you know. We can figure out. But if humans themselves cannot
figure out if it is a deep fake, then it is harder because, which means that the emotions are in
sync.
So, our metrics our some of the metrics like, you know, how what is the distance, you know,
the VAD distance between them. We can threshold them, you know, whatever. The distance.
Yeah.
But at the same time, if the distance is not too much, that does not mean it has not been deep
fake. It is a good deep fake behind it, right.
Yeah.
Could be could be. And that is, that has been the biggest challenge. We do not have any good
metrics to evaluate that. The deep fake community, I believe is still, I would not say it is an
725
infancy, there is a lot of work in that community. But there is, that it is still a very Boolean
operation. Is it a deep fake? Is it not a deep fake?
You know, we do not know how, you know, if there is better ways to evaluate how close it is
it to reality or how unclose is it to reality. There are some metrics, but none of them. I think
as a community, you know, I am not from the deep fake community, but I think as a
community over the next few years, we should, you know, figure out better metrics, better.
Benchmarks to figure out whether something is deep fake or not. Because. This is literally
going out of hand.
Yeah, yeah and actually, you know, like one thing, as you said. So, usually what we do we
when we try to develop the algorithms for the machines, we try to keep the humans'
performance as a benchmark.
Yeah.
So, if the humans are good at it, machines should also be good at it.
Yeah.
And are the machines performing up to the level of the humans? But frankly speaking, you
know, like now we have seen the advancements in the AI in the last. Decade, more
importantly, where machines are even performing better than humans.
Yeah.
In many cases. So, for example, I mean, you know, the use case of the imageNet and.
Object recognition.
Much better than the humans. Now, of course, you know, some playing also, you know, of go
and other.
Yeah.
726
Other things. And so I, for example, one thing that I really struggle a lot and I believe you
may also have encountered the same situation is doing the capture recognition. I mean, it is it
has become insanely. You know, when you log into a website and you want to prove that ok,
you are not a human. Sometimes it is hard for me.
Yeah.
Yeah.
I am.
Yeah.
So, difficult.
Yeah.
Yeah.
That.
Yeah.
So, I think somehow, I completely agree with you that, you know, while keeping the
traditional performance metrics in mind, we need to really look at ok, what is the benchmark
that we can.
Yeah.
Keep, which usually is the human benchmark. But while doing so also, I think.
Yeah.
727
We need to also really understand, maybe keeping humans at the benchmark may or may not
be the most viable option also again.
Yeah.
Exactly.
Exactly.
Right. And if humans cannot figure out that is where we need to do deep fake detection.
Exactly.
Yeah.
But if an AI tells me that, no, this is clearly deep fake. Then we need to figure out, ok, maybe
stop it. Yeah.
Because if people can already figure it out, if it is a deep fake, then maybe it is not that
harmful.
Yeah.
Again, you and I, you know, we are researchers in this area, we may look at things a little bit
more deeply.
Yeah.
But, you know, if this reaches like some rural place in India, then they may not even realize
this deep fake.
Yeah.
728
So, yeah, you are absolutely right that in this case, I think we need to go beyond humans.
Exactly.
Exactly.
Yeah.
Exactly. So, I would really like the audience to please make a note of it. This you know like, I
think usually, of course, while we are talking about the emotionally intelligent machines. So,
we like to take these patients from the humans that ok, but I think there are some places
where you really need to, you know, look at the common intelligence of the.
Yeah.
Yeah.
Yeah.
Absolutely.
Perfect, Aniket sound, I think we are doing really, we are going really well. So, you know,
one last question that I have and I think this is really important question, especially for the
audience and the young learners in the domain is that you know we already know that
affective computing is a very very interdisciplinary field. And I think you started your talk,
discussion with that already, that, you know, this AI with the emotional intelligence, it
essentially lies at the intersection of computer science, design.
729
Yeah.
Psychology and so that becomes really challenging you know to even start working in it.
Right.
And how to address the problems that are there. And I think so, this is where I would like to
understand from you that since we have already done a good amount of work and have done
for your good for yourself, what exactly are the some of the advices that you would like to
offer to a younger colleague who wants to start in this particular domain in affective
computing, you know?
Yeah, I think this is the very interesting question. Yeah, I would say when I had started doing
my research. Lot of the problems which I had looked into, well this is an interesting problem,
let us solve it. You know, this is, we are like trying to figure out human emotion, let us try to
figure out human emotion. This will, I think, working on this for some years, you know, we
got excited, we did a lot of things, but we did not really think about the end problem in mind.
Like, what is that? This will help solve, ok. We found out that this person looks angry, ok. So,
what?
Yeah, exactly.
Agree I think thinking about the problems first and then trying to figure out how well figuring
out human intelligence or human emotions or human behavior help solve this problem. If it
was, if you are getting 10 percent accuracy, if emotions can be can give another 5 percent
accuracy. So, bringing the human back into the problem, but still keeping the problem as the
first point of entry.
I have made this mistakes. I did not, I initially, I did not make keep humans as the point of
entry. I thought, just looking at somebody, I can figure out what their emotion is
computationally. That idea looked exciting, but what is the end goal for that, right. So, that is
why in all of my current projects, I am looking at as I am saying, we are working with
doctors.
In fact, one of the problems we are working with actually is the police force. And I am sure
you lot of the audience must have heard about this and you know must have written papers
730
that there is a lot of you know in the US specially, police training is a big problem when it
comes to understanding different races different cultures.
You know, there is always frequent problems with the police. And it is not that every police is
necessarily you know against or like bad or you know racist or but they are biased based on
their trainings. They only know as much as they have been trained on, right. So, we have
actually working with some police departments of how can we train police officers with these
virtual characters, which have different cultural components, different emotional components.
So, that they do not have to do the training on some videos they have captured.
They can actually interact with interact with police officers, virtual people, you know do all
these you know, talk to people from different races, different cultural backgrounds and
understand what it feels to be in their shoes, right. So, from a training point of view, working
with the military. So, where the end goal is always the key point in mind. And if once that is
established, you know, this is the good problem. Let us try to solve it. Then try to bring in the
human component.
And I am sure as time progresses, this is the very very interdisciplinary field you know we
were just talking about this an you know an hour ago that, natural language processing you
know that is also a very important player in this affective computer because language is a lot,
it conveys a lot of emotions, language.
Conveys a lot of behaviors. So, NLP, robotics, psychology, psychiatry, I mean, computer
science in general like AI, obviously, but all of these fields have heavily influenced each
other in this affective computing realm. So, even if you are working in robotics, or you are
working on some other field, if you are in the department of psychology or from a medical
point of view, think about the affective computing, think about what, how would it help the
problems you already have.
As you know, like we you know, we start like our work was always in this robotic
simulations or graphics. And initially I did not do any emotion, but as we started from crowds
to a. So, looking at crowds is looking at things in a macroscopic way. But then when we start
to look at, you know crowds are only crowds because of the individual people in it and the
individual people all have emotions, all have their own behaviors.
731
Start looking inwards, then we realize how important emotions is, how important human
behaviors is. How important cultures are. And start to model them for different applications.
Yeah. So, as you know you know, initially you were asking how can new researchers in this
field sort of, you know, do great or like do very interesting, you know solve interesting
problems in this. I think just look at problems in your own domain and then try to bring the
human in the loop (Refer Time: 52:13) and then realize that there is hundreds of problems
which we can solve.
Ok, perfect, perfect, ok, So, I think thanks a lot Aniket you know like I believe of course, we
can keep on going on taking the knowledge from you in this domain for a longer time. But I
believe given the time that we have and the scope that we have, I would like to thank you for
your time.
Thank you.
And I believe the audience definitely will find it very very useful. And I will encourage the
audience, please go through his lab’s work and some of his most fascinating research works.
And if you have more doubts about it, I am pretty sure Aniket will be more than happy to
take the doubts from you over E-mail and feel free to contact him for any purpose.
Thank you.
732
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indian Institute of Technology, Ropar
Week - 12
Lecture - 01
Ethics in Affective Computing
Hello, [FL]. I am Dr. Abhinav Dhall from the Indian Institute of Technology Ropar and
friends; this is week 12 in the Affective Computing course series. So, till now in this course
we have discussed several aspects about affect, how affect is perceived from a machine's
perspective, how we can create a machine using different sensors which can sense the affect
of a user or a group of users.
We have also seen how the different data sets can be created and what are the different
applications, where affect can be used to have a more engaging, safe and productive
experience for the user when the user is interacting with the machine. Now, what this also
means is that affective computing slowly is becoming mainstream, so, for a more useful and
productive experience of a user with artificial intelligence systems.
And with that be the very popular currently in 2023 the large language model kind of
systems. It is important that after sensing the affective state of the user the machine also
replies back in an appropriate affective way. Now, what it also means is as with any artificial
engineering system a system which is trying to understand the user and then perform
accordingly there are some issues, there are some challenges, which need to be taken into
consideration before such systems can be deployed in the real world.
And these challenges and issues are essentially around the very important aspect of ethics.
So, how are the ethical boundaries, they’ve been kept in mind when affective computing
systems are being developed, are being deployed and what are the limits after which the use
of such AI enabled systems can be counterproductive as well.
So, first we are going to discuss about the motivation why ethics is important. And we will
see how let us say the concept of privacy that is being challenged that is sometimes affected
when we are trying to understand the affect of a user.
733
(Refer Slide Time: 02:55)
Later on, I am also going to share different perspectives with respect to sensors, with respect
to the levels at which data is being processed, which is critical to the understanding of the
ethical issues in affective computing.
Now, this quote from Professor Picard and I will read out for you friends, “The fictional
message has been repeated in many forms and is serious: a computer that can express itself
emotionally will someday act emotionally, and if it is capable of certain behaviors, then the
consequences can be tragic”.
734
What we mean here is after we have added the capability of understanding of affective state
of the user into a machine and also have geared up the machine, we have taught the machine
to react accordingly and have emotion.
So, once the machine will have an emotional persona and the actions of the machine are also
affected by its own emotional state, which is driven by the external entities which includes
the user where the machine is, then what effect can all of this have on to the user and the task
which the machine is performing.
Therefore, the emotional identity of the machine and the emotional response of the machine
that needs to be very carefully curated; such that, the task for which the machine has been
created that is not hampered and we are also taking into consideration the aspects about the
user, the interest of the user, the safety of the user and so forth.
Now, first we are going to talk about affective sensing right understanding of the emotional
state through different sensors. And friends you recall we talked about how we can use a
camera sensor for understanding of the facial expressions, body language of a user. Then we
can use the microphone to understand the speech signal and that gives us very vital cues
about the emotional state and then we have text and physiological sensors.
Now, from the perspective of an affect sensing system during its design the designers’ ethical
and moral decisions, they are also embedded into that system. Now, a very fundamental
735
question is you are designing a machine, which is detecting the emotional state of the user.
Now, how would you decide which emotions the machine should be trained to recognize.
Now, let us say the example is you want to create an app which is analyzing the speech
patterns of the user when the user is talking on the phone.
So, it is not actually looking at the semantic content of what the user is speaking, but
essentially the pitch and fundamental frequency and so forth. So, as to look at things such as
stress ok; so, the example is you want to create an app which is going to detect stress based
on the speech, which is captured from the microphone in a mobile of a user.
Now, if you want to understand stress how are you going to define objectively the criteria and
then what all emotions would you like to measure would those be the categorical emotions or
you would like to go on to the valence arousal dominance continuous emotion scale.
So, that you can map the speech pattern on to the state and then detect if the user is stressed
or not. And further if this app is created and is installed on a user's phone how much is the
user's understanding of what all the app is capable of? Is it just through the contract which the
user has agreed upon?
Or let us say there is a training phase as well wherein the user is made aware of the
capabilities of this app and where we say well. The app is going to measure the emotions so,
as to tell if during a certain point during the day when the user is using the mobile phone if he
or she is stressed or not, ok.
Now, once this machine is able to assess the stress versus non stress state, which all entities
within the mobile phone and outside the mobile phone are able to access that information. So,
are there other apps which are installed on the phone, are they being given access to the
predictions, which the stress app is making, are there any outside servers, any cloud based
installation where the user's stress versus non stress state is being communicated to?
Because if you are going to store the emotional state of the user either on the device then the
method, which is used to safely encrypt that particular data so that any malicious user cannot
read the private information of the user that would be required. Now, if you are sending the
emotional state data to a cloud based system then one what is the safety mechanism there.
736
How the user's privacy is going to be maintained? Then other important questions are going
to come into the picture. How long this data is going to be stored on the machine? And who
all at the cloud end, at the servers end is able to access this machine and then access this data
do they have the rights to access this kind of emotional state data of the user?
Further once this is solved you know where the data is stored, and what are the security
mechanisms around the data? The next question which arises friends is the machine has
identified that let us say the user is feeling stressed ok, this is what the speech patterns are
telling.
Now, what will the machine do with that, what would be the patterns which would be
analyzed? That is first and once those patterns tell the app that the user is stressed, what is the
use case, what is the application of that? And who all are going to use that information is it
just for the mobile phones user that they would let us say be given a feedback, they would be
given an alarm that you know you can look at the longitudinal data throughout one day.
And see how your stress was varying based on the speech analysis or this is going to be
shared outside with let us say a clinician as well. Now, since this data is critical this is very
much personal to the user. So, after let us say the clinician accesses it, is the data being still
stored at the clinician’s (Refer Time: 11:03) end? is it at the cloud's end? So, these are the
kind of very important questions which are coming to the picture.
Now, let us say once this is also solved. So, we now we know from a designer's perspective
that after the app is going to sense the affective state of the user after we have discussed and
decided where, that information would be stored. What would be, let us say the security
parameters around that information then we discuss what is the use how are we going to use
this information, which we have gathered from the app.
The next question would be what would be the way in which the feedbacks will be given?
Ok. So, one example I have already shared with you that the machine can show let us say a
graph ok. So, it could say for example, from 6 am to 9 pm this is how the stress you know
kind of varied upon, ok. Are there other ways as well? could this is quite explicit way are
there implicit ways of conveying this to the user?
What would be the right time at which this information should be conveyed to the user? And
if at all if the user is very stressed should this information be conveyed to the user that you
737
are very stressed or should there be a feedback, which could explicitly help the user let us say
in calming down. An example of this could be some suggestions regarding soothing music
and so forth.
Now, from the user perspective the user would like to have control over these feedbacks as
well. So, as now an app designer you will have to give these options to the user. So, that the
user is aware about the whole gamut of how this data is going to be used and what all effect it
could have on the user and let us say if there are some pitfalls as well.
So, these ethical choices, these start at the designer's end. So, the team of researchers who let
us say are gathering some requirement and then would be designing a system, which would
be sensing the affect of a user.
Now, when we are talking about affect sensing - these are the important ethical considerations
friends, which we need to take into perspective. The first is privacy of the user. Now, in the
context of emotions, emotions according to Professor Picard are perhaps more so than
thoughts, since these are our internal thoughts, these are our internal stage, these are personal
and private, ok.
Now, should a particular machine be allowed to access this very private information and what
would be the particular use case within which it needs to be allowed, right? Not every app
should be allowed to sense the user's affect because not every app could be designed could be
738
capable of the repercussions of the future steps after which the affect has been predicted
because it is a very personal information. So, personal information needs to be very carefully
taken into consideration.
Now, another interesting aspect is. So, you designed this app which is understanding the
user's emotion and then let us say it is reacting by changing the color schema the font of the
user interface of the app. So, now, you could say that the interface of your app is also emotion
aware right, it. The app is sensed the emotion of the user and then the response is also in the
form of the visual changes which are there in the interface.
Now, since the app has sensed and the app is now reacting accordingly then is this also
invasion of privacy of the user? Because the user has very personal thoughts and emotion is
very personal to a particular person. So, that is from the designer's perspective it has to be
extremely carefully taken into consideration that emotion unlike simple attributes, which are
typically you know looked at a phone for example, how many steps one has walked, right.
Unlike this emotions are far more complex far more personal.
The other important aspect, which needs to be considered from the ethics perspective for
affect-sensing friends, is the emotional dependency. Now, the question is as follows with
respect to emotional dependency. If a user is using an app which is sensing the affect of the
user and then giving feedback to the user very frequently now the user might become
dependent on to the app as well.
Now, this one way dependence, right, that could actually be unhealthy in some cases. Let me
put this in simple perspective here. Let us say the user gets too dependent on an app for the
feedback about their emotional state and they are believing the emotional state result which
the app is telling them. So; that means, their future action, their immediate future action
might be influenced by the emotional state which has been communicated to them by this app
right.
So, they will get dependent on the inputs from this particular app and that is a very fine line
friends after which this dependency can become unhealthy. So, again from the designer's
perspective the designer who is creating an app, this is an important ethical consideration that
the user should not become excessively dependent on to the app, ok.
739
You could also consider this particular point from the perspective of a virtual agent. Let us
say the user is interacting with a virtual agent, a 2D or a 3D character and the character is also
able to sense the emotional state. And then the emotional response of the character and the
personality of the character is tuned with respect to the emotional state of the user.
So, if with a longer use the user develops a kind of relationship with the agent the virtual
agent then the user will become quite dependent on the virtual agent. So, you know these
kind of remote cases also need to be considered. Now, friends, the third point with respect to
the ethical consideration in creating these affect sensing systems is emotional manipulation.
So, after the machine has detected the emotional state. And let us say you know the example
which we were discussing about the user being detected as feeling stressed and then the app
suggesting some music to soothe down you know some soothing music some relaxing music
and so forth.
Then is not this essentially an attempt to modify the certain behavior of the user? So, in this
pursuit of giving suggestions to the user the app could eventually lead to certain behavioral
changes in the user as well with longer you know continuous use of the app. So, is that
actually healthy for the user could this create a longer term problems for the user. So, this
needs to be taken into consideration while creating the systems for affect sensing.
Now, the another aspect is you could say that you know the system is motivated by goodwill
because the user was stressed and the machine wanted to help the user. So, there is a good
intent, but with long longer and very frequent use this may become counterproductive for the
user as well.
Now, the fourth aspect in the ethics consideration is building relationships. Now, in the very
near future a person who is using these affect sensing apps very frequently. At what point will
that particular user begin to value affective computing technology and its well-being over
another human being?
So, consider the hypothetical scenario a person is using affect sensing app very frequently
and the affect sensing app is telling it you know this is your emotional state based on let us
say the valence arousal kind of emotional continuous emotion representation.
740
And the user gets so, used to this that it is now dependent to get the feedback from the app for
his or her own emotional state rather than you know talking to or discussing any problems or
you know getting feedback from the human beings around that particular user.
So, the effect of this particular user's relationship with his or her human beings around his
friends and relatives around can, in a very remote scenario, these be affected if a particular
user gets too dependent on getting emotional feedback from a machine and rather than not
you know consulting with fellow human beings. So, that is a very important aspect, which
needs to be taken into consideration.
Now, from ethics we are guys going to discuss privacy, ok. The approach which we are going
to take is as follows. So, we will say well you know your mobile phone is essentially a
combination of a large number of sensors, right. So, you have a large number of sensors and
these are collecting information in different formats at different frequency, which can be used
to understand the affective state of the user.
Now, we are going to talk about how these different sensors are capable, if not handled very
carefully by an app designer creator, in evading the privacy of the user. So, these sensors
friends, these are divided into three very broad categories, ok. So, these are based on their
capability in privacy invasiveness intensity. So, you know if we have low, medium and high.
741
On the lower end of the spectrum where the chances of the invasiveness or loss of privacy of
the user are relatively lower are sensor such as your accelerometer, your battery, gyroscope,
charger and so forth right. Now, let us take an example ok. So, you could use the
accelerometer to understand the physical movement of the user, right. The aspects about the
movement of the user when they are using the phone and this could be linked to how much
active they are and that can be used to understand the affective state.
Now, this could be a very loose correlation, but what we are actually getting is just the
affective state we do not get the information, which is about the user who is the user. And
other attributes such as you know age, gender and so forth about the user. So, the personal
information is still intact let us say when using an accelerometer.
Same goes for things such as you know your screen touch sensor and your environment
sensor. So, the pattern in which the user is interacting with the screen. So, that tactile
feedback that pattern can be used to map to certain emotional state of the user, but still that is
it is not actually (Refer Time: 24:41) you know who is the user and so, forth.
Now, moving forward the set of sensors, which have a bit higher you know medium intensity
for invasiveness in privacy of the user from a app perspective are the apps themselves. Some
apps which are you know looking at the browsing pattern of the user, ok, and then things such
as Wi-Fi and Bluetooth sensors.
So, where is the user right? So, based on the Wi-Fi router network to which a user is
connected that gives very wider information about the user. Same is with the Bluetooth
sensor as well being able to identify if there are other devices around the user also gives you
information about the user right, is the user alone is the user you know in a group, where let
us say there are other mobile phones which also have Bluetooth sensors around. So, you
know one could create metadata out of it.
Now, from affect sensing the most critical set of sensors with respect to the privacy concern
are these ones, ok. So, from the microphone you can understand the speech. If you can
understand the speech not only you can understand the emotion, but using speech one could
identify the user as well.
742
Essentially who is the user is this user a male or a female and then things such as based on the
speech signal what could be the rough age group, right. So, now along with the emotional
state you are making the user's very personal information accessible, interpretable.
So; that means, if you are going to use the microphone in your app in your software you have
to make sure that the user identity information that is essentially removed in the beginning of
analysis of the data, which is being captured. Now, same goes for the camera as well friends,
you are able to record the face if you are able to record the face you know who is the user you
again have these attributes like male, female, age and so forth right.
So, along the emotion if you are going to use the camera and you know since we are talking
about mobile phone as an example you could have multiple cameras the front camera and the
back camera. Again, you know from the privacy perspective that is tricky, right. Front tells
you about the face back camera could tell you about where the person is right and who is the
person with.
So, even though you know the intent of the app would only be looking at the emotions
through the camera, but if not carefully you know created with respect to hiding the identity
of the user and if the security information is not very well maintained, this app could leak
information could be hacked and so forth.
Now, again guys there are similar GPS where the person is call records again very private
information SMS and emails tells you a lot more about of course, the affective state of the
user. How the user is you know drafting emails, which tells us about the emotional state and
so, forth, but in the same pursuit of course, the email or the SMS could contain very personal
very private information, which could be identifying the user as well.
So, what it essentially means is from a affective computing enabled app or software
developers designers perspective you have to be very careful in what all sensors are you
going to use to un sense the affect of the user. And in that very pursuit be mindful of the
vulnerability the user may have due to the app using these particular sensors.
And of course, you know from the app designer's perspective this would mean that you may
have to do a trade-off between things such as how accurate your affect sensing is versus how
much privacy aware your app is. So, this is a very important trade-off. Sometimes you may
743
make the app a little less accurate just because that may enable you to preserve the privacy of
the user, which is extremely important.
Now, moving on in to similar direction we discussed about the different sensors, right So,
from the same example which is analyzing the speech pattern of the user. Let us look at the
perspective of the privacy from an affective computing apps angle from the different stages at
which the information is analyzed and what could be the challenges which are possible in
these very states, ok.
So, again you have your mobile phone there is an app which is analyzing the speech which
uses the microphone, ok. And the software is understanding the affective state of the user. So,
we are going to use the same example. So, the four stages here friends, are sensing,
understanding, inferring the context. Once you have understood the context then predicting
the context let us say the future context future action. And then intelligent actioning the
feedback and what could the machine do once it has understood the affect, ok.
Now, you understand sensing would mean friends as we have talked about earlier. You are
correcting the data from sensor in that example from the microphone. What are the
challenges? Well, you know this could be the quality of data which is being collected from
microphone could be dependent on the sensor itself onto the software, which is being
attached to the sensor and also where is the machine being used.
744
Now, the machine which is mobile phone in this case could be let us say recording the speech
samples of the user at a certain frequency. So, as to look at how the emotional state is varying
for the user throughout the day, ok. Once this sensing has been done. So, we have collected
the raw data comes the second stage right, inferring context.
So, now this is your machine learning part. We are extracting features. So, in the speech it
could be looking at your fundamental frequency and then extracting the MFCC or you know
the representation learning pre-trained based features. So, once the feature has been extracted,
we are going to then use machine learning to predict, ok.
For example, you could use a Gaussian mixture model or you could use you know a support
vector machine neural network and so forth, which is measuring the stress of the user. So, we
have now inferred the state of the user. The next is we have been recording the user and then
we have predicted the stress versus non-stress.
If we have sufficient data it allows us to do predicting context, that is we can build models for
future events we can predict how the user could behave at this particular day time based on
the data collected from the n earlier days, right. Now, in this there are some very important
questions with respect to privacy.
If the user state is going to be predicted for a future event, how much the prior information is
being used; is that a short term information or longer term information, what is the context in
which that information is being used? And then again, the points which I was discussing with
you with respect to who is going to get access to this future prediction?
So, you could use this prediction of stress level to do some type of feedback some warning to
the user as well right. Now, that would be our intelligent actioning. Perhaps when the user
looks at the calendar and sees n events which are planned the higher precedence event or the
events, which are just going to be happening in the very next few hours those could be shown
first. So, that you know the user does not get sense that you know they have let us say a very
very busy day ahead, right.
But in this very aspect, the machine might anticipate that an event is far x is more important
than event y and then hence the visualization is showing that more, but perhaps it is not the
that particular way. May be one of the event is just a one-off event and that is important for
the user, right. The machine could then actually end up being trying to regulate the behavior.
745
Further it is also question of curiosity versus accuracy, right? As we have discussed with the
sense with the sensor's perspective as well. What is the value of a decision, right? What is the
cost if the prediction is wrong? And that again links back to the privacy concerns of the user
with respect to how the app data is going to be used and who accesses and so forth.
Now, there is a problem of insufficiency with respect to the current privacy approaches. So,
current approaches, they use things such as cryptography and then there is privacy preserving
data mining approaches then there are approaches for anonymisation of the data, that is,
removing identity information.
It has been observed that these often are insufficient by themselves or in combination when
they are deployed right. So, even if the app predicted that the user is stressed and that
longitudinal data is being stored in an encrypted way on the mobile phone, on the user's
phone there are still possibilities that data could be hacked that could be you know decrypted.
What this also means is then after let us say the user has been given feedback about how they
were feeling, how the machine sensed their affective state is it really important to store
longitudinal data? Is it important to store very fine grain data? Because you know there are
these privacy concerns.
So, again this goes back to the designer that if you are storing the information which has been
predicted by a machine learning model do you actually store the exact predictions or some
746
interpretations, which could be used by the user in longer term, but do not really affect or
evade the privacy of the user as much as if the predictions themselves were stored and
unfortunately there was an unaccredited access to that data.
Now, with respect to anonymisation, right, which is removing the identity information, it is it
is fairly complex to anonymize data because it is non-trivial to have a fool-proof
anonymisation method. It has been shown in recent research works in signal processing and
machine learning that how for example, the pattern of key press on a keyboard or on a mobile
phone can be used to roughly identify who the user is.
Further things such as using the Wi-Fi signal using the radio you know the radio signal in
turn can be used to look at how people are let us say walking or are doing certain activity in a
room, right. So, one can deconstruct the identity information or some vital information about
the user from the sensor data even though a designer may think that the sensor data is not
recording the identity information, the personal information of the user.
But with these newer mappings which are being enabled by the deep learning based systems
one could make sense of the private information, which let us say is captured from the Wi-Fi
signals or from the accelerometer plus gyroscope kind of data, right.
So, what this means again this is the whole discussion goes back to the designer themselves
that they need to be very careful observant of this fact that even though they are trying to
anonymize the data, which let us say is going to be stored or which is going to be analyzed at
a cloud server end to predict about the user's affective state.
There are possibilities that the identity or a pseudo identity or the smaller set of attributes,
which are still extractable from that data which could identify the user and hence affect the
privacy of the user. So, friends in this first part of ethics in affective computing we discussed
aspects of why you know it is important to consider the different gamut of ethics.
How the sensors can be used to understand the personal data which from a designer
perspective is extremely important to know so that they do not affect the user they design the
machine such that the privacy of the user the ethical concerns which we have with respect to
the affect sensing and affective feedback they are taken into consideration.
Thank you.
747
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indian Institute of Technology, Ropar
Week - 12
Lecture - 02
Ethics in Affective Computing
Hello and [FL]. I am Dr. Abhinav Dhall from the Indian Institute of Technology Ropar.
Friends, this is the second lecture for Ethics in affective Computing. This is week 12 of the
affective Computing course. So, in the first lecture, we touched upon the concept of why it's
important for us to consider the ethical considerations, aspects when designing an emotion
aware app. And then we discussed about the privacy concerns from the sensor and the
software perspective.
In today's lecture, we are going to move forward. And first, we are going to discuss about the
ethical concerns from AI enabled system pipeline. So, we will are going to take an example
of an emotionally aware AI machine. And we will see how the different stakeholders can add
the biases and which can later lead to the ethical issues in the system when it is deployed for
the user.
748
Then we are also going to talk about facial expressions in per say. So, how you know the
ethical considerations need to take into the picture and what are the concerns when facial
expressions are being analyzed. And then we will conclude by discussing the open issues.
Now, we are going to talk from a AI systems perspective ok. Let us say we are part of a large
team, which is creating system for interacting with the user. Now, what the system does is, it
analyzes the speech and face of the user.
So, you have the microphone and the camera sensors. Now, this is the sensing part. The
feedback part after the affect has been sensed is in the form of a virtual agent, which of
course, is showing facial expressions, which are response to the sensed affect and also is
using a text to speech and varying there, ok.
So, let us say this is the system which we are going to create to understand the emotional
state and then you have a conversation with the user. So, after the requirements having
gathered by the team, let us look at once the requirements are understood by the team, the
creators, what call could be the issues with respect to the privacy and the ethical concerns
here.
First of all, the designers themselves may have certain biases based on their training right.
You could consider an example, let us say all the developers are males and are from a
particular ethnicity.
749
Now, they may have their own set of beliefs, which of course, is then going to affect the
design of the system. Now, from the example which we have taken. So, let us say they may
choose all the virtual agent avatars to be of a particular gender. Ok they may say, well, you
know we only will have the avatars of a particular gender.
Now, friends, you can very well imagine the problem with this right. So, they are designing
the system and they are influencing the system based on their own beliefs, which is due to the
biases which they may have. There is a problem, maybe you know this hypothetical example,
team which we have taken, there is lack of diversity on that team, which means there could
be lack of different perspectives, which would go into the designing of this particular
machine.
There is lack of gender diversity as well. And maybe it it's just one ethnicity or two ethnicity
is not people from all spectrums are part of the team which is creating. So, there is a
limitation in the perspectives which are being you know taken into consideration when this
affect sensing and reaction machine is being created.
So, there is a large possibility that there is a bias which comes from the creators themselves,
because they have a certain notion of what could be the emotions and how a machine should
react when the emotion of the user is sensed. The other is, these creators would choose
particular type of algorithms, right.
So, for the microphone and camera sensor data to be mapped from this data to affective state,
as we have studied in the earlier lectures, we will have a certain machine learning algorithm
there. Now, this algorithm itself, the nature of the algorithm may lead to some unfair
behavior, some unexpected results with respect to the feedback.
Because, when it is sensing, maybe it is not unbiased to certain type of data because of how it
has been designed by the creators. Further, it needs to be fair, right? Of course, you know, the
you would not expect your machine to give incorrect emotion state results for a particular set
of users belonging to a certain you know corner of the society. And there needs to be
accountability as well.
So, why does the machine think let us say the user is happy ok or the user is sad, because the
feedback is also based on this. Further, there is transparency required. Why is your machine
reaching at a certain conclusion about the affective state of the user, right? Now, if this is not
750
taking into consideration, let us say the machine senses certain behavior, certain affective
state of the in the user and the virtual agent starts to react accordingly, which in reality could
be totally opposite of what the user might expect, right?
Then from the creators perspective, when they are designing the system, they would like to
understand that how a particular machine learning algorithm is reaching at a particular state.
So, the transparency aspect is extremely important. So, that you can understand, you know,
you can interpret that why a particular affective state has been predicted.
The same goes from the algorithm perspective for the virtual agents feedback as well. Since
you are expecting your virtual agent to show certain facial expressions to vary the speech as
well. So, how is this mapping from the user's affective state to the synthesis of emotion is
going to be done and then the algorithm needs to be having these particular attributes of
fairness, accountability and transparency.
Now, another aspect is you have chosen your algorithm and you would now be collecting
data, right? And friends, this is one particular stage at which large number of biases can be
added to a system. For example, when you wanted to create this sensing part of the system,
we wanted to train your machine learning algorithm, this data, let us say you were sourcing
from the internet, ok then is this data which is being sourced from the internet, is that actually
covering the spectrum of users who could be using the system, does it have enough
representation from different age, genders and ethnicity?
And another aspect is, now this is again from the ethics perspective, what is the source of the
data you are, let us say taking it from the internet to train your machine learning algorithm,
but have the permissions be in seat from the data creators, right. So, that is another ethical
concern which comes into the picture when you are designing this emotion aware AI system.
That the data which you are using to train who owns that data and also what rights has the
curator of the data given to you to use that data in a particular sense. And then based on the
discussion which we had in the last lecture, let us say these are face images, speech patterns
which we are downloading from the internet, these belong to certain individuals.
So, while training our machine learning algorithm, right this algorithm, are we implicitly
encoding the identity information about the subjects whose data is being used to train? So,
this is an extremely important step friends, that from where to use your data coming, who is
751
the owner of the data, what is the type of processing which you are doing on the data and then
the aspect of who is labelling? So, let us say we downloaded hundreds of images, hundred
thousands of images from the internet and then we have a team of labellers.
Now, the same issues which we have discussed for the creators come for the labellers as well.
When a label let us say looks at an image of a person and then decides if the person let us say
is showing the fear related expression or happy related expression, then that judgment is also
based on how they conceive a fear expression could look like and which of course, that
means, that their judgment could be affected by a lot of biases which could be there based on
their earlier experiences and their training right.
So, once the data is sourced and the before the labelling starts, its extremely important to train
the labellers to tell them what is expected, how the system would be behaving, what is the
type of data which we have collected, when what were let us say any meta information about
the data which has been collected.
And since we are having a team of labellers, in order to avoid the bias which can come from
the labelling, we would be expecting that we are covering the different genders and ethnicity
within the labeller team as well.
And also performing statistical analysis on the labels which are created by the labellers
because ultimately these two things your data and labels are going to decide how well your
752
machine learning algorithm is going to perform, which in turn is going to affect the quality of
affect sensing in your machine, which will directly affect the quality of feedback by the
virtual agent right.
So, from the designer perspective, these are extremely important points which need to be
taken care such that your data is not biased, your data is clean, your data’s usability
parameters, its ownership and all is clear and the same goes for labelling as well. The
labellers are aware of what they are supposed to label, there is a consensus among the
labellers and the bias is avoided as much as possible.
And then a battery of statistical analysis test are performed on the labels to see the
consistency for example, across the labels which are generated. Now, let us say the data is
carefully curated, the algorithm is decided and the developers, they are training the algorithm,
they train the algorithm they see that you know let us say the training error is within what
they expected and they are able to validate on a validation set from the data itself.
Now, before the system is actually sent out to customers, to users, there is a testing phase as
well right. So, there is a group of people, they are testers, they are going to evaluate how this
particular system is performing in different environments. Now, what that would mean is they
would be writing test cases. Now, these test cases are based on the expected behavior when
the system is given certain input. Now, friends here as well, there is a possibility of biases
coming into the picture.
Again, testers are also human being, they would design test cases, they will interact with this
machine, they will test this machine and their response would be affected by their own prior
training, their own prior experiences. So, there also you know these biases can creep into the
system because ultimately your testers are supposed to give feedback to the developers, to the
designers, so that the machine can be improved.
Now, what this means is at every step of development of a system, there are these variables
which can affect the privacy, the ethical, you know aspects of the machine and also the
privacy of the user. Therefore, we have to be careful about the biases which are added when
we are designing such a machine.
753
(Refer Slide Time: 16:12)
Now, we are going to talk about one modality here friends as well, ok. So, since we were
talking about emotions, sensing and then response, let us look at the facial part of it, the facial
expression. So, I am sharing this data from the seminal work by Barrett and others and I
quote the authors here.
So, they say well there are three major issues with facial expressions when they are used as
the window to emotion in the part of affect sensing. And this is going to affect the
performance of the system and as we have seen in earlier discussions, the prediction of the
system once that is complete, how that affective state information is going to be used is going
to have a direct impact on the user, right.
Therefore, the fundamental step of the information which we are going to use to sense the
affect that also has limitations right. And these needs to be discussed when you are creating
an emotion aware AI system. Now, in the context of facial expressions, Barrett and all
argued, well the first issue is the limited reliability, ok which means you know there are
instances of same emotion category that are neither reliably expressed with or perceived from
a common set of facial movements.
And a typical example of that is fear. Fear is a very complex emotion, the way in which users
would be expressing that particular emotion varies quite a lot across different ethnicities,
different cultures. And what; that means, is when you are using facial expressions, there is a
possibility if we are going to interpret an emotion like this.
754
And we are making a generic system based on an image or a you know set of frames which
we are capturing, there is a high possibility that for a certain section of the users from a
certain ethnicity, the results might be actually incorrect right. So, when you are deciding a
particular type of label which you are going to use to reflect the emotional state, you have to
be careful of the aspect that particular emotional category has to be reliable in itself with
respect to the users.
The background of the users who are going to use this particular application which you are
creating to sense the affect. The second point from Barrett and all friends is the lack of
specificity, which simply means, well there is no unique mapping between a single
configuration of facial movements and instances of same emotion category, a very simple
way to understand this is when let us say you have two people from two different cultures,
they are smiling, they might be using the region around the lips quite similarly.
But you know the how they are using the eyebrows and all that could differ, the intensity of
the expression that could differ, what; that means is, the muscles which are coming together
to show that particular expression on the face, the different muscles could be activated for
different users coming from different ethnicities right.
So, again this goes back to the discussion which we are having from the creator, designer
perspective, one has to be observant of this fact as well. Then let us say if you are looking at
the facial muscles in the form of your facial action units, then after your predicted (Refer
Time: 19:59) facial action units.
The low level information, how you are going to use for the higher level understanding of let
us say the stress or the mental health or you know these kind of very serious applications, this
may be very user specific, this may be very context specific, right because different facial
muscles would be involved to express.
The third issue with facial expressions as mentioned by Barrett and others. Is the limited
generalizability? Right. What it means the effect of context and culture, they have still not
very sufficiently documented and accounted for right. The response of users in certain
environment, again coming from different cultures might be a bit different, there may be a bit
of variability of how let us say someone would smile or frown or show disgust expression
right.
755
So, its non trivial to generalize and we have discussed this friends when we were talking
about how affect is sensed for facial expressions. There this is actually a challenge when it
comes to generalizability of machine learning systems for understanding the emotional state
of the user. Now, the same goes back to the ethical concerns as well. If the creator of the
system assumes that the system would generalize based on the data which they are using to
train the machine learning system to sense the affect.
It is possible that if a particular ethnicity, culture or a certain spectrum in the age range, those
users have been missed from the training data. They may get biased results when those
particular users are in the test phase, right because of the lack of generalizability of facial
expressions.
And friends, you can also link this to the speech patterns as well. Similar issues, there is
limited reliability of how people would be expressing when they are speaking. Same lack of
specificity with respect to let us say the behavior in terms of the words which people will be
using to express the same emotion. And then of course, there is limited generalizability as
well when it comes to the speech patterns right.
Now, let us change the gears a bit and we are going to talk about the policy aspect right. So, if
we are going to use this emotion sensing machines, certainly there needs to be a policy, there
needs to be guidelines with respect to where the machine can be used, where it cannot be
756
used, on whom it can be used, for whom it can be used, who can create it, who can validate it
right.
And the same goes back to not just emotion of AI systems, but AI systems in general as well
right. So, since this area is. So, new, affective computing is not a very old area. There is still a
gap, there is still lack of policy which needs to be formed right. And certainly, this is going to
take into consideration that what are the implications, what are the pitfalls, what are the
positive use cases of when the particular emotion, recognition and emotional feedback
systems are created.
And again, you know who are able to use it? And I quote in (Refer Time: 23:25) Reynolds
and Picard here. So, the ethical consequences of sensing user emotion are unstudied and
methods for dealing with them in a manner users and designers see as ethically absent
acceptable are absent right. So, there needs to be a very careful deliberation with policy
makers.
So, as to see that the way in which the creators invisionize the use of emotion aware
machines is that actually how the users would be using it one and how it is safe for the users
to actually be interacting with these kind of emotion aware machines. So, certainly policy
needs to be taken care of.
757
Now, again from the work of Reynolds and Picard, there is another interesting aspect that is
about the user agreements and contracts. What that simply means is, let us say that you have
a affect sensing app which is either implicitly getting the inputs from the user based on sensor
data or explicitly it is making clear to the user that you know, well, high stress was detected.
This is what could be done, a feedback mechanism is there as well.
What they found is that the users trust are more comfortable with the app when there are
explicit agreements and contracts made right. So, the user when they are aware of what the
app is capable of doing, how the data is going to be used, where it is going to be stored.
So, then they are more comfortable in letting the user sense their affect simply because a
contract has been done between the app and the user and the user is very much aware about
what the app is doing and will be doing with the data based on the sensed affect.
And what it means from the designer perspective again friends is that we need clear
documentation, we need clear contracts. So, that the user is clear is about the capability of the
app, what is going to be recorded, how is it going to be recorded, what is going to be
analyzed, who has access to that data, right. And that way this is a mechanism to build the
trust between the app and the user.
Now, we have reached to the final part of this lecture and we are going to talk about the open
issues, issues which are still in the form of challenges and opportunities these exist for the
758
researchers and designers in affective computing. First, we still need to agree upon what sort
of machines should be given affective capabilities.
Now, an example of that is should your ATM machine which you are going to use to dispense
money, should that also be aware of the emotion, if there is a camera should it be sensing that
when user came to dispense money, how are they feeling.
Or a machine in the form of a computer which is doing an online interaction with the user in
let us say an interview based scenario, should that be allowed to understand the affective state
of the user and then use that information for assessing, for giving feedback to the interviewer.
So, this actually goes back to the policy aspect that which particular use cases actually require
the understanding of affect and which use cases if are enabled with affect sensing might let us
say be harmful to the user.
Then the other is you let us say we agree upon a particular type of machine, particular use
case to have the affect sensing capability, then we need to know the consequences of giving
machine this capability as well right. I have given you example of emotional co- dependency
and let us say a machine a user getting too dependent on the you machine to get their
feedback about their emotional state.
And then of course, also you know the non-trivial aspects of a users affective state
information being shared with other users or let us say clinicians in some use cases right. So,
what are the consequences of it right.
So, we need to be very clear about that before a system is deployed. Further this comes from
a more philosophical angle that what should be the purpose of developing an emotionally
intelligent machine and friends this goes back to the first lecture in this affective computing
course series right.
When we quoted a Professor Marvin Minsky right, he said well you know the question is not
if intelligent machines can have emotions, but if machines can be intelligent enough without
emotions right. So, the purpose essentially here is that we would like to enable these
machines with affect sensing capability in different ways in different in limited capacity or
you know in a full-blown capacity.
759
So, that the ultimate aim is to serve the user have more engagement and more productivity for
the user. Further there needs to be discussion on what are the moral constraints which need to
be considered while developing a machine right. So, this example of a machine which is
sensing the emotion during an interview right there are these moral constraints.
Perhaps maybe the interviewee is just stressed during that interview, but otherwise you know
they are a person who does not gets stressed in a work environment right. So, will it be
morally correct for the machine to sense that the person was stressed during certain phases of
the interview and share that with the interviewer and maybe this could actually not you know
work in the favor of the interviewee.
Further, should the machines simulate behavior in socially acceptable settings or should it
simulate behavior based on a critical moral standpoint, right. So, once the feedback is
happening once the machine has sensed that the user is having a certain emotional state and
then let us say there is a virtual agent which is giving a feedback is this machine supposed to
give a feedback as if it is a machine or let us say is a companion, right.
So, there is a moral dilemma there is it supposed to give the factually correct information or
is it supposed to actually modify that information a bit or you know improvise on that
particular information. So, that the emotional feedback is also appropriate with respect to the
user. And you know this actually brings to a memory. So, for example, some of you may have
seen this popular Hollywood movie called Interstellar right in that the user adjusts the funny
quotient of the robo right.
They he gives feedback to the robo ok I want your responses to be funny of level 0.6, 0.8
because if its in a critical environment and you are expecting the factually supportive
feedback from the machine an emotional improvision by the machine can lead to dilution of
what was in originally expected from that particular machine, right. So, this is a very serious
point which needs to be taken into consideration with respect to how the feedback is being
designed to for the affect aware machines.
760
(Refer Slide Time: 31:48)
Now, moving further in the direction of open issues the question is the there deception
inherent in affective computing right. So, if the machine is understanding the affect state of
the user and wants let us say to help the user, but in this pursue of helping the user let us say
it is not showing it all the calendar bookings which the user has done.
On one hand the intent is to help the user let us say to the user to not get too stressed, but on
the other hand this is deceptive you are actually withholding some information and
highlighting another information. And that is a moral dilemma there right. So, how ethical it
will be for the machine right.
And then then and then of course, this also means that the design of affect aware machines is
should not only be limited to computer scientist, it should also take into consideration the
feedback from different stakeholders coming not only from anthropology, but from the legal
aspect from the you know the medical the psychologist aspect and so forth.
Now, further friends does a machine enable human beings to function in a manner which they
choose to represent themselves or does it look through the person in other easier words does
it have the capacity to expose the concealment that is inherent to human representation of
themselves in front of the others.
Now, let us say you are in a formal environment you come across a person, a colleague and
you know you have a very formal discussion with them. And during this formal discussion
761
some facts are implicitly concealed its not that you know the intent is wrong its just that its a
very formal discussion its for example, need to know basis, right.
Now, the same is applicable to machines as well now perhaps maybe a user is trying to
conceal some facts is trying to hide their emotion due to any reason which he or she feels is
you know right. But then what if the machine actually is able to sense that even though the
user is relaxed and composed, but internally they are not because you know you are doing a
multi-model analysis of the data maybe you are using physiological senses as well right.
So, is that right or wrong maybe the because of the context and the use case where that
particular system is being used the machine is not required to do that access you know it its a
formal communication. So, maybe something needs to be concealed again this means now if
you are trying to understand the emotional state of the user.
And we expect the machine to be companion in achieving a certain task then it has to also
understand that concealment for example, does not really mean that you know someone is
trying to hide something for nefarious purposes it is simply based on the use case and the
environment.
Further should destructive machines be given emotional capabilities let us say there is
weapon machine you know machine which is capable of doing harm to the user or to the
environment should they have emotional capabilities. For example, you know these anti
missile based systems are should they be affected by the emotional state of the user should
they be affected by how let us say the sentiment about a certain adversary is in the news or in
general right.
So, that is a very complex question that is a very difficult question to answer, but till our
understanding grows of how an in AI machine. And of course, then you know the next of the
emotional emotion aware AI machine should be behaving should be reacting we need to have
a simpler understanding of why you would need let us say emotional capability in a
destructive machine maybe this particular use case does not require that or maybe it does.
So, let us say if a machine which is capable of doing certain harm is able to sense that the
user is too stressed and since the user is too stressed the user may make an incorrect decision
which the user may regret later. So, what does the machine do? Does not really take the order
does not really do the task which is the machine is being asked to perform.
762
So, there is no right or wrong answer at this point and which simply means that we as
designers of this emotion enabled AI machines need to consult the outside world outside of
computer science to understand the different challenges and aspects which could lead could
affect the society and also how the psychology studies have been done to understand human
behavior.
And what could be the response of a machine which let us say is trying to mimic or assist
human. Further friends now what are the ethical aspects that a designer should include in
developing emotionally intelligent machines, right. So, what is the checklist? You have
already seen the discussion you know we have already discussed this when you are creating
this emotion aware AI machine what could be the different aspects with respect to biases and
data aspects coming into the picture right.
So, can we create this you know checklist which the designer could follow and then be
observant of the ethical aspects which need to be taken into consideration? So, the sooner
there is a guideline I think the better it is because we are seeing a phenomenal progress in AI
enabled machines and certainly with these AI enabled systems, machines, apps being made
more deployable being more accessible to the users it is important that we understand the
human nature of the user as well the emotional quotient of the user as well.
Now, let’s take an example friends these open issues with respect to targeting advertising.
Now, let’s say you are able to understand the emotional state of the user should these
763
advertising advertisement which are shown to the user on a social media platform should they
be based on the emotional state of the user.
A trivial example is the user is let us say feeling sad should the user be shown example
advertisements of a chocolates, right? So, that is actually a very a moral dilemma right there
right maybe solve the short term problem, but could lead to a longer term problem with
respect to dependency of the user on these products which are being shown to the user in this
targeted advertisements.
Now, should this user anyhow allow or the software which is being used by the user to
browse let us say the browser should it be allowed to actually look at the emotional state of
the user? And then what about the vulnerable population? Let us say you know there is this
teenage teenager in some part of the world who is not feeling.
So, confident about themselves and then you know there are these targeted advertisements
which let us say talk about a certain type of wardrobe a certain type of food which can you
know enhance that particular teenager. So, that there are these very vulnerable populations
right someone let us say who is sick and is searching for a certain answers and then you know
through cookies system analyzes the let us say the emotional state.
And then shows them these advertisements which could be you know something which the
user might not directly be looking for, but can strongly influence the user. Now, the other
question is if let’s say these questions are answered right. So, the limits of targeting
advertising with respect to emotions are answered.
What would be the change in advertising world itself once these are the system becomes
emotionally intelligent right? Currently the advertisements which are shown on social media
platforms are based on the different attributes such as age gender browsing patterns so forth.
If you add emotion to it as well then it is going to have a massive impact on targeting
advertisement right. So, before such a deployment is done its quintessential its super
important that the affects are studied not only from the user perspective, but from the
advertisers perspective as well and then you know how would we deal with these kind of
issues certainly again this goes back to the point which I was making earlier friends which is
about the policy currently in the affective computing the policy part is missing, right.
764
We see that generally for AI systems different nations are and different consortiums are
coming up with these policies these blueprints of where AI systems can be used how AI
systems are going to affect the jobs and the user the user reaction the user knowledge and so
forth. Same needs to be done for emotion and then when we talk about this use case of
targeting advertising the same discussion will come into the picture.
So, friends with this we have reached towards the end of the second lecture of ethics in
affective computing. We discussed about the effect of biases on privacy of the user which is
based on the different stages of the development of an emotional aware AI system.
We discussed about how the facial expressions and speech patterns have limitations and that
effects the privacy of the user. And later we discussed on the current open issues in with
respect to the creation, the usability and deployment of emotion aware AI systems which will
have massive effect on the user.
Thank you.
765
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indian Institute of Technology, Ropar
Indraprastha Institute of Information Technology, Delhi
Week - 12
Lecture - 39
Finale
Hi friends.
So, it’s fascinating to see the this wonderful journey that we have taken together is now
coming to an end goal. And it was wonderful learning experience for both of us. And I
believe for you as well to go through these emotionally intelligent machines and the
understanding of the emotions. So, in short, we tried to cover just the tip of the iceberg. And
some of the topics that we tried to cover were including the foundations of the emotions.
So, first we tried to understand how emotions can be expressed, can be modelled, how they
are expressed in the in humans and trying to build on the understanding of their expressions
in humans. We tried to understand how machines can further express it, can recognize it and
can also make use of it to make the interactions more empathetic through different
modalities.
So, once we understood how these emotions can be represented from machine perspective,
started looking at the different modalities, the different sensors which can be used to
understand the user's effect. So, first we started with the camera. So, looking at the facial
expressions, the hand gestures, the body movement.
And we realized that the camera based data can give us very powerful cues to understand the
emotional state of the user. From the camera we moved on to the microphone. So, we
discussed about how speech based signals can be used to map the user emotion onto the
computer understanding. And then also how machine a computer can react back and have the
emotion based variation in the speech typically from the perspective of text to speech
systems.
766
And then friends we also discussed about the natural language processing, text based emotion
analysis. So, typically we see these E-mails and these conversations, chatting, social blogs
and so forth. So, how we can understand the affective state and particularly the perceived
affective state of the user based on these text analysis?
So, further we try to understand that these are not the only modalities through which
emotions are expressed by humans. And hence machines can not only use these modalities,
but they can also take a step further and try to look into the physiological signals.
And the way humans use physiological signals to express these emotions. We looked at a
range of physiological signals including galvanic skin response, heart rate, brain signals so on
so forth. And then we tried to see how all together these modalities can come together and
complement each other through the help of multimodal understanding of the emotions and
also the expression of these multimodal emotions. So, that was how we tried to understand
how can we recognize the emotions.
Next, we tried to understand how machines can also express emotions. So, for example, we
tried to understand how can there be a more empathetic interaction with the humans building
upon the basis of these emotions that we have. And then we tried to look into a very naive
case study where we tried to see how we can make use of a virtual agent.
Which can be emotionally intelligent so, we tried to convert Siri, Alexa like virtual agents
into an emotionally intelligent virtual agents and we looked at some of the naive codes and
code bases as well. And then further we tried to see what could be some of the ethical issues
around it.
So, typically when we are designing any AI enabled system and you know the context of this
course, we are looking at emotionally aware AI enabled system, right. So, the ethics is an
extremely important topic. So, friends we discussed about how the privacy of the user can be
taken into consideration, how different sensors if not carefully used can lead to invasiveness
in privacy.
Hm hm.
767
And then also looking at the concepts of you know lacks and gaps currently in the
community. For example, there is currently a policy vacuum, right. So, affective computing is
a very new area. So, certainly there is a need for a policy to come in to decide what can.
Hm hm.
Emotion enabled systems do. Where emotions should be measured, for whom emotion
enabled systems should be you know given access to. And once emotion has been detected,
has been predicted by a machine, what all should the system be allowed to do with it.
For example, the meta analysis, where it is stored, to whom the machine will give access to
the emotion. So, all of these are like very burning important questions, which need to be
taken into consideration before we design an emotionally aware machine such that the user
interest is a primal and it is always taken care of.
Yeah, and I guess Abhinav its worth addressing and commenting that during the offering of
this course only we have seen the phenomenal rise in the advancements of the large language
models.
Yes.
Such as the ChatGPT and all that and users may be aware the learners may be aware about
the lots of issues that are arising with the advancements of these technologies. And there is a
lot of talk among the researchers, among the policymakers about how can such a technology
be regulated.
And I would say Abhinav while we are nowhere nearby on the understanding of the
regulations around these large language models, but once it comes to an stage imagine that
there is a large language model which is also emotionally intelligent. So, what about the
regulations around that kind of technology?
So, where we are definitely not still there yet, but of course, building on what we have you
know about regulations with respect to the existing technologies, with respect to the AI
technologies, with respect to the LLMs maybe we can definitely look a forward to create
regulations and.
Yes.
768
And policies around how an ethical usage of emotionally intelligent machines can be done.
Yes, and also friends we have seen a very healthy participation from you on the blogs, on the
forums. We would really like for you to continue that, you know go move forward, write
blogs, try to integrate emotion aware technology where it is ethically you know allowed
within your AI based systems.
Hm hm.
And keep in touch. So, certainly we would love to see these new knowledge where you have
added not only the emotion sensing, the affect sensing, but also the emotional response into
your next AI enabled projects.
Yeah, and certainly I would like to reiterate what you know just said. You know so, this is
certainly the end of the course, but this is certainly not the end of the learning for you and
neither for us. So, definitely we would like and hope to see you will create this, you will do
keep this lifelong learning around the affecting computing.
And definitely we would love to see even here back from you about some of the projects that
you are doing and that you are interested in doing some of the ideas that you have around the
usage of affecting computing.
Let it be, you know, you could be a researcher, you could be a student, you could be a
practitioner in a technology in your particular business. So, we would love to see the ideas
that how are you trying to make use of this technologies or the learning that you have
acquired in this course, in making better products, in making better interactions and making
the entire experience for your users more empathetic.
And then, certainly along with the learning, I think this has been a wonderful collaboration.
And I am really hopeful that this resource which we have created together would be useful in
the longer term. And I will also like to take this opportunity to thank the team at IIIT Delhi,
Ravi, Anoop and Aman and of course, also the brilliant TA Gulshan Sharma.
Yeah, I completely agree, you know I think this entire beautiful effort and the collaboration
would not have been possible without the support that we got from IIITD. And also, you
know, wonderful TA's and students who those who have assisted us during the offering of the
course itself.
769
I think a very warm thanks to all of them. And it was really wonderful to have you here. And
you know, to be able to collaborate with you on this in this last few months to create this
beautiful resource. And we hope that this kind of collaboration continues beyond this course
as well. So, with that learners, we would like to wish you all a.
Happy learning.
770
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE