Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
63 views18 pages

Sentiment Classification Method Based On Blending of

The document discusses a sentiment classification method based on blending of emoticons and short texts. It describes how short texts have become a main medium for information sharing due to reduced thresholds online. The method analyzes sentiments in short texts that also contain emojis to make communications more engaging.

Uploaded by

Sharuka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views18 pages

Sentiment Classification Method Based On Blending of

The document discusses a sentiment classification method based on blending of emoticons and short texts. It describes how short texts have become a main medium for information sharing due to reduced thresholds online. The method analyzes sentiments in short texts that also contain emojis to make communications more engaging.

Uploaded by

Sharuka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

entropy Article

Sentiment Classification Method Based o


sification
ication
cation Method Based
MethodBased
Method Basedon on Blending
onBlending
Blending of of and Short Texts
of
Article Emoticons
Short
ort Texts
ortTexts
Texts
Sentiment Classification Method Based on Blending of
Haochen Zou * and Kun Xiang 1, 2

ang
2
Emoticons
2
and Short Texts
Haochen Zou 1, * and Kun Xiang 2 Department of Computer Science and Software Enginee
1

Department
1 of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, Canada
epartmentofofComputer
partment ComputerScience
Scienceand
andSoftware
SoftwareEngineering,
Engineering,Concordia
ConcordiaUniversity,
University, 2 Department of Science and Engineering, Hosei Univers
Montreal,
ontreal, QCH3G QC1M8,
H3G H3G Canada
1M8, Canada
ntreal, QC
2 Department 1M8, Canada
of Science and Engineering,
1
Hosei DepartmentKoganei
University, of Computer Science
184-8584, [email protected]
and Software Engineering, Concordia
Tokyo, University,
epartment
partment ofof Science
Science and
and Engineering,
Engineering, HoseiUniversity,
Hosei University, Koganei184-8584,
Koganei 184-8584, Japan;Japan;
Tokyo,Japan;
Tokyo,
[email protected] Montreal, QC H3G 1M8, Canada * Correspondence: [email protected]
[email protected]
[email protected] 2
* Correspondence: [email protected] Department of Science and Engineering, Hosei University, Koganei 184-8584, Tokyo, Japan;
orrespondence: [email protected]
respondence: [email protected] [email protected] Abstract: With the development of Internet technolo
* Correspondence: [email protected]
Abstract:
tract: WiththeWith the development
thedevelopment
development of Internet technology, short texts gradually
have gradually become the mainmedium for people to obtain information and com
act: With ofofInternet
Internet technology,
technology, shorttexts
short textshave
have gradually become become thethemain
main
medium
iumfor for
forpeople people
peopletotoobtain to obtain
obtaininformation information
informationand and
andAbstract: communicate.
communicate. Short Short text
textreduces
reduces reduces the
thethreshold
threshold threshold ofinformation production and reading by virtue of its
um communicate. With Short
the development
text ofthe
Internet ofof short texts have gradually become the main
technology,
information
rmationproduction production
productionand and
andreading reading
readingbybyvirtue by virtue
virtueofmedium of
ofitsitsshortits
short short
length, length,
which which is in line with the trend offragmented reading in the context of the current f
mation for people
length, isisininline
to obtain
which linewith
withthe
information thetrend
trend
and ofof
communicate. Short text reduces the threshold of
fragmented
mented reading reading thein the context theof the current fast-paced life. In addition, short texts containemojis to make the communication immersive. H
ented reading ininthe context
context ofofthe current
current fast-paced
information
fast-paced life.InIn
production
life. addition,
and reading
addition, short
short bytexts
virtue
texts contain
of its short length, which is in line with the trend of
contain
emojis to make the communication immersive. However, short-text content means it containsrelatively little information, which is not conduciv
sjistotomake
make thecommunication
the communication immersive.
immersive. However,
fragmented
However, short-text
reading
short-text in the content
contextmeans
content means
of it itcontains
the current contains
fast-paced life. In addition, short texts contain emojis
relatively
ively little little information,
information, which which
is notis not conducive
conducive to theto the
analysis analysis
of of sentiment
sentiment characteristics.
characteristics. Therefore, this paper proposes a sentiment classifica
vely little information, which is not conducive to make to the
the communication
analysis of sentiment immersive. However, short-text content means it contains relatively
characteristics.
Therefore,
efore,this this
thispaper paper
paperproposes proposes
proposesa asentiment a sentiment
sentimentclassification classification
classification method
methodbased based based on the blending of emoticons and short-text content. Emoticons and short-text
fore, little information,
method which onon the
isthe
not blending
conducive
blending ofofemoticons
emoticons
to the analysis of sentiment characteristics. Therefore, this
and short-text
short-text content. content. Emoticons
Emoticons and short-text
andshort-text
short-text contentcontent are transformed
areatransformed
transformed into vectors,
intovectors,
vectors, andthe and thecorresponding word vector and emoticon vector are
the
hort-text content. Emoticons and paper proposes
content are sentiment classification
into method
and based on the blending of emoticons and short-text
corresponding
esponding
ponding word
word
wordvector vector
vectorand and
andemoticon emoticon
emoticonvector vector vector
are
are
areconnected
content.
connected
connected
Emoticons intoaand
into
into a
asentencingsentencing
sentencing
short-text matrix
content
matrix
matrix
ininturn.
turn.
are
in The Thesentence
turn.
transformed
The
matrix is input into a convolution neural
into vectors, and the corresponding word
sentence
encematrix matrix
matrixisisinput is
inputintoinput into a
intoa aconvolution convolution
convolutionneural neural neural
network network classification
classification model model for classification.
forclassification.
classification. The results indicate that, compared with existing
nce vector and emoticon
network classification vectormodel
are connected
for into a sentencing matrix in turn. The sentence matrix is
The results
resultsindicate indicate
indicatethat, that,
that,compared compared
comparedwith with
withexisting
existingexisting
methods,methods, the
theproposed
proposed proposedmethod method
improves improves theaccuracy of analysis.
the model for classification. The results indicate
esults input into a convolution
methods, the neural network
method classification
improves the
accuracy of
racyofofanalysis. analysis.
analysis.
acy that, compared with existing methods, the proposed method improves the accuracy of analysis.
Keywords: sentiment analysis; convolutional neura
Keywords:
words: sentiment
sentimentanalysis; analysis;
analysis;convolutional convolutional
convolutionalneural neural neural
network;network;
emoticon emoticon vectorization
vectorization algorithm algorithm
ords: sentiment Keywords:network; sentiment
emoticon analysis; convolutional
vectorization algorithm neural network; emoticon vectorization algorithm

 Citation: Zou, H.; Xiang, K. 1. Introduction


 Sentiment Classification Method
1. Introduction
ntroduction 1. Introduction As an important media platform for spr
roduction Citation: Zou, H.; Xiang, K. Based on Blending of Emoticons and
Asan anAsimportant
an important media media
platform platform forAs
forspreading
spreadingspreading socialsocial
events, events, the Internet
theInternet
Internet plays plays asignificant role in thesocial eventsplays[1]. With
a the ra
As Sentiment
important Classification
media Method
platform for an social
important events, media
Short Texts.
the platform
Entropy 2022, for
plays aa
24, x.spreading social events, Internet
significant
nificant role
roleininsocial
Based on in social
socialevents
Blending events
events[1].
of [1].With
Emoticons [1].
Withthe
and With
the the
significant rapid
rapiddevelopment development
role in social
development and and
events
maturity maturity
[1]. of With
https://doi.org/10.3390/xxxxx of Internet technology, many
the rapid development and maturity of Internet
ofInternet
Internet online social platforms ha
ficant role rapid and maturity
technology,
hnology, Short many
Texts.
manyonline Entropyonline
onlinesocial 2022, social
24, 398.
socialplatforms platforms
platformstechnology, have
havegradually gradually
manybecome
gradually online
become become
social
the the
platforms
main main
media media
have forpeople to obtain information
gradually become the main media for people
for and communic
nology, many have Academicthe main
Editor: media
Adam for
Lipowski
plepeople
to to obtain
obtain information
https://doi.org/10.3390/e24030398
information and and communicate
to
communicate obtain with
information
with each eachand
other. other.
communicate
Twitter Twitteras a as
with
sociala each network
socialother. platform
Twitter is popular
as a social networkbecauseplat- of it
le to obtain information and communicate with each other. Twitter as a social
network platform is popular because of itsitsof its real-time, convenient,
Received:
of its5 real-time,
January and
2022 characteristics [2]. The
interactiveand interactive characteristics [2]. Theburgeoning increase of T
work platform
ork platform is popular
AcademicisEditor: popular because
because
Adam Lipowski ofform isreal-time,
popular
real-time, because
convenient,
convenient, and interactive
and convenient,
interactive
characteristics
racteristics [2].The [2].burgeoning
The The burgeoning increase increase
burgeoning of Twitterincreaseand of other
Accepted: social
Twitter 11 March
andplatforms
2022
other depends
social on the following
platforms depends on two thepoints. First,two
following the short le
acteristics [2]. burgeoning increase ofofTwitter
Twitter andand other
other social
social platforms
platforms depends
depends
on the following two
Received: 5 January points.
2022 First, the short
points. length
First, theof tweet
short text
Published:
length reduces
12 March
of tweet the
2022
textthreshold
reduces ofinformation
the threshold production
of information and reading,
production catering
ehefollowing
followingtwo twopoints.
points.First, First,the theshortshortlength
lengthofoftweet tweettext textreduces
reducesthe thethreshold
thresholdofof
the reading in the current fast-paced life [3]. netwo
current fast-paced life [3]. Second, social
Accepted:
information 11 March 2022
production and reading, catering to the trend of fragmented reading in the
rmationproduction
productionand andreading,
reading,catering and reading,
catering catering to the trend of fragmented
mation Published: 12 March 2022 totothethetrend
trend ofoffragmented
fragmented
Publisher’s Note: reading
MDPI
reading ininthe
stays
current
rent fast-paced
fast-pacedlife life[3]. life [3].
[3].Second, Second,
Second,social social
socialnetwork
network network
Second,content content
social network such
suchasasneutral as
content
tweets tweets
withcan such
regard can
toas
contain contain
texts,texts,
tweets
jurisdictional emojis, texts,
can contain pictures, videos,
emojis, and other
pictures, videos,forms, w
nt fast-paced content such tweets can contain texts,
emojis,
ojis, pictures,
pictures, videos, videos,
andother and
other other
forms, forms,
and other
which which
forms,
makes makes
which
upfor up
for for
makes
the the
up for
in lack lackthe oflackpure pure text communication compared withface-to-f
of text communication compared with
ofofpure pure text
Publisher’s Note: MDPI stays neutral claims published maps and
is, pictures, videos, and forms, which makes up the lack text
communication
mmunication with regard compared
to jurisdictional
compared claims
with with face-to-face
face-to-face
inface-to-face communication
communication
communication institutional and
and and
makes
affiliations.
makes makes
text text textcommunication
communication more more immersive
immersive and accurate
and accu-
munication compared with face-to-face communication and makes text
communication
mmunication published maps more
moreimmersive immersive
and institutional
immersiveand and
affil-
andaccurate accurate
rate
accurate[4]. [4].
[4].Users[4].
Users
Userscan Users
can
canlog can
log log
in to in to
Twitter
loginintotoTwitter
TwitterandTwitterand and
publish
andpublish
publish publish information
information by by computers,
computers, smartphones,
smartphones, an
munication more
information
rmationby iations. by computers, smartphones,
bycomputers,
computers,smartphones, smartphones,and andandother and
other other
terminal
otherterminal terminal
devices.
terminaldevices. devices.
There
devices.There are
Thereare There
two
aretwo are
striking
two two striking
features features
of short of short
texts suchtexts
as such
tweets. as tweets.
mation
striking
king features
features of shortof short
texts texts
such such
as as
tweets.tweets.
First, theyFirst,
First, they they
are short,
are are
short, short,
together
together
Copyright:
together
©with
2022with a word
by
with
thea word a wordlimitation
limitation
authors. on tweetson tweets
[5]. The[5]. The
short short content of th
content
ng features of short texts such as tweets. First, they are short, together with a word
limitation
tation ontweetson tweets
tweets [5].The [5].short
The The short content content theoftweet
of the the tweet
tweet means means it contains
itcontains
contains relatively
relatively narrow narrow information,which
information, whichisisnot notconducive
conduciveto to further
ation on [5]. short content ofofthe tweet means means ititcontains
Submitted relatively
for possible
relatively narrow
open
narrow access
information,
rmation,which which
whichisisnot is not
notconducive
conducive conducive further
to further analysis.
analysis. Second,
Second, short
short texts
texts suchsuch as astweets
tweets and
and comment
comment content
content contain
contain aa wealth
wealth of emoj
totofurther
furtheranalysis.
analysis.Second, Second,short shorttextstextssuch suchasastweets tweets
publication under the terms and
mation, Copyright: © 2022 by the authors.
and
commentcomment contentcontentcontain contain
awealth a wealth
wealthofofemojis of ofemojis
emojis
emojis[6]. [6].
[6].
[6].Emojis Emojis
Emojis
Emojishave have
conditions
havebeen beenusedbeen
been
of the used
used
Creative
usedfrequently frequently
frequently
frequentlyon Commons on onon social
social media,
media, and and
they they
have have
been been
endowed w
omment content
Licensee contain
MDPI, Basel,aSwitzerland.
endowed with rich connotations in the process of use. In addition to basic functions such
social media, and they have been endowed with rich connotations
Attribution in (CCthe process
BY) of
license use. In addition to basic functions such as express
lalmedia,
media,This and
and they
they
article have
is have
an open been
been endowedwith
accessendowed
article withrich richEntropy
connotations
connotations
2022, 24, xinFORinthethe process
process
PEER REVIEW ofofuse.use.
In addition
ddition to to
basic
ddition todistributed
basic
functionsfunctions
under the such
basic functions such
terms as
such
as as
as expressing
expressing
expressing actions actions
actions
(e.g., (e.g.,
(e.g.,
andexpressing actions (e.g., “Go skiing today! 🎿”), “Go “Go
“Go
skiing skiing
skiing
today! today!
(http://creativecommons.org/licenses
today!
🎿”), 🎿”), ”),objects
objects(e.g.,
(e.g., “Sushi
“Sushi 🍣 for for lunch.”),
lunch.”), weathe
ectsobjects (e.g.,
(e.g.,conditions
“Sushi “Sushi
of 🍣 for
the Creative for
🍣 lunch.”),
Commonslunch.”), weather weather (e.g., (e.g.,
“It’s “It’s
sunny sunny
/by/4.0/).
☀in ☀Montreal
in in Montreal thismorning.”) or emotions (e.g., “Tonight is a gre
thismorning.”)
ts (e.g., “Sushi 🍣 for lunch.”), weather
weather (e.g.,(e.g., “It’s
“It’s sunny☀
sunny in Montreal this
Montreal or emotions (e.g., “Tonight is a
morning.”)
rning.”) or or
emotions
Attribution emotions
(CC BY) (e.g., (e.g., (https:// is a great night! 😁”), emojis can also enhance the the
“Tonight
license “Tonight is a great night! 😁”), emojis can also enhance
ning.”) or emotions (e.g., “Tonight is a great night! 😁”),
great night! emojis can
”), emojis can also
also enhance
enhance the emotion
the
emotion and even
and even disambiguate
disambiguate short short
textstexts
such as tw
creativecommons.org/licenses/by/
4.0/). such as tweets with sarcastic phrases (e.g., “I love
words and phrases (e.g., “I love to work overtime! to work overtime! 🙄”). ”).
Entropy 2022, 24, x. https://doi.org/10.3390/xxxxx
The additional content, such as pictures, videos, and
0.3390/xxxxx
/xxxxx www.mdpi.com/journal/entropy
www.mdpi.com/journal/entropy
xxxx www.mdpi.com/journal/entropy emotional inclination of the tweet itself, which is a kind
from study. Tweet short text, as one of the most important
Entropy 2022, 24, 398. https://doi.org/10.3390/e24030398 the emotional orientation of tweets in most cases [7]. How
https://www.mdpi.com/journal/entropy
of words, short text sometimes cannot fully express
Entropy 2022, 24, 398 2 of 18

The additional content, such as pictures, videos, and links, has little influence on the
emotional inclination of the tweet itself, which is a kind of noise and can be eliminated from
study. Tweet short text, as one of the most important elements of Twitter, determines the
emotional orientation of tweets in most cases [7]. However, due to the limited number of
words, short text sometimes cannot fully express users’ emotions and attitudes. Therefore,
users add emoticons such as emojis to enrich their emotional leanings. Internet emoticons
were born in the 1980s, and the original emoticons were made up of characters [8]. With the
advancement of the Internet, emoticons have undergone great changes in form, content, and
function. Emojis have moderately become the most popular emoticons on social networks
and have become an indispensable chatting tool in today’s network communication [9].
Symbolic communication can convey feelings more accurately and change people’s commu-
nication mode and expression habits. At the same time, emoticons have different extended
meanings in different situations [10], which can convey rich semantic information beyond
the reach of text expression. Therefore, the importance of emoticons is self-evident. To
objectively judge the emotional polarity of short texts such as tweets, it is necessary to study
emoticons such as emojis in addition to analyzing short text. By integrating emoticons into
the process of short-text sentiment analysis, it can more accurately judge the emotional
tendency of short texts in social media such as Twitter and TikTok.
Social networks are not only a medium for people to record their lives and commu-
nicate with each other but also a way to express personal feelings and maintain relation-
ships [11]. Therefore, social media such as Twitter is an important carrier for people to
express their happiness and sorrow. As an extremely influential news and public opinion
platform, short texts from social networks generate huge emotional information from a
great number of users, which seems to be chaotic but contains a considerable value. These
emotional traits reflect users’ interests and preferences and, at the same time, may also have
a huge impact on the spread of online public opinion [12]. Therefore, sentiment analysis of
short texts can understand users’ preferences and their views on some hot events in real
society and make trend predictions, providing the scientific basis for government decision
making. At the same time, tweets and other short-text data from social media contain a vast
majority of users’ comments and suggestions on products, services, environment, etc. [13].
Enterprises and institutions can further mine and analyze short-text information to obtain
and provide a scientific basis for further research and research or improvement of prod-
ucts [14]. Through the analysis of short-text information, we can not only predict people’s
personality characteristics and living conditions but also forecast the development trend of
new events, which has practical significance for social development.
Previous work has studied the presence of sentiment value in different short texts
and attempted to analyze the relevant sentiment characteristics in these cases. Zhao et al.
proposed an unsupervised word-embedding method based on large corpora, which utilizes
latent contextual semantic relationships and co-occurrence statistical features of words in
tweets to form effective feature sets. The feature set is integrated into a deep convolution
neural network for training and predicting emotion classification labels [15]. Alharbi et al.
proposed a neural network model to integrate user behavior information into a given docu-
ment and evaluate the data set using a convolutional neural network to analyze sentiment
value in short text such as tweets. The proposed model is superior to existing baseline
models, including naïve Bayes and support vector machines for sentiment analysis [16].
Sailunaz et al. incorporated tweet responses into datasets and measurements and created a
dataset with text, user, emotion, sentiment information, etc. The dataset was used to detect
sentiment and emotion from tweets and their replies and measured the influence scores
of users based on various user-based and tweet-based parameters. They used the latter
information to generate generalized and personalized recommendations for users based on
their Twitter activity [13]. Naseem et al. proposed a transformer-based sentiment analysis
method, which encodes the representation from the converter and applies deep intelligent
context embedding to improve the quality of tweets by removing noise and considering
word emotion, polysemous, syntactic, and semantic knowledge. They also designed a
Entropy 2022, 24, 398 3 of 18

two-way long-term and short-term memory network to determine the sentiment value of
tweets [17]. However, these studies focus on methods to improve the accuracy of analyzing
the emotional characteristics of short texts, ignoring the effect of emoticons such as emojis
on the emotional tendency of the whole text.
Several studies have been conducted to analyze the emotional features and semantic in-
formation contained in emoticons and their emotional impact on text content. Barbieri et al.
collected more than 100 million tweets to form a large corpus, and distributed representa-
tions of emoticons were obtained using the skip-gram model. Qualitative analysis showed
that the model can capture the semantic information of emoticons [18]. Kimura et al. pro-
posed a method of automatically constructing an emoticon dictionary with any emotion
category. The emotion words are extracted, and the co-occurrence frequency of emotion
words and emoticons is calculated. According to the proportion of the occurrence times
of each emoticon in the emotion category, a multi-dimensional vector is assigned to each
emoticon, and the elements of the vector represent the intensity of the corresponding
emotion [19]. However, these studies focused on the emotional characteristics and semantic
information contained in emoticons themselves, rather than combining emoticons with
textual data. Arifiyanti et al. utilized emoticons to build a model, classified the emotion cat-
egories of tweets containing emoticons, and evaluated the performance of the classification
model [20]. Helen et al. proposed a method to understand emotions based on emoticons,
and a classification model based on attentional LSTM was designed [21]. However, the
above studies focus on understanding the emotional characteristics of the whole text by
extracting and analyzing emoticons in the text and establishing relevant sentiment labels,
without integrating emoticons’ emotional information with the emotional features of the
text content for further in-depth analysis.
In view of the increasing frequency of people, especially young people, using emoti-
cons such as emojis in text, and the increasingly close relationship between emojis and the
emotional tendency of short-text content such as tweets, this paper aims to improve the
accuracy of sentiment analysis on text data, especially short-text data, and objectively judge
the emotional tendency of short-text content such as tweets. Combined with the features of
emojis in short texts, this paper designs and implements a sentiment classification method
with the emoji vectorization algorithm based on the blending of emojis and short texts.

2. Materials and Methods


2.1. Data Source and Corpus Construction
The corpus is one of the basic resources of natural language processing [22]. Compared
with traditional texts, short texts such as tweets are characterized by short-text content, rich
emoticons, more noise, and unstructured language [23]. Taking the short-text data of tweets
as an example, it has four distinct characteristics. First, concise language. Although each
tweet is limited to 280 words, most users often use only one or two sentences to express
their views and opinions, and the number of words is far less than 280 words [24]. In
addition, another important reason for concise language is that users often omit sentence
components [25]. Short texts such as short tweets resulting in insufficient contextual
information and difficult-to-extract evaluation objects due to default sentence elements have
brought challenges to sentiment analysis. Second, various forms of expression. Emoticons
are widely used in short texts such as tweets. According to the collected tweet data, it is
found that the number of tweets containing at least one emoticon accounts for 37.5% of the
total, which is enough to show the users’ love for emoticons. The reason is not only that
emoticons increase the readability and sense of substitution of short text, but also because
emoticons can directly and vividly convey users’ attitudes and emotions [26]. Third, more
noise. Short text, such as tweets, uses specific symbols to indicate a specific role. Links
often appear in tweets [27]. These symbols and links do not affect the emotional orientation
of tweets and are the noise of sentiment analysis. If not removed, the accuracy of sentiment
classification will be affected. Fourth, new words appear frequently on the Internet [28].
With the increasing number of netizens, netizens have created many new words which
users’
users’ attitudes
attitudes
users’ attitudes and
andand emotions
emotions
emotions [26]. Third,
[26].[26].Third,Third,more
more morenoise.
noise. Short
noise.Short text, such
text,text,
Short such as
as tweets,
such tweets,
as tweets, uses
usesuses
specific
specific symbols
specificsymbols
symbols to
to indicate
indicate
to indicate aa specific
specific
a specific role. Links
role.role.
Links often
Linksoften appear
appear
often appear in
in tweets
tweets
in tweets [27]. These
[27].[27].
These These
symbols
symbols and
and links
links do
do not
not affect
affect the
the emotional
emotional orientation
orientation
symbols and links do not affect the emotional orientation of tweets and are the noise of
of tweets
tweets and
and are
are the
the noise
noise of
of of
sentiment
sentiment analysis.
analysis. If
If not
not removed,
removed, the
the accuracy
accuracy
sentiment analysis. If not removed, the accuracy of sentiment classification will be of
of sentiment
sentiment classification
classification will
will be
be
Entropy 2022, 24, 398 affected.
affected. Fourth,
affected.Fourth,Fourth, new
newnew words
words wordsappear
appear appear frequently
frequently
frequently on
on theon Internet
the Internet
the Internet[28]. With
[28].[28].
With the
the increasing
With increasing
the increasing 4 of 18
number
number
number of
of netizens,
netizens,
of netizens, netizens
netizens
netizenshave
have created
havecreated
created many
many many new
newnew words
words wordswhich
which whichare
are different
different
are different from
from from
traditional
traditional
traditionallanguage
languagelanguage forms
forms in
inthe
forms in process
the process
the processof
ofonline
online
of online communication.
communication.
communication. For
Forexample,
example,
For example, “TBH”,
“TBH”, “TBH”,
to
to be
tohonest,
be honest,
be honest, and
andand “amirite”,
“amirite”,
“amirite”, am
am Iam I right,
right, etc. Generally,
etc.etc.
I right, Generally,
Generally, network
network neologisms
neologisms also
also have
have an
an an
are different from traditional language forms in network
the process neologisms
of online also have
communication. For
emotional
emotional
emotional tendency.
tendency.
tendency. Therefore,
Therefore,
Therefore, in
in the
the field
field
in theand of
of
field tweet
tweet
of tweetemotion
emotion analysis,
analysis, we
we also
also need
need to
to to
example, “TBH”, to be honest, “amirite”, amemotion
I right, etc.analysis,
Generally, we alsonetwork need neologisms
analyze
analyze network
analyzenetworknetwork neologisms.
neologisms.
also have neologisms.
an emotional tendency. Therefore, in the field of tweet emotion analysis, we also
This
ThisThispaper
paper paper constructs
constructs
constructs aa corpus
corpus
a corpus of
of short
short
of short texts
texts containing
containing
texts containing rich
richrichemoticons.
emoticons.
emoticons. Data
Data Data
need to analyze network neologisms.
acquisition
acquisition
acquisition tools
tools were
were
tools used
used
were to
to collect
used collect
to short-text
collectshort-text
short-text data using
datadatausing Twitter
Twitter
using Twitteras
as the
the
as source
source
the sourceof
of short
short
of short
This paper constructs a corpus of short texts containing rich emoticons. Data acqui-
texts,
texts, and
andand
texts, 100,000
100,000100,000non-news
non-newstweets tweets published
published from
from 15
15October,
October, 2021,
2021, to
to3131 October,
October, 2021,
2021,
sition toolsnon-news
were used tweets published
to collect short-text from 15 October,
data using Twitter 2021, to
as 31theOctober,
source of 2021,
short texts,
were
were selected
selected
were selected as
as backup
backup
as backupcorpus.
corpus.
corpus. The
The preliminary
preliminary
The preliminary collected
collected
collectedtweet
tweet data
data
tweet contained
contained
data contained aa lot
lota of
of
lot of were
and 100,000 non-news tweets published from 15 October 2021 to 31 October, 2021,
noise
noise and
andand
noise redundant
redundant
redundant data,
data, so
so ititsowas
data, was
it necessary
wasnecessary
necessary to
to preprocess
preprocess
to preprocess the
the data.
data.
the First,
First,
data. we
we deleted
First, deleted
we deleted
selected as backup corpus. The preliminary collected tweet data contained a lot of noise and
tweets
tweets without
tweetswithout
without emoticons.
redundantemoticons. data,The
emoticons. The research
so itThe research
wasresearch goal
necessary of
goalgoal tothis
of this paper
paper
of this
preprocess was
paperwas
thewasto
to analyze
data.analyze
toFirst,
analyze the
we emotional
thedeleted
emotional
the emotional
tweets without
features
features of
featuresof the
the
of text
text
the by
by
text integrating
integrating
by integrating emoticons
emoticons
emoticons with
with the
the
with short
short
the
emoticons. The research goal of this paper was to analyze the emotional features text.
text.
short Emoticons
Emoticons
text. Emoticons can
can not
not
can only
onlyonly
not of the text
directly
directly convey
directlyconveyconvey
by the
the feelings
feelings
the
integrating and
feelings andand
emoticons opinions
opinions
opinions
with the ofof the
of information
the
short information
the information
text. publisher
publisher
Emoticons can but
publisher not also
but butalsoalso
only have
have an
have
directly an convey
an the
emotional
emotional
emotional tendency
tendency
tendency
feelings and[29].
[29]. Therefore,
Therefore,
[29].
opinions of the emoticons
Therefore, emoticons
emoticons
information should
should
should
publisher be
bebutconsidered
considered
be considered
also have an as one
asemotional
one
as one of
of thethe
of the
tendency [29].
important
important
important factors
factors in
factors
Therefore, in the
the emotion
in theemotion
emoticons emotion analysis
analysis
should analysis of
of short
short
be consideredof shorttexts.
texts. In
astexts.
oneInofthis
this
Inthe paper,
paper,
this paper,
important short-text
factorsand
short-text
short-text and
in theandemotion
emoticon
emoticon
emoticon co-occurrence
co-occurrence
co-occurrence tweets
tweets
tweetsare
are used
used
are as
usedas candidate
candidate
as candidate corpus.
corpus.
corpus.
analysis of short texts. In this paper, short-text and emoticon co-occurrence tweets areSecond,
Second,
Second, we
we deleted
deleted
we deletedtweets
tweets tweets
with
with less
with than
lessless
than
used three
thanthree words
asthreewordswords
candidate in
in the in corpus.
the
corpus. corpus.
the corpus.In
Second,In general,
general,
Inwe we
general,we believe
deleted believe
we that
believe
tweets that
with tweets
tweets
that tweets
less of
of less
than less
of than
than
less
three than in the
words
three
three characters
characters
three characters are
corpus. Inare not
not
are emotionally
emotionally
not emotionally
general, inclined
inclined
we believeinclined or
or
that tweetsdo
do not
not
or do fully
fully
ofnot express
express
lessfully
thanexpress the
the opinion
opinion
the opinion
three characters holder’s
holder’s
are holder’s
not emotionally
attitude
attitude and
attitude should
andinclined
should
and shouldbe
bedo
or removed
removed
benotremoved
fullyfrom
from the
fromthecorpus
express corpus
the
thecorpus [30].
opinion Third,
[30].[30].
Third, we
Third,
holder’s deleted
weattitude
deleted
we deletedlinks,
links,
and usernames,
usernames,
links,
should usernames,
be removed from
topic
topic names,
names,
topic theand
names, and
corpus retweets
retweets
and retweets
[30]. from
from
Third, from
wetweets
tweetstweets
deleted in the
inlinks,
the
in thecorpus.
corpus.
corpus.
usernames, Fourth,
topicwe
Fourth,
Fourth, we performed
performed
we
names, performed
and retweets word
word word
from tweets
segmentation
segmentation
segmentation in of ofTwitter
the Twitter
corpus.
of Twitter text.
text.
Fourth,
text. we performed word segmentation of Twitter text.

2.2. 2.2.ofof
2.2.Annotation
Annotation
2.2. Annotation
Annotation Emotion
Emotion inofShort-Text
in
of Emotion Emotion inCorpus
Short-Text Short-Text
Corpus Corpus
Corpus
in Short-Text
Short
Short texts
Shorttexts Short
sent
sentsent
texts bytexts
by by sent
users
users on
usersonbysocial
users
social
on onmedia
media
media
social social
can media
cancancontain
contain can
contain contain
colorful
colorful colorful
emoticons.
colorfulemoticons.
emoticons. emoticons.
Among
AmongAmongAmong
them,
them, emojis
emojis them,
are
are theemojis
the most
most are the
popular
popular mostand
and popular
most
most and
used
used most
emoticons.
emoticons.used
them, emojis are the most popular and most used emoticons. At present, emojis have been emoticons.
At
At present,
present, At
emojis
emojispresent,
have
have emojis
been
been have been
widely
widely used
used widely
in
in used
various
various in various
social
social social
networks,
networks, networks,
and
and and
some
some
widely used in various social networks, and some of them have clear emotional some
of
of of
them
themthem have
havehave clear
clear
clear emotional
emotional
emotional tendencies.
tendencies.
tendencies.
tendencies. For
For example,
Forexample,
example,
For example,😁😁 has 😁hashas
has positive
positive emotional
positive
positive emotional
emotional
emotional tendencies,
tendencies,
tendencies,
tendencies, while
while
while 😫
while😫 has 😫hashas
has negative
negative
negative
negative emotional
emotional
emotional
emotionaltendencies.
tendencies.
tendencies.
tendencies. The
The site
TheThesite“Emojitracker”
site “Emojitracker”
site “Emojitracker”
“Emojitracker” monitors
monitors
monitors
monitors emoji
emoji usage
emoji
emoji usage in
usage
usage tweets
inintweets
in in
tweets
tweets real-time,
ininreal-time,
in real-time,
real-time, as shown
as
as shown
shown
as shown in
in Figure
Figure
in Figure
in Figure 1.
1. The
The site
1. The
1. The site ranks
ranks
sitesite
ranks each
each
rankseach emoji
emoji
each from
from
emoji
emoji from highest
highest
from to
highest
highest to lowest
lowest
to lowest
to lowest in
in terms
terms
in terms
in terms of
of the
the
of the
of the number of
number
number
numberof
oftimes
times
of times
times they
they
they have
have
they
have been
been
havebeen used
used
beenused in
usedinthe
in in current
the
the current
the time.
current
current time. As
As
time.
time. As can
can
Ascanbe
can seen
bebeseen
beseenfrom
from
seen the
from
from the figure,
figure,
the
the figure,many of
figure,
many
many of
manyof the
the emojis
oftheemojis
the that
emojis
emojis that are
that
that used
areareused
are frequently
frequently
used
used frequently
frequently have
have no
no sentiment
have
have sentiment
nono sentiment
sentiment value,
value, such
such
value,
value, as
as 📷.
such
such 📷.
as 📷.
as These
These
. These
These emojis
Entropy 2022, 24, x FOR PEER
emojis
emojis REVIEW
have
have little
little impact
impact on
on the
the sentiment
sentiment orientation
orientation of
of tweets,
tweets, so
so this
this paper
paper includes
includes 5 of 19
emojis have
have little impact on the sentiment orientation of tweets, so this
impact on the sentiment orientation of tweets, so this paper includes them in the paper includes
them
them in
inthe
them the neutral
neutral
inneutral
the neutralcategory.
category.
category.
category.

Figure 1. The site “Emojitracker” monitors emoji usage in tweets in real time.
Figure 1. The site “Emojitracker” monitors emoji usage in tweets in real time.

In addition, Cappallo et al. performed statistical analyses of a large amount of data


with emoticons and found that the emergence of emoticons conforms to the long-tail
distribution [31]. Maximum emojis can be covered by studying only the most frequent
ones. Therefore, this paper adopts artificial methods to select the top 300 emojis with clear
Entropy 2022, 24, 398 Figure 5 of 18
Figure 1.
1. The
The site
site “Emojitracker”
“Emojitracker” monitors
monitors emoji
emoji usage
usage in
in tweets
tweets in
in real
real time.
time.

In
In addition,
addition, Cappallo
Cappallo et et al.
al. performed
performed statistical
statistical analyses
analyses of of aa large
large amount
amount of of data
data
with
In emoticons
addition,
with and
Cappallo
emoticons found that the
et al. performed
and found emergence of emoticons
statistical analyses
that the emergence conforms
of a large
of emoticons amount
conforms to the
to of long-tail
thedata
long-tail
distribution
with emoticons [31].
distributionand Maximum
found
[31]. emojis
that the
Maximum can
can be
emergence
emojis be ofcovered
emoticons
covered by
by studying
conforms
studying only
only to the most frequent
long-tail
most frequent
ones.
ones. Therefore,
distribution this
this paper
[31]. Maximum
Therefore, adopts
emojis
paper canartificial
adopts be covered
artificial methods
methods to
to select
by studying the
the top
selectonly the300
top emojis
most
300 with
with clear
frequent
emojis clear
ones. emotional
Therefore, tendencies
this paper from
adopts the 819 emoticons
artificial methods in
to the “Emojitracker”
select the top 300 website
emojis
emotional tendencies from the 819 emoticons in the “Emojitracker” website for research. for
with research.
clear
After
emotional screening,
Aftertendencies 80
80 emojis
screening, from the 819
emojis were selected
selectedinfor
emoticons
were thestudy
for in
in this
this paper,
“Emojitracker”
study website
paper, including 40
40 positive
for research.
including positive
After screening,
emojis
emojis and 80
and 40 emojis were
40 negative selected
negative emojis. for
emojis. Emojis study
Emojis and in
and their this paper,
their emotional including
emotional tendencies 40
tendencies arepositive emojis
are shown
shown in in Table
Table
and 401.
1.negative emojis. Emojis and their emotional tendencies are shown in Table 1.

Table
Table 1.
Table 1.
1. Emojis
Emojis and
and their
Emojis their
their emotional
andemotional tendencies.
tendencies.
emotional tendencies.
Emotional
Emotional Tendencies
Emotional Tendencies
Tendencies Emojis
Emojis
Emojis

😂
😂 😍
😍😊😊😘😘😁😁😉😉👍👍☺☺😄😄
😝
😝😎😋💯💗😜👏😆💕😀
😎 😋 💯 💗 😜 👏 😆 💕 😀
Positive
Positive
Positive
😃
😃🎉🎉😬😬💓💓😝😝😇😇👌👌😻😻😚😚😛😛
😹
😹👻👻😙😙😗😗🤣🤣🙃🙃 😸
😸

😭
😭😔😔😒😒😫😫💔
💔😢😢 😑
😑😐😐😕
😕
😱
😱😫😫😤😤😓😓😥
😥😷😷😠😠🥀 🥀💩💩😰
😰
Negative
Negative
Negative 😟
😟😮😮😨😨😵😵😿
😿🙄🙄😧😧😖 😖😣😣😞
😞
😡 ☹
😡☹🤧 🤧 🤢 👿 🙀 😮💨
🤢 👿 🙀 😮💨 🤒🤒

In thisIn this
this paper,
Inpaper, paper, two
two people
two people people were
were organized
were organized organized to
to annotate
to annotate annotate the
the selected
the selected selected short-text
short-text
short-text corpus corpus
corpus
data
data with with
data emojis, emojis,
with emojis,
and theand
and the content
the content
content met themet
met the requirements
the requirements
requirements by manual
by manual
by manual annotation.
annotation.
annotation. To
To ensure
To ensure ensure
the
the accuracy
the accuracy of theof
accuracy of the
the annotation
annotation
annotation results,
results,
results, the
the candidate
the candidatecandidate data
data could
data could could
only be only
only be
be tweets
tweets tweets with
with the
with the the
same same
same annotation
annotation
annotation resultsresults
results
by two by two
two people.
by people.people.
At the At the
the same
At same same time,
time,
time, aa third
a thirdthird person
person
person waswas
was selected
selected to
selected to
Entropy 2022, 24, x FOR PEER REVIEW re-check 6 ofwas
19 obtained by
re-check
to re-check thethe
the candidate
candidate
candidate data. data.
data. Finally,
Finally,
Finally, aa corpus
a corpus corpus of
of tweets
of tweets tweets emotion
emotion
emotion test
test
test was was
obtainedobtained
by by
manual manual
manual annotation,
annotation, including
including
annotation, 2000
2000 positive
2000 positive
including and 2000
positive and 2000
2000 negative
and negative tweetstweets
negative each. each.
tweets All
All tweets
All tweets
each. in
tweets in
in
the
the corpus corpus
contain
the corpus contain
at least
contain at least
atone one
leastemoji emoji for
for analysis.
one emoji analysis.
for analysis.
2.3. Emoji Vectorization Algorithm
2.3. Emoji Vectorization Algorithm
2.3.1. Word Vector Training
2.3.1. Word Vector Training
Emojis are strongly correlated with the sentiment orientation of short texts such as
Emojis are strongly correlated with the sentiment orientation of short texts such as
tweets. In the sentimental analysis of tweets, taking emojis as one of the research objects
tweets. In the sentimental analysis of tweets, taking emojis as one of the research objects
can more objectively judge the sentiment value of tweets. However, one of the questions
can more objectively judge the sentiment value of tweets. However, one of the questions
that need to be addressed is how emojis in the form of pictures should be used to co-
that need to be addressed is how emojis in the form of pictures should be used to co-operate
operate with
with the
thetext
texttotoimprove
improve thethe accuracy
accuracy of emotion
of emotion classification.
classification. Eisner Eisner
et al. et al.
proposed the
proposed the vectorization algorithm called emoji2vec [32]. By transforming
vectorization algorithm called emoji2vec [32]. By transforming emojis into vector forms, emojis into
vector forms,
emojisemojis
can becan usedbe in
used
all in all areas
areas of language
of language processing
processing as well asas
well as words.
words.
Word2vec is a tool that Google open-sourced in 2013 to
Word2vec is a tool that Google open-sourced in 2013 to turn wordsturn words in text into aindata
text into a
format that
datacomputers
format that cancomputers
understand can [33]. It learns hidden
understand [33]. It information
learns hidden between wordsbetween
information
unsupervised
words in unlabeled training
unsupervised in unlabeled sets training
and obtainssets andword vectors
obtains word that can preserve
vectors that can preserve
syntacticsyntactic
and semantic relationships between words. The emoji2vec
and semantic relationships between words. The emoji2vec emoji vector emoji vector algorithm
algorithmuses the word2vec tool to train the word vector. The worf2vec tool was usedused
uses the word2vec tool to train the word vector. The worf2vec tool was for training
for training inprocessed
in the the processed corpus,corpus, and 765,285
and 765,285 five-dimensional
five-dimensional word word
vectors vectors were
were obtained after
obtainedthe
after the training.
training.

2.3.2. Construct Sample Set


2.3.2. Construct Sample Set
Emoticons and languages
Emoticons interact semantically.
and languages This paper
interact semantically. constructs
This a sampleaset
paper constructs of set of
sample
emoticons emoticons
accordingaccording
to the to the actual
actual requirements
requirements of the
of the vectorizationalgorithm.
vectorization algorithm.The
The sample
sample setsetconstructed
constructed consists
consists of
of three
three parts:
parts: emoji,
emoji, emoji
emoji name,
name, and
and CLDR
CLDR short
short name,
name, such as:
such as: {{😊,, Cute,
Cute, Smiling
Smiling Face
Face with
with Smiling
Smiling Eyes}.
Eyes}.
The firstThe
component of the sample
first component of theissample
the emoji picture.
is the emoji The emoji
picture. name
The emojiis the
namesecond
is the second
component of the sample. In the naming process, the emoji name must
component of the sample. In the naming process, the emoji name must meet two meet two
conditions.
conditions. First, the name can reflect the basic meaning of the emoji. Second, names are
relatively unique [34]. The third component of the sample is CLDR. The Common Locale
Data Repository (CLDR) project is a project of the Unicode Consortium to provide locale
data in XML format for use in computer applications. CLDR contains locale-specific
Entropy 2022, 24, 398 6 of 18

First, the name can reflect the basic meaning of the emoji. Second, names are relatively
unique [34]. The third component of the sample is CLDR. The Common Locale Data
Repository (CLDR) project is a project of the Unicode Consortium to provide locale data in
XML format for use in computer applications. CLDR contains locale-specific information
that an operating system will typically provide to applications [35]. The Full Emoji List is a
list of emojis that details the code and CLDR short name of each emoji. Sample information
on emoticons can be found in this list. Through the above methods, we construct the
positive samples in the sample set.

2.3.3. Algorithm Flow


The emoji vectorization algorithm maps an emoji to a point in a high-dimensional
vector space such that an emoji can be transferred into an N-dimensional vector with the
format of (dim1 , dim2 , dim3 , . . . , dimN ). The algorithm steps are as follows.
Initialize the emoji vector xi . Each sample contains the name of the emoji, and we
take the word vector corresponding to the emoji name wname as the initial vector of the
emoji vector. If the emoji name is an unregistered word, the emoji vector will be randomly
initialized. The emoji vector xi = wname . It can be seen from the sample set information
that the name of an emoji is a simple description of the meaning of the emoji, so the initial
emoji vector already contains the basic part of the semantic information of the emoji, which
will be conducive to the formation of the emoji vector.
Construct the description the vector v j . w1 , w2 , . . . , w N is a set of word vector se-
quences, which, respectively, correspond to the word sequences in the descriptive sentences
in the sample. In this paper, these word vectors are added together as description vectors
of emoticons. Then, the formula to describe the vector is:
N
vj = ∑ wk (1)
k =1

Description vector is the sum of the corresponding word vectors of each word in a
descriptive statement, which synthesizes the syntactic and semantic information of all
words in a descriptive statement [36].
Establish the mathematical model. The dot product of emoji vector xi and description
vector v j can indicate the similarity between the two vectors. The sigmoid function is used
to model the similarity probability of emoji vector xi and description vector v j , and the
formula is: y 1−y
P(y) = h xiT v j 1 − h xiT v j
(2)
h( x ) = 1+1e−x
 
Calculate the emoji vector xi . The sample dataset D = v j , yij v j ∈ Rn , yij ∈ {0, 1}
consists of every description vector v j . When the description sentences j match with the
emoji i, then yij = 1. Otherwise, yij = 0.
For all description vectors v j in the sample dataset D, the logarithmic loss function of
Equation (2) is calculated, which is:
   
− ∑ yij log h xiT v j − ∑ 1 − yij log(1 − h xiT v j )

(3)
i,j i,j

The batch gradient descent algorithm is used to find the best emoji vector xi . The emoji
vector obtained in this paper is a five-dimensional vector, and each emoji in the sample set
has a corresponding emoji vector. Table 2 shows the vector of four emojis.
− , log ℎ − , 1− log(1 − ℎ ) (
− , log ℎ − , 1− log(1 − ℎ ) (

The batch gradient,
log ℎ −
descent algorithm,
1 − log(1 − ℎ )
is used to find the best emoji vector 𝑥 . Th (
emojiThe
The
batch
vector gradient
,
obtained
batch gradientindescent algorithm
this paper
descent algorithm
is used to find vector,
is a, five-dimensional the bestand
is used to find vector,
emoji
the bestand
emoji emoji in.. Th
eachvector
vector th
Th
emoji
sample vector
The set obtained
has
batch a in this
corresponding
gradientindescentpaper
emojiis a five-dimensional
vector.
algorithm Table 2 shows the vector
is used to find vector,
the bestand each
of
emojifour emoji in
emojis.
vector th
. Th
emoji vector
samplevector obtained
set has this
a corresponding paper is a five-dimensional
emojiisvector. Table 2 showsvector,
the vector each emoji in th
Entropy 2022, 24, 398 emoji
sample set has obtained in this paper
a corresponding a five-dimensional
emoji vector. and of
Table 2 shows the vector
four
7 of 18emojis.
each
of fouremoji in th
emojis.
Table
sample2. Vectors
set has of four emojis.
a corresponding emoji vector. Table 2 shows the vector of four emojis.
Table 2. Vectors of four emojis.
TableEmoji
2. Vectors of four emojis. Emoji Vector
Table
Table 2. Vectors 2.four
Vectors
ofEmoji of four emojis.
emojis. Emoji Vector
Emoji 1.253765409217565
Emoji Vector
Emoji 1.253765409217565
Emoji
Emoji Emoji Vector
−0.926587569876967
Vector
1.253765409217565
😍 −0.926587569876967
1.253765409217565
1.698378103032509
−0.926587569876967
1.253765409217565
1.698378103032509
😍 −0.926587569876967
−0.527834892346427
😍 −0.926587569876967
1.698378103032509
😍 −0.527834892346427
1.698378103032509
1.698378103032509
0.847561023874692
−0.527834892346427
0.847561023874692
− 0.527834892346427
−0.527834892346427
1.157409314569509
0.847561023874692
0.847561023874692
1.157409314569509
0.847561023874692
−0.375025790846135
1.157409314569509
−0.375025790846135
1.157409314569509
1.157409314569509
0.746948348694133
😸 −0.375025790846135
😸 −0.375025790846135
0.746948348694133
−0.375025790846135
−1.047891534675927
0.746948348694133
😸 0.746948348694133
−1.047891534675927
😸 0.746948348694133
0.287905347982658
−1.047891534675927
−1.047891534675927
0.287905347982658
−1.047891534675927
0.287905347982658
−2.219247091347005
0.287905347982658
−2.219247091347005
0.287905347982658
0.345388574098972
− 2.219247091347005
−2.219247091347005
💔 0.345388574098972
−2.219247091347005
0.345388574098972
−1.613782945782691
0.345388574098972
−1.613782945782691
💔 −1.613782945782691
0.345388574098972
0.547893128730935
💔 −1.613782945782691
0.547893128730935
0.547893128730935
💔 −1.613782945782691
−0.789132897543139
0.547893128730935
− 0.789132897543139
−0.789132897543139
0.547893128730935
−0.441708935387097
−0.789132897543139
−0.441708935387097
−0.441708935387097
−0.789132897543139
1.134897523487950
−0.441708935387097
1.134897523487950
1.134897523487950
🤮 −0.441708935387097
−1.824560823954166
− 1.824560823954166
1.134897523487950
−1.824560823954166
1.134897523487950
0.078318930571228
0.078318930571228
−1.824560823954166
0.078318930571228
−0.927943898316451
−1.824560823954166
−0.927943898316451
0.078318930571228
−0.927943898316451
0.078318930571228
−0.927943898316451
Entropy 2022, 24, x FOR PEER REVIEW VisualizationVisualization of five-dimensional
of five-dimensional emoji
emoji sentiment sentiment
−0.927943898316451
vectors 8 of 19
vectors in a two-dimension
in a two-dimensional space
spaceVisualization
is displayed of
in five-dimensional emoji sentiment vectors in a two-dimension
is displayed in Figure
Visualization
spaceVisualization ofFigure
2. Emojis
is displayed in
include2.positive
Emojis include
five-dimensional
Figure
emojis
2. Emojis include
positive
emojiand emojis
negative
sentiment
positive
and negative
emojis.
vectors
emojis
emojis.
in a two-dimension
and negative emojis.
of five-dimensional emoji sentiment vectors in a two-dimension
space is displayed in Figure 2. Emojis include positive emojis and negative emojis.
space is displayed in Figure 2. Emojis include positive emojis and negative emojis.

Figure2.2.Visualization
Figure Visualizationofoffive-dimensional
five-dimensionalemoji
emojisentiment
sentimentvectors
vectorsinina atwo-dimensional
two-dimensionalspace.
space.

2.4.
2.4.Naïve
NaïveBayes
Bayes
Naïve
NaïveBayes
Bayesisisaaclassification
classificationmethod
methodbased
basedononBayes’
Bayes’theorem,
theorem,which
whichassumes
assumes
conditional
conditional independence among features. When the naïve Bayes algorithmisisapplied
independence among features. When the naïve Bayes algorithm appliedtoto
text classification, it assumes that the words above and below the text are independent of
each other. The training set is counted and the prior probability of text category is
calculated:
Entropy 2022, 24, 398 8 of 18

text classification, it assumes that the words above and below the text are independent
of each other. The training set is counted and the prior probability of text category Ci
is calculated:
N
P(Ci ) = i (4)
N
where Ni represents the total number of documents whose document category is Ci , and
N represents the total number of all documents in the training set. Then, the conditional
probability of the characteristic attributes of document d with classification is calculated:
n
P(d|Ci ) = P((t1 , t2 , t3 , . . . , tn )|Ci ) = ∏ P(t j |Ci ) (5)
j =1

where t j represents the j features of document d. P(t j Ci ) represents the probability that
feature t j appears in text category Ci . Finally, the formula for calculating the probability of
all categories of documents to be classified is as follows:

P(d|Ci )· P(Ci )
P(Ci |d) = (6)
P(d)

Therefore, document d is in the category with the highest probability. Naïve Bayes is a
common text classification method with stable classification efficiency, can handle multiple
classification tasks, and performs well on small-scale data.

2.5. Support Vector Machine


Support vector machine (SVM) is a kind of classifier whose core idea is to determine
an optimal hyperplane that can correctly divide samples into two classes by maximizing
the interval of the nearest samples in different classes
 of samples
 in the training set [37].
Given the i training sample in a sample set x (i) , y(i) , where x represents the eigen-
vector, y = {−1, 1} represents the class tag. When the linear is separable, the hyperplane
can be expressed as:
wT x + b = 0 (7)
Hence, for any sample set ( x (i) , y(i) ), when y(i) = 1, w T x + b > 0, and when y(i) = −1,
w T x + b < 0. We define: (
w T x + b ≥ 0, y(i) = 1
(8)
w T x + b ≤ 0, y(i) = −1
The sum of the distances between the two support vectors belonging to different
categories and the hyperplane is:
2
γ= (9)
kwk
In order to determine the optimal hyperplane, it is necessary to satisfy the parameters
w and b in Equation (8), such that the interval γ is the largest, namely:
(
min 12 kwk2
( i ) T ( i ) (10)
y (w x + b) ≥ 1, i = 1, 2, 3, . . . , n

In this paper, the support vector machine (SVM) algorithm, which is widely used
in the classification field, is selected as the classification algorithm to construct an SVM
emotion classifier. It is crucial to select suitable features to get a better SVM emotion
classifier. According to the characteristics of short texts, this paper selects the following
text features.
First, the frequency of emotional words. Counting the frequency of emotion words
requires an emotion dictionary. The emotion dictionary used in the experiment is Word-
Emotion Association Lexicon [38]. Each short text in the experimental data is traversed, and
Entropy 2022, 24, 398 9 of 18

the number of positive emotion words and negative emotion words is counted according
to the emotion dictionary. Due to the different lengths of each short text, normalization
is needed. The number of emotional words obtained by statistics is divided by the total
number of words in the short text to obtain the frequency of emotional words.
Second, negative words and adverbs of degree. When people communicate, they ha-
bitually use negative words and adverbs of degree. Although negative words and adverbs
of degree do not have emotional polarity, when they are combined with emotional phrases,
they will affect the original emotional tendency of emotional words [39]. Specifically, the
combination of degree adverb plus emotional words will enhance or weaken the origi-
nal emotional tendency of emotional words, such as “really fancy”. The combination of
negative words plus emotion words will reverse the polarity of emotion words, such as
“not into”. The combination of degree adverb plus negative word plus emotion word will
make the emotion degree and polarity of emotion word change, such as “really not into”.
Therefore, special attention should be paid to negative words and adverbs of degree when
analyzing the emotional tendency of short texts.
Third, the number of exclamation marks and question marks. In short-text content,
there are often multiple question marks or multiple exclamation marks used together. The
combination of punctuation marks indicates the strengthening of the original emotional
tendency. For example, the combination of multiple exclamation marks indicates the
strengthening of surprise, anger, and other emotions. The combination of multiple question
marks indicates the strengthening of doubts and puzzles. Therefore, counting the number
of exclamation marks and question marks in short-text content is helpful to analyze the
emotional tendency of short texts.
Fourth, the number of emoticons such as emojis. Emoticons have their own emotional
tendency, which affects the whole process of short text to a certain extent. The emotional
intensity of the body is even the emotional tendency. Therefore, the number of emoticons
is one of the characteristics of this paper.

2.6. Convolutional Neural Network


Deep learning has been widely used in image recognition, speech recognition, com-
puter vision, and other fields since it was proposed and has made remarkable achieve-
ments [40]. Compared with traditional machine learning algorithms, deep learning has
advantages in feature expression and model building [41]. Therefore, we use the convolu-
tional neural network (CNN) to analyze short-text emotion. To give full play to the role of
emoticons in promoting short-text emotional tendency, this paper adds emoticons vector to
the short-text emotional analysis based on the CNN classification model.
The convolution layer and pooling layer play an important role in the convolution
neural network. The convolution layer can extract local features and semantic combinations
from input data. The pooling layer selects local features and semantic combinations based
on the convolution layer and then filters out unimportant local features and semantic com-
binations with low confidence [42]. The alternating superposition of multiple convolution
layers and pooling layers can extract highly abstract features from text data and improve
the accuracy of emotion classification. Figure 3 is the structure diagram of the classification
model based on the convolutional neural network adopted in this paper.
The model has the following four layers:
Input layer. The input of the classification model is a matrix. The matrix is formed
by connecting the word vectors corresponding to all words in the sentence after word
segmentation. If the word vector corresponding to the ith word in the input sentence with
length n is Xi ∈ R5 , then the matrix is X = X1 ⊕ X2 ⊕ . . . ⊕ Xn , and ⊕ is the connector.
combinations based on the convolution layer and then filters out unimportant local
features and semantic combinations with low confidence [42]. The alternating
superposition of multiple convolution layers and pooling layers can extract highly
abstract features from text data and improve the accuracy of emotion classification. Figure
Entropy 2022, 24, 398 3 is the structure diagram of the classification model based on the convolutional neural
10 of 18
network adopted in this paper.

Structurediagram
Figure3.3.Structure
Figure diagramofofthe
theclassification
classificationmodel.
model.

Convolutional
The model has thelayer. The classification
following model based on the convolution neural network
four layers:
usesInput
convolution filters with different window h lengths
layer. The input of the classification model is a to extract
matrix. Thethematrix
local features
is formedof by
the
input layer. In the research, we implement the parallel convolution
connecting the word vectors corresponding to all words in the sentence after word layer with multiple
convolution kernels
segmentation. of different
If the word vector sizes to learn short-text
corresponding to the features.
ℎ word inMultiple
the input convolution
sentence
kernels are
with length isdefined to acquire features in the short-text content and reduce
∈ , then the matrix is = ⨁ ⨁ … ⨁ , and ⨁ is the connector. the degree of
fortuity in the feature extraction process. We define the filter size of convolution kernels as
Convolutional layer. The classification model based on the convolution neural
h1 Xk, h2 Xk, h3 Xk, where k is an integer and the dimension of word embeddings, and h1 is
network uses convolution filters with different window ℎ lengths to extract the local
the stride value. The feature obtained by using the convolution filter w as the input layer is:
features of the input layer. In the research, we implement the parallel convolution layer
ci = f (w × Xi:i+h−1 + b) (11)

where f is the non-linear activation function. The rectified linear unit (ReLu) is used in
the research:
ci = max (0, w × Xi:i+h−1 + b) (12)
In the equation, b is the bias term, w ∈ Rhk is the shared weight, Xi:i+h−1 represents the
connection of the word embedding which is from the i word of short text X to the i + h − 1
word ordered from top to bottom, and b ∈ R is an offset term. A characteristic graph can
be obtained by applying the convolution filter to all adjacent word vectors with length h
in the input matrix C = [c1 , c2 , . . . , cn−h+1 ], c ∈ Rn−h+1 . Therefore, n − h + 1 feature maps
are used for each convolution kernel to obtain a feature vector t whose dimension is X
(n − h + 1).
If the number of convolution kernels  is p, then
 p feature vectors can be obtained
through feature mapping, and T = t1 , t1 , . . . . . . , t p . If q parallel convolution kernels of
different types are used; for example, h1 Xk, h2 Xk, . . . . . . , hq Xk, and the number of each
type of convolution kernel is p, then pXq feature vectors that can be obtained after feature
mapping, and S = t1 , t1 , . . . . . . , t p . Therefore, S is the output from the convolution layer.
The output will be sent to the pooling layer of the CNN.
Pooling layer. The role of the pooling layer is to screen out the optimal local fea-
tures. The pooling layer performs a max-pooling operation on all the characteristic graphs
obtained by the convolution layer, which is Ĉ = max{C }.
Fully connected layer. The output of the pooling layer is connected to the output
node of the last layer by full connection, and the tweet emotion is classified by the SoftMax
classifier. In the final implementation, the dropout technique is used on the fully connected
layer to prevent the hidden layer neurons from self-adapting and to reduce overfitting. The
output layer is a fully connected SoftMax layer with the dropout technique. The output
Entropy 2022, 24, 398 11 of 18

layer outputs the classification accuracy and loss of the method. The proportion of dropout
starts from 0.5 and gradually decreases until the model performs best, which is 0.2. The
number of training epochs is 8. The optimizer used for training is AdamOptimizer.

2.7. Recurrent Neural Network


The recurrent neural network (RNN) is one of the artificial neural networks. It is a
neural network that can model sequence data and process sequence data of any length. The
difference between it and the convolutional neural network (CNN) is that the cyclic neural
network can consider the sequence characteristics in the text [43]. In traditional neural
networks, nodes in the same layer are not connected. Traditional neural networks assume
that elements do not affect each other. However, this assumption does not accord with
the realistic logic. Therefore, this network structure will appear powerless in dealing with
many problems. The nodes between the hidden layers of the recurrent neural network are
connected. The input of the current time and the output of the hidden layer of the previous
time jointly determine the output of the current time. In other words, the recurrent neural
Entropy 2022, 24, x FOR PEER REVIEW 12 of 19
network can remember the previous information. This sequence characteristic is that when
RNN processes the current input information, it will calculate the current text together
with the previously memorized information.
The RNN
RNNincludes
includesthethe
input layer,
input the hidden
layer, layer, and
the hidden theand
layer, output
the layer,
outputas displayed
layer, as
in Figure 4 [44]. It can be clearly seen from the figure
displayed in Figure 4 [44]. It can be clearly seen from that the nodes of the hidden
figure that the nodes layer
of can
the
not onlylayer
hidden be self-connected
can not only be butself-connected
also interconnected.
but also interconnected.

Figure 4. Structure
Structure of RNN.

In the RNN,
RNN, the hyperbolic
hyperbolic tangent
tangent activation function tanh is used to determine the
output
output of the current network unit and transfer the current network unit state to the next
network unit:
tanh( x ) = 2σ (2x ) − 1 (13)
ℎ( ) = 2 (2 ) − 1 (13)
1 − e−2x
f (x) = (14)
1 1+−e−2x
( )= (14)
2.8. Long Short-Term Memory 1+
The backpropagation through time (BPTT) algorithm is used in the training of the
2.8.
RNN. Long
In Short-Term Memory
the training process of this algorithm, there will be the problems of gradient
The backpropagation
explosion through timewhich
and gradient disappearance, (BPTT)makes
algorithm is used
the RNN in the
unable to training
deal withoflong
the
sequences. The long short-term memory (LSTM) network is an improvement
RNN. In the training process of this algorithm, there will be the problems of gradient of cyclic
neural networks
explosion and solves
and gradient the above problems
disappearance, [45]. LSTM
which makes solves
the RNN the problem
unable to dealofwith
gradient
long
explosion encountered
sequences. in the RNN,
The long short-term and it is(LSTM)
memory accuratenetwork
to extractistext
anlong-distance
improvementdependent
of cyclic
semantic
neural featuresand
networks when processing
solves text
the above information
problems [45]. [46].
LSTM LSTM
solvessolves the problem
the problem of text
of gradient
long-distance
explosion dependence
encountered in through
the RNN,the gating
and it system in theto
is accurate network
extractunit.
textLSTM contains
long-distance
dependent semantic features when processing text information [46]. LSTM solves the
problem of text long-distance dependence through the gating system in the network unit.
LSTM contains three such gating systems, namely input gate, output gate, and forgetting
gate. These gating systems are realized by the sigmoid function:
1
( ): ( ) = (15)
1+
The output of the sigmoid function is a value between 0 and 1. The closer the value
Entropy 2022, 24, 398 12 of 18

three such gating systems, namely input gate, output gate, and forgetting gate. These
gating systems are realized by the sigmoid function:

1
σ( x) : f ( x) = (15)
1 + e− x
The output of the sigmoid function is a value between 0 and 1. The closer the value is
to 1, the more information the door opens and retains. On the contrary, the closer it is to 0,
the more information it needs to forget. The single-cell unit of LSTM is composed of13three
Entropy 2022, 24, x FOR PEER REVIEW of 19
Entropy 2022, 24, x FOR PEER REVIEW 13 of 19
sigmoid functions, two tanh activation functions and a series of operations.
The structure of the diagram LSTM cell is displayed in Figure 5 [47] below.

Figure 5. Structure diagram of LSTM cell.


Figure 5.
Figure 5. Structure
Structure diagram
diagram of
of LSTM
LSTM cell.
cell.

InIn theLSTM
LSTM neuralnetwork,
network, thefirst first stepisistotoprocess
process thecurrent
current inputinformation
information
In the
the LSTM neural
neural network, the the first step
step is to process the the current input
input information
and
and the information transmitted from the previous state through the forgetting gate to
and the
the information
information transmitted
transmitted from the the previous
previous state
state through
through thethe forgetting
forgetting gate
gate toto
determine
determine which information will be lost from the cell state. The second step is to use the
determine which
which information
information will
will be
be lost
lost from
from the
the cell
cell state.
state. The
The second
second step
step is
is to
to use
use the
the
input gatetotocontrol
input control theinput
input ofuseful
useful informationinto into thecurrent
current stateand and obtainthe the
input gate
gate to control thethe input ofof useful information
information into the the current state
state and obtain
obtain the
latest
latest state of the current state through the tanh activation function. The output
latest state
stateof of
thethe
current state state
current through the tanhthe
through activation function. The
tanh activation outputThe
function. determines
output
determines
the state of theoutput
the state of thepasses
and outputit and
to passes
the next it to the next door.
door.
determines the state of the output and passes it to the next door.
3. Results
3.
3. Results
Results
WeWe introducedthe the convolutionalneural neural networkinindeep deep learningtotoextract
extract hidden
Weintroduced
introduced the convolutional
convolutional neural network network in deep learning
learning to extract hidden
hidden
sentimentfeatures
sentiment featuresfrom
fromshort-text
short-text data
data andand find
find thethe
best best analysis
analysis method
method for for the
thethe fusion
fusion of
sentiment features from short-text data and find the best analysis method for fusion
of short-text
short-text content
content andand emoticons
emoticons by comparing
by comparing different
different classification
classification methods.
methods.
of short-text content and emoticons by comparing different classification methods.
Basedon
Based onthe
theconvolutional
convolutionalneural
neuralnetwork
networkclassification
classificationmodel,
model,this
thispaper
paperanalyzes
analyzes
Based on the convolutional neural network classification model, this paper analyzes
three classification models. The first classification model only considers
three classification models. The first classification model only considers short texts in the short texts in
three classification models. The first classification model only considers short texts in the
the tweet
tweet corpus,
corpus, removes
removes emoticons,
emoticons, divides
divides eacheach
tweettweet
in theinexperimental
the experimental datawords,
data into into
tweet corpus, removes emoticons, divides each tweet in the experimental data into words,
words, and the
and takes takes the sentence
sentence matrixmatrix
connectedconnected
by theby the corresponding
corresponding word word
vectorvector of all
of all words
and takes the sentence matrix connected by the corresponding word vector of all words
words
as theasinput
the input
of theofconvolutional
the convolutionalneural neural
networknetwork classification
classification model model to classify
to classify short
as the input of the convolutional neural network classification model to classify short
short
texts.texts. The second
The second classification
classification model firstmodel first converts
converts emoticons emoticons in the corpus
in the short-text short-text
into
texts. The second classification model first converts emoticons in the short-text corpus into
corpus into named texts corresponding to emoticons. For example,
named texts corresponding to emoticons. For example, it would convert the tweet: it would convert the
named texts corresponding to emoticons. For example, it would convert the tweet:
“Getting
tweet: everybody
“Getting together
everybody for for
together the thestart of ofthe Christmas tour! into “Getting
😎”” into
“Getting everybody together for the startstart
of thethe Christmastour!
Christmas tour!😎” into “Getting
“Getting
everybody together for the start of the Christmas tour! Smiling face with sunglasses”.
everybody together for the start of the Christmas tour! Smiling face with sunglasses”.
Then, the transformed tweets are segmented into sentence matrices, which are trained and
Then, the transformed tweets are segmented into sentence matrices, which are trained and
tested by the classification model of the convolutional neural network. The third
tested by the classification model of the convolutional neural network. The third
classification model is to transform emoticons into emoticons vectors using the emoji
classification model is to transform emoticons into emoticons vectors using the emoji
vectorization algorithm, and then connect the corresponding word vector and emoticons
Entropy 2022, 24, 398 13 of 18

everybody together for the start of the Christmas tour! Smiling face with sunglasses”. Then,
the transformed tweets are segmented into sentence matrices, which are trained and tested
by the classification model of the convolutional neural network. The third classification
model is to transform emoticons into emoticons vectors using the emoji vectorization
algorithm, and then connect the corresponding word vector and emoticons vector into the
sentence matrix according to the lexical order of the tweet corpus. Finally, the sentence
matrix is input into the convolutional neural network classification model for classification.
We conducted comparative experiments on the previously established corpus of
tweet data and used naïve Bayes, LSTM, RNN, and SVM as the baseline methods. In
the experiments, the Python programming language and the TensorFlow platform were
utilized for implementation. The experimental results of the corpus with positive sentiment
value and negative sentiment value are shown in Tables 3 and 4.

Table 3. Experimental results with positive-sentiment-value corpus.

Analysis Method Identify Quantity Accuracy


Naïve Bayes 1447 72.35%
CNN (Word2Vec) 1632 81.60%
CNN (Emoji2Word, Word2Vec) 1649 82.45%
CNN (Emoji2Vec, Word2Vec) 1704 85.15%
LSTM 1660 83.00%
RNN 1651 82.55%
SVM 1558 77.90%

Table 4. Experimental results with negative-sentiment-value corpus.

Analysis Method Identify Quantity Accuracy


Naïve Bayes 1351 67.55%
CNN (Word2Vec) 1473 73.65%
CNN (Emoji2Word, Word2Vec) 1385 69.25%
CNN (Emoji2Vec, Word2Vec) 1596 79.80%
LSTM 1479 73.95%
RNN 1470 73.50%
SVM 1402 70.10%

According to the comparative analysis of experimental results, it can be concluded


that the model based on deep learning performs better, namely in accuracy, because the
deep learning method can extract deep-seated data features in short texts such as tweets. In
the short-text data set containing rich emoticons, the best experimental model is the third
one, which converts emoticons and text into vectors and analyzes them. By comparing
the experimental outcomes of the analysis of positive emotions and negative emotions, we
can find that the accuracy of all models on the analysis of negative short-text emotions is
reduced, especially the second model. When emoticons are converted into text messages
and then combined with short-text content analysis, the reduction range is the largest, and
the instability is the highest. This is because many short texts with negative emoticons may
not express negative emotions but express things such as surprise, emotion, etc. In this
case, the conversion of emoticons into words to analyze the emotional tendencies of the
text will have the opposite effect. In general, converting emoticons and texts into vectors to
achieve the highest accuracy in analyzing emotional value, which reflects that emoticons
vectorization algorithm can play a significant role in the emotional analysis of short texts.
Entropy 2022, 24, 398 14 of 18

Novak et al. provide a mapping to positive, negative, and neutral occurrence informa-
tion for 751 emojis, also available on Kaggle [48]. Based on the dataset with 70,000 tweets
and 969 different emojis, we designed a contrast experiment. We first extract the tweets
from the dataset that contained the emojis in Table 1. A total of 37,810 tweets were selected
from the dataset. Then, we proceeded with the accuracy analysis experiment with different
methods based on the corpus with extracted data.
The evaluation indices of the experiment included accuracy (P), recall (R), positive and
negative class F1 values. For the overall performance, the overall correction rate accuracy
is used, and the calculation formula is:

TP + TN
Accuracy = (16)
TP + FP + TN + FN
In this formula, TP, FP, TN, and FN represent the correctly classified positive tweets,
the misclassified positive tweets, the correctly classified negative tweets, and the misclassi-
fied negative tweets, respectively.
The experiment compared the proposed method CNN (Emoji2Vec, Word2Vec) with tra-
ditional classification methods, including naïve Bayes, CNN (Word2Vec), CNN (Emoji2Word,
Word2Vec), LTSM, RNN, and SVM. The parameter setup is as follows. The parameters
of c and g of SVM are obtained by grid search. The parameter setup of naïve Bayes is the
default parameter setup of sklearn. RNN is MV-RNN of reference, and its parameter setup
is the same. LSTM is Tree-LSTM of reference, and its parameter setup is the same. The
emotional classification results are displayed in Table 5.

Table 5. Comparison of emotional classification results.

Analysis Method Positive P Negative P Positive R Negative R Positive F1 Negative F1 Accuracy


Naïve Bayes 79.15% 82.58% 82.37% 78.66% 80.27% 79.36% 80.75%
CNN (Word2Vec) 86.38% 87.75% 87.20% 85.52% 87.05% 86.97% 86.91%
CNN (Emoji2Word,
85.60% 89.95% 87.42% 85.10% 87.83% 86.65% 87.60%
Word2Vec)
CNN (Emoji2Vec,
89.23% 91.60% 92.16% 88.33% 90.59% 89.70% 90.35%
Word2Vec)
LSTM 86.55% 87.10% 87.05% 88.42% 86.98% 88.15% 87.20%
RNN 84.92% 86.03% 87.13% 84.88% 86.04% 84.79% 85.33%
SVM 83.78% 85.35% 85.41% 83.88% 84.79% 84.32% 84.66%

The experimental results indicate that CNN (Emoji2Vec, Word2Vec) has a better per-
formance with higher overall accuracy and two types of F1 values than the traditional
classification methods. The naïve Bayes algorithm has the lowest overall performance
indices. The experimental results show that the CNN (Emoji2Vec, Word2Vec) method is
effective for the emotion classification of short texts with emoticons such as emojis.

4. Discussion
Entropy refers to the degree of the chaos of a system. A system with a low degree of
chaos has low entropy, while a system with a high degree of chaos has high entropy [49].
In the absence of external interference, entropy increases automatically [50]. In information
theory, entropy is the average amount of information contained in each piece of information
received, which is also called information entropy [51]. In the information world, the higher
the entropy, the more information can be transmitted, and the lower the entropy, the less
information can be transmitted [52]. The booming development of social media such as
Twitter and TikTok with the increasingly wide range of short-text communication has
reduced the difficulty of information dissemination. At the same time, the extensive use
of emoticons such as emojis in short texts increases the amount of information covered in
Entropy 2022, 24, 398 15 of 18

short-text content and boosts the entropy value of short-text information, which makes the
prediction of short-text information content represented by sentiment characters difficult.
The topic of this paper is to improve the accuracy of identifying sentiment features of
short texts with emoticons, such as emojis. Based on this research goal, we first established
a corpus containing rich emoticons and short texts and identified their sentiment tendencies
in an artificial way, which are used as a data source for subsequent analysis. Second, we
screened the emojis commonly used in 819 social media and selected 40 emojis with positive
emotions and 40 emojis with negative emotions, respectively. Third, we built an algorithm
to convert emoticons into vector information and analyzed the emojis we selected. Fourth,
we combined the vector information transformed by emojis with the vector information
transformed by characters in the short-text content and analyzed the sentiment tendency of
the short-text content by using the CNN model. The results were compared with simple
text analysis and emoticon conversion, and the proposed method improved the accuracy
of identifying positive and negative emotional tendencies.
Existing analysis methods of short-text emotion tendency adopt analysis methods
such as combining with context. For example, Wan et al. proposed an ensemble sentiment
classification system of Twitter data [53]. Although they accurately analyzed the sentimental
characteristics of the content of the short text, they removed the punctuation, symbols,
emoticons, and all other non-alphabet characters from the short text and hence ignored
the important factor of emoticons in the emotional tendency of the short text. Some
studies have analyzed the emotional value of emoticons. For example, Mohammad et al.
designed an algorithm and method for sentiment analysis using texts and emoticons [54].
Matsumoto et al. developed an emotion estimation method based on emoticon image
features and distributed representations of sentences [55]. However, nowadays, the majority
of people use emojis to express their sentiment in short texts, and emojis dominate the use
of emoticons. The above-mentioned methods only focus on emoticon symbolic expression
tokens and text-based emoticons, which face difficulties in analyzing short texts with emojis.
Therefore, based on the existing text vectorization algorithm and emoticon vectorization
algorithm, this paper designs a sentiment classification method that integrates emoticons
and characters. This method has been proved to be effective. Compared with analyzing the
text content of short texts or emoticons of short texts, the method proposed in this paper
has higher sentiment character recognition accuracy.
Sentiment feature analysis can help organizations and enterprises collect and an-
alyze users’ attitudes towards their products or services in public opinion and help
improve them, which is an effective means to improve the efficiency of data analy-
sis. An increasing number of systems and methods are designed and applied for this.
Our method, which blends short texts with emoticons, still has room for improvement
in its accuracy in identifying negative sentiment tendencies. The analysis shows that
emoticons with negative emotions contain more complex emotional information than
emoticons with positive emotions. Not only can it express negative emotions, but it can
also express emotions such as movement, surprise, and even pleasantness. Therefore,
our future research direction and improvement is to improve the efficiency and accuracy
of methods to identify texts with negative emotions.

5. Conclusions
Analyzing and understanding the sentimental characteristics of short texts is con-
ducive to obtaining the data information of the deep value of public opinion and helping
organizations and companies optimize their services and products. Airline services have
been improved and enhanced after implementing the ensemble sentiment classification
method [53]. Fast-moving consumer goods (FMCG) brands such as P&G and hotels such
as Marriott Corporation mine and analyze consumers’ sentiment opinions from short
texts in social media [56,57]. These implementations effectively improve the production
efficiency of enterprises and boost users’ satisfaction, which achieves a win–win situation.
We advocate further study in this research field.
Entropy 2022, 24, 398 16 of 18

In this paper, we propose a short-text sentiment analysis method that combines


emoticons such as emojis with short-text content. A great number of tweets containing
emojis are processed and analyzed to obtain the sentimental characteristics of short texts
such as tweets. This paper first classifies popular emoticons, converts emoticons together
with characters into vector representations, and analyzes them using the convolutional
neural network method. Experimental results show that the proposed method is more
accurate than the existing method.

Author Contributions: Conceptualization, H.Z. and K.X.; methodology, H.Z.; software, H.Z.; valida-
tion, K.X.; formal analysis, K.X.; investigation, K.X.; resources, H.Z.; data curation, H.Z.; writing—
original draft preparation, H.Z.; writing—review and editing, H.Z.; visualization, H.Z.; supervision,
K.X.; project administration, K.X.; funding acquisition, K.X. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Tsou, M. Research challenges and opportunities in mapping social media and Big Data. Cartogr. Geogr. Inf. Sci. 2015, 42 (Suppl. 1),
70–74. [CrossRef]
2. Gupta, H.; Jamal, M.S.; Madisetty, S. A framework for real-time spam detection in Twitter. In Proceedings of the 2018 10th
International Conference on Communication Systems & Networks (COMSNETS), Bangalore, India, 3–7 January 2018; pp. 380–383.
3. Chatzakou, D.; Kourtellis, N.; Blackburn, J.; De Cristofaro, E.; Stringhini, G.; Vakali, A. Mean birds: Detecting aggression and
bullying on twitter. In Proceedings of the 2017 ACM on Web Science Conference, Troy, NY, USA, 25–28 June 2017; pp. 13–22.
4. Baym, N.K. Personal Connections in the Digital Age; John Wiley & Sons: Hoboken, NJ, USA, 2015.
5. Wang, X.; Liu, Y.; Sun, C.J.; Wang, B.; Wang, X. Predicting polarities of tweets by composing word embeddings with long
short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Volume 1, pp. 1343–1353.
6. Na’aman, N.; Provenza, H.; Montoya, O. Varying linguistic purposes of emoji in (Twitter) context. In Proceedings of the ACL
2017, Student Research Workshop, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 136–141.
7. Ji, X.; Chun, S.A.; Wei, Z.; Geller, J. Twitter sentiment classification for measuring public health concerns. Soc. Netw. Anal. Min.
2015, 5, 13. [CrossRef] [PubMed]
8. Venter, E. Bridging the communication gap between Generation Y and the Baby Boomer generation. Int. J. Adolesc. Youth 2017, 22,
497–507. [CrossRef]
9. Kejriwal, M.; Wang, Q.; Li, H.; Wang, L. An empirical study of emoji usage on Twitter in linguistic and national contexts. Online
Soc. Netw. Media 2021, 24, 100149. [CrossRef]
10. Highfield, T.; Leaver, T. Instagrammatics and digital methods: Studying visual social media, from selfies and GIFs to memes and
emoji. Commun. Res. Pract. 2016, 2, 47–62. [CrossRef]
11. Velten, J.C.; Arif, R. The influence of snapchat on interpersonal relationship development and human communication. J. Soc.
Media Soc. 2016, 5, 5–43.
12. Barberá, P.; Jost, J.T.; Nagler, J.; Tucker, J.A.; Bonneau, R. Tweeting from left to right: Is online political communication more than
an echo chamber? Psychol. Sci. 2015, 26, 1531–1542. [CrossRef] [PubMed]
13. Sailunaz, K.; Alhajj, R. Emotion and sentiment analysis from Twitter text. J. Comput. Sci. 2019, 36, 101003. [CrossRef]
14. Cai, L.; Zhu, Y. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 2015, 14. [CrossRef]
15. Zhao, J.; Gui, X.; Zhang, X. Deep convolution neural networks for twitter sentiment analysis. IEEE Access 2018, 6, 23253–23260.
16. Alharbi, A.S.M.; de Doncker, E. Twitter sentiment analysis with a deep neural network: An enhanced approach using user
behavioral information. Cogn. Syst. Res. 2019, 54, 50–61. [CrossRef]
17. Naseem, U.; Razzak, I.; Musial, K.; Imran, M. Transformer based deep intelligent contextual embedding for twitter sentiment
analysis. Future Gener. Comput. Syst. 2020, 113, 58–69. [CrossRef]
18. Barbieri, F.; Ronzano, F.; Saggion, H. What does this emoji mean? A vector space skip-gram model for twitter emojis. In
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23–28
May 2016; Calzolari, N., Choukri, K., Declerck, T., Eds.; European Language Resources Association (ELRA): Paris, France, 2016;
pp. 3967–3972.
Entropy 2022, 24, 398 17 of 18

19. Kimura, M.; Katsurai, M. Automatic construction of an emoji sentiment lexicon. In Proceedings of the 2017 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining, Sydney, Australia, 31 July–3 August 2017;
pp. 1033–1036.
20. Arifiyanti, A.A.; Wahyuni, E.D. Emoji and emoticon in tweet sentiment classification. In Proceedings of the 2020 6th Information
Technology International Seminar (ITIS), Surabaya, Indonesia, 14–16 October 2020; pp. 145–150.
21. Helen, A.; Suryani, M.; Fakhri, H. Emotional context detection on conversation text with deep learning method using long
short-term memory and attention networks. In Proceedings of the 2021 9th International Conference on Information and
Communication Technology (ICoICT), Yogyakarta, Indonesia, 4–5 August 2021; pp. 674–678.
22. Zeroual, I.; Lakhouaja, A. Data science in light of natural language processing: An overview. Procedia Comput. Sci. 2018, 127, 82–91.
[CrossRef]
23. Afyouni, I.; Al Aghbari, Z.; Razack, R.A. Multi-feature, multi-modal, and multi-source social event detection: A comprehensive
survey. Inf. Fusion 2022, 79, 279–308. [CrossRef]
24. Piao, Z.; Park, S.M.; On, B.W.; Choi, G.S.; Park, M.S. Product reputation mining: Bring informative review summaries to producers
and consumers. Comput. Sci. Inf. Syst. 2019, 16, 359–380. [CrossRef]
25. Liang, J.; Tsou, C.H.; Poddar, A. A novel system for extractive clinical note summarization using EHR data. In Proceedings of the
2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 46–54.
26. Zhang, Y.; Shao, B.J. Influence of service-entry waiting on customer’s first impression and satisfaction: The moderating role of
opening remark and perceived in-service waiting. J. Serv. Theory Pract. 2019, 29, 565–591. [CrossRef]
27. Samuel, J.; Ali, G.G.; Rahman, M.; Esawi, E.; Samuel, Y. COVID-19 public sentiment insights and machine learning for tweets
classification. Information 2020, 11, 314. [CrossRef]
28. Zhang, Z.; Robinson, D.; Tepper, J. Detecting hate speech on twitter using a convolution-gru based deep neural network. In
European Semantic Web Conference; Springer: Cham, Switzerland, 2018; pp. 745–760.
29. Kim, Y.; Jun, J.W. Factors affecting sustainable purchase intentions of SNS emojis: Modeling the impact of self-presentation.
Sustainability 2020, 12, 8361. [CrossRef]
30. Shah, P.V.; Swaminarayan, P. Sentiment analysis—An evaluation of the sentiment of the people: A survey. In Data Science and
Intelligent Applications; Springer: Singapore, 2021; pp. 53–61.
31. Cappallo, S.; Svetlichnaya, S.; Garrigues, P.; Mensink, T.; Snoek, C.G. New modality: Emoji challenges in prediction, anticipation,
and retrieval. IEEE Trans. Multimed. 2018, 21, 402–415. [CrossRef]
32. Eisner, B.; Rocktäschel, T.; Augenstein, I.; Bošnjak, M.; Riedel, S. emoji2vec: Learning emoji representations from their description.
arXiv 2016, arXiv:1609.08359.
33. Dehghani, M.; Johnson, K.M.; Garten, J.; Boghrati, R.; Hoover, J.; Balasubramanian, V.; Parmar, N.J. TACIT: An open-source text
analysis, crawling, and interpretation tool. Behav. Res. Methods 2017, 49, 538–547. [CrossRef] [PubMed]
34. Jaeger, S.R.; Roigard, C.M.; Jin, D.; Vidal, L.; Ares, G. Valence, arousal and sentiment meanings of 33 facial emoji: Insights for the
use of emoji in consumer research. Food Res. Int. 2019, 119, 895–907. [CrossRef] [PubMed]
35. Wright, S.E. The creation and application of language industry standards. Perspect. Localization 2006, 241–278. [CrossRef]
36. Anderson, A.J.; Kiela, D.; Binder, J.R.; Fernandino, L.; Humphries, C.J.; Conant, L.L.; Lalor, E.C. Deep artificial neural networks
reveal a distributed cortical network encoding propositional sentence-level meaning. J. Neurosci. 2021, 41, 4100–4119. [CrossRef]
[PubMed]
37. Zheng, Q.; Tian, X.; Yang, M. The email author identification system based on support vector machine (SVM) and analytic
hierarchy process (AHP). IAENG Int. J. Comput. Sci. 2019, 46, 178–191.
38. Kwok, S.; Wai, H.; Sai, K.V.; Guanjin, W. Tweet topics and sentiments relating to COVID-19 vaccination among Australian Twitter
users: Machine learning analysis. J. Med. Internet Res. 2021, 23, e26953. [CrossRef] [PubMed]
39. Chen, S.; Lv, X.; Gou, J. Personalized recommendation model: An online comment sentiment-based analysis. Int. J. Comput.
Commun. Control 2020, 15. [CrossRef]
40. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications.
Neurocomputing 2017, 234, 11–26. [CrossRef]
41. Li, C.; Ding, Z.; Zhao, D.; Yi, J.; Zhang, G. Building energy consumption prediction: An extreme deep learning approach. Energies
2017, 10, 1525. [CrossRef]
42. Hou, Z.; Ma, K.; Wang, Y.; Yu, J.; Ji, K.; Chen, Z.; Abraham, A. Attention-based learning of self-media data for marketing intention
detection. Eng. Appl. Artif. Intell. 2021, 98, 104118. [CrossRef]
43. Liang, G.; Hong, H.; Xie, W. Combining convolutional neural network with recursive neural network for blood cell image
classification. IEEE Access 2018, 6, 36188–36197. [CrossRef]
44. Gao, M.; Shi, G.; Li, S. Online prediction of ship behavior with automatic identification system sensor data using bidirectional
long short-term memory recurrent neural network. Sensors 2018, 18, 4211. [CrossRef] [PubMed]
45. Ribeiro, A.H.; Tiels, K.; Aguirre, L.A.; Schön, T. Beyond exploding and vanishing gradients: Analysing RNN training using
attractors and smoothness. Int. Conf. Artif. Intell. Stat. 2020, 108, 2370–2380.
46. Zhang, Y.; Zheng, J.; Jiang, Y.; Huang, G.; Chen, R. A text sentiment classification modeling method based on coordinated
CNN-LSTM-attention model. Chin. J. Electron. 2019, 28, 120–126. [CrossRef]
Entropy 2022, 24, 398 18 of 18

47. Hrnjica, B.; Bonacci, O. Lake level prediction using feed forward and recurrent neural networks. Water Resour. Manag. 2019, 33,
2471–2484. [CrossRef]
48. Kralj Novak, P.; Smailović, J.; Sluban, B.; Mozetič, I. Sentiment of emojis. PLoS ONE 2015, 10, e0144296. [CrossRef] [PubMed]
49. Neill, C.; Roushan, P.; Fang, M.; Chen, Y.; Kolodrubetz, M.; Chen, Z.; Martinis, J.M. Ergodic dynamics and thermalization in an
isolated quantum system. Nat. Phys. 2016, 12, 1037–1041. [CrossRef]
50. Eskov, V.M.; Eskov, V.V.; Vochmina, Y.V.; Gorbunov, D.V.; Ilyashenko, L.K. Shannon entropy in the research on stationary regimes
and the evolution of complexity. Mosc. Univ. Phys. Bull. 2017, 72, 309–317. [CrossRef]
51. Delgado-Bonal, A.; Marshak, A. Approximate entropy and sample entropy: A comprehensive tutorial. Entropy 2019, 21, 541.
[CrossRef] [PubMed]
52. Osamy, W.; Salim, A.; Khedr, A.M. An information entropy based-clustering algorithm for heterogeneous wireless sensor
networks. Wirel. Netw. 2020, 26, 1869–1886. [CrossRef]
53. Wan, Y.; Gao, Q. An ensemble sentiment classification system of twitter data for airline services analysis. In Proceedings
of the 2015 IEEE international Conference on Data Mining Workshop (ICDMW), Atlantic, NJ, USA, 14–17 November 2015;
pp. 1318–1325.
54. Ullah, M.A.; Marium, S.M.; Begum, S.A.; Dipa, N.S. An algorithm and method for sentiment analysis using the text and emoticon.
ICT Express 2020, 6, 357–360. [CrossRef]
55. Fujisawa, A.; Matsumoto, K.; Yoshida, M.; Kita, K. Emotion Estimation Method Based on Emoticon Image Features and
Distributed Representations of Sentences. Appl. Sci. 2022, 12, 1256. [CrossRef]
56. Yadav, M.L.; Dugar, A.; Baishya, K. Decoding Customer Opinion for Products or Brands Using Social Media Analytics: A Case
Study on Indian Brand Patanjali. Int. J. Intell. Inf. Technol. (IJIIT) 2022, 18, 1–20. [CrossRef]
57. Aydin, G.; Uray, N.; Silahtaroglu, G. How to Engage Consumers through Effective Social Media Use—Guidelines for Consumer
Goods Companies from an Emerging Market. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 768–790. [CrossRef]

You might also like