0% found this document useful (0 votes)

86 views11 pages

Understanding Language

This document summarizes a study that analyzed over 430,000 tweets from Indian users who are bilingual in Hindi and English. The study aimed to understand whether these bilingual speakers have a preference for expressing opinions and sentiments in Hindi or English. The researchers first classified the tweets by language, and then developed classifiers to identify opinionated tweets and classify them by sentiment. The results showed that Hindi, the native language, was preferred over English for expressing negative opinions and swearing. The study also explored how code-switching between Hindi and English can signal different pragmatic functions. This was one of the first large-scale quantitative studies of language preference in multilingual societies using social media data.

Uploaded by

Vishal Singh Rana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views11 pages

Understanding Language

Uploaded by

Vishal Singh Rana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Understanding Language Preference for Expression of Opinion and

Sentiment: What do Hindi-English Speakers do on Twitter?

Koustav Rudra
Shruti Rijhwani
Carnegie Mellon University,
IIT Kharagpur, India
[email protected] Pittsburgh, Pennsylvania
[email protected]
Kalika Bali
Microsoft Research Labs,
Bangalore, India
[email protected]

Monojit Choudhury
Microsoft Research Labs,
Bangalore, India
[email protected]

Abstract

Niloy Ganguly
IIT Kharagpur, India
[email protected]

standing and characterizing language preference in

multilingual societies has been the subject matter of
linguistic inquiry for over half a century (see Milroy
and Muysken (1995) for an overview).

Linguistic research on multilingual societies

has indicated that there is usually a preferred
language for expression of emotion and sentiment (Dewaele, 2010). Paucity of data has
limited such studies to participant interviews
and speech transcriptions from small groups
of speakers. In this paper, we report a study
on 430,000 unique tweets from Indian users,
specifically Hindi-English bilinguals, to understand the language of preference, if any,
for expressing opinion and sentiment. To this
end, we develop classifiers for opinion detection in these languages, and further classifying
opinionated tweets into positive, negative and
neutral sentiments. Our study indicates that
Hindi (i.e., the native language) is preferred
over English for expression of negative opinion and swearing. As an aside, we explore
some common pragmatic functions of codeswitching through sentiment detection.

Rafiya Begum
Microsoft Research Labs,
Bangalore, India
[email protected]

Conversational phenomena such as CS were observed only in speech and therefore, all previous
studies are based on data collected from a small
set of speakers or from interviews. With the growing popularity of social media, we now have an
abundance of conversation-like data that exhibit CS
and other speech phenomena, hitherto unseen in
text (Bali et al., 2014). Leveraging such data from
Twitter, we conduct a large-scale study on language
preference, if any, for the expression of opinion and
sentiment by Hindi-English (Hi-En) bilinguals.

Introduction

The pattern of language use in a multilingual society is a complex interplay of socio-linguistic, discursive and pragmatic factors. Sometimes speakers
have a preference for a particular language for certain conversational and discourse settings; on other
occasions, there is fluid alteration between two or
more languages in a single conversation, also known
as Code-switching (CS) or Code-mixing1 . Under

* This work was done when the author was a Research Fellow at Microsoft Research Lab India.
1
Although some linguists differentiate between Codeswitching and Code-mixing, this paper will use the two terms
interchangeably.

We first build a corpus of 430,000 unique Indiaspecific tweets across four domains (sports, entertainment, politics and current events) and automatically classify the tweets by their language: English,
Hindi and Hi-En CS. We then develop an opinion detector for each language class to further categorize
them into opinionated and non-opinionated tweets.
Sentiment detectors further classify the opinionated
tweets as positive, negative or neutral. Our study
shows that there is a strong preference towards Hindi
(i.e. the native language or L1) over English (L2) for
expression of negative opinion. The effect is clearly
visible in CS tweets, where a switch from English to
Hindi is often correlated with a switch from a positive to negative sentiment. This is referred to as
the polarityswitch function of CS (Sanchez, 1983).
Using the same experimental technique, we also explore other pragmatic functions of CS, such as reinforcement and narrativeevaluative.

1131
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 11311141,
c
Austin, Texas, November 1-5, 2016. 2016
Association for Computational Linguistics

Apart from being the first large-scale quantitative study of language preference in multilingual
societies, this work also has several other contributions: (a) We develop one of the first opinion and sentiment classifiers for Romanized Hindi
and CS Hi-En tweets with higher accuracy than
the only known previous attempt (Sharma et al.,
2015b). (b) We present a novel methodology for automatically detecting pragmatic functions of codeswitching through opinion and sentiment detection.
The rest of the paper is organized as follows:
Sec. 2 introduces language preference, functions of
CS and Hindi-English bilingualism on the web. Sec.
3 formulates the problem and presents the fundamental questions that this paper seeks to answer.
Sec. 4 and 5 discuss dataset creation and opinion and
sentiment detection techniques respectively. Sec. 6
evaluates the hypotheses in light of the observations
on the tweet corpus. We conclude in Sec. 7, and
raise some interesting sociolinguistic questions for
future studies.

1991; Maschler, 1994) such as: (a) reported speech

(b) narrative to evaluative switch (c) reiterations or
emphasis (d) topic shift (e) puns and language play
(f) topic/comment structuring etc. Attempts of predicting the preferred language, or even exhaustively
listing such functions, have failed. However, linguists agree that language alteration in multilingual
communities is not a random process.
Of specific interest to us are the studies on
language preference for expression of emotions.
Through large-scale interviews and two decades of
research, Dewaele (2004; 2010) argued that for most
multilinguals, L1 (the dominant language, which is
often, but not always, the native or mother tongue)
is the language preference for emotions, which include emotional inner speech, swearing and even
emotional conversations. Dewaele argues that emotionally charged words in L1 elicit stronger emotions than those in other languages, and hence L1
is preferred for emotion expression.

Around 125 million people in India speak English,

half of whom have Hindi as their mother-tongue.
The large proportion of the remaining half, especially those residing in the metropolitan cities, also
know at least some Hindi. This makes Hi-En codeswitching, commonly called Hinglish, extremely
widespread in India. There is historical attestation, as well as recent studies on the growing use of
Hinglish in general conversation, and in entertainment and media (see Parshad et al. (2016) and references therein). Several recent studies (Bali et al.,
2014; Barman et al., 2014; Solorio et al., 2014; Sequiera et al., 2015) also provide evidence of Hinglish
and other instances of CS on online social media
such as Twitter and Facebook. In a Facebook dataset
analyzed by Bali et al. (2014), almost all sufficiently
long conversation threads were found to be multilingual, and as much as 17% of the comments had
CS. This study also indicates that on online social
media, Hindi is seldom written in the Devanagari
script. Instead, loose Roman transliteration, or Romanized Hindi, is common, especially when users
code-switch between Hindi and English.
While there has been some effort towards computational processing of CS text (Solorio and Liu,
2008; Solorio and Liu, 2010; Vyas et al., 2014; Peng

Background and Related Work

In order to situate the questions addressed in our

work in existing literature, we present a brief
overview of the past research in pragmatic and discursive analysis of code-switching, and specifically,
on language preference for emotional expression. A
primer to Hi-En bilingualism and its presence in social media shall follow.
2.1

CS Functions and Language Preference

In multilingual communities, where there are more

than one linguistic channels for information exchange, the choice of the channel depends on a variety of factors, and is usually unpredictable (Auer,
1995). Nevertheless, linguistic studies point out
certain frequently-observed patterns. For instance,
certain speech activities might be exclusively or
more commonly related to a certain language choice
(e.g. Fishman (1971) reports use of English for professional purposes and Spanish for informal chat
for English-Spanish bilinguals from Puerto Rico).
Apart from association between such conversational
contexts and language preference, language alteration is often found to be used as a signaling device to imply certain pragmatic functions (Barredo,
1997; Sanchez, 1983; Nishimura, 1995; Maschler,
1132

2.2

Hindi-English Bilingualism

et al., 2014), to the best of our knowledge, there has

been no study on automatic identification of functional aspects of CS or any large-scale, data-driven
study of language preference. The current study
adds to the growing repertoire of work on quantitative analysis of social media data for understanding
socio-linguistic and pragmatic issues, such as detection of depression (De Choudhury et al., 2013),
politeness (Danescu-Niculescu-Mizil et al., 2013),
speech acts (Vosoughi and Roy, 2016), and social
status (Tchokni et al., 2014).

number of socio-linguistic parameters beyond sentiment. For instance, on social media, English is
overwhelmingly more common than any Indic language (Bali et al., 2014). This is because (a) English tweets come from a large number of users apart
from Hi-En bilinguals and (b) English is the preferred language for tweeting even for Hi-En bilinguals because it expands the target audience of the
tweet by manifolds. The preference of for expressing , therefore, can be quantified as:
pr(|; T ) =

Problem Formulation

Along the lines of (Dewaele, 2010), we ask the following question: Is there a preferred language for
expression of opinion and sentiment by the Hi-En
bilinguals on Twitter?
3.1

Definitions

More formally, let = {h, e, m} be the set of

languages: Hindi (h), English (e) and Mixed (m),
i.e., code-switched. Let = {d, r}, be the set
of scripts:2 Devanagari (d) and Roman (r). Let
us further introduce a set of sentiments, 3 =
{+, , 0, }, where +, and 0 respectively denote
utterances with positive, negative and neutral opinions. denote non-opinionated (like factual) texts.
Let T = {t1 , t2 , . . . t|T | } be a set of tweets (or any
text) generated by Hi-En bilinguals. We define:
(T ), (T ) and (T ) as the subsets of T that
respectively contain all tweets in language ,
script and sentiment .
(T ) = (T ) (T ) (T ). Likewise, we
also define (T ) = (T ) (T ), (T ) =
(T ) (T ) and (T ) = (T ) (T ).
The preference towards a language-script pair for
expressing a type of sentiment is given by the probability
pr(|; T ) =

pr(|; T )pr(|T )
pr(|T )

(1)

However, pr(), which defines the prior probability

of choosing for a tweet is dependent on a large
2
Tweets in mixed script are rare and hence we do not include
a symbol for it, though the framework does not preclude such
possibilities.

1133

| (T )|
|(T )|

(2)

We say is the preferred language-script choice

over 0 0 for expressing sentiment if and only if
pr(|; T ) > pr(|0 0 ; T )

(3)

The strength of the preference is directly

proportionate the ratio of the probabilities:
pr(|; T )/pr(|0 0 ; T ).
An alternative but
related way of characterizing the preference is
through comparing the odds of choosing a sentiment type to its polar opposite - 0 . We say, is
the preferred language-script pair for expressing ,
if
pr(|; T )
pr(|0 0 ; T )
>
(4)
pr(0 |; T )
pr(0 |0 0 ; T )
3.2

Hypotheses

Now we can formally define the two hypotheses, we

intend to test here.
Hypothesis I: For Hi-En bilinguals, Hindi is the preferred language for expression of opinion on Twitter.
Therefore, we expect
pr({+, , 0}|hd; T ) > pr({+, , 0}|er; T ) (5)
i.e., pr(|hd; T ) < pr(|er; T )

(6)

And similarly,
pr(|hr; T ) < pr(|er; T )

(7)

Hypothesis II: For Hi-En bilinguals, Hindi is the

preferred language for expression of negative sentiment. Therefore,
pr(|hd; T ) pr(|hr; T ) > pr(|er; T ) (8)

In particular, we would like to hypothesize that the

odds of choosing Hindi for negative over positive is
really high compared to the odds for English. I.e.,
pr(|hd; T )
pr(|hr; T )
pr(|er; T )

>
(9)
pr(+|hd; T )
pr(+|hr; T )
pr(+|er; T )
A special case of the above hypotheses arise in
the context of code-mixing, i.e., for the set mr(T ).
Since the mixed tweets certainly come from proficient bilinguals and have both Hi and En fragments,
we can reformulate our hypotheses at a tweet level.
Let mh r(T ) and me r(T ) respectively denote the set
of Hi and En fragments in mr(T ).
Hypothesis Ia: Hindi is the preferred language for
expression of opinion in Hi-En code-mixed tweets.
Therefore, we expect
h

i.e., pr(|m r; T ) < pr(|m r; T )

(10)

Hypothesis IIa: Hindi is the preferred language

for expression of negative sentiment in Hi-En codeswitched tweets. Therefore,
pr(|mh r; T ) > pr(|me r; T )

(11)

pr(|mh r; T )
pr(|me r; T )
>
pr(+|me r; T )
pr(+|mh r; T )

(12)

Likewise, the above hypotheses also apply for the

Devanagari script, though for technical reasons, we
do not test them here.
Besides comparing aggregate statistics on mr(T ),
it is also interesting to look at the sentiment of
mh r(ti ) and me r(ti ) for each tweet ti . In particular,
for every pair of 6= 0 , we want to study the fraction
of tweets in mr(T ) where mh r(ti ) has sentiment
and me r(ti ) has 0 . Let this fraction be pr(h
e0 ; mr(T )). Under no-preference for language
(i.e., the null) hypothesis, we would expect pr(h
e0 ; mr(T )) pr(h0 e; mr(T )). However,
if pr(h 0 ; mr(T )) is significantly higher than
pr(h0 e; mr(T )), it means that speakers prefer
to switch from English to Hindi when they want to
express a sentiment and vice versa.
Pragmatic Functions of Code-Switching: When
native speakers tend to switch from Hindi to English
when they switch from an expression with sentiment
to one with 0 , or in other words 0 , we
1134

Topic (# tweets): Hashtags

Sports (188K): #IndvsPak, #IndvsUae, #IndvsSa
Movies (82K): #MSG3successfulweeks, #MSGincinemas, #BlockbusterMSG, #Shamitabh, #PK
Politics (92K): #DelhiDecides, #RahulonlLeave,
#AAPStorm, #AAPSweep
Current Events (68K): #RailBudget2015, #Beefban, #LandAcquisitionBill, #UnionBudget2015
Table 1: Hashtags used and number of tweets collected

say this is an observed pragmatic function of codeswitching between Hindi and English (note that the
order of the languages is important), if and only if
pr(h e0 ; mr(T ))
>1
pr(h0 e; mr(T ))
3.3

(13)

A Note on Statistical Significance

All the statistics defined here are likelihoods; Equations 9, 12 and 13, in particular, state our hypothesis
in the form of the Likelihood Ratio Test. However,
the true classes and are unknown; we predict
the class labels using automatic language and sentiment detection techniques that have non-negligible
errors. Under such a situation, the likelihoods cannot be considered as true test statistics, and consequently, hypothesis testing cannot be done per se.
Nevertheless, we can use these as descriptive statistics and investigate the status of the aforementioned
hypotheses.

Datasets

We collected tweets with certain India-specific hashtags (Table 1) using the Twitter Search API (Twi,
2015b) over three months (December 2014 February 2015). In this paper, we use tweets in Devanagari script Hindi (hd), and Roman script English (er), Hindi (hr) and Hi-En Mixed (mr). English and mixed tweets written in Devanagari are extremely rare (Bali et al., 2014) and we do not study
them here. We filter out tweets labeled by the Twitter API (Twi, 2015a) as German, Spanish, French,
Portuguese, Turkish, and all non-Roman script languages (except Hindi).
We experiment on the following different corpora:
TAll : All tweets after filtering. This corpus
contains 430,000 unique tweets posted by 1,25,396
unique users.

http://www.cnergres.iitkgp.ac.in/codemixing

1135

100

90
% of tweets covered

TBL : Tweets from users who are certainly Hi-En

bilinguals, which are approximately 55% (240,000)
of the tweets in TAll . We define a user to be a Hi-En
bilingual if there is at least one mr tweet from the
user, or if the user has tweeted at least once in Hindi
(hd or hr) and once in English (er).
Tspo , Tmov , Tpol , Teve : Topic-wise corpora for
sports, movies, politics and events (Table 1).
TCS : Tweets with inter-sentential CS. We define
these as tweets containing at least one sequence of 5
contiguous Hindi words and one sequence of 5 contiguous English words. The corpus has 3,357 tweets.
SAC: 1000 monolingual tweets (er, hr, hd) and
260 mixed (mr) tweets manually annotated with
sentiment and opinion labels. These were annotated
by two linguists, both fluent Hi-En speakers. The annotators first checked whether the tweet is opinionated or and then identified polarity of the opinionated tweets (+, or 0). Thus, the tweets are classified into the four classes in the set 3. If a tweet contains both opinion and , each fragment was individually annotated. The inter-annotator agreement is
77.5% ( = 0.59) for opinion annotation and 68.4%
( = 0.64) over all four classes. A third linguist
independently corrected the disagreements.
LLCTest : 141 er, 137 hr, and 241 mr tweets
annotated by a Hi-En bilingual form the test set for
the Language Labeling system (Sec. 5.1).
SAC and LLCTest can be downloaded and used
for research purposes3 .
Note that apart from SAC and LLCTest , all corpora are subsets of TAll . For generalizability of
our observations, it is important to ensure that the
tweets in TAll come from a large number of users
and the datasets do not over-represent a small set of
users. In Figure 1, we plot the minimum fraction of
users required (x-axis) to cover a certain percentage
of the tweets in TAll (y-axis). Tweets from at least
10%, i.e., 12.5K users are needed to cover 50% of
the corpus. As expected, we do observe a powerlaw-like distribution, where a few users contribute a
large number of tweets, and a large number of users
contribute a few tweets each. We believe that 12.5K
users is sufficient to ensure an unbiased study.
Further, we classify the users into three specific
groups (i) news channels, (ii) general users (having

50
0

40
60
% of top users

100

Figure 1: Distribution of cumulative % of tweets and # of

users (sorted in descending order by number of tweets).

10,000 followers), (iii) popular users or celebrities

(having > 10,000 followers). Interestingly, for both
TAll , and TBL corpora, we observe that around
98% of all users are general, and 96% of all tweets
come from such users. Hence, most observations
from these corpora are expected to be representative
of the average online linguistic behavior of a Hi-En
bilingual.

Method

Fig. 2 diagrammatically summarizes our experimental method. We identify the language used in each
tweet before detecting opinion and sentiment.
5.1

Language Labeling

Tweets in Devanagari script are accurately detected

by the Twitter API as Hindi tweets we label these
as hd, though a small fraction of them could also be
md. To classify Roman script tweets as er, hr or
mr, we use the system that performed best in the
FIRE 2013 shared task for word-level language detection of Hi-En text (Gella et al., 2013). This system uses character n-gram features with a Maximum
Entropy model for labeling each input word with a
language label (either English or Hindi). We design
minor modifications to the system to improve its performance on Twitter data, which are omitted here
due to paucity of space.
5.2

Opinion and Sentiment Detection

Most of the existing research in opinion detection (Qadir, 2009; Brun, 2012; Rajkumar et al.,

2011). We use the SAC dataset (Sec. 4) as training data and features as described in Sec. 5.3.
5.3

Figure 2: Overview of the experimental method.

2014) and sentiment analysis (Mohammad, 2012;

Mohammad et al., 2013; Mittal et al., 2013; Rosenthal et al., 2015) focus on monolingual tweets and
sentences. Recently, there has been a couple of
studies on sentiment detection of code-switched
tweets (Vilares et al., 2015; Sharma et al., 2015b).
Sharma et al. (2015b) use Hindi SentiWordNet and
normalization techniques to detect sentiment in HiEn CS tweets.
We propose a two-step classification model. We
first identify whether a tweet is opinionated or nonopinionated (). If the tweet is opinionated, we further classify it according to its sentiment (+, or 0).
Fig. 2 shows the architecture of the proposed model.
Two-step classification was empirically found to be
better than a single four-class classifier.
We develop individual classifiers for each language class (er, hr, hd, mr) using an SVM with
RBF kernel from Scikit-learn (Pedregosa et al.,
1136

Classifier Features

For opinion classification (opinion or ), we propose a set of event-independent lexical features and
Twitter-specific features. (i) Subjective words: Expected to be present in opinion tweets. We use lexicons from Volkova et al. (2013) for er and Bakliwal
et al. (2012) for hd. We Romanize the hd lexicon
for the hr classifiers (ii) Elongated words: Words
with one character repeated more than two times,
e.g. sooo, naaahhhhi (iii) Exclamations: Presence
of contiguous exclamation marks (iv) Emoticons4
(v) Question marks: Queries are generally nonopinionated. (vi) Wh-words: These are used to
form questions (vii) Modal verbs: e.g. should,
could, would, cud, shud (viii) Excess hashtags:
Presence of more than two hashtags (ix) Intensifiers: Generally used to emphasize sentiment, e.g.,
we shouldnt get too comfortable (x) Swear words5 :
Prevalent in opinionated tweets, e.g. that was a
f ing no ball!!!! #indvssa (xi) Hashtags: Hashtags might convey user sentiment (Barbosa et al.,
2012). We manually identify hashtags in our corpus
that represent explicit opinion. (xii) Domain lexicon: For hr, & hd category tweets, we construct
sentiment lexicons from 1000 manually annotated
tweets. Each word or phrase in this lexicon represents +, or , or 0 sentiment. (xiii) Twitter user
mentions (xiv) Pronouns: Opinion is often in first
person using pronouns like I and we.
For sentiment classification, we use emoticons,
swear words, exclamation marks and elongated
words as described above. We also use subjective words from various lexicons (Mohammad and
Turney, 2013; Volkova et al., 2013; Bakliwal et
al., 2012; Sharma et al., 2015a). Additionally, we
use (i) Sentiment words: From Hashtag Sentiment and Sentiment140 lexicons (Mohammad et al.,
2013). We also manually annotate hashtags from our
dataset that represent sentiment. (ii) Negation: A
negated context is tweet segment that begins with
a negation word and ends with a punctuation mark
(Pang et al., 2002). The list of negation words are
4
5

The list of emoticons was extracted from Wikipedia

Swear word lexicons from noswearing.com, youswear.com

Classifier

Ablated Feature(s)

Opinion
Sentiment

72.6
64.4

72.0
61.5

79.9
62.7

73.5
63.4

N ONE
mention
lexicon
subjective
wh-words
modal verb
intensifier
slang
pronoun
domain lex.
non-word

72.6
70.1
68.1
69.7
71.0
71.1
71.3
70.0
71.6
N.A.
67.7

79.9
79.3
75.9
79.8
79.3
79.3
76.6
79.2
79.7
77.0
75.6

72.0
70.8
66.6
70.3
70.1
71.3
69.6
70.6
70.3
67.7
68.9

Table 2: Accuracy of the opinion and sentiment classifiers. All values are in %.

taken from Christopher Potts sentiment tutorial6 .

The mr opinion classifier uses the output from the
er and hr classifiers as features (Fig. 2), along with
an additional feature that represents whether the majority of the words in the tweet are Hindi or not. A
similar strategy is used for mr sentiment detection.
5.4

Evaluation

We evaluated the language labeling system on the

LLCT est corpus, on which the precision (recall)
values were 0.93(0.91), 0.90(0.85) and 0.88(0.92)
for er, hr and mr classes respectively. The tweetlevel classification accuracy was 89.8%.
The opinion and sentiment classifiers were evaluated using 10-fold cross validation on the SAC
dataset. Table 2 details the class-wise accuracy. For
comparison, we also reimplemented the dictionary
and dependency-based method by Qadir (2009).
The accuracy of the opinion classifier on the er
tweets was found to be 65.7%, 7% lower than our
system. We also compared our mr sentiment classifier with that of Sharma et al. (2015b). As their
method performs two class sentiment detection (+
and ), we select such tweets from SAC. Their
system achieves an accuracy of 68.2%, which is 4%
lower than the accuracy of our system.
An analysis of the errors showed more false negatives (i.e., opinions labeled ) than false positives in
opinion classification. Sentiment misclassification is
uniformly distributed.
Table 3 reports the accuracy of the opinion classifier for feature ablation experiments. For all three
language-script pairs, lexicon and non-word (emoticons, elongated words, hashtags, exclamation) features are the most effective, though all features have
some positive contribution towards the final accuracy of opinion detection. For hr and hd tweets, domain knowledge is significant, as shown by the 4%
accuracy drop with removing the domain lexicon.
6

http://sentiment.christopherpotts.net/lingstruc.html

1137

Table 3: Feature ablation experiments for the opinion

classifiers. N ONE represents the case when all features
were used. The two smallest values (pertaining to the
two most effective features) are shown in bold.

Corpus

TBL

TAll

Tpol

Tmov

|er(T )|/|T |
|hd(T )|/|T |
|hr(T )|/|T |
|mr(T )|/|T |

0.65
0.12
0.08
0.15

0.79
0.08
0.05
0.08

0.76
0.13
0.05
0.06

0.70
0.04
0.09
0.17

Table 4: Distribution across classes in

Experiments and Observations

In this section, we report our experiments on

430,000 unique tweets (TAll ), and its various subsets as defined in Sec 4. First, we run the language
detection system on the corpora. Table 4 shows the
language-wise distribution. We see that language
preference varies by topic, which is not surprising.
Due to paucity of space, the correlation between language usage and topic will not be discussed at length
here, but we will highlight cases where the differences are striking.
We apply the language-specific opinion and sentiment classifiers to tweets detected as the corresponding language class. In the following subsections, we
empirically investigate the hypotheses.
6.1

Status of Hypotheses I and II

Table 5 shows pr(|; T ), pr(|; T ) and

pr(|; T )/pr(+|; T ) for TAll , TBL and two
randomly selected topics Movie and Politics. The
statistics are fairly consistent over the corpora, with
slight differences but similar trends in Tmov .

Statistic

TBL

TAll

Tpol

Tmov

Statistic

mh r

me r

pr(|; T )

er
hd
hr

0.34
0.45
0.38

0.35
0.47
0.39

0.37
0.48
0.37

0.29
0.49
0.49

pr(|; TCS )
pr(|; TCS )
pr(|; TCS )/pr(+|; TCS )

0.39
0.22
2.2

0.45
0.14
0.34

pr(|; T )

er
hd
hr

0.16
0.18
0.24

0.17
0.17
0.25

0.22
0.19
0.27

0.07
0.16
0.13

er
hd
hr

0.35
3.00
1.46

0.38
3.27
1.60

0.59
5.67
1.96

0.11
1.90
0.55

pr(|; T )
pr(+|; T )

Table 5: Sentiment across languages: Statistics concerning hypotheses I and II.

We need the first statistic in order to investigate

Hypothesis I (Eqs. 6 and 7), and the two latter ones
for verifying Hypothesis II (Eqs. 8 and 9).
Contrary to Eqs. 6 and 7, for all corpora except
Tmov , we observe the following trend:
pr(|hd; T ) > pr(|hr; T ) pr(|er; T )
In other words, hd is more commonly preferred for
expressing non-opinions than hr and er. Hypothesis I is clearly untrue for these corpora, though due to
the small differences between hr and er, we cannot
claim that English is the preferred language for expressing opinions. A closer scrutiny of the corpora
revealed that hd tweets mostly come from official
sources (news channels, political parties, production
houses) and celebrities, which are mostly factual.
hr tweets are from general users and show similar
trends as English. Thus, in general, there seems to
be no preferred language for expressing opinion by
the Hi-En bilinguals on Twitter.
In the context of Hypothesis II, we see the general pattern (with some topic specific variations):
pr(|hr; T ) > pr(|hd; T ) pr(|er; T )
The pattern emerges even more strongly, when we
look at pr(|; T )/pr(+|; T ). The odds of expressing a negative opinion over positive opinion in
Hindi is between 1.5 and 6 (Tmov exhibits a slightly
different pattern but similar preference, Tpol shows a
stronger preference towards Hindi for negative sentiment), whereas the same for English is between 0.1
and 0.6. In other words, English is more preferred
1138

Table 6: TCS statistics for testing hypotheses Ia and IIa

for expressing positive opinion, and Hindi for negative opinion. These observations provide very strong
evidence in favor of Hypothesis II.
6.2

Status of Hypotheses Ia and IIa

Recall that Hypothesis Ia and Hypothesis IIa are

essentially same as Hypotheses I and II, but applied
on mh r and me r fragments from the TCS corpus.
Table 6 reports the three statistics necessary
for testing these hypotheses. pr(|me r; TCS ) is
slightly greater than pr(|mh r; TCS ), which is
what we would expect if Hypothesis Ia was true.
However, since the difference is small, we view it
as a trend rather than a proof of Hypothesis Ia.
The statistics clearly show that Hypothesis IIa
holds true for TCS . The fraction of negative sentiment in mh r is over 1.5 times higher than that of
me r. Further, the odds of expressing a negative sentiment in Hindi over positive sentiment in Hindi in
a code-switched tweet is 6.5 times higher than the
same odds for English.
6.3

Switching Functions

Recall that using Eq. 13 (Sec. 3), we can estimate

the preference, if any, for switching to a particular
language while changing the sentiment. In particular, research in socio-linguistics has shown that users
often switch between languages when they switch
from non-opinion () to opinion ({+, , 0}). This
is called the Narrative-Evaluative function of CS
(Sanchez, 1983). This function appears in 46.1%
of the tweets in TCS . We find that
pr(h{+, , 0} e; TCS )
= 0.86
pr(h e{+, , 0}; TCS )
which indicates that there is no preference for
switching to Hindi (or English) while switching between opinion and non-opinion. This is also confirmed above in the context of hypotheses I and Ia.
While switching between opinion and non-opinion
in a tweet, users do switch language. However, we

waele, 2004), we computed the fraction of tweets

that contain swear words in each language class.
Fig. 3a shows the distribution across topics. The
languages hr and mr have a much higher fraction
of abusive tweets than er and hd. Fig. 3b shows the
distribution of abusive mh r and me r fragments for
tweets in TCS . Interestingly, over 90% of the swear
words occur in mh r. Both distributions strongly
suggest a preference for swearing in Hindi.

30
Percentage of tweets containing
slang terms

25
20
15
10
5
0

Sports

Movies

Politics

Events

(a) Abusive tweets

7 Conclusion

Percentage of codeswitched tweets

containing slang terms

8
er

7
6
5
4
3
2
1
0

Sports

Movies

Politics

Events

(b) Swearing pref. in TCS

Figure 3: Distribution of swear words by language

observe no particular preference for the languages

chosen for each part.
We also report two other pragmatic functions:
pr(h e{+, 0, }; TCS )
= 1.98
pr(h{+, 0, } e; TCS )
pr(h e+; TCS )
= 10.27
pr(h+ e; TCS )

The latter function is called polarity switch. The extremely high value for these ratios is an evidence
for a strong preference towards switching language
from English to Hindi while switching to negative
sentiment (and switching to English when sentiment
changes from negative to positive).
We also observe cases where there is a language
switch, but no sentiment switch and hence, we cannot evaluate language preference using Eq. 13 (because = 0 ). In TCS , 15.3% of the tweets show
Positive Reinforcement, where both fragments are of
positive sentiment. Negative Reinforcement is defined similarly and is seen in 8.7% of the tweets.
Other tweets in TCS likely have pragmatic functions
that cannot be identified based on sentiment.
6.4

Language Preference for Swearing

Since there is evidence that the native language

(Hindi, in this case) is preferred for swearing (De1139

In this paper, through a large scale empirical study

of nearly half a million tweets, we tried to answer
a fundamental question regarding multilingualism,
namely, is there a preferred language for expression
of sentiment. We also looked at some of the pragmatic functions of code-switching. Our results indicate a strong preference for using Hindi, L1 for the
users from whom these tweets come, for expressing
negative sentiment, including swearing. However,
we do not observe any particular preference towards
Hindi for expressing opinions.
Previous linguistic studies (Dewaele, 2004; Dewaele, 2010) have already shown a preference for
L1 for expressing emotion and swearing. However,
we observe that for expressing positive emotion, English (which would be L2) is the language of preference. This raises some intriguing socio-linguistic
questions. Is it the case that English being the language of aspiration in India, it is preferred for positive expression? Or is it because Hindi is specifically
preferred for swearing and therefore, is the language
of preference for negative emotion? How do such
preferences vary across topics, users and other multilingual communities? How representative of the
society is this kind of social media study? We plan
to explore some of these questions in the future.
Our study also indicates that inferences drawn
on multilingual societies by analyzing data in just
one language (usually English), which has been the
norm so far, are likely to be incorrect.

Acknowledgement
Koustav Rudra was supported by a fellowship from
Tata Consultancy Services.

References
Peter Auer. 1995. The pragmatics of code-switching:
a sequential approach. In Lesley Milroy and Pieter
Muysken, editors, One speaker, two languages, pages
115135. Cambridge University Press.
Akshat Bakliwal, Piyush Arora, and Vasudeva Varma.
2012. Hindi subjective lexicon : A lexical resource for
hindi polarity classification. In Proc. LREC, Austin,
Texas, USA, May.
Kalika Bali, Yogarshi Vyas, Jatin Sharma, and Monojit Choudhury. 2014. i am borrowing ya mixing?
an analysis of English-Hindi code mixing in Facebook. In Proc. First Workshop on Computational Approaches to Code Switching, EMNLP.
Glivia A. R. Barbosa, Wagner Meira Jr, Ismael S. Silva,
Raquel O. Prates, Mohammed J. Zaki, and Adriano
Veloso. 2012. Characterizing the effectiveness of
twitter hashtags to detect and track online population
sentiment. In Proc. ACM CHI, Austin, Texas, USA,
May.
Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of social media.
In The 1st Workshop on Computational Approaches to
Code Switching, EMNLP 2014.
Inma Munoa Barredo. 1997. Pragmatic functions
of code-switching among Basque-Spanish bilinguals.
Retrieved on October, 26:528541.
Caroline Brun. 2012. Learning opinionated patterns for
contextual opinion detection. In COLING (Posters),
pages 165174. Citeseer.
Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan
Jurafsky, Jure Leskovec, and Christopher Potts. 2013.
A computational approach to politeness with application to social factors. Proceedings of ACL.
Munmun De Choudhury, Michael Gamon, Scott Counts,
and Eric Horvitz. 2013. Predicting depression via social media. In ICWSM.
Jean-Marc Dewaele. 2004. Blistering barnacles! What
language do multilinguals swear in?! Estudios de Sociolinguistica, 5:83105.
Jean-Marc Dewaele. 2010. Emotions in multiple languages. Palgrave Macmillan, Basingstoke, UK.
J. A. Fishman. 1971. Sociolinguistics. Rowley, Newbury, MA.
Spandana Gella, Jatin Sharma, and Kalika Bali. 2013.
Query word labeling and back transliteration for indian
languages: Shared task system description.
Yael Maschler. 1991. The language games bilinguals
play: language alternation at language boundaries.
Language and communication, 11(2):263289.

1140

Yael Maschler.
1994.
Appreciation haaraxa o
haarasta? [valuing or admiration]. Negotiating contrast in bilingual disagreement talk, 14(2):207238.
Lesley Milroy and Pieter Muysken, editors. 1995. One
speaker, two languages: Cross-disciplinary perspectives on code-switching. Cambridge University Press.
Namita Mittal, Basant Agarwal, Garvit Chouhan, Nitin
Bania, and Prateek Pareek. 2013. Sentiment analysis
of hindi review based on negation and discourse relation. In proceedings of International Joint Conference
on Natural Language Processing, pages 4550.
Saif M. Mohammad and Peter D. Turney.
2013.
Crowdsourcing a Word-Emotion Association Lexicon.
29(3):436465.
Saif Mohammad, Svetlana Kiritchenko, and Xiaodan
Zhu. 2013. Nrc-canada: Building the state-of-theart in sentiment analysis of tweets. In Proceedings of
the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, Georgia,
USA, June.
Saif M Mohammad. 2012. # emotional tweets. In Proceedings of the First Joint Conference on Lexical and
Computational Semantics-Volume 1: Proceedings of
the main conference and the shared task, and Volume
2: Proceedings of the Sixth International Workshop on
Semantic Evaluation, pages 246255. Association for
Computational Linguistics.
Miwa Nishimura. 1995. A functional analysis of
Japanese/English code-switching. Journal of Pragmatics, 23(2):157181.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up?: Sentiment classification using
machine learning techniques. In Proc. EMNLP, pages
7986.
Rana D. Parshad, Suman Bhowmick, Vineeta Chand,
Nitu Kumari, and Neha Sinha. 2016. What is India
speaking? Exploring the Hinglish invasion. Physica
A, 449:375389.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825
2830.
Nanyun Peng, Yiming Wang, and Mark Dredze.
2014. Learning polylingual topic models from codeswitched social media documents. In ACL (2), pages
674679.
Ashequl Qadir. 2009. Detecting opinion sentences specific to product features in customer reviews using
typed dependency relations. In Proceedings of the
Workshop on Events in Emerging Text Types, pages
3843. Association for Computational Linguistics.

Pujari Rajkumar, Swara Desai, Niloy Ganguly, and

Pawan Goyal. 2014. A novel two-stage framework
for extracting opinionated sentences from news articles. TextGraphs-9, page 25.
Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
Saif M Mohammad, Alan Ritter, and Veselin Stoyanov. 2015. Semeval-2015 task 10: Sentiment analysis in twitter. Proceedings of SemEval-2015.
Rosaura Sanchez. 1983. Chicano discourse. Rowley,
Newbury House.
Royal Sequiera, Monojit Choudhury, Parth Gupta,
Paolo Rosso, Shubham Kumar, Somnath Banerjee,
Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul
Chittaranjan, Amitava Das, and Kunal Chakma. 2015.
Overview of fire-2015 shared task on mixed script information retrieval. In Working Notes of FIRE, pages
2127.
Raksha Sharma, Pushpak Bhattacharyya, Ultimate Goal,
and Hindi Senti Lexicon Statistics. 2015a. A sentiment analyzer for hindi using hindi senti lexicon.
Shashank Sharma, Pykl Srinivas, and Rakesh Chandra
Balabantaray. 2015b. Text normalization of code mix
and sentiment analysis. In Advances in Computing,
Communications and Informatics (ICACCI), 2015 International Conference on, pages 14681473. IEEE.
Thamar Solorio and Yang Liu. 2008. Part-of-speech tagging for english-spanish code-switched text. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, pages 10511060. Association for Computational Linguistics.
Thamar Solorio and Yang Liu. 2010. Learning to Predict
Code-Switching Points. In Proc. EMNLP.
Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven
Bethard, Mona Diab, Mahmoud Gohneim, Abdelati
Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison
Chang, et al. 2014. Overview for the first shared task
on language identification in code-switched data. Proceedings of The First Workshop on Computational Approaches to Code Switching, EMNLP, pages 6272.
Simo Tchokni, D.O. Seaghdha, and Daniele Quercia.
2014. Emoticons and phrases: Status symbols in social media. In Eighth International AAAI Conference
on Weblogs and Social Media.
2015a. GET help/languages Twitter Developers, 8.
2015b. GET search/tweets Twitter Developers, 8.
David Vilares, Miguel A Alonso, and Carlos GomezRodrguez. 2015. Sentiment analysis on monolingual,
multilingual and code-switching twitter corpora. In
6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis.
Svitlana Volkova, Theresa Wilson, and David Yarowsky.
2013. Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter
Streams. In Proc. ACL (Vol2: Short Papers).

1141

Soroush Vosoughi and Deb Roy. 2016. Tweet acts: A

speech act classifier for twitter. In Tenth International
AAAI Conference on Web and Social Media.
Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika
Bali, and Monojit Choudhury. 2014. POS Tagging
of English-Hindi Code-Mixed Social Media Content.
In Proc. EMNLP, pages 974979.

Code Mixing: A Challenge For Language Identification in The Language of Social Media
No ratings yet
Code Mixing: A Challenge For Language Identification in The Language of Social Media
11 pages
Attitudes Toward Different Types of Chinese-English Code-Switching
No ratings yet
Attitudes Toward Different Types of Chinese-English Code-Switching
19 pages
Sentiment Analysis in Code-Mixed Tweets
No ratings yet
Sentiment Analysis in Code-Mixed Tweets
17 pages
Code Mix and Code Switching
No ratings yet
Code Mix and Code Switching
14 pages
Code-Switching & Lang. Processing
No ratings yet
Code-Switching & Lang. Processing
70 pages
Hindi - English Code - Switching
100% (1)
Hindi - English Code - Switching
4 pages
TOX7010 Sociolinguistics Seminar Presentation
No ratings yet
TOX7010 Sociolinguistics Seminar Presentation
18 pages
ENG 4120 - Unit3
No ratings yet
ENG 4120 - Unit3
25 pages
Presentation Code Switicing
No ratings yet
Presentation Code Switicing
22 pages
Code-Mixing and Code-Switching in Efl/Esl Context: A Sociolinguistic Approach
No ratings yet
Code-Mixing and Code-Switching in Efl/Esl Context: A Sociolinguistic Approach
18 pages
Code-Switching in Language Tech
No ratings yet
Code-Switching in Language Tech
13 pages
Code Switching - Effects On Online Brand Engagement
No ratings yet
Code Switching - Effects On Online Brand Engagement
61 pages
Code-Switching & Mixing in India
No ratings yet
Code-Switching & Mixing in India
19 pages
Grammatical Theory and Bilingual Codeswitching 1st Edition Jeff Macswan Updated 2025
100% (2)
Grammatical Theory and Bilingual Codeswitching 1st Edition Jeff Macswan Updated 2025
126 pages
Slide Methodology
No ratings yet
Slide Methodology
13 pages
Montes Alcala
No ratings yet
Montes Alcala
26 pages
Code SwitchingandOptimality
No ratings yet
Code SwitchingandOptimality
224 pages
English in Hong Kong Cantopop Language Choice, Code-Switching and Genre
No ratings yet
English in Hong Kong Cantopop Language Choice, Code-Switching and Genre
23 pages
Literature Review On Code Switching and Code Mixing
100% (1)
Literature Review On Code Switching and Code Mixing
6 pages
Codes Witi Ching Code Mixing
No ratings yet
Codes Witi Ching Code Mixing
84 pages
Language Choices of Multilingual BS English Students On Social Networking Sites: A Case Study of Public Sector Colleges
No ratings yet
Language Choices of Multilingual BS English Students On Social Networking Sites: A Case Study of Public Sector Colleges
16 pages
Twitter Code Switching Study
No ratings yet
Twitter Code Switching Study
11 pages
Spanish-English Code-Switching in American Mainstream TV Series
No ratings yet
Spanish-English Code-Switching in American Mainstream TV Series
58 pages
Code Switching As A Communicative Strategy English Korean Bilinguals
No ratings yet
Code Switching As A Communicative Strategy English Korean Bilinguals
15 pages
An Analysis of Code Switching in Media Social in Facebook
No ratings yet
An Analysis of Code Switching in Media Social in Facebook
16 pages
Code-Switching in Conversation Language Interactio... - (3 Code-Switching and The Notion of Code in Linguistics Proposals For A... )
No ratings yet
Code-Switching in Conversation Language Interactio... - (3 Code-Switching and The Notion of Code in Linguistics Proposals For A... )
25 pages
Code Switching As A Communicative Strate
No ratings yet
Code Switching As A Communicative Strate
20 pages
Code Switching on Twitter
No ratings yet
Code Switching on Twitter
16 pages
Code Switching EnglishFess
No ratings yet
Code Switching EnglishFess
61 pages
Code-Switching in Linguistics
100% (4)
Code-Switching in Linguistics
5 pages
Socio Linguistic Study of Code Switching of The Arabic Language Speakers On Soci
No ratings yet
Socio Linguistic Study of Code Switching of The Arabic Language Speakers On Soci
9 pages
02 Hanan Omar A Ben Nafa
No ratings yet
02 Hanan Omar A Ben Nafa
32 pages
Choosing A Code: Arranged By: Eneng Liah Khoiriyah, M.PD
No ratings yet
Choosing A Code: Arranged By: Eneng Liah Khoiriyah, M.PD
10 pages
The Nature of Code Switching
No ratings yet
The Nature of Code Switching
8 pages
Code-Switching in South Asian English CMC Shakir and Deuber (2024)
No ratings yet
Code-Switching in South Asian English CMC Shakir and Deuber (2024)
31 pages
Tweeting The Conyo Persona: A Corpus-Based Analysis of Code-Switching in Twitter
No ratings yet
Tweeting The Conyo Persona: A Corpus-Based Analysis of Code-Switching in Twitter
9 pages
Titan Beast 4
No ratings yet
Titan Beast 4
26 pages
Effective Code Switching Strategies in Dialogue Systems
No ratings yet
Effective Code Switching Strategies in Dialogue Systems
12 pages
Analysis of Emotion Detection From Code Mixed or Codeswitched
No ratings yet
Analysis of Emotion Detection From Code Mixed or Codeswitched
14 pages
Code Switching in Social Media
No ratings yet
Code Switching in Social Media
18 pages
Social Motivation For Code Switching in Sujatha Novels in Tamil
No ratings yet
Social Motivation For Code Switching in Sujatha Novels in Tamil
4 pages
Code Switching and Code Mixing
No ratings yet
Code Switching and Code Mixing
12 pages
LANGUAGE CHOICE Chapter 5 Ells 104
No ratings yet
LANGUAGE CHOICE Chapter 5 Ells 104
5 pages
Sociolinguistic Code Switching & Mixing
No ratings yet
Sociolinguistic Code Switching & Mixing
22 pages
Code-Switching in Bilingual Children2
No ratings yet
Code-Switching in Bilingual Children2
291 pages
Code Choice in Ms. Cynthia's Selected Vlogs: A Study of Boholano Language Use
No ratings yet
Code Choice in Ms. Cynthia's Selected Vlogs: A Study of Boholano Language Use
12 pages
dELAL JTD 2018
No ratings yet
dELAL JTD 2018
52 pages
2024 Session 1 - 2
No ratings yet
2024 Session 1 - 2
23 pages
Code Switching & Code Mixing
No ratings yet
Code Switching & Code Mixing
15 pages
Forms and Function Code Mixing Used in Twitter by Tweet Netizen
No ratings yet
Forms and Function Code Mixing Used in Twitter by Tweet Netizen
58 pages
Trends in Code Switching Within Spanish Communities in The UK, London
No ratings yet
Trends in Code Switching Within Spanish Communities in The UK, London
16 pages
Questionnaire Survey
No ratings yet
Questionnaire Survey
2 pages
1 PB
No ratings yet
1 PB
11 pages
Sociolinguistic Code Switching
No ratings yet
Sociolinguistic Code Switching
22 pages
F. Grosjean Codeswitching and Borrowing Capitulo5
No ratings yet
F. Grosjean Codeswitching and Borrowing Capitulo5
7 pages
Code Switching Cinta Laura Kiehl in Youtube Video
No ratings yet
Code Switching Cinta Laura Kiehl in Youtube Video
9 pages
Poplack 2001 CS
No ratings yet
Poplack 2001 CS
5 pages
T1 - 112011104 - Full Text
No ratings yet
T1 - 112011104 - Full Text
77 pages
Pkpadmin, Paper #1
No ratings yet
Pkpadmin, Paper #1
11 pages
IT Professional Resume
No ratings yet
IT Professional Resume
2 pages
Radiod Master
0% (1)
Radiod Master
149 pages
Teacher Professional Development On ICT in Education in The Philippines
No ratings yet
Teacher Professional Development On ICT in Education in The Philippines
6 pages
Singapore Maths (P2) Test 1
No ratings yet
Singapore Maths (P2) Test 1
3 pages
Eye of The Storm
No ratings yet
Eye of The Storm
14 pages
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
No ratings yet
New GRE Scoring Format: GRE Test GRE Test GRE Exam GRE Exam
2 pages
Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg PDF Download
100% (1)
Glad We Met: The Art and Science of 1:1 Meetings Steven G. Rogelberg PDF Download
147 pages
Hermann Haken
No ratings yet
Hermann Haken
2 pages
Chapter 6-Leading
No ratings yet
Chapter 6-Leading
27 pages
By Fendy Sutandio Anyone Experienced in Helping Students in Any
No ratings yet
By Fendy Sutandio Anyone Experienced in Helping Students in Any
7 pages
Week 7 - Human Computer Interaction (HCI)
No ratings yet
Week 7 - Human Computer Interaction (HCI)
38 pages
Resume Help for Job Seekers
100% (1)
Resume Help for Job Seekers
4 pages
Academic Text Summarizing & Outlining
No ratings yet
Academic Text Summarizing & Outlining
38 pages
Dubuque Historical Education Plan
No ratings yet
Dubuque Historical Education Plan
48 pages
Assessing Quality of Education: in Perspective With Continuous Assessment and Learners' Performance in Adwa College, Ethiopia
No ratings yet
Assessing Quality of Education: in Perspective With Continuous Assessment and Learners' Performance in Adwa College, Ethiopia
11 pages
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
No ratings yet
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
48 pages
Siena News Fall 2010
No ratings yet
Siena News Fall 2010
36 pages
Y12 HEco QP 23
No ratings yet
Y12 HEco QP 23
12 pages
Nepal Pokhara Affiliated College List.
No ratings yet
Nepal Pokhara Affiliated College List.
3 pages
Understanding Literature Review in Research
No ratings yet
Understanding Literature Review in Research
9 pages
Rita Smith CV
No ratings yet
Rita Smith CV
5 pages
Sex Education in Utah
No ratings yet
Sex Education in Utah
10 pages
Cognitive Level Capability Verbs Instructional Prompts: Bloom's Taxonomy - Chart 2
No ratings yet
Cognitive Level Capability Verbs Instructional Prompts: Bloom's Taxonomy - Chart 2
2 pages
2020 World AIDS Day Report Graphs Tables en
No ratings yet
2020 World AIDS Day Report Graphs Tables en
45 pages
Pots Resume
No ratings yet
Pots Resume
3 pages
PR Chapter 1-5
No ratings yet
PR Chapter 1-5
65 pages
Grade 6 Quiz Bee Guidelines 2023
No ratings yet
Grade 6 Quiz Bee Guidelines 2023
3 pages
Cetprospectus 2025
No ratings yet
Cetprospectus 2025
56 pages
E Twinning
No ratings yet
E Twinning
13 pages

Understanding Language

Uploaded by

Understanding Language

Uploaded by

Understanding Language Preference for Expression of Opinion and

Sentiment: What do Hindi-English Speakers do on Twitter?

standing and characterizing language preference in

Linguistic research on multilingual societies

1991; Maschler, 1994) such as: (a) reported speech

Around 125 million people in India speak English,

Background and Related Work

In order to situate the questions addressed in our

CS Functions and Language Preference

In multilingual communities, where there are more

et al., 2014), to the best of our knowledge, there has

More formally, let = {h, e, m} be the set of

However, pr(), which defines the prior probability

We say is the preferred language-script choice

The strength of the preference is directly

Now we can formally define the two hypotheses, we

Hypothesis II: For Hi-En bilinguals, Hindi is the

In particular, we would like to hypothesize that the

i.e., pr(|m r; T ) < pr(|m r; T )

Hypothesis IIa: Hindi is the preferred language

Likewise, the above hypotheses also apply for the

Topic (# tweets): Hashtags

A Note on Statistical Significance

TBL : Tweets from users who are certainly Hi-En

Figure 1: Distribution of cumulative % of tweets and # of

10,000 followers), (iii) popular users or celebrities

Tweets in Devanagari script are accurately detected

Opinion and Sentiment Detection

Figure 2: Overview of the experimental method.

2014) and sentiment analysis (Mohammad, 2012;

The list of emoticons was extracted from Wikipedia

taken from Christopher Potts sentiment tutorial6 .

We evaluated the language labeling system on the

Table 3: Feature ablation experiments for the opinion

Table 4: Distribution across classes in

Experiments and Observations

In this section, we report our experiments on

Status of Hypotheses I and II

Table 5 shows pr(|; T ), pr(|; T ) and

Table 5: Sentiment across languages: Statistics concerning hypotheses I and II.

We need the first statistic in order to investigate

Table 6: TCS statistics for testing hypotheses Ia and IIa

Status of Hypotheses Ia and IIa

Recall that Hypothesis Ia and Hypothesis IIa are

Recall that using Eq. 13 (Sec. 3), we can estimate

waele, 2004), we computed the fraction of tweets

(a) Abusive tweets

Percentage of codeswitched tweets

(b) Swearing pref. in TCS

Figure 3: Distribution of swear words by language

observe no particular preference for the languages

Language Preference for Swearing

Since there is evidence that the native language

In this paper, through a large scale empirical study

Pujari Rajkumar, Swara Desai, Niloy Ganguly, and

Soroush Vosoughi and Deb Roy. 2016. Tweet acts: A

You might also like