Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
57 views10 pages

Cities: Sciencedirect

Research paper on cities

Uploaded by

Deepti Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views10 pages

Cities: Sciencedirect

Research paper on cities

Uploaded by

Deepti Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Cities 92 (2019) 187–196

Contents lists available at ScienceDirect

Cities
journal homepage: www.elsevier.com/locate/cities

Characterization of citizens using word2vec and latent topic analysis in a T


large set of tweets
Vladimir Vargas-Calderóna, Jorge E. Camargob,*
a
Physics Department, Universidad Nacional de Colombia, Bogotá, Colombia
b
Systems Engineering Department, Fundación Universitaria Konrad Lorenz, Bogotá, Colombia

A R T I C LE I N FO A B S T R A C T

Keywords: With the increasing use of the Internet and mobile devices, social networks are becoming the most used media to
Natural language processing communicate citizens' ideas and thoughts. This information is very useful to identify communities with common
Word embedding ideas based on what they publish in the network. This paper presents a method to automatically detect city
T-sne communities based on machine learning techniques applied to a set of tweets from Bogotá’s citizens. An analysis
Social network analysis
was performed in a collection of 2,634,176 tweets gathered from Twitter in a period of six months. Results show
that the proposed method is an interesting tool to characterize a city population based on a machine learning
methods and text analytics.

1. Introduction using data from any social network with no a priori knowledge of
people's relations.
Internet usage in Colombia, and in particular in Bogotá, has been With the objective of identifying the main topics treated by Bogotá’s
increasing in the last years, not only because it constitutes the largest population on Twitter, as well as detecting possible communities, we
source of information of every kind, but also because of governmental collected tweets emitted from Bogotá. Then, a Word2Vec model
efforts to make people from different social backgrounds recurrent (Řehůřek & Sojka, 2010) was built in order to represent the set of tweets
Internet users (Chaparro Gaitán, 2013), with the idea of reducing the corresponding to each user as a vector in a vector space. The gap sta-
digital gap in Colombia. For instance, from 2011 to 2014, several tistic (Tibshirani, Walther, & Hastie, 2001) was used to estimate the
portions of the city have experienced an increase of up to 148.3% in number of clusters that could be formed using these vectors. Finally, a
homes with Internet connection (Durango Padilla, 2014). frequency distribution of words was built for each cluster, so that each
By 2011, both Facebook and Twitter were the most popular social cluster could be identified by its most frequent words, which ultimately
networks in Bogotá (Chaparro Gaitán, 2013). Moreover, 65.5% of Bo- characterizes a topic.
gotá’s Internet users used social networks by 2014 (Durango Padilla, Our work acknowledges that online social networks constitute one
2014). This study concentrates on Twitter because it allows its users to of the main scenarios where people express their opinions, which makes
write posts with a length ranging from 1 to 140 characters1, making them an outstanding source of information that allows the character-
Twitter a tool for microblogging, a form of communication in which ization of important topics for the citizenship. Therefore, we provide a
users express their opinion about several topics in short posts (tweets) robust method for studying cultural and political aspects of a society,
(Java, Song, Finin, & Tseng, 2007). which unveils groups of citizens that are actively concerned about
By only taking the users' posts and not the explicit relations between particular topics that are relevant for the dynamics of a city. The main
the users, our large data set is an interesting basis for testing un- contribution of our work is the fully unsupervised and city-dependent
supervised community detection methodologies. In fact, the only method that we propose for discovering such groups. In turn, the
common factor that relates the users from our data set is that they are identification of the main topics discussed by the citizens simplifies the
connected by geographical location. Such unsupervised community task of targeting groups of people for promoting cultural, political or
detection methodologies help to understand social phenomena that educational campaigns about very specific issues.
takes place in that geographical region and in a particular period of This paper is organized as follows. In Section 2 we present related

*
Corresponding author at: System Engineering Department, Fundación Universitaria Konrad Lorenz, Carrera 9 Bis No. 62 - 43, Bogotá, Colombia.
E-mail addresses: [email protected] (V. Vargas-Calderón), [email protected] (J.E. Camargo).
1
Data was collected before the change in the maximum tweet length to 280 characters.

https://doi.org/10.1016/j.cities.2019.03.019
Received 10 January 2019; Received in revised form 12 March 2019; Accepted 25 March 2019
0264-2751/ © 2019 Published by Elsevier Ltd.
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

work regarding methods for topic detection, models for text re- knowledge from lexical, syntactic and semantic structures in the
presentation as well as a short review on community detection algo- Spanish language, as well as machine learning techniques used with the
rithms. Section 3 details material and methods. In Section 4 we present bag-of-words representation, have shown improvements over the sole
the theoretical aspects of the proposed method. In Section 5 the prin- bag of words approach. Besides, the clever creation of a corpus called
cipal results are shown. We discuss these results in Section 6 with the MeSiento (Spanish for “I feel”) (Montejo-Ráez, Díaz-Galiano, Perea-
purpose of providing some enlightenment on the way our method de- Ortega, & Ureña López, 2013) allowed a robust unsupervised method
tected topics treated by Bogotá’s population. Finally, Section 7 presents for polarity classification of Spanish tweets that reached accuracy levels
the main conclusions of the paper and future work. close to the ones obtained with supervised algorithms.
Nonetheless, these methods applied to sentiment analysis tasks de-
2. Related work pend a lot on annotated dictionaries which do not contain Bogotá’s
jargon. As a matter of fact, few attempts have been made to make small
In this section we review important milestones in the areas of topic topic-specific dictionaries such as Alvarado Valencia, Carrillo, Forero,
detection, opinion-mining, text representation and community detec- Caicedo, and Ureña (2016), where the political sentiment towards Bo-
tion models, all of which are pertinent to our work. gotá mayoral candidates for 2015 was analyzed using Twitter and a
Numerous methods have been created for topic detection in texts. political sentiment dictionary defined in the Colombian political con-
The most successful and acknowledged ones are Latent Dirichlet text. A second study briefly examined the sentiment of tweets from
Allocation (LDA) (Blei et al., 2003), Latent Semantic Analysis (LSA) (or Bogotá with words related with health symptoms (Salcedo & León,
Latent Semantic Indexing (LSI)) (Deerwester, Dumais, & Harshman, 2015). A third study examined the results of 2015 Colombian regional
1990), and Correlated Topic Models (CTM) (Blei & Lafferty, 2006). elections and compared them with political ideology and Twitter ac-
However, people always tune and mix these methods in larger pipelines tivity of the candidates (Correa & Camargo, 2017).
that allow them to study a singular application in which they are in- In the end, it is clear that an unsupervised model for text re-
terested. In particular, analyzing posts from Twitter is a challenging presentation is needed to give robust topic-independent text re-
task because these are short texts that lack of context. Moreover, many presentations. One of the most successful and widely used text re-
of these short texts are not words of an official language, both because presentation models is Word2Vec, which has proven to give good
of misspelling and because of the usage of Internet slang. However, results regardless of the language in opinion mining and topic detection
several studies have tried to adapt to these inconveniences. A model duties. For instance, Enríquez, Troyano, and López-Solaz (2016) com-
(Zhou, De, & Moessner, 2016) based on TwitterLDA (Zhao et al., 2011) bined Word2Vec and a bag-of-words document classifier, and showed
(which is an adaptation of LDA to short texts such as tweets) was that Word2Vec provided word embeddings that produced more stable
constructed to take advantage of Twitter as a source of real world event results when doing cross-domain classification experiments. Also, since
information. This model aimed to detect emerging events that could Word2Vec was first introduced, there has been some research trying to
affect a geographical region in a city, and also quantitatively estimated improve and fine-tune word embeddings. Such is the case of J. Li, Li,
the impact of such an event on the population nearby the event loca- Fu, Masud, and Huang (2016) that proposes a hybrid model between
tion. Another study (Benny & Philip, 2015) proposed a model to detect skip-gram model and continuous bag of words (CBOW) called mixed
related topics based on clustering performed over a TF-IDF re- word embedding (MWE). All in all, we choose word embedding models
presentation of tweets. The clustering was done by including weights such as Word2Vec for being able to embed semantic similarities be-
between words called Associative Gravity Force (Klahold, Uhr, Ansari, tween words in a similarity metric defined over a Euclidean vector
& Fathi, 2014) (along with other measures of similarity between words, space.
as well as ranking of words), that accounted for the frequency of pairs Advances in sentiment analysis have been reported in recent works
of words occurring together. Also, graph-based approaches have been such as Dashtipour et al. (2016), where state of the art methods were
recently proposed (Hachaj & Ogiela, 2017), in which hash-tags are used surveyed and compared. Deep learning methods are being used in
to identify communities and topics by finding frequent co-occurring works such as Dashtipour et al. (2018) to analyze sentiments in Persian
hash-tags. texts. Deep convolutional neural networks have been also investigated
Other non common methods have also been used to study texts from to analyze sentiments in Twitter (Jianqiang, Xiaolin, & Xuejun, 2018).
Twitter, like (Cigarran, Castellanos, & Garcia-Serrano, 2016), in which Deep learning based methods have been used to detect malicious ac-
Formal Concept Analysis (FCA) (Wille, 1992; Bernhard & Rudolf, counts in location-based social networks (Gong et al., 2018). One recent
1999), a mathematical application of lattices and ordered sets to the work used a Bayesian network and fuzzy recurrent neural networks for
process of concept formation, was used as an alternative approach to detecting subjectivity (Chaturvedi, Ragusa, Gastaldo, Zunino, &
topic detection. Authors use FCA because it deals with several problems Cambria, 2018).
that the traditional methods suffer such as the unknown number of With regard to one of the particular objectives of our work: de-
topics, the difficulty of these methods to adapt to new topics, among tecting communities, several methods have been developed in the last
others. couple of decades to solve the so-called planted ℓ-partition model,
It is worth noting that the majority of work has been done using where the structure of graphs are studied to find densely connected
English text corpora with valuable results (Dashtipour et al., 2016). groups of nodes (see refs Lancichinetti & Fortunato, 2009; Yang,
However, Spanish is significantly a more inflected language than Eng- Algesheimer, & Tessone, 2016 for excellent reviews). More modern
lish, and this difference could pose problems. For example, supervised methods based on embedding communities in low-dimensional vector
machine learning methods for the topic classification of annotated spaces try to solve problems such as node clustering, node classifica-
Spanish tweets modeled with n-grams have shown to be insufficient tion, low-dimensional visualizations, edges prediction, among others
(Fernández Anta, Núñez Chiroque, Morere, & Santos, 2013). Better with great success (Cavallari, Zheng, Cai, Chang, & Cambria, 2017; Y.
attempts to deal with the several research branches of topic detection Li, Sha, Huang, & Zhang, 2018). However, we shall point out that this is
and opinion-mining in Spanish have taken place. For instance, in the a very active area of research with many facets, and as argued in
field of opinion mining, studies like (Dolores Molina-González, Rosvall, Delvenne, Schaub, and Lambiotte (2017), community detec-
Martínez-Cámara, Teresa Martín-Valdivia, & Alfonso Ureña-López, tion should not be considered as a well-defined problem, but instead,
2015) proposed a lexicon-based model that adapts to specific domains should be motivated by particular reasons. In this sense, our motivation
in Spanish for polarity classification of film reviews. Also, polarity for detecting communities is to find groups of people with a clear topic
classification in Spanish tweets has been treated in Vilares, Alonso, and of interest, regardless of whether such groups of people follow each
Gómez-Rodríguez (2013), where hybrid systems that bring together other on Twitter. This means that we do not know from the beginning

188
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Fig. 1. Overview of the proposed method: (1) A component crawls a set of tweets, which are stored in a document data base; (2) The Word2Vec model is applied to
this data set to build all the tweets in an embedding space; (3) A clustering analysis is performed in the embedded space to find latent topics; (4) Each user is projected
in a 2D visualization in which the obtained latent topics are colored. (For interpretation of the references to color in this figure legend, the reader is referred to the
web version of this article.)

any connection between the nodes (users), and we aim to detect com- Word2Vec. We selected this model because it has been the seed for all
munities solely based on the data that characterizes each node, i.e. the word embedding models, and it is the most widely used model, despite
text representation of each user's tweets. the existence of newer and very successful word embedding models
The contribution of this paper is twofold: first, we proposed a such as fastText (Joulin et al., 2016; Joulin, Grave, Bojanowski, &
method to automatically identify digital communities of a city grouped Mikolov, 2017; Bojanowski, Grave, Joulin, & Mikolov, 2017), BERT
by topic of interest, and second, we collected a set of tweets of Bogotá’s (Devlin, Chang, Lee, & Toutanova, 2018), Swivel (Shazeer, Doherty,
citizens to illustrate the proposed method. Evans, & Waterson, 2016) and ELMo (Peters et al., 2018). Afterwards,
clustering analysis is performed in the embedded space to find latent
topics. Each user is projected in a 2D visualization in which the ob-
3. Material and methods tained latent topics are colored. It is of the uttermost importance to
notice that we do not create a graph with explicit edges between the
An overview of the proposed method is depicted in Fig. 1, which is users, but rather let the latent topics found in each user's tweets to
inspired by the ideas found in Silva, Santana, Lobato, and Pinheiro create implicit edges. The following sections describe in detail each of
(2017). We first crawl a set of tweets, which are stored in a document these components.
data base. Then, we generate vector representations of texts using

189
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Fig. 2. Distribution of tweet and document length in the corpus.

3.1. Crawler discarded the ones that contained less than 40 word occurrences from
the model's vocabulary. This was done with the purpose of examining
A crawler component was implemented in Java using the Twitter documents that represented active Twitter users. A total of 30,746
Streaming API, which allowed to collect a set of 2,476,426 tweets of documents satisfied this condition.
Bogotá in a period of 111 days (from August 2015 to December of
2015). The query only searched for tweets that matched with the string 3.6. Clustering of documents (users)
“Bogotá” in the field “place” of the tweet meta-data.
In order to identify the main topics treated by Bogotá’s Twitter users
3.2. Tweets data set we used the k-means algorithm (with the Python's scikit-learn module
(Pedregosa et al., 2011)) on the vector representation of the documents.
Tweets corresponding to the same user form a document. All the To determine the number of clusters k in the cluster analysis, a gap
documents form the corpus. The set of words composing the corpus is statistic study was performed.
the vocabulary. The distribution of tweet and document length of the
corpus is shown in Fig. 2. It is worth noting that most of the tweets have 3.7. Frequency distribution of words within clusters
20–60 characters. The average length in characters of the tweets is 55
and the average length in tokens of the documents is 639. Once k was estimated, we took the 15 most representative docu-
ments of each cluster to build a frequency distribution of words that
3.3. Pre-processing allowed us to easily identify the topics represented in each cluster.
Moreover, a tweet length distribution of the 15 documents of each
We decided to discard tweets with less than 20 characters because cluster was also built in order to recognize which topics demanded
very short texts usually lack of significant information. From the tweets users to write longer or shorter tweets.
with 20 or more characters, a total of 58,644 documents were created.
The documents were tokenized using the NLTK's Tweet Tokenizer (Bird, 4. Theoretical framework
Klein, & Loper, 2009), which allowed to preserve emoticon-like (and
smileys) characters in tweets. The tokens were not stemmed because This section presents the theoretical framework used in each com-
Word2Vec deals with different conjugations of words. ponent of the proposed method.

3.4. Word2Vec embedding 4.1. Word2Vec embedding

With these documents, a Word2Vec model with a context distance Word embeddings constitute a solution to the task of numerically
of 6 words was trained. Words that appeared at least 10 times in the representing words as semantics and syntax carriers within sentences.
corpus were selected from the vocabulary to train the model. This Let  = {wi: i = 1, 2, …, V } be the ordered set of words, or vocabulary,
subset of the corpus vocabulary (or model's vocabulary) was composed contained in the corpus, where wi is the i-th word, and V is the size of
of 55,168 words. Word2Vec represents each document (the average of the vocabulary. Generally, words are represented as hot-vectors, which
the embedded vector words composing the document) as a vector in a are the V-dimensional canonical basis vectors,
vector space whose cardinality was set to 150 dimensions.
w1 = (1, 0, …, 0), …, wV = (0, 0, …, 1), (1)

3.5. Documents database where wi is the hot-vector representation of the word wi. Clearly ⟨wi,wj⟩
:= ∑ kwikwjk = δij, where δij is the Kronecker's delta and wik is the k-th
Having the vector representation in the 150 dimensional space, we component of the i-th word hot-vector, which shows that in this

190
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

contexts must have dissimilar vector representations. Notice that


computing E is time-consuming, since normally V is a big number (∼
104).
Furthermore, when C > 1, h is defined as the average vector of the
context words' vector representations. Here the loss function is
E : =−log P (wj |) , where  is a set of C words chosen from  . This
probability can be expressed as in Eq. (2) by changing ωi by h. Notice
V
()
that there are C ways of picking contexts, and the number of calcu-
lations increase dramatically. To reduce this number, negative sampling
~
(Mikolov et al., 2013) is used. Let P (w ) be the term frequency dis-
~
tribution of words. Let Q (w ) be a noise distribution built by taking the
probability P(w) of each word and raising them to the power of 3/4
(this allows less frequent words to have more probability of being
~
drawn from Q (Goldberg & Levy, 2014)), and then renormalizing. The
Fig. 3. Three-layered neural network explaining CBOW method from
Word2Vec.
problem of minimizing E is converted into a binary classification pro-
blem as follows. Given a context  , we pick a word w from  such that
 are context words for w. This pair (w, ) is called a true sample and is
canonical basis of ℝV there cannot be a relation between words, since ~
labeled by d = 1. From the noise distribution Q , K words {w1n,…,wKn}
they are pairwise orthonormal. Therefore, Mikolov, Sutskever, Chen, are drawn. The pairs (wk , ) are negative samples and are labeled by
n
Corrado, and Dean (2013) proposed a model to build a projector d = 0. Therefore, the binary classification problem can be stated as
 : V → N , where N ≪ V, that maps hot-vectors into embedded N- maximizing the joint probability of (d,w), where w can refer to a word
dimensional vectors. That is,  (wi ) ↦ ωi , where ωi is the embedded N- both from the true sample or from the negative samples (Dyer, 2014):
dimensional representation of the word wi. The way of constructing 
is via a multi-layered neural network that can be arranged in two dif- ⎧ K ~
⎪1 Q (w ) ifd = 0,
ferent ways. The first way results in the Continuous Bag of Words P (d, w|) =
+K
(CBOW) model, and the second way results in the skip-gram model. ⎨ 1 ~
⎪1 P (w|) ifd = 1.
This work centers in the CBOW model, in which the neural network has ⎩ +K (4)
the job of predicting a target word given a set of words called context ~
Since the actual empirical distribution P (w|) is not known, the one
words. The context words of a word are defined as the set of words that defined in Eq. (2) is used. Now, to reduce the calculations, under some
are at a distance less than or equal to c from each occurrence of the approximations (Dyer, 2014), it is possible to write Eq. (4) as two
word in the corpus, where c is some integer that one defines. For in- simple equations:
stance, if the corpus contains food reviews, one would expect the neural
network to predict “food” when the context words are {“delicious”, 1
P (d = 0|w, ) = ,
“yummy”, “exquisite”}, and do not predict “cat”. The neural network exp(⟨ω, h⟩) + 1 (5)
can be depicted as in Fig. 3, where the input layer consists of
exp(⟨ω, h⟩)
C × (V × 1) neurons (C is the number of words in the set of context P (d = 1|w, ) = .
exp(⟨ω, h⟩) + 1 (6)
words), the hidden layer of N neurons, and the output layer of V neu-
rons. Note that these are sigmoid functions σ, of the same argument, except
To see how  is constructed, consider the simplified problem of a for a sign. From Eq. (5) it can be seen that if ω and h are dissimilar
one word context. In this case, the input layer receives a word hot- (similar) then the mentioned probability will be close to 1 (0). Simi-
vector wi, and acts linearly on it with a N × V matrix W, whose com- larly, in Eq. (6), if ω and h are similar (dissimilar), the probability will
ponents Wji are the weights between the i-th neuron of the input layer, also be close to 1 (0). In this case, it can be shown that the error
and the j-th neuron of the hidden layer. This layer can be represented by function takes the form
an N-dimensional vector h whose components are the sum of the K
weights received by each neuron, i.e. hk = (W wi)k = Wki. Here, W can E= ∑ log σ (⟨ωnk , h⟩) − log σ (⟨ω, h⟩),
be identified as the matrix form of the projector  , where the i-th k=1 (7)
column of W is ωi. In a similar fashion, W′ is a V × N weight matrix
for a target word w. To see negative sampling in the context of the skip-
between the hidden and the output layer. Note that both the i-th
gram model (also used instead of CBOW), see Goldberg and Levy
column of W and the i-th row of W′ are N-dimensional vector re-
(2014).
presentations of the word wi. Moreover, since hk = Wki, then h = ωi.
This allows ⟨ωj′,h⟩ = ⟨ωj′,ωi⟩ to be a score of similarity between the
4.2. Gap statistic
different vector representations of the words wi and wj. These scores
allow the definition of a softmax multinomial distribution,
The gap statistic allows the estimation of the number of clusters in a
exp(⟨ω′j , ωi ⟩) data set (Tibshirani et al., 2001). Consider a data set {x1,…,xN} of N
yj : =P (wj |wi ) = V
, samples and M features. Let Cr be the r-th cluster found with K-Means
∑k = 1 exp(⟨ω′k , ωi ⟩) (2)
out of k clusters (other clustering algorithms can be implemented),
where yj is the output of the j-th neuron in the output layer. Eq. (2) is containing nr = |Cr| samples. The within-cluster dispersion is defined as
the estimated probability of wi being a context word of wj. To learn the 1
correct probabilities, a loss function E is defined as Dr : =
2nr
∑ ||xi − xj ||2 ,
xi, xj ∈ Cr (8)
V
E : =−log P (wj |wi ) = log ∑ exp(⟨ω′k , ωi ⟩) − ⟨ω′j , ωi ⟩. which is similar to the variance except for a factor of 1/nr. The total
k=1 (3) within-cluster dispersion for k clusters is
Since the goal is to minimize E, then ⟨ωj′,ωi⟩ must become larger in k

the learning process, meaning that words that share a context must Wk : = ∑ Dr .
r=1 (9)
have similar vector representations, and words that do not share

191
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Wk is normally used to estimate the “correct” number of clusters K each document of the cluster and its respective cluster centroid. With
via the elbow method. On the other hand, the gap statistic uses null each set of 15 documents, a word frequency distribution was built in
reference distributions. These are each constructed by finding the range order to get the most frequent words of each cluster. A tweet length
in the M-dimensional space of the samples and generating N data points distribution was also built with each of these sets. The most frequent
with a uniform distribution. Next, k clusters are computed for each words in each cluster tell us if the topics can be easily defined. From
reference distribution. If there are B such distributions, then the gap both visualizations it is seen that there are two types of clusters: the
statistic is defined as ones corresponding to a mega-cluster, and the ones that are certainly
B
different from the rest of the corpus, called one-topic clusters because
1 their topic can be easily defined. The one-topic clusters are tagged in
GapB (k ) = ∑ log Wk(b) − log Wk ,
B b=1 (10) the figures with their respective topic (except from “Love” and
“Religion”, which can be found within the mega-cluster). It can be seen
where Wk(b) is the total within-cluster dispersion of the b-th reference
that PCA makes a sparse visualization of the documents corresponding
distribution for k clusters. It can be argued (Tibshirani et al., 2001) that
to one-topic clusters, while t-SNE groups them and plots them apart
GapB(k) reaches a maximum for k = K when the cluster centroids are
from the mega-cluster. It should be noted that the one-topic clusters can
aligned in an equally spaced fashion. Also, the uncertainty of GapB(k)
be used to treat documents as nodes in a graph, whose edges connect
can be estimated to be
nodes from the same one-topic clusters. This graph could then be used
1 by a semi-supervised community detection method such as Mirabelli
sk = 1+ sd(k ), and Kushnir (2018) in order to label the remaining nodes from the
B (11)
mega-cluster.
where sd(k) is the standard deviation of the logarithm of the B reference The most frequent words for each cluster are presented in Table 1. It
distributions' total within-cluster dispersion. Thus, with a 1-sigma cer- is noteworthy that even though clusters 25 and 30 are related to the
tainty, K is the value of k for which GapB(k) ± sk reaches its maximum. same topic, documents in cluster 30 are more news-like written, while
25 contained documents full of personal opinions about Country Poli-
5. Results tics. The same relation occurs between clusters 34 and 39. Cluster 34 is
full of documents with well-written tweets that try to inform the si-
Once the Word2Vec model was trained, the documents were re- tuation of a particular football team, or match, whereas cluster 39
presented by the average of its Word2Vec representations. The gap contains documents that express feelings about specific teams or foot-
statistic was computed for several quantity of clusters k = 5,10,…,140, ball events. It is important to point out that cluster 2 was filled with
resulting in Fig. 4. The curve shown in this figure resembles a log curve. dialogue-like written tweets, i.e. tweets that have conversations in
This happens because the Word2Vec model uses all the dimensions to them. In the case of clusters 22 and 24, corresponding to news, it can be
represent the similarity relations between word vectors. Since texts are identified that the combination of words used in these types of re-
normally rich in words, it is expected that no clear clusters are formed, porting tweets is very peculiar of those clusters. This is clear because
and therefore the number of clusters is quite indistinguishable. This is even though the topic news covers plenty of topics, the way the tweets
also supported by the fact that Twitter users might tweet about different are written is perceived by Word2Vec.
topics, making their document's vector to be assigned to clusters that
group 2 or more topics. Despite these inconveniences, the gap statistic
allows us to estimate the number of clusters. From the figure, it can be 6. Discussion
seen that the curve begins to flatten down, or to form an elbow, around
k = 40. Remarkably, many clusters contained documents that constantly
The result of K-means clustering with 40 clusters is shown in Fig. 5, make reference to love. The similarity between these clusters can be
where the 15 most representative documents of each cluster were seen because they share many common words. Also, cluster 9 is a very
plotted, using PCA (Fig. 5a) and t-SNE (Fig. 5b), using different colors particular one because it encloses documents containing tweets both
for each cluster. written in French and Portuguese.
We established the 15 most representative documents of each From the average tweet length of each cluster, it can be easily seen
cluster by sorting in ascending order the euclidean distance between that longer tweets generally are part of a document contained in a
cluster that represents a specific topic. On the other hand, clusters with
short tweet length average consist of documents with tweets that ex-
press personal experiences. From the table, it is seen that there are
plenty of clusters with these sorts of documents, indicating that people
tend to express their personal experiences with the same set of words
and in very similar semantic expressions.
To make Table 1 easier to read, a PCA visualization of the cluster
centroids is shown in Fig. 6. In this figure, the clusters with the longer
tweet length average appear to be away from the mega-cluster, com-
prised of all the short tweet length average clusters. This phenomenon
allows us to propose that people tend to share their personal experi-
ences in shorter tweets, while they give opinions of community im-
portant topics in longer texts.
Overlapping of document vectors in two topics means that these
users belong to both communities, which is normal because typically a
user is interested in more than one topic. This is known in machine
learning as soft clustering, in which an object can belong to more than
one cluster. In our approach this analysis was not conducted, but it
would be interesting to address this aspect in future work.

Fig. 4. Gap statistic for several number of clusters.

192
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Fig. 5. Annotated two-dimensional visualizations of the 15 most representative documents for each of the 40 clusters computed with K-Means. The axes of both
visualizations have arbitrary units.

193
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Table 1
The most frequent words of each cluster. Each cluster is represented by 3 cells: the first one tells the number of the cluster (#) and the topic (in parentheses, if the
topic is easily identified); the second one tells the average number of characters per tweet of that cluster; and the third one, tells the 10 most common words of the
cluster (with their respective translations in English in parentheses) for the 15 clusters with the larger tweet length average, and 6 most common words for the rest of
the clusters.
#22 (News) 104 #30 (Country politics) 87 #34 (Sports) 87 #24 (News) 86 #25 (Country politics) 85

hoy (today) Santos (Colombia's president) gol (goal) via (way) Santos (Colombia's president)
Tunja (a city) FARC (Colombian guerrilla) partido (match) departamento (state) FARC (Colombian guerrilla)
Bogotá Colombia city (Manchester City) sector paz (peace)
Boyacá (Colombian state) paz (peace) madrid (Real Madrid) tránsito (transit) Maduro (Venezuela's president)
Santa Marta (a city) Maduro (Venezuela's president) Santa Fé (football team) accidente (accident) país (country)
Colombia Colombia junior (football team) Bogotá país (country)
gobierno (government) gobierno (government) gran (great) Cundinamarca (Colombian State) terroristas (terrorists)
nacional (national) justicia (justice) mejor (better) cierre (closure) colombianos (Colombians)
vehículos (vehicles) colombianos (Colombians) américa (football team) total pueblo (people)
personas (people) Uribe (Colombia's ex-president) nacional (football team) carril (lane) justicia (justice)

#10 (Local politics) 84 #8 76 #11 (Religion) 73 #2 (Dialogues) 65 #16 (Love) 60

Bogotá más (more) Dios (God) más (more) vida (lane)


Peñalosa (Bogotá’s mayor) vida (life) Señor (Lord) ser (to be) día (day)
Colombia siempre (always) amor (love) vida (life) amor (love)
Santos (Colombia's president) solo (only) padre (father) hijueputa (sonofabitch) quiero (I want)
Petro (Bogotá’s ex-mayor) mejor (better) vida (life) amor (love) Dios (God)
FARC (Colombian guerrilla) día (day) gracias (thanks) bien (good) siempre (always)
paz (peace) nunca (never) corazón (heart) solo (only) corazón (heart)
Uribe (Colombia's ex-president) cosas (things) Jesús (Jesus) alguien (someone) mejor (better)
gobierno (government) gente (people) misericordia (misericordy) mamá (mom) mierda (shit)
alcalde (mayor) tiempo (time) fuerza (strength) hoy (today) gracias (thanks)

#7 (Love) 57 #26 51 #23 (Love) 50 #15 (Love) 49 #9 (Portuguese and French) 49

más (more) vida (life) Dios (God) quiero (life) não (no)
vida (life) mejor (better) vida (life) día (day) est (is)
quiero (I want) solo (only) amor (love) jajaja (laughter) pas (not)
mierda (shit) día (day) siempre (always) vida (life) mais (more)
mejor (better) años (years) feliz (happy) mejor (better) quero (I want)
gente (people) Colombia corazón (heart) hoy (today) elle (he)
bien (good) nunca (never) gracias (thanks) alguien (someone) vou (you)
siempre (always) siempre (always) tiempo (time) novio (boyfriend) minha (my)
amor (love) mundo (world) nunca (never) amo (I love) hoje (today)
necesito (I need) Bogotá personas (people) bien (good) melhor (better)

#39 (Sports football) 48 #5 (Love) 44 #31 43 #1 42 #29 42

partido (match) amor (love) días (days) quiero (I want) usted (formal you)
Santa fé (football team) quiero (I want) hoy (today) jajaja (laughter) quiero (I want)
gol (goal) solo (only) bueno (good) usted (formal you) vida (life)
hoy (today) vida (life) gente (people) solo (only) solo (only)
vamos (come on) amo (I love) vida (life) mejor (better) cosas (things)
bien (good) dia (day) ahora (now) vida (life) mejor (better)

#13 42 #38 41 #28 41 #27 40 #20 40

más (more) más (more) más (more) jajaja (laughter) amo (I love)
tan (so) mejor (better) vida (life) más (more) más (more)
vida (life) quiero (I want) tan (so) fav voy (I will)
hoy (today) tan (so) alguien (someone) tan (so) tan (so)
quiero (I want) hoy (today) quiero (I want) quiero (I want) quiero (I want)
solo (only) mamá (mom) solo (only) vida (life) vida (life)

#4 40 #35 40 #37 39 #14 39 #32 39


más (more) más (more) más (more) más (more) más (more)
vida (life) tan (so) usted (formal you) quiero (I want) ser (to be)
alguien (someone) vida (life) vida (life) vida (life) tan (so)
solo (only) mierda (shit) mejor (better) tan (so) voy (I will)
persona (person) solo (only) amor (love) mejor (better) solo (only)
ser (to be) asi (like this/that) Dios (God) Dios (God) vida (life)

#6 38 #0 37 #17 36 #18 36 #33 36

más (more) más (more) más (more) más (more) más (more)
vida (life) quiero (I want) quiero (I want) tan (so) amor (love)
quiero (I want) tan (so) tan (so) mejor (better) vida (life)
amor (love) vida (life) vida (life) vida (life) quiero (I want)
solo (only) hoy (today) mejor (better) quiero (I want) solo (only)
siempre (always) día (day) hoy (today) jajaja (laughter) tan (so)

#19 35 #36 (English) 34 #12 34 #3 33 #21 33

(continued on next page)

194
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Table 1 (continued)

#22 (News) 104 #30 (Country politics) 87 #34 (Sports) 87 #24 (News) 86 #25 (Country politics) 85

más (more) like más (more) más (more) más (more)


mejor (better) love hoy (today) quiero (I want) amor (love)
quiero (I want) people día (day) siempre (always) vida (life)
amor (love) want Dios (God) vida (life) solo (only)
vida (life) get vida (life) voy (I will) quiero (I want)
tan (so) don’t solo (only) amor (love) ser (to be)

Fig. 6. PCA visualization of clusters' centroids. The


axes have arbitrary units. Each cluster is represented
by its numeric identification, as in Table 1. Green
numbers show those clusters whose topic is defined.
Purple numbers show the clusters whose topic was
love, and the rest are the clusters whose topic is
difficult to define. (For interpretation of the refer-
ences to color in this figure legend, the reader is
referred to the web version of this article.)

6.1. Threats to validity 7. Conclusion and future work

It is worth noting that our approach is not using other information This paper presented a method to automatically identify commu-
available in Twitter such as user account meta-data, re-tweets, likes, nities using citizens data from the social network Twitter. Up to
URLs, images, location, geo-referenced data (when available), etc. We knowledge, this is the first study that analyzes Twitter data from Bogota
concentrate our analysis only in the text content. However, these to automatically detect communities. The use of machine learning
complementary information could add value to better represent a user methods such as neural networks and dimensionality reduction algo-
and at the same time to modify the obtained communities. Other issue rithms to detect communities is the base of the proposed method.
in our approach is the amount of tweets used in the experimentation. Results show that this method can find out groups of citizens that share
We gathered data from Bogota users in a period of six months, but It is common topics such as politics, news, religion, sports, languages,
not clear what is considered enough data. However, gathering more among others. This is an interesting tool that could be used by the local
data is a time consuming task and high demanding disk space. We government to support making decision processes in which what
considered that this issue can be addressed in future work to analyze communities express can provide valuable information.
the impact of the detected communities in a larger period of time. As future work we want to compare our approach with the com-
munities that can be obtained when data is modeled as a graph, which
will be challenging because the obtained communities will not be ne-
6.2. Policy implications cessarily comparable. We want to explore in the future the use of other
data available in Twitter such as meta-data user account, re-tweets,
It is important to remark that our work can be used by government likes, URLs, etc. It is possible that complementary information of users
and private entities to develop cultural, political or educational cam- generates value to the concept of community. We also want to gather
paigns, which are the most valuable intangible fields in a society. We more Tweets to increase the sample data an analyze how the commu-
believe that the method proposed in this paper is especially useful in nities change in the time. It would be interesting to analyze how
developing countries because of the gap or disconnection that exists communities are influenced by social phenomena, events and period of
between politicians in the government and the citizenship, because our the year.
method easily identifies the main concerns of a society. As an example,
during the time we gathered the tweets, every city in Colombia was References
going through one of the deepest changes of the country because of the
peace treaty between the Colombian government and the FARC guer- Alvarado Valencia, J. A., Carrillo, A., Forero, J., Caicedo, L., & Ureña, J. C. (2016). XXVI
rilla, which was discovered by our method as one of the main topics Simposio Internacional de Estadística 2016 Sincelejo, Sucre, Colombia, 8 al 12 de
Agosto de 2016. Xxvi simposio internacional de estadística 2016 (pp. 1–4). Sincelejo.
discussed in Bogotá. This early information could have been used by the Benny, A., & Philip, M. (2015). Keyword based tweet extraction and detection of related
government to explain the treaty specifically to those confused or un- topics. Procedia Computer Science, 46(Icict 2014), 364–371.
informed. Bernhard, G., & Rudolf, W. (1999). Formal Concept Analysis: Mathematical Foundations.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python: Analyzing
text with the natural language toolkit. O’Reilly Media, Inc.

195
V. Vargas-Calderón and J.E. Camargo Cities 92 (2019) 187–196

Blei, D. M., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., & Edu, J. B. (2003). Latent Twitter sentiment analysis. IEEE Access, 6, 23253–23260. https://doi.org/10.1109/
Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022 ACCESS.2017.2776930.
arXiv:1111.6189v1. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016).
Blei, D. M., & Lafferty, J. D. (2006). Correlated topic models. Advances in Neural FastText.zip: Compressing text classification models. arXiv e-prints arXiv:1612.03651.
Information Processing Systems 18, 147–154. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with classification. Proceedings of the 15th conference of the european chapter of the asso-
subword information. Transactions of the Association for Computational Linguistics, 5, ciation for computational linguistics: Volume 2, short papers (pp. 427–431). Association
135–146. https://doi.org/10.1162/tacl_a_00051. for Computational Linguistics. http://aclweb.org/anthology/E17-2068.
Cavallari, S., Zheng, V. W., Cai, H., Chang, K. C.-C., & Cambria, E. (2017). Learning Klahold, A., Uhr, P., Ansari, F., & Fathi, M. (2014). Using word association to detect
community embedding with community detection and node embedding on graphs. multitopic structures in text documents. IEEE Intelligent Systems, 29(5), 40–46.
Proceedings of the 2017 acm on conference on information and knowledge management Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A compara-
(pp. 377–386). New York, NY, USA: ACM. https://doi.org/10.1145/3132847. tive analysis. Phys. Rev. E, 80, 056117. https://doi.org/10.1103/PhysRevE.80.
3132925. 056117.
Chaparro Gaitán, J. A. (July 2013). Bogotá d.c. ciudad de estadísticas. Research note 52.. Li, J., Li, J., Fu, X., Masud, M. a., & Huang, J. Z. (2016). Learning distributed word
Chaturvedi, I., Ragusa, E., Gastaldo, P., Zunino, R., & Cambria, E. (2018). Bayesian net- representation with multi-contextual mixed embedding. Knowledge-Based Systems,
work based extreme learning machine for subjectivity detection. Journal of the 106, 220–230.
Franklin Institute, 355(4), 1780–1797. https://doi.org/10.1016/j.jfranklin.2017.06. Li, Y., Sha, C., Huang, X., & Zhang, Y. (2018). Community detection in attributed graphs: An
007 Special Issue on Recent advances in machine learning for signal analysis and embedding approach.https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/
processing. view/17142.
Cigarran, J., Castellanos, a., & Garcia-Serrano, A. (2016). A step forward for topic de- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed re-
tection in Twitter: An FCA-based approach. Expert Systems with Applications, 57, presentations of words and phrases and their compositionality. Advances in neural
21–36. information processing systems (pp. 3111–3119). .
Correa, J. C., & Camargo, J. (2017). Ideological consumerism in Colombian elections, Mirabelli, B., & Kushnir, D. (2018). Active community detection: A maximum likelihood
2015: Links between political ideology, Twitter Activity and Electoral Results. approach. arXiv e-prints arXiv:1801.05856.
Cyberpsychology, Behaviour and Social Networking, 20, 37–43. Montejo-Ráez, A., Díaz-Galiano, M. C., Perea-Ortega, J. M., & Ureña López, L. A. (2013).
Dashtipour, K., Gogate, M., Adeel, A., Ieracitano, C., Larijani, H., & Hussain, A. (2018). Spanish knowledge base generation for polarity classification from masses.
Exploiting deep learning for Persian sentiment analysis. Bics.. Proceedings of the 22nd international conference on world wide web (pp. 571–578). New
Dashtipour, K., Poria, S., Hussain, A., Cambria, E., Hawalah, A. Y. A., Gelbukh, A., & York, NY, USA: ACM.
Zhou, Q. (2016). Multilingual sentiment analysis: State of the art and independent Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ...
comparison of techniques. Cognitive Computation, 8(4), 757–771. https://doi.org/10. Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine
1007/s12559-016-9415-7. Learning Research, 12, 2825–2830.
Deerwester, S., Dumais, S. T., & Harshman, R. (1990). Indexing by latent semantic ana- Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L.
lysis. Journal of the American society for information science, 41(6), 391–407. (2018). Deep contextualized word representations. arXiv e-prints arXiv:1802.05365.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large
bidirectional transformers for language understanding. arXiv e-prints corpora. Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks
arXiv:1810.04805. (pp. 45–50). Valletta, Malta: ELRA.
Dolores Molina-González, M., Martínez-Cámara, E., Teresa Martín-Valdivia, M., & Alfonso Rosvall, M., Delvenne, J.-C., Schaub, M. T., & Lambiotte, R. (2017). Different approaches
Ureña-López, L. (2015). A Spanish semantic orientation approach to domain adap- to community detection. arXiv e-prints arXiv:1712.06468.
tation for polarity classification. Information Processing and Management, 51(4), Salcedo, D., & León, A. (2015). Behavior of symptoms on Twitter. In H. Lossio-Ventura, J.
520–531. Antonio, & Alatrista-Salas (Eds.). Proceedings of the 2nd annual international symposium
Durango Padilla, N. (2014). Tic y brecha digital. Research note 73.. Secretaría Distrital de on information management and big data - simbig 2015 (pp. 83–84). Busco: CEUR-WS.
Planeación. Shazeer, N., Doherty, R., Evans, C., & Waterson, C. (2016). CoRR abs/1602.02215Swivel:
Dyer, C. (2014). Notes on noise contrastive estimation and negative sampling. arXiv Improving embeddings by noticing what's missing.
preprint arXiv:1410.8251. Silva, W., Santana, A., Lobato, F., & Pinheiro, M. (2017). A methodology for community
Enríquez, F., Troyano, J. A., & López-Solaz, T. (2016). An approach to the use of word detection in Twitter. Proceedings of the international conference on web intelligence (pp.
embeddings in an opinion classification task. Expert Systems with Applications, 66, 1–6. 1006–1009). New York, NY, USA: ACM. https://doi.org/10.1145/3106426.3117760.
Fernández Anta, A., Núñez Chiroque, L., Morere, P., & Santos, A. (2013). Sentiment Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a
analysis and topic detection of {S}panis tweets: A comparative study of {NLP} data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical
techniques. Procesamiento del Lenguaje Natural, 50, 45–52. Methodology), 63(2), 411–423.
Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving Mikolov et al.’s negative- Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2013). Supervised polarity classifi-
sampling word-embedding method. cation of spanish tweets based on linguistic knowledge. Proceedings of the 2013 acm
Gong, Q., Chen, Y., He, X., Zhuang, Z., Wang, T., Huang, H., ... Fu, X. (2018). Deepscan: symposium on document engineering (pp. 169–172). New York, NY, USA: ACM.
Exploiting deep learning for malicious account detection in location-based social Wille, R. (1992). Concept lattices and conceptual knowledge systems. Computers &
networks. IEEE Communications Magazine, 56(11), 21–27. https://doi.org/10.1109/ Mathematics with Applications, 23(6-9), 493–515.
MCOM.2018.1700575. Yang, Z., Algesheimer, R., & Tessone, C. J. (2016). A comparative analysis of community
Hachaj, T., & Ogiela, M. R. (2017). Clustering of trending topics in microblogging posts: A detection algorithms on artificial networks. Scientific Reports, 6, 30750 EP. https://
graph-based approach. Future Generation Computer Systems, 67, 297–304. doi.org/10.1038/srep30750 Article.
Java, A., Song, X., Finin, T., & Tseng, B. (2007). Why we Twitter: Understanding mi- Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., & Li, X. (2011). Comparing
croblogging usage and communities. Proceedings of the 9th webkdd and 1st sna-kdd Twitter and traditional media using topic models. 33rd European Conference on IR
2007 workshop on web mining and social network analysis (pp. 56–65). New York, NY, Research, ECIR 2011, 338–349.
USA: ACM. Zhou, Y., De, S., & Moessner, K. (2016). Real world city event extraction from Twitter
Jianqiang, Z., Xiaolin, G., & Xuejun, Z. (2018). Deep convolution neural networks for data streams. Procedia Computer Science, 58(DaMIS), 443–448.

196

You might also like