0 ratings0% found this document useful (0 votes) 84 views35 pagesRRR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
IN
Real-time Event Detection on Twitter
A Project Report Submitted in partial fulfilment of the degree of
‘Master of Computer Applications
By
Shubham Singh (18MCMC52)
Nitesh Rawal (18MCMC12)
School of Computer and Information Sciences,
y of Hyderabad,
Gachibowli, Hyderabad- 500046, India
Univey
July 2021CERTIFICATE
This is to certify that the Project Report entitled “Real-time Event Detection(on_)
Twitter” submitted by Shubham Singh bearing Reg. No. 18MCMCS2 and Nitesh —
Rawal bearing Reg. No. 18MCMC12 in partial fulfilment of the requirements for the
award of Master of Computer Applications, is a bonafide work carried out byQin>
under my supervision and guidance.
The Project Report has not been submitted previously in part or in full to
this or any other University or Institution for the award of any degree or diploma
Dr. Rajendra Prasad Lal
Assistant Professor
School of Computer and Information Sciences,
University of Hyderabad
Dean,
Dr. Chakravarthy Bhagvati
School of Computer and Information Sciences, University of
HyderabadDECLARATION
We, Shubham Singh and Nitesh Rawal hereby declare that this dissertation entitled
“Realtime Event DetectionCon)Twitter” submitted by me under the guidance and
supervision of Dr. Rejendra Prasad Lal is a bonafide work. We also declare that it has not
been submitted previously in part or in full to this or any other University or
Institution for award of any degree or diploma.
Date:
Signature of Student: Signature of Student:
Shubham Singh Nitesh Rawal
Reg. No.: 18MCMC52 Reg. No.: 18MCMC12ACKNOWLEDGEMENTS
We would like to take this opportunity to express our gratitude to the people who
have been instrumental in the successful completion of this project.
> We feel great pleasure to express our deep sense of gratitude to Dr.
Rajendra Prasad Lal, our supervisor who generously showed his continuous
support and guidance throughout our project.
© We would like to thank all the Faculty of SCIS, Al Lab staff, and non-teaching
staff, the University of Hyderabad for their cooperation.
> | am grateful to Dr. Chakravarthy Bhagvati, Dean of the SCIS(UOH) and
University of Hyderabad for providing the facilities and extending their
cooperation during course.
© Last but not least we place a deep sense of gratitude to our family members
and our friends who have been a constant source of inspiration during the
preparation of this project work
Shubham Singh
Nitesh RawalAbstract sew
In the last few years, Twitter has become a popular platform for sharing opinions,
experiences, news, and views in real-time. Twitter presents an interesting opportunity for
detecting events happening around the real-world events. The content (tweets) published on
Twitter are short and pose diverse challenges for detecting and interpreting event-related
information. Twitter can produce rich data streams for immediate insights into ongoing
matters and the conversations around them to tackle the problem. There are various
‘technique available in previous work for detecting an event from twitter data stream such as Pict
feature pivot, document pivot, topie modeling unapacifiadand-spacied-avent. In oueesne OA
sectres entity Wen
extraction, amg filtering, computing one between entity and entity clustering to get the
final oust in f detected from the
twitter datastredns.,
os + Not requiredChapter 1
a
13
14
15
Chapter 2
Chapter 3
Ba
3.2
33.
Chapter 4
44
42
43
44
45
Chapter 5,
5.1
5.2
Bibliography
Contents
Certificate
Beceleeation
Acknowledgement
Abstract
List of Table
List of figure
INTRODUCTION
‘What is twitter and why should we use
Social network analysis
Event and application of event
Neveltyofprommatmethsd Dyent otal
Thesis Structure why Eni”
RELATED WORK
Event detection technique
Challenges
METHODOLOGY
Framework
Methodology
Algorithms
EXPERIMENT & RESULTS
Evaluation data set
Entity extraction and filtering
Computing similarity
Graph generation
Entity clustering
Conclusion and Future Work
Conclusion
Future work
Page No.
a
10
11
B
14
15
16
18
19
19
20
21
27
29
29
30
31
32
33
34
34
34
35Table No.
1
2
3
List of Tables
Table Title
Pre-processed data
NER Types and Examples
Example tweet text
Encoding of entities
Page No.
2
2
24
24Figure No.
1
2
3.
4.
11
2.
13.
14
18.
16.
17.
List of Figure
Figure Title
Example of events categories
Feature-pivot paradigm for event detection
Document-pivot event detection
Example of NER
Cosine Similarity formula
Modularity computation formula
Clustering service design
Raw data
‘Tweet text with named entities and hashtag
List of Keys
Tweet Dictionary
Total number of entities
Inverted index between tweet and set of entities,
Cosine similarity between entities
Cluster of entities
Number of events in graph
Total number of event
Page No.
n
uv
W
2
23
26
26
29
30
30
30
31
3
31
32
33
33Chapter 1
INTRODUCTION
ei wietor and why shout?
Twitter is 2 ‘microblogging’ system that allows you to send and receive short posts called
‘tweets. Tweets can be up to 280 characters long and can include links to relevant websites
and resources. Twitter users follow other users. If you follow someone you can see their
tweets in your twitter ‘timeline’. You can choose to follow people and organizations with
similar academic and personal interests to you.
You can create your own tweets or you can retweet information that has been tweeted by
others. Retweeting means that information can be shared quickly and efficiently with a large
number of people. Twitter has become increasingly popular with academics as well as
students, policymakers, politicians and the general public. Many users struggled to
understand what Twitter is and how they could use it, but it has now become the social media
platform of choice for many. The snappy nature of tweets means that Twitter is widely used
by smartphone users who don’t want to read long content items on-screen.
Twitter is perhaps the most popular microblogging platform and Social Networking Service on
Which Users Can Post and Read short text messages known as Tweet that are written in a
unique conversational style particular tothe brevity of the Twitter medium. There are
approximately 500 million tweets a Senden} 6K tweets per second on average on the
complete Twitter Firehouse for stream of all tweets. These Millions of users share information
on different aspects of everyday life through these social networking services. Information
shared in these platforms range from personal status (i.e., opinions, emotions, pointless
babbles) to newsworthy events, as well as updates and discussions on these events, Due to
the informal nature of the tweets, and the ease with which they can be posted, Twitter users
can be faster in covering an event than the traditional news media which sometimes results
that an event may occur on Twitter first then the main stream media
Social Networks are being increasingly used for news by both journalists and consumers alike.
For journalists, they are a key way to distribute news and engage with audiences found that
more than half of journalists’ stated social media was their preferred mode of communication
with the public [1]. Where journalists also use social media frequently for sourcing news
stories because they “promise faster access to elites, to the voice of the people, and toregions
of the world which are difficult to access” [2]. For consumers, according to a recent survey,
social networks have surpassed print media as their primary source of news gathering.
Factoring in journalists’ predilection for breaking stories on social media, audiences often turn
to social networks for discovering what is happening in the world.Social Network Analysis
Soe
network analysis (SNA) is the process of investigating social structures through the use
of networks and graph-theory (3]. It characterizes networked structures in terms of nodes
(individual actors, people, or things within the network) and the ties, edges, or links
(relationships or interactions) that connect them.
Examples of social structure commonly visualized through social-network analysis include
social media networks, memes spread, information circulation, friendship and acquaintance
networks, business networks, knowledge networks, difficult working relationships, social
networks, collaboration graphs, and disease transmission. Itis visual representation of social
link in which nodes are represented as points and ties are represented as lines. These
Visualizations provide a means of qualitatively assessing networks by varying the visual
representation of their nodes and edges to reflect attributes of interest.
SNA Is the practice of representing networks of people as graphs and then exploring these
graphs. A typical social network representation has nodes for people, and edges connecting
‘two nodes to represent one or more relationships between them. The resulting graph can
reveal patterns of connection among people. Small networks can be represented visually, and
‘these visualizations are intuitive and may make apparent patterns of connections, and reveal
nodes that are highly connected or which play a critical role in connecting groups together.
As the network representation of a community grows, it becomes necessary to apply graph
analytic techniques to compute the characteristics of nodes and the graph as a whole [4].
Event
|FONSeKUEHEES! Specific elections, accidents, crimes and natural disasters are examples of
events [5]. Also define an activity as a connected set of actions that have a common focus or
purpose. Specific campaigns, investigations, and disaster relief efforts are examples of
activities
A news event as being any event (something happening at a specific time and place) of
interest to the (news) media. They also consider any such event as being a single episode in a
larger story.
Example: A speech at a rally might be an event, but it is an episode in a larger context: a
presidential election. They use the term episode to mean any such event, and saga to refer to
the collection of events related within a broader context.
10ert wesing/ bays
Fig-1 Example of Event categories in above image [6]
4. 2) Application of Event Detection
In general, in order to detect an event, one must be actively looking for it. In other Words,
the correct data must be collected, examined, and analyzed in order for an event to be
detected. Given that it is stil difficult to detect an event even while actively seeking it, the
chances are slim that an event will be detected by happenstance alone. Therefore, it makes
sense that the primary purposes of event detection are monitoring, surveillance, and
management of systems and processes [6]
This section examines example event detection applications and classifies and organizes these
applications according to their problem domains.
1. Identify Particularly Important Plays during Sporting Events such as Cricket, Football,
Volleyball, Hockey etc. are detecting using tweets between the games.
2. Identify Earthquake or natural disaster from session data or from social media or from
news texts on online but it is also been used in many different area like identify
hardware issue or changes in hardware allocation and cloud based system.
3. Financial Markets Crises can be detected on Tweets using financial data for stock
prices
4, Areas like Cyber Security are detected using graph for bandwidth utilization for
particular server where graph line can show the bandwidth of that server.
Health Monitoring and Management
In Health-care detection and Prediction of condition or events is also extreme importance in
healthcare application. The Center for Disease Control and Prevention (CDC), for instance,
continuously monitors medical and pul
health information from physicians and hospitals,
across the country. This practice is necessary for the earliest possible detection of viruses and
disease like Covid19. Any detected wide-spread illness must be contained by quarantining
nuand treating the afflicted individuals. The objective is to prevent further spreading of the
illness so that it does not result in an epidemic or, worse, a pandemic. The early detection of
disease within individual patients presents another health management issue. Discusses
screening and monitoring programs for the early detection of diseases such as covid19,
Environmental Monitoring and Prediction
Environmental monitoring and prediction is another common area for the application of
Event detection methods. The earth's environment can be extremely violent, and early
warnings of impending natural disasters such as tornadoes, hurricanes, tsunamis,
earthquakes, floods, and volcanic eruptions are critical for the safety and security of
populations within the affected regions. For example, Hurricane Ike recently devastated the
city of Galveston, Texas. Due to the influence of early detection and warning systems, the
majority of the populace was safely evacuated prior to hurricane landfall [MSNBC News
Service, 2008]. Additionally, the contamination of natural resources, whether it be by natural
or man-made (e.g., terrorist) causes, is another area of concern. Potable water, for instance,
is continuously monitored by water utilities for purity and potential contaminants.
Safety and Security
Safety and security applications are other areas that utilize event detection methods. For
instance, physical intrusion detection and fire safety are of critical importance to businesses
and homeowners. Automobile, home, and corporate security alarm systems deter potential
‘thefts and mischievous acts. Furthermore, fire, smoke, and carbon monoxide alarm systems
Increase survivability in the advent of a fire or buildup of toxic gas. Developing a prediction
method for 9-1-1 call volumes can aid emergency service Providers in service planning and
recognition of anomalous calls.
Business Process Optimization
Manufacturers rely heavily upon event detection methods to reduce overall maintenance
costs and ensure compliance with requirements. Manufacturing and condition-based
maintenance is one example. In industrial plants, engineers are concerned with identifying
machines or processes that are in need of repair or adjustment. Business process compliance
is another issue. In food and drug manufacturing, strict regulatory requirements obligate
companies to certify that their products do not exceed specific environmental parameters
during processing Furthermore, today's fast-paced and constantly evolving high-tech
business environment is extremely demanding on organizations and requires a detailed
Visibility into real-time business activities. Thus, the ability to efficiently detect correlated
business process events that represent opportunities or problems to the organization and
require quick action from the decision-maker is paramount.
12This Work has resulted in a production application that solves various product
| 13 le implemented D Remon
se cases ant i 1 4p
improves medics related to user exploration of ongoing events,
tn Ds r )
vel contributions: audleal home
lution over time - Online social post streams sj Ryot
timelines and forum distsgsions have emerged as important Chanfels for inffmatipn
dissemination. They are noisyigformal, and surge quickly. Real life eyertes eu Tas
happen and evolve every minute-ae perceived and circulatey it sbf 2x
social users, Which Representing evenbae. chain of clusters ojo tine powcr |+—
abstraction Moreover, we are able to track thage cluster chains, We I me
analysis of clusters yields insights about sub-evehtg and audierice
interest shifts. We
highlight a case study of a high profile event on Twitt8rto emerges? Cog
fa
2. Differenthated focus on quality of clustering ~ As author ino new meti cela, ot PR
th
{gg quality that we believe can help ground subséquent efforts in
space. First we Quantitatively evaluate the quality of detected e¥@pt)Our meopel tig _——
the cluster of twee? Yat represent events. We are interested in ro ea oh
va
that are the most populdkevents on Twitter. For each method, wé rapk the Getected
events based onthe numbOnef tweets assigned totheyn and then pick the sop evens aa
v
for each method, eS
3. Novel real-time system design ~As author's contribution [7] is based on the
realization that we can decompose bOM¢ detection and clustering into separate
components that can be scaled indepetently. Through system profiling, we
demonstrate the scalability and resilience of th?
me combined detection and clustering of bun
inreal terms ina sequential pipeline
In our system design i>
component of cluster servi
and entity clustering by relying
result as number of events detecte
sed on realization tha\we can decompose each of the separate
design we see that otk system perform similarity computation
the output of simiyy type of entities and shows the final
here graph showSluster of similar entities.
13Thesis Structure
In the Chapter 2 we have described the related work based on the problem of event detection
where most significant event detection technique are categorized as feature-pivot,
document-pivot, topics model
In the chapter 3 we have described about the methodology that are being used as clustering
service design in which the mentioned terminologies are Entity extraction, compute
similarity, similarity filtering, entity clustering
In the Chapter 4 we have described about experiments and result of our Evaluation data set.
In the chapter $ we have described about conclusion and future work of Real Time Event
Detection on Twitter. In the last we have mentioned the bibliography that are linked to our
projects.
4Chapter 2
RELATED WORK
There are different survey papers in the literature focusing on summarizing different types of
event detection technique, Where these techniques can be categorized as document pivot
methods, where the primary features are tweets and feature pivot methods, where the
primary features are the essential keywords, hashtags and other user-level features such as
language-specific or user-specific information. In this paper[19] Summarized the event
detection models for Twitter in terms of different techniques used, Term interestingness-
based approaches, topic modelling-based approaches, incremental clustering-based
approaches, and the other miscellaneous approaches. Furthermore, in [19] categorized the
different event detection technique in terms of event detection types, i.e., Specified and
Unspecified event detection, which use Supervised, unsupervised, and semi-supervised
approaches
According to survey [11] the most significant event detection techniques can be broadly
categorized as either feature-pivot (FP) or document-pivot (DP) methods.
Feature-pivot: These are based on detecting abnormal patterns in the appearance of
features, such as words. Once an event happens, the expected frequency of a feature will be
abnormal comparing to its historical behavior, indicating a potential new event.
Document-pivot: This category comprises methods that represent documents as items to be
clustered using some similarity measure.
Topic modeling: This includes methods that utilize statistical models to identify events as
latent variables in the documents,
2.1 Feature-pivot methods
ivot methods, this event detection method is rely on the detection of text features,
which are likely to refer to events. These feature-pivot techniques, proposed originally for the
analysis of time stamped document streams, consider an event as a bursty activity that makes
some of the text features more prominent. The intuition behind these methods is that once
an event emerges, certain features related to it will exhibit a similar abnormal rise in their
frequency. The type feature can range from single keywords, named entities and phrase to
social interactions. Traditionally, the distributions are detected, and finally, the discovery of
events is conducted by grouping features that exhibit a similar behavior. This process is
depicted in Figure 2 .According to this, an event is represented as a number of features
showing an abnormality in appearance counts. The initial documents are assigned to thedetected events, based mainly on the appearance of the event-related features in them.
These techniques cannot work in pure real-time fashion as there is a need for knowledge of
the behavior of feature frequencies over a certain period of time. To overcome that limitation,
an incremental is approach is usually followed, by detecting events on predefined timeslots.
Another drawback of feature-pivot aporoaches is that as they depend on the detection of a
bursty activity, they typically captures trends. Thus events that are not trendy and do no
attract 2 lot of attention are likely to be missed [9]
Provide generic framework for evant
detection
nh
Lbs RRS se
AS
Co = ©
ig. 2feature pivot paradigm for event detection (9]
2.2 Document-pivot methods:
In this section discusses method that detect events by clustering documents on the basis of
their semantic similarity. Document-pivot approaches, originating from the field to Topic
Detection task (TDT) [17] can be seen as a clustering problem. Both RED and FSD approaches
have been proposed. In both cases, the underlying principle is quite similar: documents are
model in the way that capture their semantic content and then a clustering algorithm is
applied to group them into events. What differs between approaches is mainly the
characteristics of the clustering approach, the way that are documents are mapped to feature
vectors, and the similarity metric used to identify whether two documents are from the same
event [9].
Fig. 3. Document-pivot event detection [9]
162.3 Topic modeling method:
In this section describes approaches based on probabilistic models that detect events in social
media documents in similar way that topic model identify latent topic in text documents.
Originally topic models relied on word occurrence in text corpora to model latent topics as
mixture of the identified set of topics. Latent Dirichlet Allocation (LDA) [19], which is the most
known probabilistic topic modeling technique, is a hierarchical Bayesian model where a topic
distribution is assumed to have sparse Dirichlet prior.
2. A Visual Backchannel for Large-Scale Events on Twitter:
Concept of a Visual Backchannel [12] as a novel way of following and exploring online
conversations about large-scale events. Such as Twitter, are increasingly used as digital
backchannels for timely exchange of brief comments and impressions during political
speeches, sport competitions, natural disasters, and other large events. Currently, shared
updates are typically displayed in the form of a simple list, making it difficult to get an
overview of the fast-paced discussions as it happens in the moment and how it evolves over
time. in contrast, our Visual Backchannel design provides an evolving, interactive, and multi-
faceted visual overview of large-scale ongoing conversations on Twitter. To visualize a
continuously updating information stream, in that include visual saliency for what is
happening now and what has just happened, set in the context of the evolving conversation.
As part of a fully web-based coordinated-view system In that Topic Streams has been
introduce, a temporally adjustable stacked graph visualizing topics over time, a People Spiral
representing participants and their activity, and an Image Cloud encoding the popularity of
event photos by size. Together with a post listing, these mutually linked views support cross:
filtering along topics, participants, and time ranges. In that design considerations has been
discussed, in particular with respect to evolving visualizations of dynamically changing data
2.5 Unspecified versus spectid Event
Depending on the availabl
classified into specified and unspecified techniques, Because no prior information is available
about the event, the unspecified event detection techniques rely on the temporal signal of
Twitter streams to detect the occurrence of a real- world event. These techniques typically
require monitoring for bursts or trends in Twitter streams, grouping the features with
Identical trend into events, and ultimately classifying the events into different categories.
formation on the event of interest, event detection can be
On the other hand, the specified event detection relies on specific information and features
‘that are known about the event, such as a places, time, type, and description, which are
provided by the user or from the event context. These features can be exploited by adapting
traditional information retrieval and extraction techniques (such as filtering, query generation
and expansion, clustering, and information aggregation) to the unique characteristics of
tweets. [13]
72.6 Challenges
There are various challenges to the problem of event detection. From twitter data stream in
which the first of these is the scale and second one is brevity of the twitter medium (280
characters limit per tweet) and third one is noise where many of the tweets on the platform
are unrelated to events and even those that are related can include irreverent terms and
fourth one is dynamic nature of what is discussed on the platform-shatare-aetvally events
18Chapter 3
Methodology
3.1 Framework
Clustering service design described the End-to-End Framework to the output entity clusters
from twitter stream of data where each component in our system deals with the operation
of specific stage and we have separated each component that can be scale independently at
particular part of pipeline in order to improve overall output.
‘The Key advantage of this end-to-end framework is the modular compositions.to describe this,
clustering service design we first start with important terminology and then go through each
of the component.
3.1.1 Terminology
Entity ~
Entities provide metadata and additional contextual information about content posted on
‘Twitter. The entities section provides vector of common things included in Tweets: hashtags,
user mentions, links, stickers (symbols), Twitter polls, and attached media
Examples:
entity types such as user IDs or URLs.
ised in this work include named entities and hashtags but we can extend to other
Cluster -
A set of entities and their associated metadata (e.g., entity frequency count)
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
than those in other groups.
Cluster Chain -
Alist of clusters over time that is related to the same ongoing event.
In cluster chain there is an individual who act as source of message and transmit information
to the pre-selected group off individual out of whom few individual again tell the same
message to the other selected group of individual.
Event -
An event as something that happens at some specific time and place, and the unavoidable
consequences. Specific elections, accidents, crimes and natural disasters are examples of
Events. A cluster chain along with any metadata (e.g. detected start time, end time).
193.1.2 Entity Extraction and Filtering
In this context, Entity extraction we have discussed about extracting 2 tweet, Data
preprocessing, and performed Named Entity Reorganization (NER) Within a raw text Data
The categorization system that these entities are sorted into can be unique to each other.
Entities can be categorized into groups as broad as People, Organizations, and Places.
Here are some example entity types that are extracted from each tweet:
+ Named entities - e.g. “Jason roy”
+ Hashtags - e.g. “Hindiavseng”
Before it’s possible to extract entities from a text, there are several preprocessing tasks that
need to be completed.
Extraction of tweet data using Tweepy:
Twitter allows us to mine the data of any user using Twitter API or Tweepy. The data will be
‘tweets extracted from the user. The first thing to do is get the consumer key, consumer secret,
access key and access secret from twitter developer available easily for each user. These keys
will help the API for authentication.
Stops to obtain keys:
Login to twitter developer section
* Goto “Create an App”
+ Fillthe details of the application.
* Glick on Create your Twitter Application
‘Details of your new app will be shown along with consumer key and consumer secret.
© For access token, Click” Create my access token”. The page will refresh and generate
access token.
‘Tweepy is one of the library that should be installed using pip. Now in order to authorize our
app to access Twitter on our behalf, we need to use the OAuth(Authorization key) Interface.
‘Tweepy provides the convenient Cursor interface to iterate through different types of objects.
Twitter allows a maximum of 3200 tweets for extraction,
Data Pre-Processing:
Once gathered the tweets data from twitter you need to prepare your data. Social media data
is unstructured and needs to be cleaned before using it
Preprocessing a Twitter dataset involves a series of tasks like removing all types of irrelevant
information like RT, User Mentions, links, special characters, digit and extra blank spaces. It
can also involve making format improvements, delete duplicate tweets, or tweets that are
shorter than three characters.
20Remove ‘RT’, User Mentions and link
Inthe tweet text, we can usually see that every sentence contains areference that is aretweet
(‘RT’), a User mention or a URL. Because itis repeated through a lot of tweets and it doesn’t
give us any useful information about event, we can remove them.
Convert tweets to lower case:
Inorder to bring all tweets toa consistent form. By performing this, we can assure that further
transformations and classification tasks will not suffer from non-consistency or case sensitive
issues in our data.
Remove numbers:
Likewise, numbers do not contain any sentiment, so it is also common practice to remove
them from the tweet text.
Remove punctuation marks and special characters:
Because this will generate tokens with a high frequency that will cloud our analysis, it is
important to remove them.
Removing stop words:
Stop words are function words that are high frequently present across all tweets. There is no
need for analyzing them because they do not provide useful information. We can obtain list
of these words from NLTK stop words function,
After performing preprocessing in tweet data removed all type of irrelevant information and
we extracted only Tweet id, user id, created time and tweet text from and store it input text
file.
Toll about tweet objects. rofer to tweet developer documentation
Tweet_id_ User_id_ Created_at___ Tweet_text
1391429449309212 | 1391429450034724 | Sun May 09 Jason roy
16:26:46 +0000 | depart bow!
2021 bhuvi
bhuvneshwar
kumar england
early wicket
india #indvseng
#indiavsenglan
d#ohuvi
Table-1 pre-processed data
Named Entity Recognition (NEI
NER, short for, Named Entity Recognition is a standard Natural Language Processing problem
which deals with information extraction. The primary objective is to locate and classify named
2entities in text into predefined categories such as the names of persons, organizations,
locations, events, expressions of times, quantities, monetary values, percentages, etc.
Example:
peers echt er meet arty
Fig-4 Example of NER
Recognizing named entities in a large corpus can be a challenging task, but NLTK has built-in
method ‘nltk.ne_chunk()’ that can recognize various entities shown in the table below:
NEType Examples
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2020-06-29
TIME two fifty a.m, 1:30 p.m.
GPE South-East Asia, Midlothian
Table-2 NER Types and examples
For recognize named entities using NLTK. We have to import ntk library next we tokenize the
sentence by using the method work_tokenize(), Also we tag each word with their respective
part-of-speech tags using pos_tagl). The next step is to use ne_chunk{) to recognize each
named entity in the sentence
3.1.5 Compute Similarities
‘A commonly used approach to match similar documents is based on counting the maximum
number of common words between the documents. But this approach has an inherent flaw.
That is, as the size of the document increases, the number of common words tend to increase
even if the documents talk about different topics. The cosine similarity helps overcome this
fundamental flaw in the ‘count-the-common-words’ with the remaining filtered entities.
In which we make Inverted index matrix where entities are in rows and tweet text are in
column if the entities are matched from the tweet text data add 1 to vector otherwise add 0
in the vector.Cosine Similarity
Cosine Similarity isa measurement that quantifies the similarity between two or more vectors.
The cosine similarity is the cosine of the angle between vectors. The vectors are typically non-
zero and are within an inner product space.
The cosine similarity is described mathematically as the division between the dot product of
vectors and the product of the Euclidean norms or magnitude of each vector.
on EAB
(All/Bi) S42 [Sw
Vaya
Fig-5 Cosine similarity formula
similarity = cos(0)
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-
dimensional space. In this context, the two vectors | am talking about are arrays containing
the word counts of two documents.
» Cosine Similarity is a value that is bound by a constrained range of 0 and 1.
> The similarity measurement is a measure of the cosine of the angle between the two
non-zero vectors A and 8
Suppose the angle between the two vectors was 90 degrees. In that case, the cosine
similarity will have a value of 0; this means that the two vectors are orthogonal or
perpendicular to each other.
> As the cosine similarity measurement gets closer to 1, then the angle between the
two vectors A and B is smaller.
‘The vector representations of the documents can then be used within the cosine similarity
formula to obtain a quantification of similarity.
In the scenario described above, the cosine similarity of 1 implies that the two documents are
exactly alike and a cosine similarity of 0 would point to the conclusion that there are no
similarities between the two documents.Cosine Similarity Example
Step 1: First we obtain a vectorized representation of the texts
Let us take the following example tweets to illustrate further
Tweet ID Text
1 jason roy has depart bowled by Bhuvneshwar england
early wicket are Falling now
2 indiavsengland Jason roy shocked by inswing ball of
bhuvneshwar
3 Sanju samson awesome catch upfront Team india in a
high pressure game # indiavsengland
4 bhuvneshwar man of the match results India win the
game by 20 runs #indiavsengland
5 Windiavsengland looking forward to one more
thrilling game as well from india
Table-3 Example Tweet text
We can represent the co-occurrences for entities as seen below:
Tweet Tweet? Tweet 3 Tweet 4 Tweet 5
Findiaveengland | 0 1 T 7 7
‘bhuvneshwar | 7 7 0 7 0
Table-4 Encoding of entities
|TREIERRIYEEHETE (or Hindiavsengland and bhuvneshwar are the corresponding rows
#indiavsengland = (0,1,1,1,1]
bhuvneshwar = [1,1,0,1,0]
Step 2: Find the cos
> Cosine similarity for two entities X and V:
similarity
* cos(X,¥) Xe
KT IY]
For example: X.¥=0*1#1*1+1*0+11+1°0
[X[ev 0? +12 412 412412 av 4 =?
IWJ=V 1? +17 +07 +1? +0?=V 3
cos(#indiavsengland , bhuyneshwar }
=0.5773,
(57% similarity between the sentences in both document)The potential disadvantage of this type of encoding is that it gets extremely sparse as we
Process more tweets, we avoid this by den-sifying the representation needed to update entity
co-occurrences and frequencies. We observe that this type of cosine similarity works well in
practice with respect to the final clustering output.
3.1.6 Similarity Filtering
‘After computing the entity similarities then, we can filter them based on the minimum
threshold value_ (in the range 0-1) to remove noisy connections between the entities.
Example: Let the threshold value be 0.2 similarity between entities are greater than 0.2 then
we add the edge between them otherwise if the value is less than 0.2 then we will remove
noisy connection.
3.1.7 Entity Clustering
Once we get a set of event representative keywords, the next step is to cluster the keywords
that are related to the same event. It can be said that two keywords belong to the same topic,
If they are associated with similar content they belong to the same event, It is important to
see that they follow similar appearance patterns too. Our similarity metric considers both of
these aspects (16].
At this stage, we are able to naturally construct a graph consisting of the entities as nodes
Where each entities represented by node and their similarities as edge weights. Once we can
compute similarities by using cosine similarity, the advantage is that a wide variety of
clustering algorithms can be leveraged.
For example, community detection algorithms have been used in similar settings (14]
One of the most popular algorithms of this type is the Louvain method [15], which relies on
modularity-based graph partitioning, Some key benefits of Louvain is that itis efficient on
even large-scale networks and has a single parameter, resolution R, to tune.
Community detection Algorithms
A community, with respect to graphs, can be defined as a subset of nodes that are densely
connected to each other and loosely connected to the nodes in the other communities in the
same graph.
Let me break that down using an example. Think about social media platforms such as
‘Twitter, where we try to connect with other people. Eventually, after a while, we end up being
connected with people belonging to different social circles. These social circles can be a group
of relatives, school mates, colleagues, etc.Louvain’s method
Louvain’s method for community detection is a method to extract communities from large
network created by Blondel et al [18]. The method is greedy optimization method that
appears to run in time O(nlog2n) in the number of nodes in the network.
Louvain’s algorithm is based on optimizing the Modularity very effectively
The quality of the communities referred as partitions hereafter is measured by Modularity of
the partition, Modularity Q is defined as the formula shown in the below figure.
Ail isthe weight of the edge between i an j
kis the sum of weights of the vertex atached to the vertex I, also called as degree
ofthe node
«tis the community to which vertex is asigned
(xy) is Lif x = y and 0 otherwise
m= (1/2)3s Aij Le number of links
Fig-6 Modularity Computation formula
‘Are you using?
‘Gang Sars dogs
Fig-7 Clustering Service design3.2 Algorithms
We describe the pseudo code for each of the component in the framework first we start with
the entity extraction then moving forward to compute similarities and similarity filtering after
that we perform algorithm for entity clustering.
Algorithm-1 Data Processing
Procedure ProcessData
Consumer _key <- consumer key to twitter
Consumer secret < consumer secret to twitter
Acoess_token < access token to twitter
Access secret <- acces secret to twitter
Auth < tweepy.OauthHandler(Consumer_key,Consumer secret)
Auth access_token(access_token access_secret)
Api < tweeny.APT(Auth)
Class MyListnar < use to save the data in Json Formate.
Twitter_stream <- tweepy stream(Auth,Mytistner())
1Y. Twatter_stream fiter(track« {ending tweet’)
Algorithm. 2 Inverted Index Making
Input: total entity list and tweet id
Output: return Inverted index as a vector with 1 and 0
1. inverted_index < {)
2. entity_lst < list(set(total_entity fst)
3. forentity nent. list
4 vector <1)
5. fortweet in keys. list
a. if entity in tweet_dictionarvitweetll"entities"]
write algorithms in proper format.
4LFollow the proper notations & varibale declaration.
vvec.append(t}) 2. Maintain consistency
b. else 3. Below each algorithm, describe what itis for & why
vvec.append\} are you doing this?
6. Inverted_indexlentity] < vector
return inverted_index
Algorithm. 3 Cosine Similarity
Input: list of entities
utp:
cosine value between O and 1
05_sim{entty_one, inv_entity_two)
sim < round{dot(a,b)/(norm(a)*norm(b)),2)
return e_sim
for node in node_list
for nodex in entity list
sim = cos_siminw_index{nodel, iny_index[nodex))
End
4.1f you are explaining each modules then vite these
functions in algorithmic format.
5. Rofer to 19-KDD paper algorithm
27Algorithm. 4 Graph Generation between entities
[st of entities and inverted index
-aph based on entities as node and it simi
EG & nx Graph)
G_node_lst < node_lst
Ust_to_remave <[]
For each node in node_list do
List_to_remove append node)
Check_list list{set(g_node_list)-set(ist_to_remove))
For each node in check list €o
Sim <-cos_siminv Index{node,inv_index{nodex})
8. iffsimo0.2) then
10. £G.add_edge(node,nodexweig
1. Else then
12, Continue
13. End
rity edge weight
im)
28Chapter 4
Experiment and Results:
In addition to the procedures described above in the methodology, below we have perform@>}
all the experiment and steps to get the result.
4.1 Evaluation Data Set
To create an evaluation dataset, we first start with data extraction from Twitter API of
#indiavsengland (hashtag) tweets that was played an odi match between india vs England.
We got the extracted data in the form of json format. [link]
Extracted data json format as given below image —
Fig-8 raw data4,2 Pre-processing Tweet data
After extracting the data from twitter api for #indiavsengland which was in json format that
is unstructured data so we need to clean these data and perform data pre-processing and
entity extracting in that we got the data in form of ‘tweet_id’, ‘User_id’, ‘tweet_text’ and
‘entities’ with Named Entities and hashtag,
-9 Tweet text with named e1
ies and Hashtag
4.3 Total number of tweet
After getting the pre-processed data in text file then we have calculated total number of
tweet using tweet-id for that we have created a list of key for tweet-id then calculate the
length of list which gives total number of tweet.
t
Fig-10 list of key
4.4 Tweet Dictionary
We have stored the pre-processed data from file to dictionary to make the tweet dictionary
which contains tweet-id as key and tweet text , user-id, entities as their value for each of the
tweet-id
30Fig - 11 tweet dictionary
4,5 Total number of entities
After creating the dictionary we have calculated the total number of entities by storing them
as list from dictionary and them we have remove the duplicate entities which are repeated in
a list by storing them into set after that we got the unique entity set.
Fig-12 total number of entities
4.6 Inverted Index Making
After getting data in above format we check the similarity between tweet and list of entities
and make inverted index if entities match in tweet text data then append vector to 1 other
wise append to 0 in the vector.
Fig-13 Inverted index between tweet and set of entities
4,7 Cosine Similarity
When we generate the inverted index between entities and tweet id we compute the cosine
similarity between entities and generate the graph in graph each entities belong as node and
edge are connected with nodes based on its cosine similarity value based on threshold value
S if cosine similarity greater than threshold value then edge are connected with another node
and its similarity value is its edge weight if cosine similarity value less than threshold value
then edge are connected between node.
Fig-14 Cosine similarity between entities
4.8 Graph Generation
After getting cosine similarity between entities now we are at the stage where we can
generate the graph between entities we have generated the Graph using networks library in
python,
3t“—_
ry
od
#t ivi
ebliefet ?
os L
bhuvgwieket” Bat
india
ig-15 cluster of entities
4.9 Community detection
After generating the graph we have applied community detection algorithm using Louvain
method to detect similar type of entities cluster belongs to community in graph where cluster
of entities represent different events.
32Fig- 16 number of events in graph
4.10 Total Number of Events
After generating the graph we have got our final result as total number of events related to
each other.
Fig-17 Total Number of event
Provide a table saying
Datacted events and avant descriptions.
Efficiency of your system on your datasot.
such as time complexity
accuracy in terms of Precision , Recall & F1 Score
33,Conclusion
In our work, we have followed a Methodology for detecting an Event using Twitter Data
Stream and we have evaluated our final output as number of events in an offline mode. First
We have started with data extraction from [IWWIEESRIFIEHOSE] using the Twitter API where
extracted data was unstructured so we have applied data pre-processing and retain oy
specified things such as tweet-id, user-id, tweet-text and entities after that we have filter
entities in that to get the namelentities and also filter the tweet text with hashtag and text
only to do that entity filtering to filter the tweet data we have used NER (name entity
recognition), POS( part of speech) tagging. AVter filtering the entities we have computed the
similarities among the entities using Cosine Similarity by making the inverted index then we
compute the cosine similarity between the entities. Based on similarity between entities we
\ yak) have made the graph after that we have detect the number of event in that graph. In the end
a represents the entities with similar type which shows that these event are similar to
‘each other and our final output is the total number of events related to each other and these
are the number of event from twitter data stream. We have realized an implementation of
such methodology and tested in offline mode
Y future work will include a wider evaluation, an additional labelling for the event entities, an
optimization of the thresholds in order to improve the performances in terms of event
detection speed and reduce the clustering complexity.{Additjonally, we will test our solution
with different similarity metrics and with word embedéifig {echniques to evaluate eventual
performance improvement. Final, we wil ry to detect |neocation ofthe recognized event
based onthe tweets and to produce a panic map in near fea time. And also the work canbe
further extended to make it region-centric and generate lpcation-wise events. Also, for future
work, different applications of the proposed event detdctibn models such as health-care,
emergency detection and disaster prediction, etc. can be explored.Follow proper format- check my comments
Bibliography
n
2
13
u
15
16
”
18
8
20
2017. 2017 Gobel Social Journism Study. hitps/] wna con comus[resouresesere-epors/2017 lobalsociah
ptr stay aefse
Gere Von Nordheim, aca Best, and Lvs Kapgers. 2018, Soueng the Sours: An snl af the use af Tver and
-acebook 2 urate sure oer 10 years Te New Yr Times, The Guaran, nd Susetione Zine,
Degtaliouralsms, 7 2018), 207-228.
‘ith International Command and Contral Research and Technology Symposkum|(CCRTS), June 15-27, 2008,
‘washington, 0.¢htts/?ha nandle.net/10885/37485
LM, Billo, G.Petkos, C. Marti, D. Caray, S. Papadopoulos, R.Skabs, A. Goker, LKompatsiais, and A. Jaimes,
"Sensing trending topics in Twitter,” IEEE Trans. Multimedia, vol 15, no. 6, pp. 12681282, Oct. 2013,
Adrien Guile and Cécile Favre, 2015. Event detection, tracking, and visualization in titer: 8 mention anomaly
based approach, Socal Network Analysis and Mining, 12015), 18
fstns/and.ore/a/3907,03675.o0t
Pel Lee, aks VS Lakshmanan, and Evangelos € Milos. 2014. Incremental cluster evolution tracking from Highly
dynamic network data. n 2018 IEEE 30° International Conference on Data Engineering ICOE) IEEE, 3-1
Fainda Atefeh and Wael Kiveich. 2015, A survey of techniques for event detection in twitter. Computational
Inteligence 31,1 (2085), 132-168
Marian Dark, OanielGruen, Carey Wiliamson, and Sheelagh Carpendale. 2010.A visual backchannel for large-scale
evens. IEEE transactions on visualizationand computer graphics 26,6 (2010), 1125-1138
trtaicteseer t.nsu.eauieudos/downloas2dni=10.1.11039.882Scen=reoéivos=nat
"Amores Edouard, elena Cabrio, Sara Tone, nc Nhan Le Than, 2017, Graphbasedevent extraction from twtr.
In RANLPIT,
Vincent D Blondel, Jear-Loup Gullaume, Renaud Lambiotte, and Etienne Lefebvre, 2008, Fast unfolding of
ia aaa a icc,
‘ames Allan. Introduction to topie tection and tacking, Topic detection and tracking, pages 1-16, 2002
Fast unfolding of communities in large networks Vincent O. Blondelia , Jean-Laup Gullaumel,2i>, Renaud
Lambiottal,3.c and Etiaane Lafebvret
2, Saeed, RA Abbi, 0. Maqboo|, A. Saf. Razzak, A. Daud, N. . Aljohon, and G. xu, "What's happening
‘round thenarld?asurvey ond framework on event detection techniques. 17, no. 2, pp. 278312, 20
35