0% found this document useful (0 votes)

45 views6 pages

A News Analysis and Tracking System

The document presents a web-based news analysis and tracking system designed for continuous monitoring of news sources, particularly for Homeland Security. It utilizes Latent Dirichlet Allocation (LDA) for topic discovery and Latent Semantic Indexing (LSI) for generating personalized alerts based on user-defined preferences. The system aims to provide timely and relevant news updates by analyzing and indexing news content topically, allowing users to receive alerts on conceptually relevant stories.

Uploaded by

720722104025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views6 pages

A News Analysis and Tracking System

Uploaded by

720722104025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

A News Analysis and Tracking System

Sk. Mirajul Haque, Lipika Dey, and Anuj Mahajan

TCS Innovation Labs, Delhi

{skm.haque,lipika.dey,anuj.mahajan}@tcs.com

Abstract. Continuous monitoring of web-based news sources has emerged as a

key intelligence task particularly for Homeland Security. We propose a system
for web-based news tracking and alerting. Unlike subscription-based alerts,
alerting is implemented as a personalized service where the system is trained to
recognize potentially important news based on user preferences. Preferences are
expressed as combinations of topics and can change dynamically. The system
employs Latent Dirichlet Allocation (LDA) for topic discovery and Latent
Semantic Indexing (LSI) for alerting.

1 Introduction
The amount of news content available online is increasing at a steady rate. News
belongs to a very specific genre of textual data. News contents from multiple sources
are usually similar in that they largely look at the same set of events, yet they are at
the same time highly unstructured and open ended. Timely assimilation and interpre-
tation of information available from these sources is extremely important for political
and economic analysts. It is imperative to analyze the whole collection to eliminate
the possibility of missing anything.
Automated acquisition, aggregation and analysis of news content is a challenging
and exciting line of research, lying at the crossroads of information extraction, ma-
chine learning, machine translation, pattern discovery, etc. The complexity of the
problem increases due to the unpredictability of incoming content that needs to be
analyzed. User preferences cannot be expressed as pre-defined categories. A news
item of any category - be it political, sports, or entertainment can become interesting
to an analyst, depending on its content. An analyst may wish to track it for some time.
Further the interest changes dynamically. The key challenge in recognizing relevant
news lies in recognizing the concepts of interest to a user.
In this paper, we have presented a system that is distinct in functionality from other
news analysis systems. It is designed as a watch-guard which continuously scans
online sources for new stories and alerts a registered user whenever there is any new
development that could be of potential interest to the user. The news repository is
organized topically. Users can register their interests with the system in terms of top-
ics. The system generates an alert if a new story is judged to be conceptually relevant
to a user. Alerts are also sent to the user’s mobile. The novelty of the proposed system
lies in its capability to track conceptually relevant news items as opposed to news in
pre-specified categories.

S. Chaudhury et al. (Eds.): PReMI 2009, LNCS 5909, pp. 231–236, 2009.
© Springer-Verlag Berlin Heidelberg 2009
232 Sk. Mirajul Haque, L. Dey, and A. Mahajan

2 News Analysis Systems – A Review

News categorization and topic tracking through news collections has been an active
area of research for a few years now. In [1] five categorization techniques were com-
pared for news categorization. It was found that Support Vector Machines (SVM), k-
Nearest Neighbor (kNN) and neural networks significantly outperform Linear Least
squares Fit (LLSF) and Naive Bayes (NB) when the number of training instances per
category is small. [4] presents a system for automatic categorization of news stories
written in Croatian language into a pre-defined set of categories using kNN based
classifier. [10] described a news analysis system to classify and index news stories in
real time. The system extracted stories from a newswire, parsed the sentences of the
story, and then mapped the syntactic structures into a concept base.
Since news contents change in unpredictable ways supervised techniques that re-
quire large training sets are not ideally suited for news repository organization. Topic
tracking and detection [2] is another area of research where the emphasis is to monitor
online feed of news articles, and determine the underlying topics and their distribution
in the news items in an unsupervised way. In [7] a topic tracking system based on
unigram models was presented. Given a sparse unigram model built from topic train-
ing data and a large number of background models built from background material,
the system finds the best mixture of background models that matches the unigram
model. [8] proposed a clustering based approach to summarize and track events from
a collection of news Web Pages. In [9] the problem of skewed data in topic tracking
was tackled using semi-supervised techniques over a bi-lingual corpus. The Lydia
project [3] describes a method for constructing a relational model of people, places
and other entities through a combination of natural language processing and statistical
analysis techniques.

3 Overview of the Proposed News Analysis System

Fig.1 presents the software architecture of the proposed news analysis system. The
key components of the system are:-
• News acquisition module - The News acquisition module collects news from a
host of pre-defined websites from their RSS1 feeds. The RSS feeds contain news
headlines, date and time of publishing, a brief description, and the URL of the
full news story. These are periodically updates. The URLs extracted from the
RSS feeds are fed to customized web crawlers, to extract the news stories.
• Topic Discovery – This module builds a topic map for the entire news collection
using Latent Dirichlet Allocation (LDA). It also indexes the collection topically.
Each time a news repository is updated, the existing topic maps are backed up
and a new topic map is built.
• Topic Tracker and Trend Analyzer – Topic maps are time-stamped. These can be
compared to analyze how the strength of a topic has changed over time. Correla-
tion analysis of topic maps identifies entry and exit of new topics into the system.
Topic tracking and trend analysis can provide early warnings and actionable in-
telligence to avoid disasters.

1
http://en.wikipedia.org/wiki/RSS_(file_format).
A News Analysis and Tracking System 233

Fig. 1. Architecture of proposed News Analysis System

• News Alert Generator – This is a trainable module that implements Latent Seman-
tic Indexing (LSI) to categorize an incoming news story as relevant or irrelevant to
any of the multiple interest areas expressed by the user. The alerting mechanism is
designed as a trainable multi-classifier system. User interests can be further classi-
fied as long-term and short-term. Concepts in long term interest category change at
a slower rate than those in the short-term interest category. While long-term inter-
ests encode generic needs of a news monitoring agency, short-term interests help in
tracking evolving stories in any category.
• Topic Explorer – This is the interface through which the user interacts with the
system. A topic snap-shot shows all topics and concepts present in a collection.
Users can drill-down to detailed reports on any event along any topic. This allows
a single story to be read from multiple perspectives based on its contents. Topic-
based aggregation allows the user to get a bird’s eye view of the key events. Users
can also view topic strengths and localizations.

3.1 Topic Extraction Using LDA

An important step to comprehend the content of any text repository is to identify the
key topics that exist in the collection, around which the text can be organized. Latent
Dirichlet Allocation (LDA) [5] is a statistical model, specifically a topic model, in
which a document is considered as a mixture of a limited number of topics and each
meaningful word in the document can be associated with one or more of these topics.
A single word may be associated to multiple topics with differing weights depending
on the context in which it occurs. Given a collection of documents containing a set of
words, a good estimation of unique topics present in the collection is obtained with
the assumption that each document d can be viewed as a multinomial distribution over
k topics. Each topic zj , j = 1 · · · k, in turn, is assumed to be a multinomial distribution
Φ(j) over the set of words W. The problem of estimating the various distributions is in
234 Sk. Mirajul Haque, L. Dey, and A. Mahajan

general intractable. Hence a wide variety of approximate algorithms, that attempt to

maximize likelihood of a corpus given the model have been proposed for LDA. We
have used the Gibb’s sampling based technique proposed in [5] which is
computationally efficient. The topic extractor creates a topic dictionary where each
topic is represented as a collection of words along with their probabilities of
belonging to the topic. Each document d is represented as a topic distribution vector <
(p(t1,d),…,p(tk, d)>, where p(tj,d) denotes the strength of topic tj in document d is
computed as follows:
p t ,d ∑ , . (1)
where n is the total number of unique words in the collection.

3.2 News Alert Generation Using Latent Semantic Indexing

All incoming news stories are scored for relevance based on its topic distribution. The
system employs Latent Semantic Indexing (LSI) to assign relevance scores. LSI uses
Singular Value Decomposition (SVD) to identify semantic patterns between words
and concepts present in a text repository. It can correlate semantically related terms
within a collection by establishing associations between terms that occur in similar
context.
User interests are conveyed in terms of topics. Given that the underlying repository
has stories containing different topics, of which some are relevant and some not, the
aim is to find new stories which are conceptually relevant to the user, even though
they may not be content-wise similar to any old story as such. The training documents
for the system are created as follows. For each relevant topic, all stories containing
the topic with strength greater than a threshold are merged to create a single training
document dR. All remaining stories are merged to create another training document
dIR. Let Tm2 denote the term-document matrix for this collection, where m represents
the number of unique words. The weight of a word in T is a function of its probability
of belonging to the topics in each category. Applying SVD to T yields,

. (2)
where U and V are orthonormal matrices and W is a diagonal matrix. A modified
term-document matrix is now constructed using which consists of the first two
columns of U such that represents a transformed 2 dimensional concept-space,
where each dimension is expressed as a linear combination of the original terms and
the relevant and irrelevant documents are well-separated. For each new news story, its
cosine similarity with the relevant and irrelevant documents is computed in the new
concept space. Let Q represent the term vector for the new story. Q is then projected
into the transformed concept space to get a new vector . Let SR and SIR denote the
cosine similarity of with the relevant and irrelevant documents respectively. The
new story is judged as relevant if
. (3)
where α is a scaling parameter in the range (0, 1). Since number of news stories in
irrelevant category is usually much higher than number of relevant news, there is an
A News Analysis and Tracking System 235

automatic bias towards the irrelevant category. The scaling parameter α adjusts this
bias by reducing the similarity to irrelevant category.

4 Experimental Results
Fig. 2 presents some snapshots. On the left is a portion of the topic dictionary created
for a 2 week time-period ending on 9th June, 2009. Users select topics to indicate
preferences for receiving alert. On the right an alert generated for the user is shown.

Fig. 2. Snapshots from the system

(a) (b)

Fig. 3. (a) Precision and Recall with varying topic cutoff and fixed alpha=0.5 (b) Precision and
Recall with varying alpha and topic cutoff = 0.6

Since user preferences are given as topics, it is observed that the accuracy of re-
cognizing new stories as relevant depends on (a) the size of the training set (b) the
scaling factor α. Size of the training set is controlled using the topic cut off threshold.
Fig. 3 shows system performance varies with these two parameters. With a fixed
alpha, it is observed that recall falls slightly with higher values of topic cut-off,
though precision does not suffer. Higher topic cut-off reduces the training set size
causing the recall to fall. Sharper trends are observed with a fixed topic cut-off, and
changing values of alpha. Recall falls drastically with higher values of alpha, since
very few stories are now picked as relevant, though the precision is very high. Best
results are observed for alpha and topic cut-off set to 0.5 and 0.6 respectively.
236 Sk. Mirajul Haque, L. Dey, and A. Mahajan

5 Conclusion
In this work, we have presented a news analysis system which generates user-
preference based alerts. Preferences are for topics and concepts rather than pre-
defined categories. The news repository is analyzed and indexed topically. We are
currently working on integrating a mining and analytical platform to this system to
enhance its prediction and tracking capabilities.

References
1. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of
22nd ACM SIGIR, California (1999)
2. Allan, J., Papka, R., Lavrenko, V.: On-line New Event Detection and Tracking. In: Pro-
ceedings of 21st ACM SIGIR, Melbourne (1998)
3. Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A System for Large-Scale News Analysis. In:
Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 161–166. Springer,
Heidelberg (2005)
4. Bacan, H., Pandzic, I.S., Gulija, D.: Automated News Item Categorization. In: JSAI (2005)
5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learn-
ing Research 3, 993–1022 (2003)
6. Landauer, T.K., Foltz, P.W., Laham, D.: An Introduction to Latent Semantic Analysis.
Discourse Processes 25, 259–284 (1998)
7. Yamron, J.P., Carp, I., Gillick, L., Lowe, S., Van Mulbregt, P.: Topic Tracking in a News
Stream. In: Proceedings of DARPA Broadcast News Workshop (1999)
8. Mori, M., Miura, T., Shioya, I.: Topic Detection and Tracking for News Web Pages. In:
IEEE/WIC/ACM International Conference on Web Intelligence, pp. 338–342 (2006)
9. Fukumoto, F., Suzuki, Y.: Topic tracking based on bilingual comparable corpora and semi-
supervised clustering. ACM Transactions on Asian Language Information Processing 6(3)
(2007)
10. Kuhns, R.J.: A News Analysis System. In: Proc. of 12th International Conference on
Computational Linguistics, COLING 1988, vol. 1, pp. 351–355 (1988)

Theme-Based Retrieval of Web News
No ratings yet
Theme-Based Retrieval of Web News
2 pages
RSS News Aggregator for Users
No ratings yet
RSS News Aggregator for Users
11 pages
Automatic Online News Issue Construction in Web Environment
100% (2)
Automatic Online News Issue Construction in Web Environment
10 pages
F.4 Topic Detection and Tracking
No ratings yet
F.4 Topic Detection and Tracking
9 pages
Detecting Emergent Con Icts Through Web Mining and Visualization
No ratings yet
Detecting Emergent Con Icts Through Web Mining and Visualization
8 pages
Geoparsing and Geosemantics For Social Media: Spatio-Temporal Grounding of Content Propagating Rumours To Support Trust and Veracity Analysis During Breaking News
No ratings yet
Geoparsing and Geosemantics For Social Media: Spatio-Temporal Grounding of Content Propagating Rumours To Support Trust and Veracity Analysis During Breaking News
27 pages
Analyzing and Ranking Prevalent News Over Social Media
No ratings yet
Analyzing and Ranking Prevalent News Over Social Media
12 pages
The Backend of News As A Juxtaposition of Data and Human Costs
No ratings yet
The Backend of News As A Juxtaposition of Data and Human Costs
24 pages
CS 229 Paper
No ratings yet
CS 229 Paper
5 pages
TARP
No ratings yet
TARP
21 pages
A Bias Aware News Recommendation System
No ratings yet
A Bias Aware News Recommendation System
7 pages
Web Beyond Google Handout
No ratings yet
Web Beyond Google Handout
23 pages
News Aggregator The World at Your Finger Tips
No ratings yet
News Aggregator The World at Your Finger Tips
5 pages
Text Data Mining Insights
No ratings yet
Text Data Mining Insights
8 pages
Report
No ratings yet
Report
22 pages
Paper News Text Summaraizaton
No ratings yet
Paper News Text Summaraizaton
8 pages
Spatial Analysis of News Sources: Andrew Mehler, Yunfan Bao, Xin Li, Yue Wang, and Steven Skiena
No ratings yet
Spatial Analysis of News Sources: Andrew Mehler, Yunfan Bao, Xin Li, Yue Wang, and Steven Skiena
7 pages
A Tool For Fake News Detection: September 2018
No ratings yet
A Tool For Fake News Detection: September 2018
9 pages
Big Data For The Future: Unlocking The Predictive Power of The Web
No ratings yet
Big Data For The Future: Unlocking The Predictive Power of The Web
12 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Framing
No ratings yet
Framing
17 pages
News Aggregation Website
No ratings yet
News Aggregation Website
22 pages
AI Report Ver2
No ratings yet
AI Report Ver2
19 pages
AI Phase2
No ratings yet
AI Phase2
6 pages
P1 PDF
No ratings yet
P1 PDF
7 pages
Temporal Analytics for BI Experts
No ratings yet
Temporal Analytics for BI Experts
10 pages
Stream Survey04
No ratings yet
Stream Survey04
18 pages
Revealing Media Bias in News Articles: Felix Hamborg
No ratings yet
Revealing Media Bias in News Articles: Felix Hamborg
245 pages
Asynchronous Text Mining Method
No ratings yet
Asynchronous Text Mining Method
5 pages
Dailygram Proposal
No ratings yet
Dailygram Proposal
13 pages
38 Socirank
No ratings yet
38 Socirank
26 pages
Debunking Disinformation Revolutionizing Truth Wit
No ratings yet
Debunking Disinformation Revolutionizing Truth Wit
11 pages
NewsCube Delivering Multiple Aspects of News To Mi
No ratings yet
NewsCube Delivering Multiple Aspects of News To Mi
11 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
Deploying Natural Language Processing For Social Science Analysis
No ratings yet
Deploying Natural Language Processing For Social Science Analysis
2 pages
Smriti Mishra
No ratings yet
Smriti Mishra
15 pages
Identifying Political Bias in News Articles: 1 Motivation
No ratings yet
Identifying Political Bias in News Articles: 1 Motivation
12 pages
Guo 2012 Social Network Analysis
No ratings yet
Guo 2012 Social Network Analysis
17 pages
Titov Bunker
No ratings yet
Titov Bunker
8 pages
Fig. 1.1 System Workflow
No ratings yet
Fig. 1.1 System Workflow
29 pages
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
No ratings yet
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
11 pages
Fake News Detection with ML
No ratings yet
Fake News Detection with ML
20 pages
News Article Category Predictor
No ratings yet
News Article Category Predictor
6 pages
Fake News Synopsis
No ratings yet
Fake News Synopsis
10 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Fake News Detection Using Python
No ratings yet
Fake News Detection Using Python
11 pages
Twitter-Based Traffic Monitoring
No ratings yet
Twitter-Based Traffic Monitoring
3 pages
Using NLP For Machine Learning of User Profiles
No ratings yet
Using NLP For Machine Learning of User Profiles
16 pages
A Survey On Sentiment Analysis On Twitter Data Using Different Techniques
No ratings yet
A Survey On Sentiment Analysis On Twitter Data Using Different Techniques
5 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
8194 27869 1 PB
No ratings yet
8194 27869 1 PB
8 pages
Text Summarization
No ratings yet
Text Summarization
8 pages
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
No ratings yet
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
9 pages
Project Synopsis Report Format
No ratings yet
Project Synopsis Report Format
9 pages
News Articles Similarity For Automatic Media Bias Detection in Polish News Portals - 359
No ratings yet
News Articles Similarity For Automatic Media Bias Detection in Polish News Portals - 359
4 pages
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
No ratings yet
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
9 pages
Traffic Data Mining Australasian Database Conference
No ratings yet
Traffic Data Mining Australasian Database Conference
12 pages
Introduction to Cryptography Basics
No ratings yet
Introduction to Cryptography Basics
14 pages
Quiz2 Fall22
No ratings yet
Quiz2 Fall22
4 pages
HSDM Vignette
No ratings yet
HSDM Vignette
99 pages
Oracle Generative AI (1Z0-1127-25) Mock Test - Set - 7
No ratings yet
Oracle Generative AI (1Z0-1127-25) Mock Test - Set - 7
5 pages
Diffusion Models: A Comprehensive Survey of Methods and Applications
No ratings yet
Diffusion Models: A Comprehensive Survey of Methods and Applications
54 pages
Finite Element Methods: Unit 2 Guide
100% (1)
Finite Element Methods: Unit 2 Guide
4 pages
18 State Space Analysis
No ratings yet
18 State Space Analysis
47 pages
Enhancing Exercise Form A Pose Estimation Approach With Body Landmark Detection
No ratings yet
Enhancing Exercise Form A Pose Estimation Approach With Body Landmark Detection
6 pages
Ec1008 Signals and Systems PDF
No ratings yet
Ec1008 Signals and Systems PDF
9 pages
Neural Networks for Finance
No ratings yet
Neural Networks for Finance
46 pages
Revision V5no
No ratings yet
Revision V5no
14 pages
Digital - Chapter3.k Map
No ratings yet
Digital - Chapter3.k Map
18 pages
Computer Science Practical
No ratings yet
Computer Science Practical
43 pages
Generating Symbolic World Models Via Test-Time Scaling of Large Language Models
No ratings yet
Generating Symbolic World Models Via Test-Time Scaling of Large Language Models
32 pages
Who Invented Queuing Theory?
No ratings yet
Who Invented Queuing Theory?
16 pages
3 - Discrete-Time Systems
No ratings yet
3 - Discrete-Time Systems
61 pages
Queue Simulation & Probability Solutions
No ratings yet
Queue Simulation & Probability Solutions
66 pages
Advances in Neural Computation Machine Learning and Cognitive Re 2020
No ratings yet
Advances in Neural Computation Machine Learning and Cognitive Re 2020
434 pages
Quantitative Methods For Business 13th Edition Anderson Test Bank Instant Download
100% (12)
Quantitative Methods For Business 13th Edition Anderson Test Bank Instant Download
64 pages
PG DataMiningR Practicals
No ratings yet
PG DataMiningR Practicals
2 pages
Coding & Decoding - Questions
No ratings yet
Coding & Decoding - Questions
4 pages
Lecture 5: Scheduling and Binary Search Trees
No ratings yet
Lecture 5: Scheduling and Binary Search Trees
8 pages
Perturbation Theory
No ratings yet
Perturbation Theory
6 pages
5-Step Machine Learning Process Guide
No ratings yet
5-Step Machine Learning Process Guide
23 pages
Word Sense Disambiguation: A Survey
No ratings yet
Word Sense Disambiguation: A Survey
16 pages
Worksheet 2
No ratings yet
Worksheet 2
6 pages
MATLAB Econometrics Toolbox User S Guide The Mathworks PDF Download
100% (10)
MATLAB Econometrics Toolbox User S Guide The Mathworks PDF Download
56 pages
A G1002 Pages: 2: Answer Any Two Full Questions, Each Carries 15 Marks
No ratings yet
A G1002 Pages: 2: Answer Any Two Full Questions, Each Carries 15 Marks
2 pages
DM Mod4
No ratings yet
DM Mod4
108 pages
Karachi LTE1800 Model Tuning - Cluster Comparison
No ratings yet
Karachi LTE1800 Model Tuning - Cluster Comparison
18 pages

A News Analysis and Tracking System

Uploaded by

A News Analysis and Tracking System

Uploaded by

A News Analysis and Tracking System

Sk. Mirajul Haque, Lipika Dey, and Anuj Mahajan

TCS Innovation Labs, Delhi

Abstract. Continuous monitoring of web-based news sources has emerged as a

2 News Analysis Systems – A Review

3 Overview of the Proposed News Analysis System

Fig. 1. Architecture of proposed News Analysis System

3.1 Topic Extraction Using LDA

general intractable. Hence a wide variety of approximate algorithms, that attempt to

3.2 News Alert Generation Using Latent Semantic Indexing

Fig. 2. Snapshots from the system

You might also like