A News Analysis and Tracking System
Sk. Mirajul Haque, Lipika Dey, and Anuj Mahajan
TCS Innovation Labs, Delhi
{skm.haque,lipika.dey,anuj.mahajan}@tcs.com
Abstract. Continuous monitoring of web-based news sources has emerged as a
key intelligence task particularly for Homeland Security. We propose a system
for web-based news tracking and alerting. Unlike subscription-based alerts,
alerting is implemented as a personalized service where the system is trained to
recognize potentially important news based on user preferences. Preferences are
expressed as combinations of topics and can change dynamically. The system
employs Latent Dirichlet Allocation (LDA) for topic discovery and Latent
Semantic Indexing (LSI) for alerting.
1 Introduction
The amount of news content available online is increasing at a steady rate. News
belongs to a very specific genre of textual data. News contents from multiple sources
are usually similar in that they largely look at the same set of events, yet they are at
the same time highly unstructured and open ended. Timely assimilation and interpre-
tation of information available from these sources is extremely important for political
and economic analysts. It is imperative to analyze the whole collection to eliminate
the possibility of missing anything.
Automated acquisition, aggregation and analysis of news content is a challenging
and exciting line of research, lying at the crossroads of information extraction, ma-
chine learning, machine translation, pattern discovery, etc. The complexity of the
problem increases due to the unpredictability of incoming content that needs to be
analyzed. User preferences cannot be expressed as pre-defined categories. A news
item of any category - be it political, sports, or entertainment can become interesting
to an analyst, depending on its content. An analyst may wish to track it for some time.
Further the interest changes dynamically. The key challenge in recognizing relevant
news lies in recognizing the concepts of interest to a user.
In this paper, we have presented a system that is distinct in functionality from other
news analysis systems. It is designed as a watch-guard which continuously scans
online sources for new stories and alerts a registered user whenever there is any new
development that could be of potential interest to the user. The news repository is
organized topically. Users can register their interests with the system in terms of top-
ics. The system generates an alert if a new story is judged to be conceptually relevant
to a user. Alerts are also sent to the user’s mobile. The novelty of the proposed system
lies in its capability to track conceptually relevant news items as opposed to news in
pre-specified categories.
S. Chaudhury et al. (Eds.): PReMI 2009, LNCS 5909, pp. 231–236, 2009.
© Springer-Verlag Berlin Heidelberg 2009
232 Sk. Mirajul Haque, L. Dey, and A. Mahajan
2 News Analysis Systems – A Review
News categorization and topic tracking through news collections has been an active
area of research for a few years now. In [1] five categorization techniques were com-
pared for news categorization. It was found that Support Vector Machines (SVM), k-
Nearest Neighbor (kNN) and neural networks significantly outperform Linear Least
squares Fit (LLSF) and Naive Bayes (NB) when the number of training instances per
category is small. [4] presents a system for automatic categorization of news stories
written in Croatian language into a pre-defined set of categories using kNN based
classifier. [10] described a news analysis system to classify and index news stories in
real time. The system extracted stories from a newswire, parsed the sentences of the
story, and then mapped the syntactic structures into a concept base.
Since news contents change in unpredictable ways supervised techniques that re-
quire large training sets are not ideally suited for news repository organization. Topic
tracking and detection [2] is another area of research where the emphasis is to monitor
online feed of news articles, and determine the underlying topics and their distribution
in the news items in an unsupervised way. In [7] a topic tracking system based on
unigram models was presented. Given a sparse unigram model built from topic train-
ing data and a large number of background models built from background material,
the system finds the best mixture of background models that matches the unigram
model. [8] proposed a clustering based approach to summarize and track events from
a collection of news Web Pages. In [9] the problem of skewed data in topic tracking
was tackled using semi-supervised techniques over a bi-lingual corpus. The Lydia
project [3] describes a method for constructing a relational model of people, places
and other entities through a combination of natural language processing and statistical
analysis techniques.
3 Overview of the Proposed News Analysis System
Fig.1 presents the software architecture of the proposed news analysis system. The
key components of the system are:-
• News acquisition module - The News acquisition module collects news from a
host of pre-defined websites from their RSS1 feeds. The RSS feeds contain news
headlines, date and time of publishing, a brief description, and the URL of the
full news story. These are periodically updates. The URLs extracted from the
RSS feeds are fed to customized web crawlers, to extract the news stories.
• Topic Discovery – This module builds a topic map for the entire news collection
using Latent Dirichlet Allocation (LDA). It also indexes the collection topically.
Each time a news repository is updated, the existing topic maps are backed up
and a new topic map is built.
• Topic Tracker and Trend Analyzer – Topic maps are time-stamped. These can be
compared to analyze how the strength of a topic has changed over time. Correla-
tion analysis of topic maps identifies entry and exit of new topics into the system.
Topic tracking and trend analysis can provide early warnings and actionable in-
telligence to avoid disasters.
1
http://en.wikipedia.org/wiki/RSS_(file_format).
A News Analysis and Tracking System 233
Fig. 1. Architecture of proposed News Analysis System
• News Alert Generator – This is a trainable module that implements Latent Seman-
tic Indexing (LSI) to categorize an incoming news story as relevant or irrelevant to
any of the multiple interest areas expressed by the user. The alerting mechanism is
designed as a trainable multi-classifier system. User interests can be further classi-
fied as long-term and short-term. Concepts in long term interest category change at
a slower rate than those in the short-term interest category. While long-term inter-
ests encode generic needs of a news monitoring agency, short-term interests help in
tracking evolving stories in any category.
• Topic Explorer – This is the interface through which the user interacts with the
system. A topic snap-shot shows all topics and concepts present in a collection.
Users can drill-down to detailed reports on any event along any topic. This allows
a single story to be read from multiple perspectives based on its contents. Topic-
based aggregation allows the user to get a bird’s eye view of the key events. Users
can also view topic strengths and localizations.
3.1 Topic Extraction Using LDA
An important step to comprehend the content of any text repository is to identify the
key topics that exist in the collection, around which the text can be organized. Latent
Dirichlet Allocation (LDA) [5] is a statistical model, specifically a topic model, in
which a document is considered as a mixture of a limited number of topics and each
meaningful word in the document can be associated with one or more of these topics.
A single word may be associated to multiple topics with differing weights depending
on the context in which it occurs. Given a collection of documents containing a set of
words, a good estimation of unique topics present in the collection is obtained with
the assumption that each document d can be viewed as a multinomial distribution over
k topics. Each topic zj , j = 1 · · · k, in turn, is assumed to be a multinomial distribution
Φ(j) over the set of words W. The problem of estimating the various distributions is in
234 Sk. Mirajul Haque, L. Dey, and A. Mahajan
general intractable. Hence a wide variety of approximate algorithms, that attempt to
maximize likelihood of a corpus given the model have been proposed for LDA. We
have used the Gibb’s sampling based technique proposed in [5] which is
computationally efficient. The topic extractor creates a topic dictionary where each
topic is represented as a collection of words along with their probabilities of
belonging to the topic. Each document d is represented as a topic distribution vector <
(p(t1,d),…,p(tk, d)>, where p(tj,d) denotes the strength of topic tj in document d is
computed as follows:
p t ,d ∑ , . (1)
where n is the total number of unique words in the collection.
3.2 News Alert Generation Using Latent Semantic Indexing
All incoming news stories are scored for relevance based on its topic distribution. The
system employs Latent Semantic Indexing (LSI) to assign relevance scores. LSI uses
Singular Value Decomposition (SVD) to identify semantic patterns between words
and concepts present in a text repository. It can correlate semantically related terms
within a collection by establishing associations between terms that occur in similar
context.
User interests are conveyed in terms of topics. Given that the underlying repository
has stories containing different topics, of which some are relevant and some not, the
aim is to find new stories which are conceptually relevant to the user, even though
they may not be content-wise similar to any old story as such. The training documents
for the system are created as follows. For each relevant topic, all stories containing
the topic with strength greater than a threshold are merged to create a single training
document dR. All remaining stories are merged to create another training document
dIR. Let Tm2 denote the term-document matrix for this collection, where m represents
the number of unique words. The weight of a word in T is a function of its probability
of belonging to the topics in each category. Applying SVD to T yields,
. (2)
where U and V are orthonormal matrices and W is a diagonal matrix. A modified
term-document matrix is now constructed using which consists of the first two
columns of U such that represents a transformed 2 dimensional concept-space,
where each dimension is expressed as a linear combination of the original terms and
the relevant and irrelevant documents are well-separated. For each new news story, its
cosine similarity with the relevant and irrelevant documents is computed in the new
concept space. Let Q represent the term vector for the new story. Q is then projected
into the transformed concept space to get a new vector . Let SR and SIR denote the
cosine similarity of with the relevant and irrelevant documents respectively. The
new story is judged as relevant if
. (3)
where α is a scaling parameter in the range (0, 1). Since number of news stories in
irrelevant category is usually much higher than number of relevant news, there is an
A News Analysis and Tracking System 235
automatic bias towards the irrelevant category. The scaling parameter α adjusts this
bias by reducing the similarity to irrelevant category.
4 Experimental Results
Fig. 2 presents some snapshots. On the left is a portion of the topic dictionary created
for a 2 week time-period ending on 9th June, 2009. Users select topics to indicate
preferences for receiving alert. On the right an alert generated for the user is shown.
Fig. 2. Snapshots from the system
(a) (b)
Fig. 3. (a) Precision and Recall with varying topic cutoff and fixed alpha=0.5 (b) Precision and
Recall with varying alpha and topic cutoff = 0.6
Since user preferences are given as topics, it is observed that the accuracy of re-
cognizing new stories as relevant depends on (a) the size of the training set (b) the
scaling factor α. Size of the training set is controlled using the topic cut off threshold.
Fig. 3 shows system performance varies with these two parameters. With a fixed
alpha, it is observed that recall falls slightly with higher values of topic cut-off,
though precision does not suffer. Higher topic cut-off reduces the training set size
causing the recall to fall. Sharper trends are observed with a fixed topic cut-off, and
changing values of alpha. Recall falls drastically with higher values of alpha, since
very few stories are now picked as relevant, though the precision is very high. Best
results are observed for alpha and topic cut-off set to 0.5 and 0.6 respectively.
236 Sk. Mirajul Haque, L. Dey, and A. Mahajan
5 Conclusion
In this work, we have presented a news analysis system which generates user-
preference based alerts. Preferences are for topics and concepts rather than pre-
defined categories. The news repository is analyzed and indexed topically. We are
currently working on integrating a mining and analytical platform to this system to
enhance its prediction and tracking capabilities.
References
1. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of
22nd ACM SIGIR, California (1999)
2. Allan, J., Papka, R., Lavrenko, V.: On-line New Event Detection and Tracking. In: Pro-
ceedings of 21st ACM SIGIR, Melbourne (1998)
3. Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A System for Large-Scale News Analysis. In:
Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 161–166. Springer,
Heidelberg (2005)
4. Bacan, H., Pandzic, I.S., Gulija, D.: Automated News Item Categorization. In: JSAI (2005)
5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learn-
ing Research 3, 993–1022 (2003)
6. Landauer, T.K., Foltz, P.W., Laham, D.: An Introduction to Latent Semantic Analysis.
Discourse Processes 25, 259–284 (1998)
7. Yamron, J.P., Carp, I., Gillick, L., Lowe, S., Van Mulbregt, P.: Topic Tracking in a News
Stream. In: Proceedings of DARPA Broadcast News Workshop (1999)
8. Mori, M., Miura, T., Shioya, I.: Topic Detection and Tracking for News Web Pages. In:
IEEE/WIC/ACM International Conference on Web Intelligence, pp. 338–342 (2006)
9. Fukumoto, F., Suzuki, Y.: Topic tracking based on bilingual comparable corpora and semi-
supervised clustering. ACM Transactions on Asian Language Information Processing 6(3)
(2007)
10. Kuhns, R.J.: A News Analysis System. In: Proc. of 12th International Conference on
Computational Linguistics, COLING 1988, vol. 1, pp. 351–355 (1988)