Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
72 views4 pages

Generalized Model Perform

This document presents a model for trending topic analysis of Bangla language newspapers. It collects data from two popular Bangladeshi newspapers and performs natural language processing techniques like removing stop words and applying n-gram analysis. Unigram, bigram, and trigram frequencies are calculated and the most frequent topics are identified as trending topics. Topic trends are visualized over time through generated graphs. The goal is to analyze changing topic trends in Bangladeshi newspapers and understand the situations and viewpoints reflected in the newspapers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views4 pages

Generalized Model Perform

This document presents a model for trending topic analysis of Bangla language newspapers. It collects data from two popular Bangladeshi newspapers and performs natural language processing techniques like removing stop words and applying n-gram analysis. Unigram, bigram, and trigram frequencies are calculated and the most frequent topics are identified as trending topics. Topic trends are visualized over time through generated graphs. The goal is to analyze changing topic trends in Bangladeshi newspapers and understand the situations and viewpoints reflected in the newspapers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Generalized Model to Perform Trending Topic Analysis on Bangla

Newspaper
Syed Mehedi Hasan Nirob1 , Md. Kazi Nayeem2 and Md. Saiful Islam3

Abstract Trending topic of newspaper is a indicator to type of analysis on a newspaper cant guarantee us
understand the situation of a country and also a way to evaluate
the particular newspaper. This thesis is about the news trend of
In Bangladesh newspaper has always been a popular
Bangla Newspaper. Topics that are discussed more frequently
then other in Bangla newspaper will be selected and how a and important source of information in Bangladesh. Cause
very famous topic loses its importance with the change of time in past a large number of people in our country had no
and another topic takes its place will be demonstrated. Data internet access. Although at present almost all of the
form two popular Bangla Newspaper with date and time were Bangladeshi newspaper has online version. In this thesis we
collected. There can be many related word in this data and
are proposing a generalized model to find most discussed
they all have same possibility to be news trend. Words related
to each other and having similar meaning were combined using topic of Bangla newspaper. We build this model by using
word vector similarity. For a specific keyword trend graph data of Prothom Alo [1] and Kaler Kantho [?]. But this
can be plotted now. Also n-gram frequency can give more model is effective for any Bangla newspaper. Statistical
reliable result. Additional statistical analysis on these data were analysis on Bangla language word is the main concern here.
performed. These data were categorized in different subject like
technology, politics, sports etc. Finding category wise news trend
or a list of news trend in daily or weekly basis is also possible Newspaper trending topic changes with the change of
with enough data. Trending topic analysis can provide answer time. Political change, Economic and social change, peoples
of many question. Like if any newspaper is politically biased behavior etc. We will try to find a pattern on these changes
or not or for a specific topic how much attention this topic gets and will develop a model that will measure popularity of
in this newspaper. Pattern can be found on their news trend
too. Comparison among past news trend of Bangla newspapers different subjects in Bangla newspaper.
will give a visualization of the situation of Bangladesh. This
visualization will be helpful to predict the future trending topic II. RELATED WORK
of Bangla Newspaper.
Index Terms News trend, Newspaper, Frequency, N-gram, Trending topic analysis specially find a way to extract most
Categorization used and meaningful keywords that is popular among people
from stream of data. News trend analysis for newspapers
I. INTRODUCTION written in major language like English, Spanish, French is
Trending topic analysis is actually to spot a pattern or not a new concept. But research work was basically done
trend on collected date. Trending topic may vary from time with the data of social media like Twitter, Google, Wikipedia
to time and from place to place. Its the people who decide etc.There are comprehensive study with these three major
which topic will rule for a certain time or will be most online and social media stream [3].
discussed. Current trend for a particular area can be detected L. M. Aiello et al. compared six topic detection methods
with satisfactory precision if we survey on data that represent to sense twitter trends from Twitter datasets related to major
the current situation of that area. With the expansion of social events [4] In 2012 Rong Lu and Qing Yang worked on
media like Facebook, Twitter this type of data is becoming Twitter trend analysis including reason analysis for news
more and more available. But newspapers are another option topics [5]. In social media trending topic is not constant and it
that can help us in this case. rises and decay after an amount of time [6] . For a particular
Online contents of newspaper are growing rapidly with time we can easily take a look at the most popular topic
the change of time and growing audience. Daily newspapers right now in twitter. Real time trending topic or streaming
update news on daily basis and people can find their news trends with previous statistics has been done with twitter
online. Statistical analysis on these newspaper data can data [7]. Besides twitter hashtag trends is another effective
reveal the situation of a particular country. Result of this trend analysis that only work with hashtagged topic and
popularity of them [8]. Trending topics in Twitter are not in
1 Syed Mehedi Hasan Nirob is with the Department of Computer Science
same category. Real time classification of Twitter treanding
and Engineering, Shahjalal University of Science and Technology, Sylhet-
3114, Bangladesh [email protected]
topics is another important research work that was done in
2 Md. Kazi Nayeem is with the Department of Computer Science and 2013 [9]. Also these trending topics can be categorized in
Engineering, Shahjalal University of Science and Technology, Sylhet- 3114, different subject like science, sports, politics, technology,
Bangladesh [email protected] music etc. In 2011 Lee, K., Palsetia, D., Narayanan, R.,
3 Md. Saiful Islam is with the Department of Computer Science and
Engineering, Shahjalal University of Science and Technology, Sylhet- 3114, Patwary, M.M.A., Agrawal, A. and Choudhary, classified
Bangladesh [email protected] Twitter Trending Topics into 18 general categories [10]. They
used two approach for topic classification. These are Bag- B. Unigram
of-Words approach for text classification and network-based At first we count the frequency of each separated word.
classification. From this simple unigram or frequency counting we get some
But for Bangla language there is no such work on trending interesting result. But most of the word with high frequency
topic analysis. There was some analysis and observation on are irrelevant and doesnt make sense as a trending topic.
Bangla news corpus that is not directly related to trending For each keyword we find the value of this:
topic analysis [11]. But for working with Bangla laguage
corpus these are important concept. And also there was some (observed expected)2
statistical analysis on Bangla newspaper data. expected

III. METHODOLOGY Here,

We collected data from two popular Bangladeshi news- Observed = Frequencyo f akeyword
paper Prothom-Alo and Kaler Kantho. We separated
each word and excluded stop words from them. Then we Expected = Averageo f f requency
performed unigram on these data using Chi-squared test
[12]. In this process we placed keywords in top place with But for different keyword we get same Chi-square value.
higher Chi-squared value and for same Chi-squared value In this case we consider frequency and plot unigram trends.
we considered their frequency. We marked them as trending
topic candidate for a week. Later for both bigram and trigram
we followed same strategy but this time for two and three
consecutive word. Keywords in topmost place became more
relevant to trending topic. Then we picked some keyword
those have very high probability of becoming trending topic
and plotted some comparison graphs using those keyword.
We manually clustered keywords having similar meaning in
our news data to visualize some comparison.

IV. E XPERIMENT AND P ERFORMANCE


A. Data Collection and Preprocessing
We collected data from the online version of two most
Fig. 1: Trending topics of Prothom Alo for every week from
popular Bangla Newspaper Prothom Alo and Kaler Kantho.
January 1, 2010 to April 30, 2010 using unigram
News of data was collected using an web crawler. It sys-
tematically browsed the website of these two newspaper and
stored relevant data for our thesis work. From this graph we can see weekly trending topics of
Raw data collected from Prothom Alo are not structured Prothom Alo newspaper for 4 month. Also We can visualize
and not appropriate for performing statistical analysis. So relative popularity of each topic from this graph.
we need to build a corpus that will be used in our work. We
made a structured set of text containing news from Prothom C. Bigram
Alo separated by date. After applying unigram on our news data we observed
that more than one unigrams have same value in Chi-squared
We then separate each word so that we can work on test. And some trending topic just doesnt make sense and
individual word. Now for each corresponding date we have to ignore them we use Bigram.
a list of word. In every language there are some common We count frequency of two adjacent keyword. Now in Chi-
words used more frequently and not very meaningful if we squared test we find distinct value for each bigram and result
compare them with keywords. So they cant be trending is better than unigram.
topic candidate. We filter these stop words from our corpus.

We collected 4 large list of Bangla stop words and com-


bined them. Final stop word list contains approximately 500
words. After excluding these word from our main corpus, our
corpus become more reliable and give better performance.We
then use a Bangla stemmer [13] to find the root word of these
word. But this performance of this stemmer is not that much
satisfactory. So, We work with normal words but also make
a graph using root word to compare them.
This is a table showing Chi-Square value of some selected
bigrams with frequency. From this table we can examine
that Chi-squared value for each bigram is quite different and
clearly distinguishable. By using this information we plot
graph.

Fig. 3: Trending topics of Prothom Alo for every week from


January 1, 2010 to April 30, 2010 for trigrams

Trending topic for each of these category may vary. Also


we can select a universal trend for a specific time. Now,
we can plot some graph by providing clustered keywords
Fig. 2: Trending topics of Prothom Alo for every week from manually. These keywords can be related to gender, politics,
January 1, 2010 to April 30, 2010 using bigram or cultural world. This analysis will reveal the inner phi-
losophy of Bangla newspaper. Do they provide politically
biased opinion? Do they represent common people? Answer
If we observe bigram trending topic of Bangla newspaper of these questions will be answer from this specific keyword
we can see that they are more pertinent than unigram trending analysis on Bangla newspaper data.
topic. We didnt cluster our data. Besides some unigrams
frequency is higher but they are not actually trending topic
candidate. Some of them can be considered as stopwords
of Bangla language. For bigram this problem diminished
dramatically. Also
D. Trigram
Frequency measurement of bigrams showed better perfor-
mance in finding trending topic of Bangla newspaper. Now,
we count frequency of three adjoining keyword or trigrams.
We follow similar technique that we used while finding
bigrams trending topic.

Fig. 4: Trending topics of Prothom Alo for every week from


January 1, 2010 to April 30, 2010 for bigrams

This is a graph comparing male and female in different


category. We can show from this graph that in crime and
accident category male are discussed most in Prothom Alo
newspaper. And in other category female are discussed more
frequently than male. In sport male and female get almost
From our 5 selected trigrams Chi square test value and similar priority. From this graph we can say that female get
frequency, we can see topics with top value and frequency much attention than male in Bangla newspaper.
are almost similar to bigram keyword. This table shows closely related words to perform an
But for some week trigram produces better result than analysis of male-female comparison. Some of those words
bigram. Also we find peak value of frequency of same are synonym of each other and some of those word has same
trigram in different place. Thats why for our non-clustered contextual meaning. We also made a similar table for male
data trigram is more suitable than unigram or even bigram. and clustered them.
E. Analysis and Comparison for Specific Keyword F. Performance
Newspaper trends can be classified in different cate- Our model to analyze trending topic for Bangla newspaper
gory. Like Entertainment, Economics, Sports, Education etc. is the vary first model of this kind. So, result after applying
[6] Asur, Sitaram, Bernardo A. Huberman, Gabor Szabo, and Chunyan
Wang. Trends in social media: Persistence and decay. Available at
SSRN 1755748 (2011).
[7] Benhardus, J. and Kalita, J., 2013. Streaming trend detection in twitter.
International Journal of Web Based Communities, 9(1), pp.122-139.
[8] Ma, Z., Sun, A. and Cong, G., 2013. On predicting the popularity of
newly emerging hashtags in twitter. Journal of the American Society
for Information Science and Technology, 64(7), pp.1399-1410.
[9] Zubiaga, A., Spina, D., Martinez, R. and Fresno, V., 2015. Realtime
classification of Twitter trends. Journal of the Association for Infor-
mation Science and Technology, 66(3), pp.462-473.
[10] Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A.
and Choudhary, A., 2011, December. Twitter trending topic classifi-
cation. In 2011 IEEE 11th International Conference on Data Mining
Workshops (pp. 251-258). IEEE.
[11] Majumder, Khair Md, and Yasir Arafat. Analysis of and observations
from a Bangla News Corpus. (2006).
[12] Greenwood, Priscilla E., and Michael S. Nikulin. A guide to chi-
squared testing. Vol. 280. John Wiley and Sons, 1996.
[13] Bangla Stemmer github link: https://github.com/rafi-kamal/Bangla-
Stemmer

our method was average. Bangla word processing is not an


easy task and there is not enough resource to process data
so that we can perform statistical analysis on these data.
But after analyzing our output, we observed that our model
show relevant trending topic in 4 week in average out of 10
week for unigrams. This performance become much better
if we consider bigrams. We get relevant and meaningful
treanding topic in * week out of ** week. Performance for
trigrams is slightly better than bigrams. We get relevant
trending topic in * week out of ** week in this case.
V. CONCLUSIONS
In this paper we designed a generalized model that can
be used to perform trending topic analysis for Bangla
newspaper. Trending topics of Bangla newspaper follow
some specific pattern. If we perform some analysis on this
pattern we also will be able to predict future trending topic
or future situation of our country. Thats why trending topic
analysis on Bangla newspaper data is mattering much.
Improvement on our model can make performance much
better. Like clustering keywords having similar contextual
meaning and then perform statistical analysis on each cluster.

R EFERENCES
[1] Prothom Alo website: http://www.prothom-alo.com/
[2] Kaler Kantho website: http://www.kalerkantho.com/
[3] Tim Althoff, Damian Borth, Jrn Hees and Andreas Dengel, Analysis
and forecasting of trending topics in online media streams, MM 13
Proceedings of the 21st ACM international conference on Multimedia
Pages 907-916 ACM New York, NY, USA, 2013.
[4] L. M. Aiello et al., Sensing Trending Topics in Twitter, in IEEE
Transactions on Multimedia, vol. 15, no. 6, pp. 1268-1282, Oct. 2013.
[5] Rong Lu and Qing Yang, Trend Analysis of News Topics on Twitter,
International Journal of Machine Learning and Computing vol. 2, no.
3, pp. 327-332, 2012.

You might also like