Topic Models from Twitter Hashtags
Authors December 2, 2013
Introduction
Topic modelling consist in extracting a mathematical model from a set of semantically related documents. Ideally, the model must characterize the semantic knowledge a set of training examples have in common, this is, the information that makes the set relevant to the topic. The task of identifying information that relates documents to topics is hard due to natural language complexity and its hardness tends to scale when documents are generated in colloquial informal environments as in social networks or microtexts like sms or tweets. The main diculties are related to the informal way of expression of documents from such domains and the ephemeral nature of the topics, events and opinions discussed. While traditional topic modelling deals with more formal documents in almost static domains (like news or scientic articles, where the semantics about certain topic does not change considerably in short times), a set of brief texts generated by users is noisy and its semantic components changes rapidly because of multiple points of view referring to same facts or events. Most successful techniques developed to tackle topic modelling are based on statistical information about the presence of words in certain contexts, and its applicability relies on assumptions that may not be satised in other domains. In this works, we tried to push towards the problem of capturing topic models from twitter messages. For this, we follow an approach based on the assumption that there exists implicit relations between words that are topic related even when they do not appear in the same contexts with high frequency. These latent associations are supported by the idea that relations which connect words could be broader that those considered in other techniques. With broader relations we mean that a couple of words could be related contextually through others in transitive manner (if wi relates to wj and wj to wk , then wi related to wk ). The transitive property of latent associations also allow to attack the problem of temporal decaying of models. After some time, the set of words related to some event tend to change, discarding some words that keeps weak (infrequent) relations with the topic while including as new elements, words whose relation with it became more strong in recent time. This temporal variation in the set that relates to a given event in certain time is consequence of the temporal evolution of events and opinions and is hard to capture with traditional static modelling while the proposed method allows us to better identify relations in larger time spans. In order to test our ideas, we developed an experimental setup which consists of a stream ltering environment where socially-labelled messages from before a given time t become positive training examples. One-class classiers are trained with these examples, representing them with dierent types of features. These are our models. In turn, classiers are used to lter (or classify) an incoming stream of user generated messages created after t and represented in the same feature space, labelling them as positives if relevant to the topic and negatives otherwise. Experiments are also intended to show the temporal decay of models when using traditional features: after some time, classiers become unable to identify relevant messages with the same accuracy that it has when ltering documents generated few time after the training due to the evolution of topics. We show the advantage of proposed method through analysis plottings where this behaviour is observable and a decay index. For the last part of our work, we relate the observed results with a proposed measure that quanties the dispersion or broadness of topics in time, the Stream Broadness Index. It is intended to quantify the semantic variation in time for a given topic-corpus and shows that variations in the performance of a classier are clearly aected by time and corpus intrinsic characteristics. Results of our experiments shows that our proposed representation outperforms other classic approaches in the proposed evaluation framework, having a more persistent behaviour in time, which reduces the decay and improve precision after bigger periods. Moreover, the intrinsic analysis conrms that our method is applicable in scenarios where the semantic component of a topic labelled data stream changes in time. The remaining of this article is as follows: In section 2 a review of related works, emphasizing those approaches which attempt to tackle ltering, recommendation and classication problems in highly dynamic domains. Section 3 contains a formalization of the problem while Section 4 shows our proposed approach, including its theoretical foundations in stochastic neural networks, and also reviewing the training algorithms. The proposed evaluation framework is also
presented here. In section 5 the experimental setup and results are discussed in two stages: the ltering test related to the temporal decay and the intrinsic corpus analysis. Finally, in section 6 we present conclusions and some future directions for the work.