Text Classification and
1
Clustering
2 Recall Text Analytics Techniques
Word
Text Learning Result/
Preprocessing represent
source methods * Prediction
ation
Social Token Classifier, Target document,
Vector model
media (unigram, clustering summary,
or word
content, bigram etc) methods sentiment, topic
embedding
documents
format or
feature
extraction
3 Text preprocessing
Recall from last lesson, there is a need to do preprocessing before any
analysis.
2. Normalization 3. Stop words
Source* 1. Tokenization
removal
5. Stemming and/or
Tokens 6. Replacing 4. Removal
Lemmatization
*Source can be a sentence, a paragraph, a document
Processes in grey (may or may not needed)
4 Word Representation - VSM
In vector space model, term vector for an object of interest (paragraph,
document or document collection) is a vector in which each dimension
represents the weight of a given word in the document.
An illustration of term vectors for many documents, containing their TF-IDF values.
5 Learning methods
In order to extract any pattern or information from the word representation,
a set of algorithms is needed for the purpose.
These algorithms can be categorised into
Supervised learning
Focus of this module
Unsupervised learning
Semi-supervised learning
Ensemble learning
Hybrid learning
6 Supervised learning
Dataset : labelled data – 1) training data, 2) validation data, 3) testing data
It is a machine learning techniques that infer a function or use a classifier to
learn from the training data in order to predict unseen data
The following is the basic process:
Gather data
Annotate data Most tedious yet important
Process data
Feature generation and engineering
Choose classifier
Train data (and use validation data to tune the parameters of the classifier)
Test data
7 Result of Classifiers
In general, a classifier handles a two-classes dataset. For example, to
classify a tweet to have either positive or negative sentiment
The classification is called hard, if a label is explicitly assigned (positive
sentiment)
It is called a soft classification when a probability value is assigned (78%
positive, 23% negative)
There are also multi-class classifiers that handle data of more than two
classes. For example, strong positive, positive, neutral, negative, strong
negative
The common evaluation metrics for text classification are precision, recall
and F1 scores.
8 Evaluation Metrics
Recall vs Precision
Inversely related
Recall (sensitivity)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |
{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}
Ratio of number of relevant records
retrieved to the total number of relevant
records in the database
Precision (positive predictive value)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |
{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}
Ratio of the number of relevant records
retrieved to the total number of irrelevant
and relevant records retrieved
Recall and precision are usually expressed
as percentage.
https://en.wikipedia.org/wiki/Precision_and_recall
9 Evaluation Metrics
Balanced F-score or F-measure or F1
Harmonice mean of precison and recall
Defined as
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑥𝑥 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
2 𝑥𝑥
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
F-score provides a measure of retrieval accuracy without bias toward
precision or recall
The higher the F-score, the better the retrieval method or algorithm
10 Confusion Matrix
Labelled Positive Labelled Negative
Predicted Positive TRUE POSITIVE (TP) FALSE POSITIVE (FP)
Predicted Negative FALSE NEGATIVE (FN) TRUE NEGATIVE (TN)
Assuming that there are 10 labelled positive and 1000 labelled negative dataset
TP: 3, TN: 7, TN: 900, FP: 100
Accuracy: (TP + TN) / (TP + FN + TN + FP) => 903/1100 = 0.821
Precision: TP / (TP + FP) => 3/103 = 0.029
Recall: TP / (TP + FN) => 3/10 = 0.3
F1 score: 2 * (precision * recall)/(precision + recall) = 0.0174/0.329 = 0.053
11 Text Classification
In text classification, the observations are documents and the classes are
document categories.
For example, determine if the mail is a spam or not (two categories/classes)
or assign the topic to the news articles (multiple categories)
How to represent a document?
doc = (w1, w2, …., wn)
For simplicity, we ignore the order of words, POS, possible concepts but just
focus on the words.
The representation is called the Bag-of-Words model
We may assume that some words occur more frequently in a specific
category than others.
Examples of Text Classifier are, Naïve Bayes Classifier, Logistic Regression,
Support Vector Machine
12 Naïve Bayes Classifier
Simplest and most widely used classifier
Model the distribution of documents in each class using a probabilistic
model assuming that the distribution of different terms are independent
from each other
The naïve assumption is clearly false in real world application, but Naïve
Bayes works surprisingly well
this was a fun party”
“this party was fun”
“party fun was this”.
http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html
13 Being Naïve
Since Naïve Bayes assumption is all words in doc are independent,
P(a very close game|Sports) = P(a|Sports) x P(very|Sports) x P(close|Sports) x P(game|Sports)
Assuming the task is to classify if the doc “a very close game” belongs to
“Sports” or “Not Sports” categories
Based on the labelled data, we will learn the probability P(game|Sports) ->
counting how many times the word “game” appears in Sports data divided
by total number of words in Sports category.
Do the same for the Not Sports category, the bigger probability is the
assigned category
14 Logistic Regression
Logistic regression is an algorithm for binary or two-class classification.
Predict a score for each doc
Threshold the score (e.g., cut off at 0.5) -> classification
(z) is known as logistic sigmoid function and it outputs values between 0
and 1, which we can use to model and predict a class/category.
If (z) if larger than 0.5 (or if
z is larger than 0), the doc
will be classified as class 1
(and class 0, if otherwise)
15 Logistic Regression
x is essentially the doc vector (either TFIDF or TF weighting)
Activation functions is logistic sigmoid function
https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
16 Logistic Regression (LR) – an example
Assumption a doc has 4 words – word1, word2, word3, word4
And its corresponding TF and doc vector is x = [1, 2, 3, 4]
Assuming the LR weight vector is w = [0.5, 0.5, 0.5, 0.5]
Lets compute z:
z = wT x = 1*0.5 + 2*0.5 + 3*0.5 + 4*0.5 = 5
The result : φ(z=5) = 1 / (1 + e-5) = 0.993
99.3% chance that this doc belong to class 1!
Logistic regression tends to have higher accuracy with more training data.
Naïve Bayes can have an advantage when the training data is small.
17 Support Vector Machine (SVM)
SVM is a supervised learning method for two-class classification. (It can be
adapted to handle multi-class through iterating the one vs all approach)
It separates a labelled {+1, -1} or annotated training data via a hyperplane
that is maximally distant from the positive and negative samples
respectively.
This optimally separating hyperplane in the feature space corresponds to a
nonlinear decision boundary in the input space.
is the
kernel
18 SVM Kernel trick
The idea is, the data may not be linearly separable in our ‘n’ dimensional
space but may be linearly separable in a higher dimensional space:
Due to the use of kernel, SVM is quite robust to high dimensionality, i.e.,
learning is almost independent of the dimensionality of the feature space.
Vector Space model used by text data is often sparse with high dimension
and hence it is ideal choice for SVM.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78
19 Review Questions
What is the difference between a binary classifier and multi-class
classifier?
Can you convert a SVM to a multi-class classifier?
20 Unsupervised learning
Unsupervised learning methods are techniques to find hidden structure out
of unlabelled data.
There is no “training phase” like supervised learning
Clustering and topic modelling are the two commonly used unsupervised
learning algorithms in the context of text data.
Clustering is the task of segmenting a collection of documents into
partitions where documents in the same group (cluster) are more similar to
each other than those in other clusters.
In topic modelling, a probabilistic model is used to determine a soft
clustering. A topic is like a cluster and the membership of a document to a
topic is probabilistic.
21 Text Clustering
Clustering is the task of finding groups of similar documents in a
collection of documents.
Text clustering can be different levels of granularities where clusters
can be documents, paragraphs, sentences or terms.
Main technique used to organise documents to enhance retrieval.
The type of algorithms are,
Distanced-based clustering algorithm
Partitioning algorithm
Probabilistic clustering algorithm
22 Unique challenges for text clustering
Text representation has a very large dimensionality, but the
underlying data is sparse. For example, the size of the vocabulary
can be of order of 105 but a document may have only a few
hundred words. Imagine if the document is a tweet!
How to capture the concepts within collection of documents?
Words of the vocabulary are commonly correlated with each other.
Algorithms used should take the word correlation into consideration.
Since words are the “deciding factors” in differentiating documents,
normalising document representations is important
23 Distanced-based Clustering algorithms
It is based on a similarity function to measure the closeness between text
documents.
One of it is Hierarchical Clustering algorithms,
Top-down (divisive)
Include all documents and split the cluster into sub-clusters
Bottom-up (agglomerative)
Each document is an individual cluster then merge similar documents
Single Linkage Clustering – highest similarity between any pair of documents from the
two groups
Group-Average Linkage Clustering – average similarity between pairs of documents
Complete Linkage Clustering – worst case similarity between any pair of documents
https://www.youtube.com/watch?v=EUQY3hL38cw
24
25
26
27 Partitioning Clustering
K-means clustering algorithm is widely used.
Partition n documents into k clusters. k is defined by user.
Main disadvantage of k-means clustering is the initial choice of k.
It is an iterative process to find the “optimal centres” and groupings
K Means Clustering -
Georgia Tech - Machine
Learning
28 Probabilistic Clustering
Topic modelling is one of the most popular probabilistic clustering
algorithms.
The idea is to create a probabilistic generative model for the corpus of text
document.
The documents
are mixture of
topics and a
topic is a
probability
distribution over
words.
29 Topic modelling
Unsupervised learning techniques that takes in documents (in vector
format) and a parameter – the number of topics (k)
Documents Topics Words (Tokens)
Observable Latent Observable
Documents are composed by many topics
Topics are composed by many words (tokens)
Two main topic modelling methods:
Probabilistic Latent Semantic Analysis (pLSA)
Latent Dirichlet Allocation (LDA)
30 Semi-supervised learning
It is a type of supervised learning but make use of unlabelled data for
training.
Since it makes use of supervised learning, it uses a small amount of labelled
data with a large amount of unlabelled data.
Prerequisite – the distribution of examples, which the unlabelled data will
help to elucidate, must be relevant for the classification problem.
It works with some assumptions:
The semi-supervised smoothness assumption – if two points are linked by a path
of high density (e.g., belong to the same cluster), then their outputs are likely to
be close.
The cluster assumption – if points are in the same cluster, they are likely to be of
the same class (use the labelled points to assign a class to each cluster)
31 Semi-supervised learning in practice
In speech recognition, it costs little to record huge amounts of speech, but
labelling it requires human to listen to it and type a transcript.
Webpage classification – billions of webpages are available but to classify
them reliably requires human to read and verify
Protein function prediction through 3D structure – protein sequences can
be acquired at industrial speed (by genome sequencing etc.), but to
resolve a 3D structure and to determine the function may require years of
scientific work.
Since unlabelled data carry less information than labelled data, they are
required in large amounts to increase prediction accuracy.
32 Ensemble learning
Ensemble learning helps improve machine learning (supervised learning)
results by combining several models or diverse set of learners.
This approach allows production of better predictive performance
compared to a single model. Commonly used in competitions, like kaggle
Common errors in each learning models:
Bias error – quantify how much on
average the predicted values
different from actual values. High bias
– keeps missing important trends
Variance – quantify how the
prediction made on same
observation different from each
other. High variance will over-fit an
perform badly on unseen data
https://www.youtube.com/watch?v=XW2YPKXt3n4
33 Ensemble learning
A champion model should maintain a balance between these two types of
errors. This is known as the trade-off management of bias-variance errors.
Ensemble learning is one way to execute this trade off analysis.
Common techniques
Bagging
Implement similar learners
on small sample
populations and takes
average of all the
predictions.
34 Ensemble learning
Boosting :
Iterative technique which adjust the weight of an observation based on the
previous classification. If it was classified incorrectly, it tries to increase the weight
of the observation, and vice versa.
Decreases bias error
Stacking : Use a learner to combine output from different leaners
Choosing the right ensemble can be an art.
35 Hybrid learning
Combining knowledge-based approach with machine learning to achieve
better performance.
What are knowledge-based approach?
Use of knowledge base (e.g., database, lexicons)
Use of complex structured and semi-structured information (e.g., relationship,
concepts)
Mainly crafted by human’s knowledge and derived rules (can be more tedious
and considered “less cool” in this AI (or machine learning) age
Still have its value, esp. in text analytics.
36 Hybrid learning
Simplest sentiment analysis – combined polarity lexicons with a trained
classifier (using labelled training data).
Applied in scarce resource language like Singlish.
37 Hybrid learning
English Sentic Pattern:
polarity reversing rule based on English negation terms such as “not”, “couldn’t”,
“shldnt”
Handling of adversative terms such as “but”. Only the polarity of the second part
was considered. For example, “this bag is nice but expensive”
A Multilingual Semi-supervised approach in Deriving Singlish Sentic Patterns for Polarity Detection. S. L. Lo, E. Cambria, R. Chiong and D. Cornforth.
Knowledge Based Systems 105 (2016), 236-247
38 Which method(s) to use?
Predict if the new email is a spam
Decide which category to assign a document
Recommend a list of social media followers as target audience
39 Any Question?
We have covered:
• Supervised and Unsupervised learning
• Supervised learning
• Naïve Bayes
• Logistic Regression
• Support Vector Machine
• Unsupervised learning
• Distance-based Clustering
• Partitioning Clustering
• Probabilistic Clustering
• Ensemble learning
• Hybrid learning