0% found this document useful (0 votes)

15 views39 pages

ITD253 L6 TextClassificationClustering

The document discusses text classification and clustering techniques, emphasizing the importance of text preprocessing and various learning methods such as supervised, unsupervised, and semi-supervised learning. It covers algorithms like Naïve Bayes, Logistic Regression, and Support Vector Machines for classification, as well as clustering methods like K-means and hierarchical clustering. Additionally, it highlights evaluation metrics such as precision, recall, and F1 score for assessing classifier performance.

Uploaded by

Chew Zhi Chao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views39 pages

ITD253 L6 TextClassificationClustering

Uploaded by

Chew Zhi Chao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Text Classification and

1
Clustering
2 Recall Text Analytics Techniques

Word
Text Learning Result/
Preprocessing represent
source methods * Prediction
ation

Social Token Classifier, Target document,

Vector model
media (unigram, clustering summary,
or word
content, bigram etc) methods sentiment, topic
embedding
documents
format or
feature
extraction
3 Text preprocessing

 Recall from last lesson, there is a need to do preprocessing before any

analysis.

2. Normalization 3. Stop words

Source* 1. Tokenization
removal

5. Stemming and/or
Tokens 6. Replacing 4. Removal
Lemmatization

*Source can be a sentence, a paragraph, a document

Processes in grey (may or may not needed)
4 Word Representation - VSM
 In vector space model, term vector for an object of interest (paragraph,
document or document collection) is a vector in which each dimension
represents the weight of a given word in the document.

An illustration of term vectors for many documents, containing their TF-IDF values.
5 Learning methods

 In order to extract any pattern or information from the word representation,

a set of algorithms is needed for the purpose.
 These algorithms can be categorised into
 Supervised learning
Focus of this module
 Unsupervised learning
 Semi-supervised learning
 Ensemble learning
 Hybrid learning
6 Supervised learning

 Dataset : labelled data – 1) training data, 2) validation data, 3) testing data

 It is a machine learning techniques that infer a function or use a classifier to
learn from the training data in order to predict unseen data
 The following is the basic process:
 Gather data
 Annotate data Most tedious yet important
 Process data
 Feature generation and engineering
 Choose classifier
 Train data (and use validation data to tune the parameters of the classifier)
 Test data
7 Result of Classifiers

 In general, a classifier handles a two-classes dataset. For example, to

classify a tweet to have either positive or negative sentiment
 The classification is called hard, if a label is explicitly assigned (positive
sentiment)
 It is called a soft classification when a probability value is assigned (78%
positive, 23% negative)
 There are also multi-class classifiers that handle data of more than two
classes. For example, strong positive, positive, neutral, negative, strong
negative
 The common evaluation metrics for text classification are precision, recall
and F1 scores.
8 Evaluation Metrics
 Recall vs Precision
 Inversely related
 Recall (sensitivity)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |

{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}

 Ratio of number of relevant records

retrieved to the total number of relevant
records in the database
 Precision (positive predictive value)
| 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 ∩ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 |

{𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓}

 Ratio of the number of relevant records

retrieved to the total number of irrelevant
and relevant records retrieved
 Recall and precision are usually expressed
as percentage.
https://en.wikipedia.org/wiki/Precision_and_recall
9 Evaluation Metrics

 Balanced F-score or F-measure or F1

 Harmonice mean of precison and recall
 Defined as
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑥𝑥 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
2 𝑥𝑥
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

 F-score provides a measure of retrieval accuracy without bias toward

precision or recall
 The higher the F-score, the better the retrieval method or algorithm
10 Confusion Matrix

Labelled Positive Labelled Negative

Predicted Positive TRUE POSITIVE (TP) FALSE POSITIVE (FP)
Predicted Negative FALSE NEGATIVE (FN) TRUE NEGATIVE (TN)

Assuming that there are 10 labelled positive and 1000 labelled negative dataset
TP: 3, TN: 7, TN: 900, FP: 100

Accuracy: (TP + TN) / (TP + FN + TN + FP) => 903/1100 = 0.821

Precision: TP / (TP + FP) => 3/103 = 0.029

Recall: TP / (TP + FN) => 3/10 = 0.3

F1 score: 2 * (precision * recall)/(precision + recall) = 0.0174/0.329 = 0.053

11 Text Classification

 In text classification, the observations are documents and the classes are
document categories.
 For example, determine if the mail is a spam or not (two categories/classes)
or assign the topic to the news articles (multiple categories)
 How to represent a document?
doc = (w1, w2, …., wn)
 For simplicity, we ignore the order of words, POS, possible concepts but just
focus on the words.
 The representation is called the Bag-of-Words model
 We may assume that some words occur more frequently in a specific
category than others.
 Examples of Text Classifier are, Naïve Bayes Classifier, Logistic Regression,
Support Vector Machine
12 Naïve Bayes Classifier

 Simplest and most widely used classifier

 Model the distribution of documents in each class using a probabilistic
model assuming that the distribution of different terms are independent
from each other
 The naïve assumption is clearly false in real world application, but Naïve
Bayes works surprisingly well

this was a fun party” 

“this party was fun” 
“party fun was this”.

http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html
13 Being Naïve

 Since Naïve Bayes assumption is all words in doc are independent,

P(a very close game|Sports) = P(a|Sports) x P(very|Sports) x P(close|Sports) x P(game|Sports)

 Assuming the task is to classify if the doc “a very close game” belongs to
“Sports” or “Not Sports” categories
 Based on the labelled data, we will learn the probability P(game|Sports) ->
counting how many times the word “game” appears in Sports data divided
by total number of words in Sports category.
 Do the same for the Not Sports category, the bigger probability is the
assigned category
14 Logistic Regression

 Logistic regression is an algorithm for binary or two-class classification.

 Predict a score for each doc
 Threshold the score (e.g., cut off at 0.5) -> classification
 (z) is known as logistic sigmoid function and it outputs values between 0
and 1, which we can use to model and predict a class/category.

If (z) if larger than 0.5 (or if

z is larger than 0), the doc
will be classified as class 1
(and class 0, if otherwise)
15 Logistic Regression

 x is essentially the doc vector (either TFIDF or TF weighting)

 Activation functions is logistic sigmoid function

https://www.quora.com/Why-is-logistic-regression-considered-a-linear-model
16 Logistic Regression (LR) – an example

 Assumption a doc has 4 words – word1, word2, word3, word4

 And its corresponding TF and doc vector is x = [1, 2, 3, 4]
 Assuming the LR weight vector is w = [0.5, 0.5, 0.5, 0.5]
 Lets compute z:
z = wT x = 1*0.5 + 2*0.5 + 3*0.5 + 4*0.5 = 5

The result : φ(z=5) = 1 / (1 + e-5) = 0.993

 99.3% chance that this doc belong to class 1!

Logistic regression tends to have higher accuracy with more training data.
Naïve Bayes can have an advantage when the training data is small.
17 Support Vector Machine (SVM)
 SVM is a supervised learning method for two-class classification. (It can be
adapted to handle multi-class through iterating the one vs all approach)
 It separates a labelled {+1, -1} or annotated training data via a hyperplane
that is maximally distant from the positive and negative samples
respectively.
 This optimally separating hyperplane in the feature space corresponds to a
nonlinear decision boundary in the input space.

 is the
kernel
18 SVM Kernel trick
 The idea is, the data may not be linearly separable in our ‘n’ dimensional
space but may be linearly separable in a higher dimensional space:

 Due to the use of kernel, SVM is quite robust to high dimensionality, i.e.,
learning is almost independent of the dimensionality of the feature space.
 Vector Space model used by text data is often sparse with high dimension
and hence it is ideal choice for SVM.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78
19 Review Questions

 What is the difference between a binary classifier and multi-class

classifier?

 Can you convert a SVM to a multi-class classifier?

20 Unsupervised learning

 Unsupervised learning methods are techniques to find hidden structure out

of unlabelled data.
 There is no “training phase” like supervised learning
 Clustering and topic modelling are the two commonly used unsupervised
learning algorithms in the context of text data.
 Clustering is the task of segmenting a collection of documents into
partitions where documents in the same group (cluster) are more similar to
each other than those in other clusters.
 In topic modelling, a probabilistic model is used to determine a soft
clustering. A topic is like a cluster and the membership of a document to a
topic is probabilistic.
21 Text Clustering

 Clustering is the task of finding groups of similar documents in a

collection of documents.
 Text clustering can be different levels of granularities where clusters
can be documents, paragraphs, sentences or terms.
 Main technique used to organise documents to enhance retrieval.
 The type of algorithms are,
 Distanced-based clustering algorithm
 Partitioning algorithm
 Probabilistic clustering algorithm
22 Unique challenges for text clustering

 Text representation has a very large dimensionality, but the

underlying data is sparse. For example, the size of the vocabulary
can be of order of 105 but a document may have only a few
hundred words. Imagine if the document is a tweet!
 How to capture the concepts within collection of documents?
Words of the vocabulary are commonly correlated with each other.
Algorithms used should take the word correlation into consideration.
 Since words are the “deciding factors” in differentiating documents,
normalising document representations is important
23 Distanced-based Clustering algorithms

 It is based on a similarity function to measure the closeness between text

documents.
 One of it is Hierarchical Clustering algorithms,
 Top-down (divisive)
 Include all documents and split the cluster into sub-clusters

 Bottom-up (agglomerative)
 Each document is an individual cluster then merge similar documents
 Single Linkage Clustering – highest similarity between any pair of documents from the
two groups
 Group-Average Linkage Clustering – average similarity between pairs of documents
 Complete Linkage Clustering – worst case similarity between any pair of documents
https://www.youtube.com/watch?v=EUQY3hL38cw
24
25
26
27 Partitioning Clustering

 K-means clustering algorithm is widely used.

 Partition n documents into k clusters. k is defined by user.
 Main disadvantage of k-means clustering is the initial choice of k.
 It is an iterative process to find the “optimal centres” and groupings

K Means Clustering -
Georgia Tech - Machine
Learning
28 Probabilistic Clustering

 Topic modelling is one of the most popular probabilistic clustering

algorithms.
 The idea is to create a probabilistic generative model for the corpus of text
document.

The documents
are mixture of
topics and a
topic is a
probability
distribution over
words.
29 Topic modelling

 Unsupervised learning techniques that takes in documents (in vector

format) and a parameter – the number of topics (k)

Documents Topics Words (Tokens)

Observable Latent Observable

 Documents are composed by many topics

 Topics are composed by many words (tokens)
 Two main topic modelling methods:
 Probabilistic Latent Semantic Analysis (pLSA)
 Latent Dirichlet Allocation (LDA)
30 Semi-supervised learning

 It is a type of supervised learning but make use of unlabelled data for

training.
 Since it makes use of supervised learning, it uses a small amount of labelled
data with a large amount of unlabelled data.
 Prerequisite – the distribution of examples, which the unlabelled data will
help to elucidate, must be relevant for the classification problem.
 It works with some assumptions:
 The semi-supervised smoothness assumption – if two points are linked by a path
of high density (e.g., belong to the same cluster), then their outputs are likely to
be close.
 The cluster assumption – if points are in the same cluster, they are likely to be of
the same class (use the labelled points to assign a class to each cluster)
31 Semi-supervised learning in practice

 In speech recognition, it costs little to record huge amounts of speech, but

labelling it requires human to listen to it and type a transcript.
 Webpage classification – billions of webpages are available but to classify
them reliably requires human to read and verify
 Protein function prediction through 3D structure – protein sequences can
be acquired at industrial speed (by genome sequencing etc.), but to
resolve a 3D structure and to determine the function may require years of
scientific work.

Since unlabelled data carry less information than labelled data, they are
required in large amounts to increase prediction accuracy.
32 Ensemble learning

 Ensemble learning helps improve machine learning (supervised learning)

results by combining several models or diverse set of learners.
 This approach allows production of better predictive performance
compared to a single model. Commonly used in competitions, like kaggle
 Common errors in each learning models:

Bias error – quantify how much on

average the predicted values
different from actual values. High bias
– keeps missing important trends

Variance – quantify how the

prediction made on same
observation different from each
other. High variance will over-fit an
perform badly on unseen data

https://www.youtube.com/watch?v=XW2YPKXt3n4
33 Ensemble learning

 A champion model should maintain a balance between these two types of

errors. This is known as the trade-off management of bias-variance errors.
 Ensemble learning is one way to execute this trade off analysis.
 Common techniques
 Bagging

Implement similar learners

on small sample
populations and takes
average of all the
predictions.
34 Ensemble learning

 Boosting :
 Iterative technique which adjust the weight of an observation based on the
previous classification. If it was classified incorrectly, it tries to increase the weight
of the observation, and vice versa.
 Decreases bias error
 Stacking : Use a learner to combine output from different leaners

 Choosing the right ensemble can be an art.

35 Hybrid learning

 Combining knowledge-based approach with machine learning to achieve

better performance.
 What are knowledge-based approach?
 Use of knowledge base (e.g., database, lexicons)
 Use of complex structured and semi-structured information (e.g., relationship,
concepts)
 Mainly crafted by human’s knowledge and derived rules (can be more tedious
and considered “less cool” in this AI (or machine learning) age
 Still have its value, esp. in text analytics.
36 Hybrid learning

 Simplest sentiment analysis – combined polarity lexicons with a trained

classifier (using labelled training data).
 Applied in scarce resource language like Singlish.
37 Hybrid learning

 English Sentic Pattern:

 polarity reversing rule based on English negation terms such as “not”, “couldn’t”,
“shldnt”
 Handling of adversative terms such as “but”. Only the polarity of the second part
was considered. For example, “this bag is nice but expensive”

A Multilingual Semi-supervised approach in Deriving Singlish Sentic Patterns for Polarity Detection. S. L. Lo, E. Cambria, R. Chiong and D. Cornforth.
Knowledge Based Systems 105 (2016), 236-247
38 Which method(s) to use?

 Predict if the new email is a spam

 Decide which category to assign a document

 Recommend a list of social media followers as target audience

39 Any Question?

We have covered:
• Supervised and Unsupervised learning
• Supervised learning
• Naïve Bayes
• Logistic Regression
• Support Vector Machine
• Unsupervised learning
• Distance-based Clustering
• Partitioning Clustering
• Probabilistic Clustering
• Ensemble learning
• Hybrid learning

ITIL 4 Foundation Cram Card PDF
100% (6)
ITIL 4 Foundation Cram Card PDF
5 pages
EJPTv2 Examen Cheatsheet - Pdf.es - en
100% (1)
EJPTv2 Examen Cheatsheet - Pdf.es - en
29 pages
OSCP Active Directory Guide
No ratings yet
OSCP Active Directory Guide
70 pages
ITIL 4 Foundation Study Guide v1 5 PDF
100% (1)
ITIL 4 Foundation Study Guide v1 5 PDF
49 pages
AXIOM A Hardware-Software Platform For
No ratings yet
AXIOM A Hardware-Software Platform For
8 pages
RT6 Map Update Guide
No ratings yet
RT6 Map Update Guide
1 page
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
UNIT5
No ratings yet
UNIT5
23 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Week 4
No ratings yet
Week 4
45 pages
Qta Lse Day4 PDF
No ratings yet
Qta Lse Day4 PDF
59 pages
Text Classification for ML Experts
No ratings yet
Text Classification for ML Experts
19 pages
Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi
No ratings yet
Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi
70 pages
CS585 Lecture October01st
No ratings yet
CS585 Lecture October01st
158 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Big Data 2 Analytical Theory
No ratings yet
Big Data 2 Analytical Theory
27 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
Chapter3 Classification Summary Final
No ratings yet
Chapter3 Classification Summary Final
11 pages
4 DL
No ratings yet
4 DL
81 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Lect 05
No ratings yet
Lect 05
17 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Text Classification
No ratings yet
Text Classification
53 pages
Unit 5
No ratings yet
Unit 5
28 pages
Classification
100% (2)
Classification
105 pages
Irs Lab Week-4
No ratings yet
Irs Lab Week-4
2 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
InSem Question Paper Answer
No ratings yet
InSem Question Paper Answer
15 pages
NLP NB
No ratings yet
NLP NB
52 pages
Text Classification
No ratings yet
Text Classification
32 pages
Week 8. Supervised Learning. Classification
No ratings yet
Week 8. Supervised Learning. Classification
45 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
Supervised Learning
No ratings yet
Supervised Learning
30 pages
AI & ML Classification Lecture
No ratings yet
AI & ML Classification Lecture
69 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
L02 Fundamentals of ML
No ratings yet
L02 Fundamentals of ML
39 pages
DL Highlights
No ratings yet
DL Highlights
6 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
CLASSIFICATION
No ratings yet
CLASSIFICATION
21 pages
Lecture 02 Supervised Learning 27102022 124322am
No ratings yet
Lecture 02 Supervised Learning 27102022 124322am
29 pages
ML Week 3
No ratings yet
ML Week 3
6 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
225 pages
Summary Machine Learning
No ratings yet
Summary Machine Learning
6 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Data Science Interview - 1
No ratings yet
Data Science Interview - 1
32 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
37 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
ITD253 P5 TopicModelling
No ratings yet
ITD253 P5 TopicModelling
7 pages
ITIL v4 Notes
No ratings yet
ITIL v4 Notes
11 pages
Strain Gauges For Integration in Fiber Composite Materials LI66
No ratings yet
Strain Gauges For Integration in Fiber Composite Materials LI66
2 pages
Comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2/comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2 PDF
No ratings yet
Comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2/comparison of Design and Analysis of Tube Sheet Thickness by Using Uhx Code-2 PDF
13 pages
Diffusion of Solids in Liquids
No ratings yet
Diffusion of Solids in Liquids
8 pages
Book List For Iit Jee
100% (2)
Book List For Iit Jee
13 pages
CTM 8
No ratings yet
CTM 8
30 pages
S220 Loader Service Guide
No ratings yet
S220 Loader Service Guide
29 pages
Topology Concepts for Math Students
No ratings yet
Topology Concepts for Math Students
45 pages
Seminar On: 3D Printing
No ratings yet
Seminar On: 3D Printing
19 pages
Cellulose Polymorphy, Crystallite Size, and The Segal
No ratings yet
Cellulose Polymorphy, Crystallite Size, and The Segal
6 pages
Retaining Wall Drawing
No ratings yet
Retaining Wall Drawing
1 page
Lecture4 - Distribution Power Flow
No ratings yet
Lecture4 - Distribution Power Flow
4 pages
Chapter 3 Methods of Lead Optimization
No ratings yet
Chapter 3 Methods of Lead Optimization
23 pages
General Tests, Processes and Apparatus PDF
No ratings yet
General Tests, Processes and Apparatus PDF
334 pages
Discussion Design Procedure For A Contact Stabilization Activated Sludge Process Randall 1977
No ratings yet
Discussion Design Procedure For A Contact Stabilization Activated Sludge Process Randall 1977
9 pages
Statement Project
No ratings yet
Statement Project
1 page
Kit - 500 Coating Thickness Gauge
No ratings yet
Kit - 500 Coating Thickness Gauge
8 pages
Scientific Aspects of Juggling by Claude Shannon
No ratings yet
Scientific Aspects of Juggling by Claude Shannon
11 pages
Brown Book
100% (1)
Brown Book
179 pages
Attention Stern Tube 27-03-2025
No ratings yet
Attention Stern Tube 27-03-2025
2 pages
Science 7 Unet - MR No Time
No ratings yet
Science 7 Unet - MR No Time
4 pages
B - Précontrainte - Prestressing C & F - EN
No ratings yet
B - Précontrainte - Prestressing C & F - EN
36 pages
Tropical Design: Climate
No ratings yet
Tropical Design: Climate
60 pages
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
No ratings yet
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
6 pages
Data Mining
No ratings yet
Data Mining
32 pages
Recent Advances and Application of Machine Learning in Food Flavor Prediction and Regulation
No ratings yet
Recent Advances and Application of Machine Learning in Food Flavor Prediction and Regulation
14 pages
F2014L
No ratings yet
F2014L
4 pages
SR4850 Product Datasheet
No ratings yet
SR4850 Product Datasheet
9 pages
GIS-Based Erosion Risk Mapping
No ratings yet
GIS-Based Erosion Risk Mapping
17 pages

ITD253 L6 TextClassificationClustering

Uploaded by

ITD253 L6 TextClassificationClustering

Uploaded by

Text Classification and

Social Token Classifier, Target document,

 Recall from last lesson, there is a need to do preprocessing before any

2. Normalization 3. Stop words

*Source can be a sentence, a paragraph, a document

 In order to extract any pattern or information from the word representation,

 Dataset : labelled data – 1) training data, 2) validation data, 3) testing data

 In general, a classifier handles a two-classes dataset. For example, to

 Ratio of number of relevant records

 Ratio of the number of relevant records

 Balanced F-score or F-measure or F1

 F-score provides a measure of retrieval accuracy without bias toward

Labelled Positive Labelled Negative

Accuracy: (TP + TN) / (TP + FN + TN + FP) => 903/1100 = 0.821

Precision: TP / (TP + FP) => 3/103 = 0.029

F1 score: 2 * (precision * recall)/(precision + recall) = 0.0174/0.329 = 0.053

 Simplest and most widely used classifier

this was a fun party” 

 Since Naïve Bayes assumption is all words in doc are independent,

P(a very close game|Sports) = P(a|Sports) x P(very|Sports) x P(close|Sports) x P(game|Sports)

 Logistic regression is an algorithm for binary or two-class classification.

If (z) if larger than 0.5 (or if

 x is essentially the doc vector (either TFIDF or TF weighting)

 Assumption a doc has 4 words – word1, word2, word3, word4

The result : φ(z=5) = 1 / (1 + e-5) = 0.993

 99.3% chance that this doc belong to class 1!

 What is the difference between a binary classifier and multi-class

 Can you convert a SVM to a multi-class classifier?

 Unsupervised learning methods are techniques to find hidden structure out

 Clustering is the task of finding groups of similar documents in a

 Text representation has a very large dimensionality, but the

 It is based on a similarity function to measure the closeness between text

 K-means clustering algorithm is widely used.

 Topic modelling is one of the most popular probabilistic clustering

 Unsupervised learning techniques that takes in documents (in vector

Documents Topics Words (Tokens)

Observable Latent Observable

 Documents are composed by many topics

 It is a type of supervised learning but make use of unlabelled data for

 In speech recognition, it costs little to record huge amounts of speech, but

 Ensemble learning helps improve machine learning (supervised learning)

Bias error – quantify how much on

Variance – quantify how the

 A champion model should maintain a balance between these two types of

Implement similar learners

 Choosing the right ensemble can be an art.

 Combining knowledge-based approach with machine learning to achieve

 Simplest sentiment analysis – combined polarity lexicons with a trained

 English Sentic Pattern:

 Predict if the new email is a spam

 Decide which category to assign a document

 Recommend a list of social media followers as target audience

You might also like