0% found this document useful (0 votes)

13 views10 pages

Text Similarity Metrics

The document outlines learning objectives related to calculating text similarity metrics, including Jaccard similarity, Euclidean distance, and Cosine similarity using Python's NLTK library. It provides detailed explanations and examples for each metric, including how to programmatically compute them using sklearn. The document emphasizes the importance of understanding these metrics for comparing text documents.

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

Text Similarity Metrics

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning Objectives

Learners will be able to…

Describe how text similarity metrics are calculated

Calculate Euclidean distance between texts
Calculate Jaccard Similarity between texts
Calculate Cosine Similarity between texts

info

Make Sure to Know

Intermediate Python.

Limitations
In this assignment, we only work with the NLTK library.
Jaccard Similarity Coeﬃcient Score
There are various text similarity metrics that exist, such as Cosine
similarity, Euclidean distance, and Jaccard Similarity.

If we consider two documents, document A and document B, the Jaccard

Score (sometimes called the similarity coeﬃcient) is calculated as follows:

The numerator or top part of the fraction is the intersection between the
documents. In Jaccard, it is literally the count of words that are in both
documents:

The denominator or bottom part of the fraction is the union between the
documents. In Jaccard, it is literally the count of all unique words that are
in either document:

Example calculation of Jaccard Score

Let’s consider the following documents:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.
Let’s start by removing stop words, punctuation and capitalization:
1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

The intersection between these two documents is the word like which is 1
word.

The union between these two documents, or the count of unique words, is
9.

The Jaccard score is then calculated as follows:

Programming Jaccard Score with

sklearn.metrics.jaccard_score

To get started, we need to vectorize the text. Because jaccard_score is

looking for integer vectors, we will use CountVectorizer instead of
TfidfVectorizer:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call jaccard_score on the vector representation of each

document:

print(jaccard_score(X.toarray()[0], X.toarray()[1]))

You should get 0.1111111111111111 - the same value we calculated above.

Try out diﬀerent documents to see what their similarity score is!
Euclidean Distance
Euclidean distance is something you have probably seen in a math class -
the basic idea behind it is Pythagorean’s theorem ( for the
sides of a right triangle where is the hypotenuse).

On a two-dimensional grid, this equation looks slightly diﬀerent as each

side of the right triangle is the diﬀerence along either the x or y dimension:

The resulting equation for distance in two dimensions is then:

You can generalize this to three dimensions by using the resulting

hypotenuse of the two-dimensional ﬁgure as one side and the third
dimension as the second side:
The resulting equation for distance in three dimensions is then:

After 3 dimensions, distance gets hard to visualize, but you can generalize
the equation to dimensions:

Example calculation of Euclidean Distance

Let’s consider the following documents again:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.

Let’s start by removing stop words, punctuation and capitalization:

1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

'play' 'swimming']

We then have the following two 9-dimensional vectors to represent the

documents:
[0, 1, 1, 1, 0, 1, 1, 0, 0]
[1, 0, 0, 0, 1, 0, 1, 1, 1]

Programming Euclidean Distance with

sklearn.metrics.pairwise.euclidean_distances

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call euclidean_distances on the vector representations:

print( euclidean_distances(X) )

You should see the following output:

[[0. 2.82842712]
[2.82842712 0. ]]

The way to read this is the top-left is the distance between the ﬁrst vector
and the ﬁrst vector – so it makes sense that the distance to itself is 0.

The top-right and bottom-left are both representing the distance between
the ﬁrst vector and the second vector. We see that the code got the same
2.83 number we calculated above.

The bottom-right number is the distance between the second vector and
the second vector – so it makes sense that the distance to itself is 0.

This representation makes more sense when you are comparing more than
two documents – try adding a third document!
Cosine Similarity
If Euclidean distance is the straight line distance between vectors, Cosine
similarity is the angular distance between vectors:

.guides/img/cosine

The formula is:

Example calculation of Cosine Similarity

Let’s consider the following documents again:
1. Bunnies like to eat lettuce more than carrots.
1. Fish like to play with bubbles while swimming.

Let’s start by removing stop words, punctuation and capitalization:

1. bunnies like eat lettuce carrots
1. ﬁsh like play bubbles swimming

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

'play' 'swimming']

We then have the following two 9-dimensional vectors to represent the

documents:

[0, 1, 1, 1, 0, 1, 1, 0, 0]
[1, 0, 0, 0, 1, 0, 1, 1, 1]
Programming Cosine Similarity with
sklearn.metrics.pairwise.cosine_similarity

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

X = vectorizer.fit_transform(documents)

Then, we simply call cosine_similarity on the vector representations:

print( cosine_similarity(X) )

You should see the following output:

[[1. 0.2]
[0.2 1. ]]

The way to read this is the top-left is the cosine between the ﬁrst vector and
the ﬁrst vector. Cosine ranges from 1 to -1 where and
– so it makes sense that the cosine of a vector to itself
is 1.

The top-right and bottom-left are both representing the cosine between the
ﬁrst vector and the second vector. We see that the code got the same 0.2
number we calculated above.

The bottom-right number is the cosine between the second vector and the
second vector – so it makes sense that the cosine to itself is 1.
Formative Assessment 1
Formative Assessment 2

Class 7 - Mathematics - Question Paper - Half Yearly Examination - 2019 - 20
100% (3)
Class 7 - Mathematics - Question Paper - Half Yearly Examination - 2019 - 20
3 pages
Practical Research 2 Notes 2018 Complete
100% (3)
Practical Research 2 Notes 2018 Complete
44 pages
Math8 - q1 - Mod5a - Multiplying and Dividing Rational Algebraic Expressions - 08092020
No ratings yet
Math8 - q1 - Mod5a - Multiplying and Dividing Rational Algebraic Expressions - 08092020
23 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
MATLAB for Earthquake Data Analysis
No ratings yet
MATLAB for Earthquake Data Analysis
14 pages
Ss 2 Economics 1st Term E-Note
No ratings yet
Ss 2 Economics 1st Term E-Note
77 pages
Data Analysis: in Microsoft Excel
100% (1)
Data Analysis: in Microsoft Excel
48 pages
DSAT Practice Test SAT #1
No ratings yet
DSAT Practice Test SAT #1
49 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Unit-1 (Part-1) Similarity and Dissimilarity Measures
No ratings yet
Unit-1 (Part-1) Similarity and Dissimilarity Measures
24 pages
FSMQ Maxima and Minima PDF
No ratings yet
FSMQ Maxima and Minima PDF
5 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
No ratings yet
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
28 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
Types of Similarity Search
No ratings yet
Types of Similarity Search
11 pages
Gandhinagar Institute of Technology: Question Bank
No ratings yet
Gandhinagar Institute of Technology: Question Bank
5 pages
Practice Problem Set - II PDF
No ratings yet
Practice Problem Set - II PDF
3 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
9 Distance Measures in Data Science
No ratings yet
9 Distance Measures in Data Science
9 pages
Cosine Similarity - GeeksforGeeks
No ratings yet
Cosine Similarity - GeeksforGeeks
6 pages
Module-7 Similarity Measure
No ratings yet
Module-7 Similarity Measure
39 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Irs Week 1
No ratings yet
Irs Week 1
4 pages
Wet-Gas Metering for Beginners
No ratings yet
Wet-Gas Metering for Beginners
28 pages
Cosign Similarity
No ratings yet
Cosign Similarity
4 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Lexical Text Similarity in NLP
No ratings yet
Lexical Text Similarity in NLP
16 pages
Learning Journal Unit 4
No ratings yet
Learning Journal Unit 4
5 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
Cosine Similarity in Data Mining
No ratings yet
Cosine Similarity in Data Mining
4 pages
2 (C) - Jaccard and Cosine Method
No ratings yet
2 (C) - Jaccard and Cosine Method
6 pages
ADC SNR Jitter
No ratings yet
ADC SNR Jitter
6 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
10 Spheres
No ratings yet
10 Spheres
2 pages
Lecture 10
No ratings yet
Lecture 10
7 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
TTS Notes-Unit 12
No ratings yet
TTS Notes-Unit 12
556 pages
Similarity
No ratings yet
Similarity
20 pages
Lecture 11 Collaborative Filtering
No ratings yet
Lecture 11 Collaborative Filtering
37 pages
Cot Math 4 q2 - Week6 2022
No ratings yet
Cot Math 4 q2 - Week6 2022
12 pages
Lab 2
No ratings yet
Lab 2
21 pages
Experiment 4 Code
No ratings yet
Experiment 4 Code
3 pages
Problem Set 3: Document Distance: Pset Buddy
No ratings yet
Problem Set 3: Document Distance: Pset Buddy
7 pages
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
No ratings yet
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
11 pages
BDA
No ratings yet
BDA
31 pages
Similarity and Distance Metrics
No ratings yet
Similarity and Distance Metrics
20 pages
Compare 2018
No ratings yet
Compare 2018
16 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
What Is Cosine Similarity and Why Is It Advantageous?
No ratings yet
What Is Cosine Similarity and Why Is It Advantageous?
2 pages
Igcse 0642
No ratings yet
Igcse 0642
9 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
CH-1, Work Sheet
No ratings yet
CH-1, Work Sheet
2 pages
Math 7 LAS 1 Well-Defined Sets
No ratings yet
Math 7 LAS 1 Well-Defined Sets
1 page
Unit 3
No ratings yet
Unit 3
114 pages
Unit5 - Updated
No ratings yet
Unit5 - Updated
112 pages
Unit4 C
No ratings yet
Unit4 C
107 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Text Mining: Similarity Measures
No ratings yet
Text Mining: Similarity Measures
11 pages
Trig Functions Acute Angles PP
No ratings yet
Trig Functions Acute Angles PP
6 pages
Clustering
No ratings yet
Clustering
43 pages
MTPPT4 ELECTRIC FIELD - With Solution
No ratings yet
MTPPT4 ELECTRIC FIELD - With Solution
37 pages
The Joys of Compounding
100% (1)
The Joys of Compounding
20 pages
Anurag Tyagi Differentiations
No ratings yet
Anurag Tyagi Differentiations
10 pages
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
No ratings yet
What Number Is Five More Than Forty? - What Number Is Five More Than Seventy-Five?
6 pages
Computational Fluid Dynamics: Indo-European Winter Academy 2013
No ratings yet
Computational Fluid Dynamics: Indo-European Winter Academy 2013
30 pages
Wavelet Theory and Application in Communication An
No ratings yet
Wavelet Theory and Application in Communication An
18 pages
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
No ratings yet
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
13 pages
Saljnikov Aleksandar
No ratings yet
Saljnikov Aleksandar
8 pages
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
No ratings yet
Satyapriya Roy College of Education: AA 287, SECTOR I, SALT LAKE, KOLKATA 700 064
6 pages
Formulation of Equivalent Beam
No ratings yet
Formulation of Equivalent Beam
2 pages
Tech Intern Seeks Real-World Experience
No ratings yet
Tech Intern Seeks Real-World Experience
1 page
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Electrostatics: Coulombs Law and Electric Field
No ratings yet
Electrostatics: Coulombs Law and Electric Field
7 pages

Text Similarity Metrics

Uploaded by

Text Similarity Metrics

Uploaded by

Learning Objectives

Learners will be able to…

Describe how text similarity metrics are calculated

Make Sure to Know

If we consider two documents, document A and document B, the Jaccard

Example calculation of Jaccard Score

The Jaccard score is then calculated as follows:

Programming Jaccard Score with

To get started, we need to vectorize the text. Because jaccard_score is

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call jaccard_score on the vector representation of each

You should get 0.1111111111111111 - the same value we calculated above.

On a two-dimensional grid, this equation looks slightly diﬀerent as each

The resulting equation for distance in two dimensions is then:

You can generalize this to three dimensions by using the resulting

Example calculation of Euclidean Distance

Let’s start by removing stop words, punctuation and capitalization:

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

We then have the following two 9-dimensional vectors to represent the

Programming Euclidean Distance with

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call euclidean_distances on the vector representations:

You should see the following output:

The formula is:

Example calculation of Cosine Similarity

Let’s start by removing stop words, punctuation and capitalization:

Let’s assume the following vocabulary:

['bubbles' 'bunnies' 'carrots' 'eat' 'fish' 'lettuce' 'like'

We then have the following two 9-dimensional vectors to represent the

To get started, we need to vectorize the text just like before:

vectorizer = CountVectorizer(stop_words='english', binary=True)

Then, we simply call cosine_similarity on the vector representations:

You should see the following output:

You might also like