Jaccard Method 🧩
Imagine you and your friend each have a basket of fruits:
Your basket: Apple, Banana, Grape
Friend’s basket: Apple, Orange, Grape
The Jaccard method helps answer this question: "How similar are these two baskets?"
Here’s how it works:
1. Find what’s common in both baskets: Apple, Grape
2. List all unique fruits from both baskets (without repeating): Apple, Banana, Grape, Orange
3. Formula:
Jaccard Similarity = (Number of common fruits) ÷ (Total unique fruits)
For these baskets:
Jaccard Similarity = 2 ÷ 4 = 0.5
So, the similarity is 50%. The bigger the fraction, the more similar the baskets!
Cosine Method 📐
Now, imagine you and your friend make lists of how many of each fruit you have in your baskets:
Your list:
Apple = 1, Banana = 1, Grape = 1, Orange = 0
Friend’s list:
Apple = 1, Banana = 0, Grape = 1, Orange = 1
The Cosine method helps measure how similar these lists are, like comparing their "angles."
Here’s how it works:
1. Multiply the matching fruit counts from both lists:
(Apple × Apple) + (Banana × Banana) + (Grape × Grape) + (Orange × Orange)
=1×1+1×0+1×1+0×1=2
2. Calculate the "strength" of each list:
o Your strength = √(1² + 1² + 1² + 0²) = √3
o Friend’s strength = √(1² + 0² + 1² + 1²) = √3
3. Formula:
Cosine Similarity = (Matching fruit counts) ÷ (Your strength × Friend’s strength)
For these lists:
Cosine Similarity = 2 ÷ (√3 × √3) = 2 ÷ 3 ≈ 0.67
So, the similarity is 67%.
Key Difference:
Jaccard looks at shared fruits as a percentage of all unique fruits.
Cosine compares the amount of matching fruits and their "strengths."
That’s it—now you can compare fruit baskets in two different ways!
Question: Comparing Documents Using Jaccard and Cosine Similarity
You are given two documents represented as sets of words (for Jaccard similarity) and word
frequencies (for Cosine similarity):
Document 1 (D1):
Words: {information, retrieval, system, data, search}
Word Frequencies: {information: 2, retrieval: 1, system: 1, data: 1, search: 1}
Document 2 (D2):
Words: {information, retrieval, search, engine, web}
Word Frequencies: {information: 1, retrieval: 1, search: 2, engine: 1, web: 1}
Tasks:
1. Jaccard Similarity:
Compute the Jaccard similarity between the two documents based on their word sets.
2. Cosine Similarity:
Compute the Cosine similarity between the two documents based on their word
frequencies.
Hints:
1. For Jaccard similarity:
o Formula:
Jaccard Similarity = (Number of common words) ÷ (Total unique words)
2. For Cosine similarity:
o Formula:
Cosine Similarity = (Sum of product of matching word frequencies) ÷ (√(Sum of
squares of D1 frequencies) × √(Sum of squares of D2 frequencies))
Example: Calculating Jaccard and Cosine Similarity
Consider two documents represented as sets of words and word frequencies:
Document 1 (D1):
Words: {apple, banana, grape, orange}
Word Frequencies: {apple: 2, banana: 1, grape: 1, orange: 1}
Document 2 (D2):
Words: {apple, banana, orange, mango}
Word Frequencies: {apple: 1, banana: 2, orange: 1, mango: 1}
Step 1: Jaccard Similarity
1. Find common words:
Common words between D1 and D2 are: {apple, banana, orange}
2. Find total unique words:
Total unique words in both sets are: {apple, banana, grape, orange, mango}
3. Jaccard Formula:
Jaccard Similarity = (Number of common words) ÷ (Total unique words)
Solution:
Common words = 3
Total unique words = 5
Jaccard Similarity = 3 ÷ 5 = 0.6 (or 60%)
Step 2: Cosine Similarity
1. Word frequency vectors:
Represent the documents as vectors:
o D1 = [2, 1, 1, 1, 0] (apple, banana, grape, orange, mango)
o D2 = [1, 2, 0, 1, 1] (apple, banana, grape, orange, mango)
2. Dot product of vectors:
Multiply matching word frequencies and sum them up:
(2 × 1) + (1 × 2) + (1 × 0) + (1 × 1) + (0 × 1) = 2 + 2 + 0 + 1 + 0 = 5
3. Magnitude of each vector:
o Magnitude of D1 = √(2² + 1² + 1² + 1² + 0²) = √(4 + 1 + 1 + 1) = √7
o Magnitude of D2 = √(1² + 2² + 0² + 1² + 1²) = √(1 + 4 + 0 + 1 + 1) = √7
4. Cosine Formula:
Cosine Similarity = (Dot product of vectors) ÷ (Magnitude of D1 × Magnitude of D2)
Solution:
Cosine Similarity = 5 ÷ (√7 × √7) = 5 ÷ 7 ≈ 0.714 (or 71.4%)
Final Answer:
Jaccard Similarity = 0.6 (60%)
Cosine Similarity = 0.714 (71.4%)
This shows that while both measures find the documents somewhat similar, Cosine similarity
considers the frequency of words, making it slightly higher in this case.
==================================================================================
Applications of Jaccard Similarity
Jaccard Similarity is most useful when comparing sets of items (presence or absence of elements).
1. Document Comparison:
o Measuring similarity between documents based on common words (ignoring
frequencies).
o Example: Finding similar research papers based on keywords.
2. Recommendation Systems:
o Comparing users' preferences or behaviors in terms of shared interests (e.g., movies,
products).
o Example: Recommending products by comparing shopping carts.
3. Plagiarism Detection:
o Identifying copied content by comparing the overlap of unique words or phrases.
4. Clustering and Classification:
o Grouping similar datasets or categorizing based on shared attributes.
o Example: Grouping customers by shared interests.
5. Biological Data Analysis:
o Comparing DNA, protein sequences, or gene sets based on common genetic
patterns.
6. Search Engine Optimization:
o Finding overlap between web pages in terms of keywords or topics.
Applications of Cosine Similarity
Cosine Similarity is ideal for comparing numerical vectors or high-dimensional data.
1. Information Retrieval:
o Measuring similarity between queries and documents in search engines.
o Example: Ranking documents based on the similarity to a search query.
2. Text Mining and NLP (Natural Language Processing):
o Comparing sentences, paragraphs, or documents using word frequency or
embeddings.
o Example: Detecting sentiment or paraphrase similarity.
3. Recommendation Systems:
o Suggesting items based on users' preferences (ratings or interaction counts).
o Example: Suggesting movies based on user ratings.
4. Image and Video Similarity:
o Comparing visual features in image recognition systems.
o Example: Matching faces or detecting duplicate images.
5. Machine Learning and Data Clustering:
o Measuring similarity between data points in clustering or classification algorithms.
o Example: Grouping similar customers or detecting anomalies.
6. Fraud Detection:
o Comparing transactional patterns to identify unusual behavior.
o Example: Identifying fraudulent credit card activity by comparing vectors of
transaction history.
7. Social Network Analysis:
o Comparing users’ behavior or connections.
o Example: Finding similar users based on their interaction frequency.
8. Recommender Systems with Sparse Data:
o Working with sparse datasets like user-item interactions.
o Example: Collaborative filtering in e-commerce platforms.
Summary:
Jaccard Similarity is more suitable for set-based comparisons (presence/absence).
Cosine Similarity works best for numerical or frequency-based comparisons in high-
dimensional data.
===============================================================================
Jaccard Matching Score