Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views6 pages

2 (C) - Jaccard and Cosine Method

The document explains the Jaccard and Cosine methods for measuring similarity between sets and vectors, respectively. Jaccard Similarity is calculated based on common and unique items, while Cosine Similarity considers the frequency of items. Applications for both methods include document comparison, recommendation systems, and data analysis in various fields.

Uploaded by

sushilkpal9457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

2 (C) - Jaccard and Cosine Method

The document explains the Jaccard and Cosine methods for measuring similarity between sets and vectors, respectively. Jaccard Similarity is calculated based on common and unique items, while Cosine Similarity considers the frequency of items. Applications for both methods include document comparison, recommendation systems, and data analysis in various fields.

Uploaded by

sushilkpal9457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Jaccard Method 🧩

Imagine you and your friend each have a basket of fruits:

 Your basket: Apple, Banana, Grape

 Friend’s basket: Apple, Orange, Grape

The Jaccard method helps answer this question: "How similar are these two baskets?"

Here’s how it works:

1. Find what’s common in both baskets: Apple, Grape

2. List all unique fruits from both baskets (without repeating): Apple, Banana, Grape, Orange

3. Formula:
Jaccard Similarity = (Number of common fruits) ÷ (Total unique fruits)

For these baskets:


Jaccard Similarity = 2 ÷ 4 = 0.5

So, the similarity is 50%. The bigger the fraction, the more similar the baskets!

Cosine Method 📐

Now, imagine you and your friend make lists of how many of each fruit you have in your baskets:

 Your list:
Apple = 1, Banana = 1, Grape = 1, Orange = 0

 Friend’s list:
Apple = 1, Banana = 0, Grape = 1, Orange = 1

The Cosine method helps measure how similar these lists are, like comparing their "angles."

Here’s how it works:

1. Multiply the matching fruit counts from both lists:


(Apple × Apple) + (Banana × Banana) + (Grape × Grape) + (Orange × Orange)
=1×1+1×0+1×1+0×1=2

2. Calculate the "strength" of each list:

o Your strength = √(1² + 1² + 1² + 0²) = √3

o Friend’s strength = √(1² + 0² + 1² + 1²) = √3

3. Formula:
Cosine Similarity = (Matching fruit counts) ÷ (Your strength × Friend’s strength)

For these lists:


Cosine Similarity = 2 ÷ (√3 × √3) = 2 ÷ 3 ≈ 0.67

So, the similarity is 67%.


Key Difference:

 Jaccard looks at shared fruits as a percentage of all unique fruits.

 Cosine compares the amount of matching fruits and their "strengths."

That’s it—now you can compare fruit baskets in two different ways!

Question: Comparing Documents Using Jaccard and Cosine Similarity

You are given two documents represented as sets of words (for Jaccard similarity) and word
frequencies (for Cosine similarity):

 Document 1 (D1):
Words: {information, retrieval, system, data, search}
Word Frequencies: {information: 2, retrieval: 1, system: 1, data: 1, search: 1}

 Document 2 (D2):
Words: {information, retrieval, search, engine, web}
Word Frequencies: {information: 1, retrieval: 1, search: 2, engine: 1, web: 1}

Tasks:

1. Jaccard Similarity:
Compute the Jaccard similarity between the two documents based on their word sets.

2. Cosine Similarity:
Compute the Cosine similarity between the two documents based on their word
frequencies.

Hints:

1. For Jaccard similarity:

o Formula:
Jaccard Similarity = (Number of common words) ÷ (Total unique words)

2. For Cosine similarity:

o Formula:
Cosine Similarity = (Sum of product of matching word frequencies) ÷ (√(Sum of
squares of D1 frequencies) × √(Sum of squares of D2 frequencies))
Example: Calculating Jaccard and Cosine Similarity

Consider two documents represented as sets of words and word frequencies:

 Document 1 (D1):
Words: {apple, banana, grape, orange}
Word Frequencies: {apple: 2, banana: 1, grape: 1, orange: 1}

 Document 2 (D2):
Words: {apple, banana, orange, mango}
Word Frequencies: {apple: 1, banana: 2, orange: 1, mango: 1}

Step 1: Jaccard Similarity

1. Find common words:


Common words between D1 and D2 are: {apple, banana, orange}

2. Find total unique words:


Total unique words in both sets are: {apple, banana, grape, orange, mango}

3. Jaccard Formula:
Jaccard Similarity = (Number of common words) ÷ (Total unique words)

Solution:
Common words = 3
Total unique words = 5
Jaccard Similarity = 3 ÷ 5 = 0.6 (or 60%)

Step 2: Cosine Similarity

1. Word frequency vectors:


Represent the documents as vectors:

o D1 = [2, 1, 1, 1, 0] (apple, banana, grape, orange, mango)

o D2 = [1, 2, 0, 1, 1] (apple, banana, grape, orange, mango)

2. Dot product of vectors:


Multiply matching word frequencies and sum them up:
(2 × 1) + (1 × 2) + (1 × 0) + (1 × 1) + (0 × 1) = 2 + 2 + 0 + 1 + 0 = 5

3. Magnitude of each vector:

o Magnitude of D1 = √(2² + 1² + 1² + 1² + 0²) = √(4 + 1 + 1 + 1) = √7

o Magnitude of D2 = √(1² + 2² + 0² + 1² + 1²) = √(1 + 4 + 0 + 1 + 1) = √7

4. Cosine Formula:
Cosine Similarity = (Dot product of vectors) ÷ (Magnitude of D1 × Magnitude of D2)
Solution:
Cosine Similarity = 5 ÷ (√7 × √7) = 5 ÷ 7 ≈ 0.714 (or 71.4%)

Final Answer:

 Jaccard Similarity = 0.6 (60%)

 Cosine Similarity = 0.714 (71.4%)

This shows that while both measures find the documents somewhat similar, Cosine similarity
considers the frequency of words, making it slightly higher in this case.

==================================================================================

Applications of Jaccard Similarity

Jaccard Similarity is most useful when comparing sets of items (presence or absence of elements).

1. Document Comparison:

o Measuring similarity between documents based on common words (ignoring


frequencies).

o Example: Finding similar research papers based on keywords.

2. Recommendation Systems:

o Comparing users' preferences or behaviors in terms of shared interests (e.g., movies,


products).

o Example: Recommending products by comparing shopping carts.

3. Plagiarism Detection:

o Identifying copied content by comparing the overlap of unique words or phrases.

4. Clustering and Classification:

o Grouping similar datasets or categorizing based on shared attributes.

o Example: Grouping customers by shared interests.

5. Biological Data Analysis:

o Comparing DNA, protein sequences, or gene sets based on common genetic


patterns.

6. Search Engine Optimization:

o Finding overlap between web pages in terms of keywords or topics.


Applications of Cosine Similarity

Cosine Similarity is ideal for comparing numerical vectors or high-dimensional data.

1. Information Retrieval:

o Measuring similarity between queries and documents in search engines.

o Example: Ranking documents based on the similarity to a search query.

2. Text Mining and NLP (Natural Language Processing):

o Comparing sentences, paragraphs, or documents using word frequency or


embeddings.

o Example: Detecting sentiment or paraphrase similarity.

3. Recommendation Systems:

o Suggesting items based on users' preferences (ratings or interaction counts).

o Example: Suggesting movies based on user ratings.

4. Image and Video Similarity:

o Comparing visual features in image recognition systems.

o Example: Matching faces or detecting duplicate images.

5. Machine Learning and Data Clustering:

o Measuring similarity between data points in clustering or classification algorithms.

o Example: Grouping similar customers or detecting anomalies.

6. Fraud Detection:

o Comparing transactional patterns to identify unusual behavior.

o Example: Identifying fraudulent credit card activity by comparing vectors of


transaction history.

7. Social Network Analysis:

o Comparing users’ behavior or connections.

o Example: Finding similar users based on their interaction frequency.

8. Recommender Systems with Sparse Data:

o Working with sparse datasets like user-item interactions.

o Example: Collaborative filtering in e-commerce platforms.

Summary:

 Jaccard Similarity is more suitable for set-based comparisons (presence/absence).


 Cosine Similarity works best for numerical or frequency-based comparisons in high-
dimensional data.

===============================================================================

Jaccard Matching Score

You might also like