0% found this document useful (0 votes)

21 views3 pages

CS 3308 Learning Journal 4

The document discusses the importance of document similarity in information retrieval, specifically using cosine similarity to recommend similar documents. It outlines the process of document vectorization through text preprocessing and TF-IDF representation, followed by the calculation of cosine similarity. The methodology is applied to example documents to illustrate how to recommend the most relevant content based on similarity scores.

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views3 pages

CS 3308 Learning Journal 4

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

CS 3308-01 - AY2025-T3 Learning Journal Unit 4

Introduction

In the field of information retrieval, understanding document similarity is crucial for tasks such as
ranking search results and recommending content. One effective way to measure similarity between
documents is by using cosine similarity, a metric that calculates the angle between document
vectors in a high-dimensional space. This technique will be employed to recommend documents
that are similar to a user's preferred document within a given corpus.

Document Vectorization

The first step in this process involves converting the provided documents into numerical
representations. This can be done through the following stages:

1. Text Preprocessing:

 Tokenize each document into individual words.

 Remove stop words, such as "is," to focus on meaningful terms that contribute to the document's
content.

2. Vector Representation:

 Convert each document into a vector representation using the term frequency-inverse document
frequency (TF-IDF) method. This approach assigns weights to terms based on their frequency in
a document and their rarity across the entire corpus.

The TF-IDF vector (vd) for a document (d) is computed as follows:

where

Here, tf(t, d) represents the term frequency of term t in document d, and idf(t) is the inverse
document frequency of term t across the corpus.
Cosine Similarity Calculation

Once we have the vectorized representation of each document, we can calculate the cosine
similarity between the document vectors. The cosine similarity between documents d1 and d2 is
given by the formula:

Where:

 is the dot product of the vectors and

 and are the Euclidean norms (lengths) of the vectors and ,

respectively.

Recommendation Process

Now, let's apply this methodology to the provided documents:

1. Document Representation:

 Document 1: "Earth is round."

 Document 2: "Moon is round."
 Document 3: "Day is nice."

2. Vectorization:

 After removing stop words ("is"), we represent each document as a TF-IDF vector.

3. Cosine Similarity Calculation:

 Compute the cosine similarity between Document 1 and Documents 2 and 3.

4. Recommendation:

 Recommend the document with the highest cosine similarity to Document 1.

Conclusion

In conclusion, recommending similar documents involves transforming text into numerical vectors,
calculating cosine similarity between these vectors, and using the results to identify the most
relevant documents. This approach, grounded in information retrieval principles, enhances
document recommendation systems and improves user experience in search engines.
References

Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval
(Online ed.). Cambridge, MA: Cambridge University Press. Available at
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Similarity Measures Le 512
No ratings yet
Similarity Measures Le 512
14 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Irs Week 1
No ratings yet
Irs Week 1
4 pages
University Solved Questions-VSM - Compressed
No ratings yet
University Solved Questions-VSM - Compressed
5 pages
Application Research of Collaborative Filtering Algorithm in Catering Recommendation System
No ratings yet
Application Research of Collaborative Filtering Algorithm in Catering Recommendation System
5 pages
Lecture 3 VSM
No ratings yet
Lecture 3 VSM
16 pages
Learning Journal Unit 4
No ratings yet
Learning Journal Unit 4
5 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Types of Similarity Search
No ratings yet
Types of Similarity Search
11 pages
Pp2 Form 3 Mathematics December 2023 Holiday Assignment Assignments - Form 3 - Mathematics
No ratings yet
Pp2 Form 3 Mathematics December 2023 Holiday Assignment Assignments - Form 3 - Mathematics
14 pages
Cosine Similarity - GeeksforGeeks
No ratings yet
Cosine Similarity - GeeksforGeeks
6 pages
Class 6 Maths Chapter 14 Revision Notes
No ratings yet
Class 6 Maths Chapter 14 Revision Notes
7 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity
No ratings yet
Document Similarity Using Term Frequency-Inverse Document Frequency Representation and Cosine Similarity
5 pages
Maths IA Q Paper
No ratings yet
Maths IA Q Paper
6 pages
Curriculum Map Grade 7 Third Fourth Quarter - 033112
No ratings yet
Curriculum Map Grade 7 Third Fourth Quarter - 033112
4 pages
Lec 3
No ratings yet
Lec 3
51 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Product vs Box Topology Explained
No ratings yet
Product vs Box Topology Explained
7 pages
Module-7 Similarity Measure
No ratings yet
Module-7 Similarity Measure
39 pages
6th Mathematics II
No ratings yet
6th Mathematics II
12 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Bavya
No ratings yet
Bavya
2 pages
Summary
No ratings yet
Summary
2 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Tank Volume Calculator1 PDF
100% (1)
Tank Volume Calculator1 PDF
6 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
17-Demonstration On Document Similarity Techniques and Measurements.-24-03-2025
No ratings yet
17-Demonstration On Document Similarity Techniques and Measurements.-24-03-2025
4 pages
(2012) Sistemasderecomendacion
No ratings yet
(2012) Sistemasderecomendacion
18 pages
Chapter 1 Revision Note
No ratings yet
Chapter 1 Revision Note
3 pages
Plagiarism Detector NLP Theory
No ratings yet
Plagiarism Detector NLP Theory
3 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
L04
No ratings yet
L04
35 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
Summary
No ratings yet
Summary
2 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Study Shore
No ratings yet
Study Shore
4 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Contextual Document Similarity For Content-Based Literature Recommender Systems
No ratings yet
Contextual Document Similarity For Content-Based Literature Recommender Systems
8 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Vector Space Modeling With TFIDF
No ratings yet
Vector Space Modeling With TFIDF
4 pages
Homework 1.1 Points Lines and Planes
100% (1)
Homework 1.1 Points Lines and Planes
5 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Equilibrium of Particles
No ratings yet
Equilibrium of Particles
13 pages
CH 6
No ratings yet
CH 6
29 pages
20th Century Math Revolution
No ratings yet
20th Century Math Revolution
42 pages
Supplementary Material: 3.6 Proofs and Simple Applications of Sine and Cosine Formulae
No ratings yet
Supplementary Material: 3.6 Proofs and Simple Applications of Sine and Cosine Formulae
21 pages
Worksheet04 - Recommender Systems
No ratings yet
Worksheet04 - Recommender Systems
2 pages
UP Diliman Physics 71 Long Exam#2 Sample Exam
No ratings yet
UP Diliman Physics 71 Long Exam#2 Sample Exam
12 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
June 2022 QP
No ratings yet
June 2022 QP
20 pages
Lecture 11 Collaborative Filtering
No ratings yet
Lecture 11 Collaborative Filtering
37 pages
Third Grading Math Periodical Examination
No ratings yet
Third Grading Math Periodical Examination
7 pages
Cantor Set in Measure Theory
No ratings yet
Cantor Set in Measure Theory
53 pages
Parabola Questions and Problems With Detailed Solutions
0% (1)
Parabola Questions and Problems With Detailed Solutions
16 pages
Online Ijmebac 2022 1 1 3 12 16 291
No ratings yet
Online Ijmebac 2022 1 1 3 12 16 291
5 pages
Movie Recommendation System Using Cosine Similarity and KNN: II. Related Work
No ratings yet
Movie Recommendation System Using Cosine Similarity and KNN: II. Related Work
4 pages
Mte494 Euler Line 1
No ratings yet
Mte494 Euler Line 1
8 pages
E96660695201532
No ratings yet
E96660695201532
5 pages
Year 1 Math Lesson Plan 2021
No ratings yet
Year 1 Math Lesson Plan 2021
16 pages
HP-GL Graphics Language: Revision C 29.10.99
No ratings yet
HP-GL Graphics Language: Revision C 29.10.99
39 pages
Knowledge: (Engagement)
50% (2)
Knowledge: (Engagement)
5 pages
Text Preprocessing with NLTK
No ratings yet
Text Preprocessing with NLTK
42 pages
Physics Problem Set for Students
No ratings yet
Physics Problem Set for Students
19 pages
Mat495 Chapter 9
No ratings yet
Mat495 Chapter 9
13 pages
Advanced Polar Integrals Guide
No ratings yet
Advanced Polar Integrals Guide
9 pages
CBR Kalkulus
No ratings yet
CBR Kalkulus
20 pages
Mathematics Lesson Plan
No ratings yet
Mathematics Lesson Plan
3 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Plane and Solid Geometry Exam
100% (1)
Plane and Solid Geometry Exam
3 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
Queries As Vectors
No ratings yet
Queries As Vectors
3 pages
Answer All The Questions. (A) Diagram 1 (A) Shows A Number Line
89% (19)
Answer All The Questions. (A) Diagram 1 (A) Shows A Number Line
16 pages
Class IX Coordinate Geometry Quiz
No ratings yet
Class IX Coordinate Geometry Quiz
2 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages

CS 3308 Learning Journal 4

Uploaded by

CS 3308 Learning Journal 4

Uploaded by

CS 3308-01 - AY2025-T3 Learning Journal Unit 4

 Tokenize each document into individual words.

The TF-IDF vector (vd) for a document (d) is computed as follows:

 is the dot product of the vectors and

 and are the Euclidean norms (lengths) of the vectors and ,

Now, let's apply this methodology to the provided documents:

 Document 1: "Earth is round."

3. Cosine Similarity Calculation:

 Compute the cosine similarity between Document 1 and Documents 2 and 3.

 Recommend the document with the highest cosine similarity to Document 1.

You might also like