CS 3308-01 - AY2025-T3 Learning Journal Unit 4
Introduction
In the field of information retrieval, understanding document similarity is crucial for tasks such as
ranking search results and recommending content. One effective way to measure similarity between
documents is by using cosine similarity, a metric that calculates the angle between document
vectors in a high-dimensional space. This technique will be employed to recommend documents
that are similar to a user's preferred document within a given corpus.
Document Vectorization
The first step in this process involves converting the provided documents into numerical
representations. This can be done through the following stages:
1. Text Preprocessing:
Tokenize each document into individual words.
Remove stop words, such as "is," to focus on meaningful terms that contribute to the document's
content.
2. Vector Representation:
Convert each document into a vector representation using the term frequency-inverse document
frequency (TF-IDF) method. This approach assigns weights to terms based on their frequency in
a document and their rarity across the entire corpus.
The TF-IDF vector (vd) for a document (d) is computed as follows:
where
Here, tf(t, d) represents the term frequency of term t in document d, and idf(t) is the inverse
document frequency of term t across the corpus.
Cosine Similarity Calculation
Once we have the vectorized representation of each document, we can calculate the cosine
similarity between the document vectors. The cosine similarity between documents d1 and d2 is
given by the formula:
Where:
is the dot product of the vectors and
and are the Euclidean norms (lengths) of the vectors and ,
respectively.
Recommendation Process
Now, let's apply this methodology to the provided documents:
1. Document Representation:
Document 1: "Earth is round."
Document 2: "Moon is round."
Document 3: "Day is nice."
2. Vectorization:
After removing stop words ("is"), we represent each document as a TF-IDF vector.
3. Cosine Similarity Calculation:
Compute the cosine similarity between Document 1 and Documents 2 and 3.
4. Recommendation:
Recommend the document with the highest cosine similarity to Document 1.
Conclusion
In conclusion, recommending similar documents involves transforming text into numerical vectors,
calculating cosine similarity between these vectors, and using the results to identify the most
relevant documents. This approach, grounded in information retrieval principles, enhances
document recommendation systems and improves user experience in search engines.
References
Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval
(Online ed.). Cambridge, MA: Cambridge University Press. Available at
http://nlp.stanford.edu/IR-book/information-retrieval-book.html