AI VIETNAM
All-in-One Course
(TA Session)
Text Retrieval
Project
Dinh-Thang Duong – TA
Year 2023
AI VIETNAM
All-in-One Course
(TA Session)
Outline
➢ Introduction
➢ Create Corpus
➢ Text Representation
➢ Text Normalization
➢ Ranking
➢ Optional: Semantic Search with BERT
➢ Question 2
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
Most famous
search engines
3
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
Search
4
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval
User Relevant Documents
Query
“Today news” Today news
TR System
articles
Text Retrieval (TR) (also called as Document Ad-hoc Retrieval2: A system aims to provide
Retrieval)1: A branch of Information documents from within the collection that
Retrieval (IR) where the system matching of are relevant to an arbitrary user information
some stated user search query against a set need, communicated to the system by
of texts. means of a one-off, user-initiated query
1: https://en.wikipedia.org/wiki/Document_retrieval 5
2: https://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval
Query
Search
Relevant
Documents 6
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Applications
Search Engines Find desire documents within
very large corpus
7
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Challenges
Document
Indexing
Text
Representation
Share the same some IR challenges How to satisfy information need?
8
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Basic Text Retrieval Pipeline
• Query: a text describes user’s information need. Input:
• Search Query (a text)
• Corpus: a set of documents (texts). • The Corpus (collection of
Corpus documents)
• Relevance: satisfaction of user’s information need.
Output:
• Information need: the topic about which the user • Relevant Documents (collection
desires to know more. of documents)
• Terms: indexed units (usually words). Indexing
Feedback
• Index: a data structure for storing documents.
Output
Input
Query Relevant
Query Searching
Processing Documents
9
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement
With MSMARCO Dataset, create a simple text retrieval program using Vector Space Model.
10
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement
’what is the official language
in Fiji’
Query
Text Retrieval
System
Corpus
MSMarco Entity Corpus
Relevant Documents 11
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Vector Space Model 3. Calculate similarity between vectors
Given a query and collection of documents: term1
1. Brings raw text to vector space
Vector
Similarty
Raw text Vector Representation term3
2. Indexing
The Vector Space
4. Ranking
12
AI VIETNAM
All-in-One Course
(TA Session) Introduction
Input
Vectorizer
Corpus Preprocessing Bag-of-Words
Our Text Retrieval
Vocabulary Vectorizer Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
13
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
Input
Vectorizer
Corpus Preprocessing Bag-of-Words
Our Text Retrieval
Vocabulary Vectorizer Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
14
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Problem
15
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Download
Download MSMARCO via Huggingface Datasets
16
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 1: Install datasets
17
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 2: Load MS_MARCO
We use MS_MARCO version 1.1 and
its test set
18
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 4: Extract text
1. Only use sample with type == entity 2. Load text (you can only load passage_text) and append to
corpus
19
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
Input
Vectorizer
Corpus Preprocessing Bag-of-Words
Our Text Retrieval
Vocabulary Vectorizer Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
20
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction
S1 = “this is a book”
S2 = “machine learning book”
Similarity?
S1 = [15, 30, 14, 50]
S2 = [12, 35, 10, 49]
21
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Challenges
Word relations:
• Synonymy: <water, H2O>…
• Antonymy: <up, down>...
• Polysemy: sentence, mouse..
• Similarity: <car, trunk>…
• Relatedness: <coffee, cup>…
• Connotation: great (positive),
terrible (negative)…
22
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Representation Taxonomy
Word
Embeddings
Without machine Transformer-
learning based
Context- Context-
Independent Dependent
Bag-of-Words
TF-IDF GPT
With machine
BERT Family
learning RNN-based
Word2Vec FastText CoVe
Skip-Gram
ELMo
GloVe
CBOW
23
https://medium0.com/nlplanet/two-minutes-nlp-11-word-embeddings-models-you-should-know-a0581763b9a9
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction to Bag-of-Words
1. The vocabulary
2. Weighting terms
Bag-of-Words (BoW): A text representation method that represent text as
method
the bag, disregarding grammar and even word order but keeping multiplicity.
24
https://en.wikipedia.org/wiki/Bag-of-words_model
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Bag-of-Words Pipeline
1
Text Create
Normalization Dictionary
Corpus
(List of paragraphs) New text representation
14 2 9 36 89
2
Text
Vectorize
Normalization
A string
(Text) 25
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Dictionary
Set of
doc_i = [‘book’, ‘deep’, documents
‘learning’]
Dictionary = […,‘book’, ‘ good’, ‘algorithm’, ‘vietnam’, …]
Unique words in corpus
26
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Weighting: Term-frequency
Corpus
doc1 = “deep learning book” [‘deep’, ‘learning’, ‘book’]
Tokenization
doc2 = “machine learning algorithm” [‘machine’, ‘learning’, ‘algorithm’]
doc3 = “learning ai from scratch” [‘learning’, ‘ai’, ‘from’, ‘scratch’]
doc4 = “ai vietnam” [‘ai’, ‘vietnam’]
Vocabulary = deep learning book machine algorithm ai from scratch vietnam
👉 Given a string = “vietnam machine learning deep learning book”
deep learning book machine algorithm ai from scratch vietnam
BoW 1 2 1 1 0 0 0 0 1
Binary BoW 1 1 1 1 0 0 0 0 1
27
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer
Input
Vectorizer
s = “Hello AI
VIETNAM” Output
Text Normalization
[0, 0, 1, …, 0]
(vector n elements)
Bag-of-words
An n words
Vocabulary
28
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer
29
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Corpus
Indexing
Feedback
Output
Input
Query Relevant
Query Searching
Processing Documents
30
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Indexing: The process of organizing and structuring a collection of E.g: Inverted Indexing
documents or data to facilitate efficient retrieval of information. It
involves creating an index that enables quick access to relevant
documents based on search queries or specific attributes.
31
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Document-term Matrix: A mathematical matrix that describes the frequency of terms that occur in a collection of
document.
word_1 word_2 word_3 … word_n Terms
doc_1
doc_2
Documents
…
doc_n
We can use document-term matrix as a database to indexing documents
32
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Document-term Matrix
Vocabulary = deep learning book machine algorithm ai from scratch vietnam
doc1 = “deep learning book”
doc2 = “machine learning algorithm” Represent raw texts as the form doci = (w1, w2, w3, …, wn)
doc3 = “learning ai from scratch” • wi ∈ vocab
doc4 = “ai vietnam” • n: vocab size
Represent documents
deep learning book machine algorithm ai from scratch vietnam
doc1 1 1 1 0 0 0 0 0 0
Represent
doc2 0 1 0 1 1 0 0 0 0 terms
doc3 0 1 0 0 0 1 1 1 0
doc4 0 0 0 0 0 1 0 0 1 33
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Doc-Term Matrix as Index
doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Normalize & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]
doc3 = “Người ấy là ai?” doc3 = [‘người’, ‘ấy’, ‘là’, ‘ai’]
học sách ai máy người ấy là
Create
Vocabulary
doc1 2 1 1 0 0 0 0
doc2 1 1 0 1 0 0 0 Vocab = [‘học’, ‘sách’, ‘ai’,
‘máy’, ‘ người’, ‘ấy’, là’]
doc3 0 0 1 0 1 1 1
Each row is also document index 34
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing Code
35
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
Input
Vectorizer
Corpus Preprocessing Bag-of-Words
Our Text Retrieval
Vocabulary Vectorizer Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
36
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation
doc1 = “Học sách học AI.” doc1 = [‘Học’, ‘sách’, ‘học’, ‘AI.’]
Tokenize
doc2 = “Sách Học Máy” doc2 = [‘Sách’, ‘Học’, ‘Máy’]
doc3 = “Người ấy là ai?” doc3 = [‘Người’, ‘ấy’, ‘là’, ‘ai?’]
Vocabulary = Học sách học AI. Sách Máy Người ấy là ai?
vocab_size = 10
Both refers to the meaning of “học” Both refers to the meaning of “sách”
37
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation
doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Preprocess & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]
doc3 = “Người ấy là ai?” doc3 = [‘người’, ‘ấy’, ‘là’, ‘ai’]
Vocabulary = sách học ai máy người ấy là ai
vocab_size = 8
Reduce unnecessary complexity for text representation
38
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Input/Output
Lowercasing
Input Output
Punctuations Removal
“Hello, this is AI
“hello ai vietnam”
VIETNAM!”
Stopwords Removal
Stemming
Optional: URL Removal, HTML Tags Removal…
39
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Lowercasing
“Hello, this is AI VIETNAM!” “hello, this is ai vietnam!”
“hello, this is ai vietnam! “hello this is ai vietnam”
40
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stopwords Removal Less crucial words ➔ No need to represent
41
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
42
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
If we not consider semantic of words: We shouln’t include all forms of a word into dictionary
but only the root form
change
changing
The same meaning as
changes change
changing
changer
Reduce the size of vector representation
Avoid Sparse Vector
43
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
Input Output
change
changing
changes Stemming chang
changing
changer
Stemming Rules
44
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
Stemming Methaod: Porter Stemmer
45
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
46
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Final Text Normalization Function
Lowercasing
Punctuations Removal
Stopwords Removal
Stemming
47
AI VIETNAM
All-in-One Course
(TA Session) Ranking
Input
Vectorizer
Corpus Preprocessing Bag-of-Words
Our Text Retrieval
Vocabulary Vectorizer Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
48
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Motivation
read book ai machine learn how
doc1 1 1 1 0 0 0
doc2 0 1 0 1 1 0
Ranked List
DocID Similarity
doc3 0 0 1 0 2 1
distance(q, d) d2 0.8165
d3 0.5774
d1 0.0000
query 0 0 0 1 1 0
49
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity
𝒃
Dot product favours long vectors (higher value in dimensions)
50
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity
term 1: ”learning”
Similarity
term 2: ”information”
51
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking based on similarity value
𝑟1 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑1, 𝑞 = 0.308
học sách ai máy người ấy là
𝑟2 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑2, 𝑞 = 0.218
doc1 2 1 1 0 0 0 0
𝑟3 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑3, 𝑞 = 0.756
doc2 1 1 0 1 0 0 0
Descending
doc3 0 0 1 0 1 1 1
Sort
DocID Similarity
query 0 0 2 1 1 0 1 d3 0.756
d1 0.308
d2 0.218
52
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results
1. Calculate query vector.
2. Calculate similarity between
query vector and each of doc
vector.
3. Sort similarity in descending
order.
53
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results
54
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction
Transformer: Type of deep neural network architecture that is
used to solve the problem of transduction or transformation of
input sequences into output sequences.
55
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction
56
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT
BERT (Bidirectional Encoder Representations from
Transformers): A pretrained model that utilized the
Transformer Encoder architecture and trained on a large
amount of text data to understand and generate
contextual representations of words.
57
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT
Input Output
’what is the official language
0.123 0.456 0.789 0.567 0.890
in Fiji’
Vector Representation of dim=384
58
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Semantic Search
Idea: Use BERT (Sentence Transformer) to generate vector
representation for query and document. Then scoring their similarity
using Cosine Similarity.
59
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Why Semantic Search?
Semantic Search: A search technique that aims to understand the meaning or semantics of a query and the content
being searched.
0.123 0.456 0.789 0.567 0.890
BERT output can catch semantic meaning
60
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
Input
Corpus
Semantic Search
BERT Encode Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
61
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 1: Import BERT and encode corpus
62
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 2: Define Cosine Similarity function
63
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 3: Define Ranking function
64
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 4: Search
65
AI VIETNAM
All-in-One Course
(TA Session) Question
?
66
67