Information Storage and
Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3
Dr. Ebtsam AbdelHakam
Computer Science Dept.
Minia University
Indexing Process
document unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index
what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008
word units? stopping? stemming?
Walid Magdy, TTDS 2017/2018
Document Acquisition (collecting
data)
Document Acquisition: Before creating an index, the
system must collect the data to be indexed. This data
can come from various sources, such as:
- Web pages (for web search engines).
- Documents (for enterprise search systems).
- Databases (for structured data search).
Example: A web crawler collects HTML pages from the
internet.
Text Transformation (Pre-processing)
• It includes tasks to extract meaningful text and metadata from the
collected data.
This involves:
- Parsing HTML, PDFs, or other file formats to extract text.
- Removing unnecessary content like ads, navigation menus, or boilerplate
text.
- Extracting metadata such as titles, headings, and author information.
Example: From an HTML page, extract the `<title>`, `<h1>`, and `<p>`
tags.
Walid Magdy, TTDS 2017/2018
Document Structure and Markup
Not all words are of equal value in a search, Some parts of documents are
more important than others
Document parser recognizes structure using markup, such as HTML tags –
Headers, anchor text, bolded text all likely to be important
Metadata can also be important – Links used for link analysis
Text transformation
• Text transformation steps are applied to both documents
(before indexing) and queries (before processing):
• Objective identify the optimal form of the term to be
indexed to achieve the best retrieval performance.
• Standard text pre-processing steps:
1. Tokenization
2. Stopping
3. Normalization
4. Stemming
5. POS tagging
Walid Magdy, TTDS 2017/2018
Tokenization
• Tokenizer: A document is converted to a stream of tokens, e.g.
individual words. (breaking down the extracted text into individual words or
terms (tokens)).
• Sentence tokenization (splitting) tokens
• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry (term), after
further processing.
• Handling special cases like hyphenated words, contractions, and
abbreviations.
Walid Magdy, TTDS 2017/2018
• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT “RT @realDonalTrump Mexico will …”
• Patents: said, claim “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective make words with different surface
forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence tokenisation tokens normalisation
terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries
Walid Magdy, TTDS 2017/2018
• Search for: “play”
should it match: “played”, “playing”, “player”?
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Stemmers attempt to reduce morphological variations
of words to a root form/ common stem.
• usually involves removing suffixes (in English)
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame
Walid Magdy, TTDS 2017/2018
• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses ss (processes process)
• yi (reply repli)
• ies i (replies repli)
• ement → null (replacement replac)
Walid Magdy, TTDS 2017/2018
• Irregular verbs:
• saw see
• went go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain
• Solution Query expansion …
Walid Magdy, TTDS 2017/2018
• Text pre-processing before IR:
•Tokenisation Stopping Stemming
Walid Magdy, TTDS 2017/2018
POS Tagging
POS tagging assigns grammatical labels (e.g., noun, verb,
adjective) to each word in the text. This step is useful for
tasks like syntactic analysis and information extraction.
POS taggers use statistical models of text to predict
syntactic tags of words.
Example tags:
NN (singular noun), NNS (plural noun), VB (verb), VBD
(verb, past tense), VBN (verb, past participle), IN
(preposition), JJ (adjective), CC (conjunction, e.g., “and”,
“or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”,
“will”).
N-Grams
Frequent n-grams are more likely to be
meaningful phrases
N-grams form is better fit than words alone
Could index all n-grams up to specified length
Much faster than POS tagging
Uses a lot of storage
e.g.,document containing 1,000 words would
contain 3,990 instances of word n-grams of
length 2 ≤ n ≤ 5
Vectorization step
Vectorization converts text into numerical representations that can be used
by machine learning models. Common techniques include:
- Bag of Words (BoW): Represents text as a vector of word frequencies.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words
based on their importance in the document and corpus.
- Word Embeddings: Represents words as dense vectors (e.g., Word2Vec,
GloVe).
- Sentence Embeddings: Represents entire sentences as vectors (e.g.,
BERT, Sentence-BERT).
Example:
- Input: `The quick brown fox.`
- Output (BoW): `{"the": 1, "quick": 1, "brown": 1, "fox": 1}`
Text Transformation Pipeline:
1.Text Cleaning
- Input: `<p>Hello, world! This is a <b>test</b>.</p>`
- Output: `Hello world This is a test`
2. Tokenization: - Output: `["Hello", "world", "This", "is", "a", "test"]`
3. Normalization: - Output: `["hello", "world", "this", "be", "a", "test"]`
4. Vectorization (BoW): - Output: `{"hello": 1, "world": 1, "this": 1, "be": 1, "a":
1, "test": 1}`
Tools and Libraries for Text Transformation:
1. - NLTK (Natural Language Toolkit): A Python library for text processing.
2. - spaCy: An industrial-strength NLP library for text transformation.
3. - Scikit-learn: A machine learning library with tools for text
vectorization.
4. - Gensim: A library for topic modeling and word embeddings.
5. - Transformers (Hugging Face): A library for advanced NLP tasks using
pre-trained models like BERT.
Example: We have a collection of three documents:
Document 1: "The quick brown fox jumps over the lazy dog."
Document 2: "A quick brown dog jumps over the lazy fox."
Document 3: "The lazy fox sleeps all day."
Step 1: Text Cleaning: Remove unnecessary characters, punctuation, and stop words (common words like "the," "a," "is," etc.).
Output:
- Document 1: "quick brown fox jumps lazy dog"
- Document 2: "quick brown dog jumps lazy fox"
- Document 3: "lazy fox sleeps day"
Step 2: Tokenization: Break the text into individual words (tokens).
Output:
-Document 1: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
-Document 2: ["quick", "brown", "dog", "jumps", "lazy", "fox"]
-Document 3: ["lazy", "fox", "sleeps", "day"]
Step 3: Normalization: Convert all tokens to lowercase and apply stemming (reducing words to their root form).
Output:
- Document 1: ["quick", "brown", "fox", "jump", "lazi", "dog"]
-Document 2: ["quick", "brown", "dog", "jump", "lazi", "fox"]
--Document 3: ["lazi", "fox", "sleep", "day"]
Indexing Process
Indexing process
It is a critical step in building a search engine or any information retrieval
system. It involves organizing and structuring data (e.g., text, documents,
or web pages) to enable fast and efficient searching.
Users can’t find you unless you’re in the index.
How to add your website to Google index?
(Assignment)
The purpose of storing an index is to optimize speed and performance in
finding relevant documents for a search query.
For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents
could take hours.
Index data structures
Search engine architectures vary in the way
indexing is performed and in methods of index
storage to meet the various design factors.
1. Inverted index
Stores a list of occurrences of each atomic search criterion, typically in the form of a hash
table or binary tree.
2. Citation index
Stores citations or hyperlinks between documents to support citation analysis, a subject
of bibliometrics.
3. n-gram index
Stores sequences of length of data to support other types of retrieval or text mining.
4. Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in documents in a two-
dimensional sparse matrix.
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.
‣ Index common types:
1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.
2. Inverted Index: Key is a term, value is a list of documents
and term positions. Provides faster processing at query time.
Forward index
- The rationale behind developing a forward index is that as documents
are parsed, it is better to intermediately store the words per document.
- The forward index is sorted to transform it to an inverted index.
- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index
• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.
• Such an index determines which documents match a query but does
not rank matched documents.
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored
‣ Term weights are calculated and stored with the terms.
‣ The weight estimates the term’s importance to the document.
‣ The weights are used by ranking algorithms
• e.g.TF-IDF ranks documents by the Term Frequency of the query
term within the document times the Inverse Document Frequency
of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
The Shakespeare collection as
Term-Document Matrix
Matrix element (t,d) is:
1 if term t occurs in document d,
0 otherwise
These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix
QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data
Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index
• Each index term is associated with an
inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
33
Sec. 1.2
Inverted index
For each term t, we must store a list of all documents that contain t.
Identify each doc by a docID, a document serial number
Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word
Caesar is added to document
14? 34
Can we used fixed-size arrays for
this?
The inverted index is a sparse matrix, since not all
words are present in each document.
To reduce computer storage memory requirements, it is
stored differently from a two dimensional array.
How we can reduce index size?
(index compression) Assignment
Sec. 1.2
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion
Posting
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why).
36
Sec. 1.2
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2
Indexer steps: Sort
Sort by terms
At least conceptually
And then docID
Core indexing step
Sec. 1.2
Indexer steps: Dictionary & Postings
Multiple term entries in a
single document are merged.
Split into Dictionary and
Postings
Doc. frequency information is
added.
Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index
Lists of
docIDs
Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?
40
Pointers
Index Creation for a Web Search Engine
1. Data Collection: A crawler fetches 1,000 web pages.
2. Text Extraction: Extract text and metadata (e.g., titles, headings) from the
HTML pages.
3. Tokenization: Break the text into individual words.
4. Normalization: Lowercase the words, remove stop words, and apply
stemming.
5. Inverted Index: Build an index mapping each term to the pages where it
appears.
6. Index Compression: Compress the index to save storage space (Assignment).
7. Index Storage: Store the index in a distributed database.
8. Index Optimization: Shard the index across multiple servers.
9. Index Maintenance: Update the index as new pages are crawled.