Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views41 pages

IR Lec3

The document outlines the indexing process in information retrieval systems, detailing steps such as document acquisition, text transformation, and the creation of various index types like inverted and forward indexes. It emphasizes the importance of pre-processing tasks like tokenization, stopping, and stemming to optimize search performance. Additionally, it discusses the data structures used for indexing and the significance of term weights in ranking algorithms for efficient document retrieval.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

IR Lec3

The document outlines the indexing process in information retrieval systems, detailing steps such as document acquisition, text transformation, and the creation of various index types like inverted and forward indexes. It emphasizes the importance of pre-processing tasks like tokenization, stopping, and stemming to optimize search performance. Additionally, it discusses the data structures used for indexing and the significance of term weights in ranking algorithms for efficient document retrieval.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Information Storage and

Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Computer Science Dept.


Minia University
Indexing Process
document  unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018


Document Acquisition (collecting
data)

 Document Acquisition: Before creating an index, the


system must collect the data to be indexed. This data
can come from various sources, such as:
- Web pages (for web search engines).
- Documents (for enterprise search systems).
- Databases (for structured data search).
 Example: A web crawler collects HTML pages from the
internet.
Text Transformation (Pre-processing)

• It includes tasks to extract meaningful text and metadata from the


collected data.
This involves:
- Parsing HTML, PDFs, or other file formats to extract text.
- Removing unnecessary content like ads, navigation menus, or boilerplate
text.
- Extracting metadata such as titles, headings, and author information.

Example: From an HTML page, extract the `<title>`, `<h1>`, and `<p>`
tags.

Walid Magdy, TTDS 2017/2018


Document Structure and Markup

 Not all words are of equal value in a search, Some parts of documents are
more important than others
 Document parser recognizes structure using markup, such as HTML tags –
Headers, anchor text, bolded text all likely to be important
 Metadata can also be important – Links used for link analysis
Text transformation
• Text transformation steps are applied to both documents
(before indexing) and queries (before processing):
• Objective  identify the optimal form of the term to be
indexed to achieve the best retrieval performance.

• Standard text pre-processing steps:


1. Tokenization
2. Stopping
3. Normalization
4. Stemming
5. POS tagging

Walid Magdy, TTDS 2017/2018


Tokenization
• Tokenizer: A document is converted to a stream of tokens, e.g.
individual words. (breaking down the extracted text into individual words or
terms (tokens)).

• Sentence  tokenization (splitting)  tokens


• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry (term), after
further processing.
• Handling special cases like hyphenated words, contractions, and
abbreviations.
Walid Magdy, TTDS 2017/2018
• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different surface
forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018


• Search for: “play”
should it match: “played”, “playing”, “player”?
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Stemmers attempt to reduce morphological variations
of words to a root form/ common stem.
• usually involves removing suffixes (in English)
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018


• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018


• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …


Walid Magdy, TTDS 2017/2018
• Text pre-processing before IR:
•Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018


POS Tagging
 POS tagging assigns grammatical labels (e.g., noun, verb,
adjective) to each word in the text. This step is useful for
tasks like syntactic analysis and information extraction.
 POS taggers use statistical models of text to predict
syntactic tags of words.
 Example tags:
 NN (singular noun), NNS (plural noun), VB (verb), VBD
(verb, past tense), VBN (verb, past participle), IN
(preposition), JJ (adjective), CC (conjunction, e.g., “and”,
“or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”,
“will”).
N-Grams
 Frequent n-grams are more likely to be
meaningful phrases
 N-grams form is better fit than words alone
 Could index all n-grams up to specified length
 Much faster than POS tagging
 Uses a lot of storage
e.g.,document containing 1,000 words would
contain 3,990 instances of word n-grams of
length 2 ≤ n ≤ 5
Vectorization step
 Vectorization converts text into numerical representations that can be used
by machine learning models. Common techniques include:
- Bag of Words (BoW): Represents text as a vector of word frequencies.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words
based on their importance in the document and corpus.
- Word Embeddings: Represents words as dense vectors (e.g., Word2Vec,
GloVe).
- Sentence Embeddings: Represents entire sentences as vectors (e.g.,
BERT, Sentence-BERT).

 Example:
 - Input: `The quick brown fox.`
 - Output (BoW): `{"the": 1, "quick": 1, "brown": 1, "fox": 1}`
 Text Transformation Pipeline:
1.Text Cleaning
- Input: `<p>Hello, world! This is a <b>test</b>.</p>`
- Output: `Hello world This is a test`
2. Tokenization: - Output: `["Hello", "world", "This", "is", "a", "test"]`
3. Normalization: - Output: `["hello", "world", "this", "be", "a", "test"]`
4. Vectorization (BoW): - Output: `{"hello": 1, "world": 1, "this": 1, "be": 1, "a":
1, "test": 1}`

 Tools and Libraries for Text Transformation:


1. - NLTK (Natural Language Toolkit): A Python library for text processing.
2. - spaCy: An industrial-strength NLP library for text transformation.
3. - Scikit-learn: A machine learning library with tools for text
vectorization.
4. - Gensim: A library for topic modeling and word embeddings.
5. - Transformers (Hugging Face): A library for advanced NLP tasks using
pre-trained models like BERT.
Example: We have a collection of three documents:
Document 1: "The quick brown fox jumps over the lazy dog."

Document 2: "A quick brown dog jumps over the lazy fox."

Document 3: "The lazy fox sleeps all day."

Step 1: Text Cleaning: Remove unnecessary characters, punctuation, and stop words (common words like "the," "a," "is," etc.).

 Output:

- Document 1: "quick brown fox jumps lazy dog"

- Document 2: "quick brown dog jumps lazy fox"

- Document 3: "lazy fox sleeps day"

Step 2: Tokenization: Break the text into individual words (tokens).

 Output:

-Document 1: ["quick", "brown", "fox", "jumps", "lazy", "dog"]

-Document 2: ["quick", "brown", "dog", "jumps", "lazy", "fox"]

-Document 3: ["lazy", "fox", "sleeps", "day"]

Step 3: Normalization: Convert all tokens to lowercase and apply stemming (reducing words to their root form).

 Output:

- Document 1: ["quick", "brown", "fox", "jump", "lazi", "dog"]

-Document 2: ["quick", "brown", "dog", "jump", "lazi", "fox"]

--Document 3: ["lazi", "fox", "sleep", "day"]


Indexing Process
Indexing process

 It is a critical step in building a search engine or any information retrieval


system. It involves organizing and structuring data (e.g., text, documents,
or web pages) to enable fast and efficient searching.
 Users can’t find you unless you’re in the index.

 How to add your website to Google index?


(Assignment)
 The purpose of storing an index is to optimize speed and performance in
finding relevant documents for a search query.
 For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents
could take hours.
Index data structures

Search engine architectures vary in the way


indexing is performed and in methods of index
storage to meet the various design factors.
1. Inverted index
Stores a list of occurrences of each atomic search criterion, typically in the form of a hash
table or binary tree.
2. Citation index
Stores citations or hyperlinks between documents to support citation analysis, a subject
of bibliometrics.
3. n-gram index
Stores sequences of length of data to support other types of retrieval or text mining.
4. Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in documents in a two-
dimensional sparse matrix.
Index
Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored.

‣ Index common types:


1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents


and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents


are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.


- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does


not rank matched documents.
Index
Creation
Storing Document Statistics

‣ The counts and positions of document terms are stored

‣ Term weights are calculated and stored with the terms.


‣ The weight estimates the term’s importance to the document.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query


term within the document times the Inverse Document Frequency
of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:


1 if term t occurs in document d,
0 otherwise

These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data
Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index

• Each index term is associated with an


inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
33
Sec. 1.2

Inverted index

 For each term t, we must store a list of all documents that contain t.
 Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word


Caesar is added to document
14? 34
Can we used fixed-size arrays for
this?

 The inverted index is a sparse matrix, since not all


words are present in each document.
 To reduce computer storage memory requirements, it is
stored differently from a two dimensional array.

 How we can reduce index size?


(index compression) Assignment
Sec. 1.2

Inverted index

 We need variable-size postings lists


 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays
 Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
36
Sec. 1.2

Indexer steps: Token sequence


 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with


Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

 Sort by terms
 At least conceptually
 And then docID

Core indexing step


Sec. 1.2

Indexer steps: Dictionary & Postings

 Multiple term entries in a


single document are merged.
 Split into Dictionary and
Postings
 Doc. frequency information is
added.

Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

40
Pointers
Index Creation for a Web Search Engine
1. Data Collection: A crawler fetches 1,000 web pages.
2. Text Extraction: Extract text and metadata (e.g., titles, headings) from the
HTML pages.
3. Tokenization: Break the text into individual words.
4. Normalization: Lowercase the words, remove stop words, and apply
stemming.
5. Inverted Index: Build an index mapping each term to the pages where it
appears.

6. Index Compression: Compress the index to save storage space (Assignment).


7. Index Storage: Store the index in a distributed database.
8. Index Optimization: Shard the index across multiple servers.
9. Index Maintenance: Update the index as new pages are crawled.

You might also like