0% found this document useful (0 votes)

26 views41 pages

IR Lec3

The document outlines the indexing process in information retrieval systems, detailing steps such as document acquisition, text transformation, and the creation of various index types like inverted and forward indexes. It emphasizes the importance of pre-processing tasks like tokenization, stopping, and stemming to optimize search performance. Additionally, it discusses the data structures used for indexing and the significance of term weights in ranking algorithms for efficient document retrieval.

Uploaded by

محمود عاطف

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views41 pages

IR Lec3

Uploaded by

محمود عاطف

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Information Storage and

Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3

Dr. Ebtsam AbdelHakam

Computer Science Dept.

Minia University
Indexing Process
document  unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

Document Acquisition (collecting
data)

 Document Acquisition: Before creating an index, the

system must collect the data to be indexed. This data
can come from various sources, such as:
- Web pages (for web search engines).
- Documents (for enterprise search systems).
- Databases (for structured data search).
 Example: A web crawler collects HTML pages from the
internet.
Text Transformation (Pre-processing)

• It includes tasks to extract meaningful text and metadata from the

collected data.
This involves:
- Parsing HTML, PDFs, or other file formats to extract text.
- Removing unnecessary content like ads, navigation menus, or boilerplate
text.
- Extracting metadata such as titles, headings, and author information.

Example: From an HTML page, extract the `<title>`, `<h1>`, and ``
tags.

Walid Magdy, TTDS 2017/2018

Document Structure and Markup

 Not all words are of equal value in a search, Some parts of documents are
more important than others
 Document parser recognizes structure using markup, such as HTML tags –
Headers, anchor text, bolded text all likely to be important
 Metadata can also be important – Links used for link analysis
Text transformation
• Text transformation steps are applied to both documents
(before indexing) and queries (before processing):
• Objective  identify the optimal form of the term to be
indexed to achieve the best retrieval performance.

• Standard text pre-processing steps:

1. Tokenization
2. Stopping
3. Normalization
4. Stemming
5. POS tagging

Walid Magdy, TTDS 2017/2018

Tokenization
• Tokenizer: A document is converted to a stream of tokens, e.g.
individual words. (breaking down the extracted text into individual words or
terms (tokens)).

• Sentence  tokenization (splitting)  tokens

• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry (term), after
further processing.
• Handling special cases like hyphenated words, contractions, and
abbreviations.
Walid Magdy, TTDS 2017/2018
• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different surface
forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018

• Search for: “play”
should it match: “played”, “playing”, “player”?
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Stemmers attempt to reduce morphological variations
of words to a root form/ common stem.
• usually involves removing suffixes (in English)
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018

• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018

• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018
• Text pre-processing before IR:
•Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018

POS Tagging
 POS tagging assigns grammatical labels (e.g., noun, verb,
adjective) to each word in the text. This step is useful for
tasks like syntactic analysis and information extraction.
 POS taggers use statistical models of text to predict
syntactic tags of words.
 Example tags:
 NN (singular noun), NNS (plural noun), VB (verb), VBD
(verb, past tense), VBN (verb, past participle), IN
(preposition), JJ (adjective), CC (conjunction, e.g., “and”,
“or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”,
“will”).
N-Grams
 Frequent n-grams are more likely to be
meaningful phrases
 N-grams form is better fit than words alone
 Could index all n-grams up to specified length
 Much faster than POS tagging
 Uses a lot of storage
e.g.,document containing 1,000 words would
contain 3,990 instances of word n-grams of
length 2 ≤ n ≤ 5
Vectorization step
 Vectorization converts text into numerical representations that can be used
by machine learning models. Common techniques include:
- Bag of Words (BoW): Represents text as a vector of word frequencies.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words
based on their importance in the document and corpus.
- Word Embeddings: Represents words as dense vectors (e.g., Word2Vec,
GloVe).
- Sentence Embeddings: Represents entire sentences as vectors (e.g.,
BERT, Sentence-BERT).

 Example:
 - Input: `The quick brown fox.`
 - Output (BoW): `{"the": 1, "quick": 1, "brown": 1, "fox": 1}`
 Text Transformation Pipeline:
1.Text Cleaning
- Input: `Hello, world! This is a test.`
- Output: `Hello world This is a test`
2. Tokenization: - Output: `["Hello", "world", "This", "is", "a", "test"]`
3. Normalization: - Output: `["hello", "world", "this", "be", "a", "test"]`
4. Vectorization (BoW): - Output: `{"hello": 1, "world": 1, "this": 1, "be": 1, "a":
1, "test": 1}`

 Tools and Libraries for Text Transformation:

1. - NLTK (Natural Language Toolkit): A Python library for text processing.
2. - spaCy: An industrial-strength NLP library for text transformation.
3. - Scikit-learn: A machine learning library with tools for text
vectorization.
4. - Gensim: A library for topic modeling and word embeddings.
5. - Transformers (Hugging Face): A library for advanced NLP tasks using
pre-trained models like BERT.
Example: We have a collection of three documents:
Document 1: "The quick brown fox jumps over the lazy dog."

Document 2: "A quick brown dog jumps over the lazy fox."

Document 3: "The lazy fox sleeps all day."

Step 1: Text Cleaning: Remove unnecessary characters, punctuation, and stop words (common words like "the," "a," "is," etc.).

 Output:

- Document 1: "quick brown fox jumps lazy dog"

- Document 2: "quick brown dog jumps lazy fox"

- Document 3: "lazy fox sleeps day"

Step 2: Tokenization: Break the text into individual words (tokens).

 Output:

-Document 1: ["quick", "brown", "fox", "jumps", "lazy", "dog"]

-Document 2: ["quick", "brown", "dog", "jumps", "lazy", "fox"]

-Document 3: ["lazy", "fox", "sleeps", "day"]

Step 3: Normalization: Convert all tokens to lowercase and apply stemming (reducing words to their root form).

 Output:

- Document 1: ["quick", "brown", "fox", "jump", "lazi", "dog"]

-Document 2: ["quick", "brown", "dog", "jump", "lazi", "fox"]

--Document 3: ["lazi", "fox", "sleep", "day"]

Indexing Process
Indexing process

 It is a critical step in building a search engine or any information retrieval

system. It involves organizing and structuring data (e.g., text, documents,
or web pages) to enable fast and efficient searching.
 Users can’t find you unless you’re in the index.

 How to add your website to Google index?

(Assignment)
 The purpose of storing an index is to optimize speed and performance in
finding relevant documents for a search query.
 For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents
could take hours.
Index data structures

Search engine architectures vary in the way

indexing is performed and in methods of index
storage to meet the various design factors.
1. Inverted index
Stores a list of occurrences of each atomic search criterion, typically in the form of a hash
table or binary tree.
2. Citation index
Stores citations or hyperlinks between documents to support citation analysis, a subject
of bibliometrics.
3. n-gram index
Stores sequences of length of data to support other types of retrieval or text mining.
4. Document-term matrix
Used in latent semantic analysis, stores the occurrences of words in documents in a two-
dimensional sparse matrix.
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.

‣ Index common types:

1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents

and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents

are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.

- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does

not rank matched documents.
Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored

‣ Term weights are calculated and stored with the terms.

‣ The weight estimates the term’s importance to the document.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query

term within the document times the Inverse Document Frequency
of the term across all documents.
• Higher scores means you have more query terms which are not
found in many documents.
The Shakespeare collection as
Term-Document Matrix

Matrix element (t,d) is:

1 if term t occurs in document d,
0 otherwise

These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix

QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data
Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index

• Each index term is associated with an

inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
33
Sec. 1.2

Inverted index

 For each term t, we must store a list of all documents that contain t.
 Identify each doc by a docID, a document serial number
 Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word

Caesar is added to document
14? 34
Can we used fixed-size arrays for
this?

 The inverted index is a sparse matrix, since not all

words are present in each document.
 To reduce computer storage memory requirements, it is
stored differently from a two dimensional array.

 How we can reduce index size?

(index compression) Assignment
Sec. 1.2

Inverted index

 We need variable-size postings lists

 On disk, a continuous run of postings is normal and best
 In memory, can use linked lists or variable length arrays
 Some tradeoffs in size/ease of insertion
Posting

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why).
36
Sec. 1.2

Indexer steps: Token sequence

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2

I did enact Julius So let it be with

Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

 Sort by terms
 At least conceptually
 And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

 Multiple term entries in a

single document are merged.
 Split into Dictionary and
Postings
 Doc. frequency information is
added.

Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index

Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

40
Pointers
Index Creation for a Web Search Engine
1. Data Collection: A crawler fetches 1,000 web pages.
2. Text Extraction: Extract text and metadata (e.g., titles, headings) from the
HTML pages.
3. Tokenization: Break the text into individual words.
4. Normalization: Lowercase the words, remove stop words, and apply
stemming.
5. Inverted Index: Build an index mapping each term to the pages where it
appears.

6. Index Compression: Compress the index to save storage space (Assignment).

7. Index Storage: Store the index in a distributed database.
8. Index Optimization: Shard the index across multiple servers.
9. Index Maintenance: Update the index as new pages are crawled.

AI Agents by Google
100% (10)
AI Agents by Google
42 pages
Chapter-1: Internship Management System
No ratings yet
Chapter-1: Internship Management System
37 pages
IT School Assessment 2020-2021
No ratings yet
IT School Assessment 2020-2021
6 pages
SAP BI Architecture Overview
No ratings yet
SAP BI Architecture Overview
6 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Lec 19
No ratings yet
Lec 19
60 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Explain Text Operation
No ratings yet
Explain Text Operation
6 pages
Pipeline
No ratings yet
Pipeline
9 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Java NLP Techniques Guide
No ratings yet
Java NLP Techniques Guide
51 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Text Mining
No ratings yet
Text Mining
34 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
16 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
5 The Term Vocabulary & Posting List
No ratings yet
5 The Term Vocabulary & Posting List
36 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
RapidMiner Text Mining Guide
No ratings yet
RapidMiner Text Mining Guide
3 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Text Processing
No ratings yet
Text Processing
114 pages
NLP Techniques and Applications
No ratings yet
NLP Techniques and Applications
17 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Text Mining
No ratings yet
Text Mining
62 pages
Lec 5
No ratings yet
Lec 5
22 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Session 1
No ratings yet
Session 1
33 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
NLP Applications and Techniques
No ratings yet
NLP Applications and Techniques
7 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Chap 2
No ratings yet
Chap 2
70 pages
NLP: Spelling Correction & QA
No ratings yet
NLP: Spelling Correction & QA
80 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
ISMG6080 Database Management: Zhiping Walter Associate Professor of Information Systems
No ratings yet
ISMG6080 Database Management: Zhiping Walter Associate Professor of Information Systems
13 pages
Oracle Notes
No ratings yet
Oracle Notes
588 pages
Class X IT Exam Solutions
No ratings yet
Class X IT Exam Solutions
9 pages
Dataanalytics Assignment 1718612504
No ratings yet
Dataanalytics Assignment 1718612504
28 pages
IP Project
No ratings yet
IP Project
13 pages
CS Preboard Paper SET-5
100% (1)
CS Preboard Paper SET-5
15 pages
Big Data and Analytics Course Overview
No ratings yet
Big Data and Analytics Course Overview
18 pages
Web Dev Lab: MySQL & PHP Basics
No ratings yet
Web Dev Lab: MySQL & PHP Basics
7 pages
Nit Kurukshetra: Database Systems (LAB)
No ratings yet
Nit Kurukshetra: Database Systems (LAB)
20 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
W1. Introduction To Database
No ratings yet
W1. Introduction To Database
26 pages
Computer Science Record Notebook Programs (4th May 2025)
No ratings yet
Computer Science Record Notebook Programs (4th May 2025)
48 pages
Class 12 TH Computer Science PRE BOARD 1
No ratings yet
Class 12 TH Computer Science PRE BOARD 1
12 pages
3 P32 Midterm 2019
No ratings yet
3 P32 Midterm 2019
6 pages
Extra Question With Answer - DBMS
No ratings yet
Extra Question With Answer - DBMS
10 pages
PHP 5 Introduction
No ratings yet
PHP 5 Introduction
53 pages
DBMS Storage and Indexing
No ratings yet
DBMS Storage and Indexing
80 pages
Bopf (Document)
No ratings yet
Bopf (Document)
7 pages
B.Tech CSE 8th Sem
No ratings yet
B.Tech CSE 8th Sem
10 pages
11 and .NET 7 Part7
No ratings yet
11 and .NET 7 Part7
8 pages
My Notes
No ratings yet
My Notes
93 pages
Class Diagram For Library Management System
No ratings yet
Class Diagram For Library Management System
3 pages
Practical 3 GIS
No ratings yet
Practical 3 GIS
7 pages
Technical Questions With Answers - Data Management
No ratings yet
Technical Questions With Answers - Data Management
12 pages
Database Modeling Essentials
No ratings yet
Database Modeling Essentials
31 pages
DD 2 1 SG
No ratings yet
DD 2 1 SG
13 pages

IR Lec3

Uploaded by

IR Lec3

Uploaded by

Information Storage and

Dr. Ebtsam AbdelHakam

Computer Science Dept.

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

 Document Acquisition: Before creating an index, the

• It includes tasks to extract meaningful text and metadata from the

Walid Magdy, TTDS 2017/2018

• Standard text pre-processing steps:

Walid Magdy, TTDS 2017/2018

• Sentence  tokenization (splitting)  tokens

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018

 Tools and Libraries for Text Transformation:

Document 3: "The lazy fox sleeps all day."

- Document 1: "quick brown fox jumps lazy dog"

- Document 2: "quick brown dog jumps lazy fox"

- Document 3: "lazy fox sleeps day"

Step 2: Tokenization: Break the text into individual words (tokens).

-Document 1: ["quick", "brown", "fox", "jumps", "lazy", "dog"]

-Document 2: ["quick", "brown", "dog", "jumps", "lazy", "fox"]

-Document 3: ["lazy", "fox", "sleeps", "day"]

- Document 1: ["quick", "brown", "fox", "jump", "lazi", "dog"]

-Document 2: ["quick", "brown", "dog", "jump", "lazi", "fox"]

--Document 3: ["lazi", "fox", "sleep", "day"]

 It is a critical step in building a search engine or any information retrieval

 How to add your website to Google index?

Search engine architectures vary in the way

‣ Index common types:

2. Inverted Index: Key is a term, value is a list of documents

- The rationale behind developing a forward index is that as documents

- The forward index is sorted to transform it to an inverted index.

• Such an index determines which documents match a query but does

‣ Term weights are calculated and stored with the terms.

‣ The weights are used by ranking algorithms

• e.g.TF-IDF ranks documents by the Term Frequency of the query

Matrix element (t,d) is:

• Each index term is associated with an

Brutus 1 2 4 11 31 45 173 174

What happens if the word

 The inverted index is a sparse matrix, since not all

 How we can reduce index size?

 We need variable-size postings lists

Brutus 1 2 4 11 31 45 173 174

Indexer steps: Token sequence

I did enact Julius So let it be with

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

 Multiple term entries in a

6. Index Compression: Compress the index to save storage space (Assignment).

You might also like