0% found this document useful (0 votes)

4 views15 pages

Ir Mod 4

Uploaded by

2022.ekta.chhabria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views15 pages

Ir Mod 4

Uploaded by

2022.ekta.chhabria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

IR

Module 04: Indexing and Scoring in Information

Systems

Inverted files, also known as inverted indexes, are a fundamental data structure
used in information retrieval systems, particularly in search engines, to facilitate
fast and efficient searching of large text collections.

What is an Inverted File?

An inverted file is an index structure that maps content, typically words or
terms, to their locations within a set of documents. Instead of storing
documents sequentially, an inverted file allows you to quickly find all the
documents that contain a particular term. It essentially "inverts" the relationship
between documents and terms, hence the name.

Structure of an Inverted File

An inverted file consists of two main components:

1. Vocabulary (Dictionary): This is a list of all unique terms (words) that

appear in the document collection.

2. Posting Lists: For each term in the vocabulary, there is a corresponding

posting list, which contains the list of documents where the term appears.
Each entry in the posting list typically includes:

Document ID: Identifies the document where the term appears.

Term Frequency: The number of times the term appears in that

document (optional but commonly included).

Positions: The exact positions (offsets) within the document where the
term occurs (optional).

Example of an Inverted File

Let's consider a small collection of three documents:

Document 1: "information retrieval is important"

Document 2: "retrieval systems help find information"

IR 1
Document 3: "efficient retrieval of information is key"

The inverted file for these documents might look like this:
Vocabulary:

information

retrieval

important

systems

help

find

efficient

key

Posting Lists:

information: {D1, D2, D3}

retrieval: {D1, D2, D3}

is: {D1, D3}

important: {D1}

systems: {D2}

help: {D2}

find: {D2}

efficient: {D3}

of: {D3}

key: {D3}

Here’s how it works:

Vocabulary Entry ("retrieval") has a posting list {D1, D2, D3} , indicating
that the term "retrieval" appears in Documents 1, 2, and 3.

IR 2
Vocabulary Entry ("important") has a posting list {D1} , indicating that
"important" appears only in Document 1.

Advantages of Inverted Files

1. Efficient Search: By directly accessing the posting list of a term, the
system can quickly identify all documents containing that term, making
searches extremely fast.

2. Space Efficiency: While it requires additional storage compared to a simple

list of documents, the space is managed efficiently, especially with
techniques like compression applied to posting lists.

3. Facilitates Boolean Queries: Inverted files make it easy to handle Boolean

queries (e.g., AND, OR, NOT operations) by intersecting or merging posting
lists.

Applications of Inverted Files

Search Engines: Inverted files are the backbone of search engines like
Google, allowing them to quickly retrieve documents that match a user's
query.

Text Analytics: Used in various text analytics applications, such as

sentiment analysis or document classification.

Database Systems: Inverted files are also used in some database systems
for indexing text fields.

Challenges and Considerations

Dynamic Updates: Handling updates (insertion or deletion of documents) in
an inverted file can be complex, requiring re-indexing or incremental
updates.

Storage: While inverted files are space-efficient, the need to store large
vocabularies and posting lists for very large datasets can become a
challenge.

Suffix Trees and Suffix Arrays

IR 3
Both suffix trees and suffix arrays are data structures used in string processing
and pattern matching. They help in efficiently handling various string-related
queries. Here’s a brief explanation of each:

Suffix Tree
Concept:
A suffix tree is a compressed trie of all suffixes of a given string. It provides a
way to store and search suffixes efficiently. Each edge in a suffix tree
represents a substring of the original string, and the tree's leaves represent all
the suffixes of the string.
Example:
For the string "banana", the suffix tree would look like:

[root]
/ | \\
a/b n / \\
| | a $
n $ |
| |
a n/a$
| |
na na$
|
ana$
|
an$

Suffixes: "banana", "anana", "nana", "ana", "na", "a"

Leaves: Represent all suffixes of "banana".

Usage:

Quick search for substring presence.

Longest repeated substring.

Pattern matching.

Suffix Array

IR 4
Concept:
A suffix array is a sorted array of all suffixes of a string. It is a more space-
efficient data structure compared to suffix trees and is used in conjunction with
additional information, like the longest common prefix (LCP) array, to efficiently
solve string problems.

Example:
For the string "banana", the suffix array is:

1. Suffixes:

"banana"

"anana"

"nana"

"ana"

"na"

"a"

2. Sorted Suffixes:

"a" (index 5)

"ana" (index 3)

"anana" (index 1)

"banana" (index 0)

"na" (index 4)

"nana" (index 2)

3. Suffix Array: [5, 3, 1, 0, 4, 2]

Usage:

Pattern matching.

Longest common prefix (LCP) computation.

String sorting.

Differences Between Suffix Trees and Suffix Arrays

Feature Suffix Tree Suffix Array

IR 5
A compressed trie of all A sorted array of all suffixes of
Definition
suffixes of the string. the string.

Typically O(n) to O(n^2), O(n) to O(n log n) depending on

Space Complexity
depending on implementation. sorting method.

Time Complexity O(n) for construction (with

O(n log n) for sorting suffixes.
(Construction) Ukkonen’s algorithm).

Can be space-intensive due to More space-efficient, usually

Space Usage
additional pointers and nodes. requires O(n) space.

Allows for fast searches and Searching can be done in O(log

Search Efficiency
substring queries. n) with binary search.

LCP is not directly available but

Longest Common LCP can be efficiently
can be computed with
Prefix (LCP) computed using the suffix array.
additional algorithms.

Used for complex string

Used for efficient string sorting,
operations, like finding
Applications substring searching, and other
repeated substrings, pattern
pattern-related problems.
matching.

Simpler queries but requires

Complexity of More complex but can handle a
auxiliary data structures like
Queries variety of queries efficiently.
LCP array.

Construction Construction is cheap costly

Binary search Binary search is not possible Binary search is possible

Uses Patricia tree to save Uses supra index for fast

memory retrieval

Uses tree, non-linear data Uses Array, linear data structure

structure for construction for construction

Uses trie as supporting data Uses Arrays as supporting data

structures structures

Not practical for large texts Can be used for large texts

Diagrams
Suffix Tree:

[root]
/ | \\
a n $
| | \\

Suffix Array:

String: banana
Suffixes: "banana", "anana", "nana", "ana", "na", "a"
Sorted Suffixes: "a" (index 5), "ana" (index 3), "anana" (i
ndex 1), "banana" (index 0), "na" (index 4), "nana" (index
2)
Suffix Array: [5, 3, 1, 0, 4, 2]

Structure of Signature Files

1. Signature file used hash function which maps words to bit masks of B bits.

2. It divides the text into blocks of 3 words each.

3. Then it assigns a bit mask of size B to each text block.

4. The mask is obtained by performing bitwise OR of the signatures of all the

words in the text block.

5. So the signature file is the sequence of bit masks of all the blocks with a
pointer to each block.

6. If the word is present in a text block, then all the bits set in its signature are
also set in the bit mask of the text block.

7. Whenever a bit is set in the mask of the query word and not in the mask of
the text block, then the word is not present in the text block.

8. Fig. 4.3.1 shows the example where the sample text is cut into blocks.

9. Block 1 Block 2 Block 3

Block 4
This is a text A text has many words words are Made from
letters
000101 110101 100100 101101 Text Signature

IR 7
(text) = 000101
(many) = 110000
(words) = 001100
(made) = 001100
(letters) = 100001 Signature Function
Even if the word is not there, then also it is possible to set all the corresponding
bits. And this is called as false drop.
The hash function is forced to deliver bit masks which have at least 1 bits set. A
good model assumes that 1 bits are randomly set in the mask.

Searching in Signature Files

1. In signature file searching, a single word is done by hashing it to a bit mask
W and then comparing the bit masks B of all the text blocks.

2. Whenever there is W and B = W (and is the bitwise AND), all the bits set in
W are also set in B, and therefore the text block may contain the word.

3. So far all candidate text blocks, an online traversal must be performed to

verify if the word is actually there.

4. This scheme is efficient to search phrases and reasonable proximity

queries because all the words must be present in a block in order for that
block to hold the phrase or the proximity query.

5. Hence, the bitwise OR of all the query masks is searched so that all their
bits must be present. This reduces the probability of false drops.

6. Some case must be taken at block boundaries to avoid missing a phrase

which crosses a block limit. To allow searching phrases of j words or
proximities of upto j words, consecutive blocks must overlap in j words.

7. If the blocks correspond to retrieval units, simple Boolean conjunctions

involving words or phrases can also be improved by forcing all the relevant
words to be in the block.

Hash Addressing
Information Retrieval (MU)

Hash Addressing (HA) is a technique to assign a location to the file using

hash function on the KEY of the file.

IR 8
KEY of a file can be any unique property (like record no or name of file).

KEY can be combination of multiple properties of a file.

Here it is assumed that at each location there is only a single file.

In HA, most important function is hashing function (f).

Hashing function (f) should distribute the address of the files uniformly over
the available storage.

Hashing function (f) is bottleneck for the performance of the Scatter

Storage.

Scatter Storage fails in case of a poor hashing function (f).

Fig. 4.4.1 shows a diagram illustrating the relationship between the KEY,
hashing function, and file storage.

Scoring
Scoring in Information Retrieval (IR) is the process of evaluating and ranking
documents based on their relevance to a user's query. The goal is to determine
how well each document matches the search criteria and to present the most
relevant documents at the top of the search results. Here's a breakdown of how
scoring works in IR:

1. Basic Concepts of Scoring

1.1 Relevance
Relevance refers to how well a document satisfies a user's query. It is typically
measured by the document's ability to match the terms or concepts in the
query.

IR 9
1.2 Score
A score is a numerical value assigned to a document indicating its relevance to
a query. Higher scores suggest greater relevance.

2. Scoring Models
Several scoring models are used to compute relevance scores. Here are some
of the most common ones:

2.1 Term Frequency-Inverse Document Frequency (TF-IDF)

The TF-IDF scoring model is one of the most widely used methods in IR. It
combines two components:

Term Frequency (TF): Measures how often a term appears in a document.

More frequent terms are considered more important.

Inverse Document Frequency (IDF): Measures how rare a term is across all
documents. Terms that are rare in the corpus are given higher weights.

2.2 Vector Space Model

In the vector space model, documents and queries are represented as vectors
in a multi-dimensional space. The relevance score is computed using similarity
measures like cosine similarity:

IR 10
2.3 Probabilistic Model
Probabilistic models estimate the probability that a document is relevant to a
query. One common example is the Binary Independence Model (BIM), which
uses probabilistic estimates to score documents.

3. Scoring Process Overview

1. Document Representation: Convert documents and queries into a suitable
representation (e.g., TF-IDF vectors, term frequency counts).

2. Score Computation: Apply a scoring model to compute relevance scores

based on the representation.

3. Ranking: Rank documents based on their scores. Higher scores indicate

higher relevance.

4. Result Presentation: Present the ranked list of documents to the user.

4. Practical Example
Assume we have a query "machine learning" and two documents:

Doc 1: "Introduction to machine learning"

Doc 2: "The basics of programming"

For TF-IDF scoring:

IR 11
1. Calculate TF-IDF for each term in each document.

2. Compute the relevance score for each document based on the query
terms.

If the query terms "machine" and "learning" have high TF-IDF values in Doc 1
compared to Doc 2, Doc 1 will have a higher score and be ranked higher.

Term Weighting
Term weighting is a technique used in Information Retrieval (IR) to assign
importance to terms in documents. The goal is to determine how much
influence each term should have on the relevance of a document with respect
to a query.

1. Term Frequency (TF)

Concept:

Term Frequency (TF) measures how often a term appears in a document.

The more frequently a term appears, the higher its term frequency for that
document.

2. Inverse Document Frequency (IDF)

Concept:

Inverse Document Frequency (IDF) measures how rare a term is across all
documents in the corpus.

Terms that appear in many documents are considered less important, as

they do not help in distinguishing between documents.

IR 12
3. Term Frequency-Inverse Document Frequency (TF-IDF)
Concept:

TF-IDF combines TF and IDF to give a measure of a term's importance in a

document relative to the entire corpus.

It balances the term's frequency within a document with its rarity across the
corpus.

Term Weighting Process

1. Compute TF for each term in each document.

2. Compute IDF for each term in the entire corpus.

3. Calculate TF-IDF for each term in each document.

4. Use TF-IDF scores to rank documents by their relevance to a query.

Summary Table

IR 13
Term Weighting Term Frequency Inverse Document
TF-IDF
Aspect (TF) Frequency (IDF)

Measures term Combines TF and IDF

Measures term rarity
Definition occurrence in a to measure term
across the corpus.
document. importance.

Formula

To highlight the To reduce the

To balance term
Purpose importance of a term weight of terms that
importance and rarity.
within a document. are too common.

Calculates how Determines how

Calculates term
much a term can important a term is in a
Usage influence in a
distinguish document relative to
document.
documents. the corpus.

IR 14
IR 15

Chapter 3 Part 2
No ratings yet
Chapter 3 Part 2
22 pages
Information Retrieval
No ratings yet
Information Retrieval
17 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Data Structures & Algorithms for IR
No ratings yet
Data Structures & Algorithms for IR
34 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Unit I Introduction To Multimedia
No ratings yet
Unit I Introduction To Multimedia
24 pages
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
No ratings yet
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
32 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
String Algorithms for CS Students
No ratings yet
String Algorithms for CS Students
48 pages
CSC 344 - Algorithms and Complexity: Lecture #5 - Searching
No ratings yet
CSC 344 - Algorithms and Complexity: Lecture #5 - Searching
31 pages
Indexing and Searching: Prof - Pravin Shinde
No ratings yet
Indexing and Searching: Prof - Pravin Shinde
25 pages
Indexing Techniques: Signature Files, Suffix Trees & Arrays
No ratings yet
Indexing Techniques: Signature Files, Suffix Trees & Arrays
17 pages
Text Database Challenges & Solutions
No ratings yet
Text Database Challenges & Solutions
44 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
Module 2-Data Structures and Algorithms For Retrieval-Cat1
No ratings yet
Module 2-Data Structures and Algorithms For Retrieval-Cat1
133 pages
Indexing and Compression Basics
No ratings yet
Indexing and Compression Basics
43 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Suffix Arrays for String Search
No ratings yet
Suffix Arrays for String Search
71 pages
House Price Predictor PPT Project
No ratings yet
House Price Predictor PPT Project
13 pages
Data Design Development
No ratings yet
Data Design Development
219 pages
Cay Dau Hieu Nhi Phan
No ratings yet
Cay Dau Hieu Nhi Phan
9 pages
Unit5 Trie
No ratings yet
Unit5 Trie
23 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IR Unit 3
No ratings yet
IR Unit 3
66 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Suffix Trees and Their Applications in String Algo
No ratings yet
Suffix Trees and Their Applications in String Algo
21 pages
Suffix Tree
No ratings yet
Suffix Tree
6 pages
Pattern Matching + Hashing
No ratings yet
Pattern Matching + Hashing
29 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
Irs Unit-2 Modified
No ratings yet
Irs Unit-2 Modified
7 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Algorithm Assignment
No ratings yet
Algorithm Assignment
10 pages
Chap 5
No ratings yet
Chap 5
64 pages
Jda 2009
No ratings yet
Jda 2009
29 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
9 Suffix Trees: Tttta
No ratings yet
9 Suffix Trees: Tttta
9 pages
Suffix Arrays: Justin Zhang 24 May 2017
No ratings yet
Suffix Arrays: Justin Zhang 24 May 2017
5 pages
Gsaca
No ratings yet
Gsaca
63 pages
Unit 2
No ratings yet
Unit 2
10 pages
Suffix Array Tutorial
No ratings yet
Suffix Array Tutorial
17 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Jea08 2
No ratings yet
Jea08 2
30 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
Suf Tree
No ratings yet
Suf Tree
6 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
09 Indexes2
No ratings yet
09 Indexes2
5 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Irs Mid1 QB
No ratings yet
Irs Mid1 QB
3 pages
Front End Design Tools E-Module - Code-205 - BCA - Sem. III
No ratings yet
Front End Design Tools E-Module - Code-205 - BCA - Sem. III
228 pages
Notesa
No ratings yet
Notesa
15 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Text Indexing
No ratings yet
Text Indexing
11 pages
DBMS Notes
No ratings yet
DBMS Notes
17 pages
GIS-824, Advanced Geographic Information Systems
No ratings yet
GIS-824, Advanced Geographic Information Systems
2 pages
Advance Data Structures: Tries
No ratings yet
Advance Data Structures: Tries
26 pages
College Data Management System Report
No ratings yet
College Data Management System Report
12 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
UG - B.Lib.I.Sc. - Library Information Science - 109 21 - ICT in Libraries
No ratings yet
UG - B.Lib.I.Sc. - Library Information Science - 109 21 - ICT in Libraries
290 pages
Simmons 1979
No ratings yet
Simmons 1979
32 pages
On The Educational Impact of Chatgpt: Is Artificial Intelligence Ready To Obtain A University Degree?
No ratings yet
On The Educational Impact of Chatgpt: Is Artificial Intelligence Ready To Obtain A University Degree?
7 pages
Data Warehousing for Decision Support
No ratings yet
Data Warehousing for Decision Support
25 pages
2021 OxfordBibliographies GeoAI
No ratings yet
2021 OxfordBibliographies GeoAI
17 pages
Cartoonify Image
No ratings yet
Cartoonify Image
12 pages
MAT 161 Lesson - 2
No ratings yet
MAT 161 Lesson - 2
21 pages
Lecture 2 Chapter 01
No ratings yet
Lecture 2 Chapter 01
18 pages
Cryptography Course Overview
No ratings yet
Cryptography Course Overview
28 pages
Hexart - in Certification Program (Quiz)
No ratings yet
Hexart - in Certification Program (Quiz)
15 pages
B.tech R-22 Iii - II
No ratings yet
B.tech R-22 Iii - II
11 pages
Class 12: Python File Handling
No ratings yet
Class 12: Python File Handling
10 pages
A - Quick Look at Normalization
No ratings yet
A - Quick Look at Normalization
18 pages
An Algorithm To Transform Natural Languages To SQL Queries For Relational Databases
No ratings yet
An Algorithm To Transform Natural Languages To SQL Queries For Relational Databases
7 pages
Leveraging A I
No ratings yet
Leveraging A I
10 pages
Section 1 - Generations of Programming Languages
No ratings yet
Section 1 - Generations of Programming Languages
5 pages
A10030681S419
No ratings yet
A10030681S419
6 pages
SIT305 Artificial Intelligence CAT1 Marking Scheme
No ratings yet
SIT305 Artificial Intelligence CAT1 Marking Scheme
5 pages
Cloud 4
No ratings yet
Cloud 4
4 pages
Tut - 01 CMT221
No ratings yet
Tut - 01 CMT221
4 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Unit 6 Dbms Unit 6
No ratings yet
Unit 6 Dbms Unit 6
4 pages
1 - Chitwan Garg
No ratings yet
1 - Chitwan Garg
1 page