0% found this document useful (0 votes)

23 views12 pages

Module 1-1

The document provides an overview of Information Retrieval (IR), detailing its processes, including data extraction, information extraction, and data mining. It discusses various indexing methods such as inverted indexes, positional indexes, and biword indexes, along with query processing techniques and optimization strategies. Additionally, it highlights the differences between unstructured data in IR and structured data in databases, along with challenges in indexing and processing documents.

Uploaded by

colii44220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views12 pages

Module 1-1

Uploaded by

colii44220

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Module 1

Introduction
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from within
large collections (usually stored on computers).
IR includes:

1. Web search

2. Email search

3. Searching laptop

4. Corporate knowledge bases

5. Legal information retrieval

Data Extraction is the process that involves retrieval of data from various
sources to migrate it to data repository, process it, and analyze it.
Information Extraction (IE) is the automated retrieval of specific information
related to a selected topic from body/bodies of text.

Data Mining is the process of analyzing dense volumes of data to find patterns,
discover trends. and gain insight into how that data can be used. Data miners
can use the result of that process to make decisions, or predict an outcome.

Web Mining is the process of using data mining techniques and algorithms to
extract information from web documents and services, web content, hyperlinks
and server logs.
Web Crawler (Spider) is a standalone bot that systematically scans the Internet
for indexing and searching for content, following internal links on web pages.
Web Scraper is the process of extracting and searching for specific information
on specific websites or pages.
Collection is a set of documents.
Information need is the topic which the user desires to know more.
Note that information need ≠ query, the query is what the user conveys to the
computer in an attempt to communicate the information retrieval

Module 1 1
Relevance:

A document is relevant if it contains useful information based on the user’s

query.

A document is irrelevant if it does not help answer the user's need.

Effectiveness: To assess an IR system precision and recall are used

Precision = Relevant documents retrieved / Total retrieved documents

Recall = Relevant documents retrieved / Total Relevant documents in the

system

Goal of IR is to retreive docs with information that is relevant to the user’s need
to help him complete a task.

Classic Search Model

The classic search model can go wrong in 2 steps it may misconcept or

misformulation.

Misformulation causes lowering in precision

Module 1 2
ADHOC Retreival is a standard task in IR. It refers to the process where a user
submits a one-time (ad hoc) query, and the system retrieves relevant
documents from a collection, without applying any refinements on the query.
Grepping is the process of searching a text in docs linearly. But it can’t be used
in case of having huge collection as it won’t be efficient and there will be no
enough space to hold the matrix used for search.

Term-Document Incidence Matrix

Then Brutus and Caesar and Not Calpurnia matches both “Antony and
Cleopatra” and “Hamlet”
Suppose we have N = 1 million documents. Suppose each document is about
1000 words long (2–3 book pages), assume an average of 6 bytes per word
including spaces and punctuation. This is a document collection about 6 GB in
size. Typically, there might be about M = 500,000 distinct terms in these
documents So the matrix size should be 500k x 1M. The matrix would have
maximum 1 billion 1’s, and the rest are zeros.
So we can’t use the matrix, so the better solution is the Inverted Index

Inverted Index
For each term t, we must store all documents that contain t

Module 1 3
Each term has corresponding posting list.

Posting list is sorted, and variable sized, stored as linked list, containiny doc
IDs that contain term t.

Forward Index is list of documents mapped with words appear in them.

Inverted Index is list of words mapped with documents that contains it.

Process of generating Inverted Index:

The Modified tokens are the token stream after applying Normalization,
Stemming, and Removing Stop words

To build inverted index:

1. Read the documents and tokenize it, then preprocess.

2. Add the terms one by one with the corresponding doc id.

3. Sort the terms alphabetically.

4. Merge the like terms and merge their posting lists, and make sure they
are sorted.

Module 1 4
Query Processing with Inverted Index
Ex: Given term x and term y, search for the query x and y

Sol:

1. Get the posting list of term x and y

2. Merge(intersect) the two posting lists

3. The result contains doc ids that contain both terms x and y

Time complexity for query is O(length of x’s posting list + length of y’s posting
list) approximatley O(N), where N is the number of docs in the collection.
→ The intersecting is done using two pointers, by following this algorithm:

Boolean Retrieval Model

These types of models can solve boolean queries using AND, OR, NOT.

It was the primary commercial retrieval tool for 3 decades.

It works by viewing every document as a set of words and check if it

matches the condition or not.

Extended Boolean Retrieval Model

Module 1 5
Its goal is to overcome the drawbacks of the Boolean model. Because the
boolean model doesn't consider term weights in queries, and the result set
of a Boolean query is often either too small or too big.

Ex: What is limitations in case involving the federal tort claims act?

LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /p access

! mean matches limit \ limitization \ limits

/3→ "LIMIT!" must be within 3 words of either "STATUTE" or

"ACTION". A limit exists, but legal action can still be taken.

/S → "FEDERAL" must be in the same sentence as the matched words.

/2 → "TORT" must be within 2 words of "FEDERAL". ‘The federal

liability and tort system is complex.’

/3 → "CLAIM" must be within 3 words of "TORT".

/P → "ACCESS" must be in the same paragraph as the matched words.

Query Optimization

WAY 1 (Optimize order of processing)

When processing a Boolean query, we want to minimize computation time by
reducing the number of documents we process. The key optimization
technique is processing terms in increasing order of posting list sizes. This
reduces the number of comparisons needed in AND operations, which are
computationally expensive (because merging is needed).

For the following query:

Solution:

1. Calpurnia AND Brutus = RESULT

Module 1 6
2. RESULT AND Caser = Answer of query

For the following query:

1. Compute (tangerine OR trees) → size O(46653) + O(316812) = O(363,465)

2. Compute (marmalade OR skies) → size O(107913) + O(271658) = O(379,571)

3. Compute (kaleidoscope OR eyes) → size O(87009) + O(213312) = O(300,321)

4. Merge (kaleidoscope OR eyes) AND (tangerine OR trees) → new set (iv)

5. Merge (iv) AND (marmalade OR skies) → final result

For the following query:

Solution:
→ If the countrymen’s frequency is small then applying it before the AND
doesn’t much help
→ If the countrymen’s frequency is high then applying it before the AND will
help reducing the docs.

1. Start with the rarest term friends or romans , whichever has the smallest
posting list.

2. Use AND to combine the first two terms ( friends AND romans) .

3. Apply NOT countrymen at the end, removing any documents that contain
"countrymen".

WAY 2 (Use Skip Pointers)

More skips = shorter skip spans:

Module 1 7
If you put a skip pointer after every few IDs, then each skip covers a
small number of IDs.

This lets you find your match faster

Pros: more opportunity to skip.

Cons: you have to perform more comparisons with these skip pointers.

Fewer skips = longer skip spans:

If you insert skip pointers less frequently, each skip covers a large
range.

Pros: you do fewer comparisons with skip pointers.

Cons: you may miss many opportunities to skip forward, reducing the
potential performance boost.

Usually skip pointers are place at √L , where L is the length of posting list

Phrase Queries
If we have a query “Stanford University” as a phrase, I don’t want “I went to
university at Stanford” to get matched, I need to have ”university” right after
“stanford”, so the normal inverted index can’t work anymore, because it stores
only one term per entry.
Biwords Indexes
To solve this we will try Biword Indexes, which stores 2 words as a term per
entry. Ex: “Friends Romans Countrymen”, the inverted index will be:

Friends Romans → {Docs Ids that has “Friends” followed by “Romans”}

Romans Countrymen → {Docs Ids that has “Romans” followed by

“Countrymen”}

Ex: Having the query phrase: “Stanford University Palo Alto”, the query
could be = Stanford University AND University Palo AND Palo Alto
Issues with Biwords Indexes:

False Positives, in the example above the document may have all of
them but not consecutive

The dictionary (Inverted index) became larger.

Positional Indexes

Module 1 8
Another way to handle phrase queries is Positional Index. Positional Index
stores each term as a key and maps it to a list of document IDs. For each
document, it stores a list of positions where the term appears.
Example: "to be or not to be"

1. Break it into word pairs:

"to be"

"be or"

"or not"

"not to"

"to be"

2. Retrieve posting lists for each term:

to → (doc1: positions 5, 20, 9), (doc2: positions 2, 70, 90)

be → (doc1: positions 6, 30, 10), (doc2: positions 3, 31, 91)

3. Merge posting lists:

Find a common document ID

Check if to appears at position x and be appears at x+1 (ensuring

they are adjacent)

Repeat for the next word pairs ( be or , or not , etc.)

4. If all pairs match in order, the document contains the phrase.

It can also be used to process proximity queries as LIMIT /3 STATUTE /3 FEDERAL /2

TORT whereas biword indexes cannot, since they only store fixed word

pairs. This is the algo for processing proximity queries:

Module 1 9
→ Biwords and Positional Indexes can be combined to form Combination
Schemes where we should use biwords in phrases like (“Michael Jackson”,
“Britney Spears”) that contains two terms only
→ Williams et al evaluated a more sophisticated mixed indexing scheme, where
a typical web query mixture was executed in ¼ of the time of using just a
positional index, but it requires 26% more space than having a positional index
alone
→ A positional index is 2–4 as large as a non positional index
→ Positional index size 35–50% of volume of original text

Difference between IR (Untructured Data) and DB (Structured

Data)
→ IR deals with unstructured data (free text) no constraints on the text
→ DB deals with structured data which is represented in tables and typically
has constraints
→ Semi-structured Data like slides (Title, bullet points)

Module 1 10
Challenges While Indexing
1. Parsing document: detecting file format/language used/character set used.

2. Multi languages/formats documents: documents may contain text of many

languages/formats.

→ Usually commercial and open-source libraries can handle a lot of this stuff.

Processing Documents
1. Tokenization: process of converting stream of text to list of tokens, where a
token is an instance of a sequence of characters (word).
Challenges:

a. Finland’s: should be 2 tokens or single token

b. Hawlett-Packard: should be 2 tokens or single token

c. Numbers/Dates

d. Some languages as German, Chinese, and Japanese has some words

with no spaces.

e. Having multi-languages in a document, where these languages have

different direction lr / rl

2. Removing stop words: having a stop list, you exclude from the dictionary
entirely the commonest words, as they have little semantic content.

There are another techniques that apply this step but doesn’t remove all of
the stop words, instead it give them small space, as IIR 5.

Also are some techniques that doesn’t apply this step but increase the
waiting time for a query, as IIR 7.

3. Normalization: is the process of converting all variations of a word to a

single form, this step is applied on both collection’s documents, and
queries. (U.S.A. → USA, anti-discriminatory → antidiscriminatory, résumé
→ resume)

4. Case folding: is the process of converting all letter into lowercase, this step
is applied on both collection’s documents and queries

→ Equivalence Classing is the stemming and lemmatization, where different

forms are reduced into a single representation.
→ Stemming chop affixes (automatic → automat) “Porter’s algorithm used”

Module 1 11
→ Lemmatization return word into its base form (automatic → automate)

→ Asymmetric Expansion involves adding related or variant forms in the

searching time (query: window, search: window + windows)
→ To get the synonyms and antonyms Thesaurus is used
→ To fix misspellings, Soundex is used

Module 1 12

Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chapter 1 - Boolean-Retrieval
No ratings yet
Chapter 1 - Boolean-Retrieval
33 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
L2 Boolean Retrieval
No ratings yet
L2 Boolean Retrieval
33 pages
Oracle: Exam Questions 1z0-083
No ratings yet
Oracle: Exam Questions 1z0-083
12 pages
Unit 1
No ratings yet
Unit 1
181 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Unit2 ISR
No ratings yet
Unit2 ISR
12 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Unit 2
No ratings yet
Unit 2
58 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lec 2
No ratings yet
Lec 2
17 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Lec 3
No ratings yet
Lec 3
17 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
7 Phrase Queries and Positional Indexes
No ratings yet
7 Phrase Queries and Positional Indexes
25 pages
Ikm Attempt
No ratings yet
Ikm Attempt
16 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Understanding AWS Core Services - Services List
No ratings yet
Understanding AWS Core Services - Services List
3 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Answer Key Advanced EDB Postgres v15
No ratings yet
Answer Key Advanced EDB Postgres v15
13 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
BLI 223 Assignment
No ratings yet
BLI 223 Assignment
14 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
IR Lec04 Skip Ptrs Phrase Queries Indexing
No ratings yet
IR Lec04 Skip Ptrs Phrase Queries Indexing
18 pages
IRS Lec06 24
No ratings yet
IRS Lec06 24
13 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
SAP R/3 Tips, Tricks, and Enhancements
100% (1)
SAP R/3 Tips, Tricks, and Enhancements
28 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
Introduction to Boolean Retrieval
No ratings yet
Introduction to Boolean Retrieval
50 pages
Ir 1
No ratings yet
Ir 1
14 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
4GL
100% (1)
4GL
33 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Lecture 3-Skip Pointers and Phrase Queries
No ratings yet
Lecture 3-Skip Pointers and Phrase Queries
12 pages
Database Concepts Notes
No ratings yet
Database Concepts Notes
48 pages
AWS Redshift Infographic Final
No ratings yet
AWS Redshift Infographic Final
1 page
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
27 pages
Santosh Curum Java Developer
No ratings yet
Santosh Curum Java Developer
7 pages
THE Pentecostal Assembly School
No ratings yet
THE Pentecostal Assembly School
31 pages
Information Retrieval Lecture Overview
No ratings yet
Information Retrieval Lecture Overview
6 pages
Configuration of CFIN Inbound Interfaces 1
No ratings yet
Configuration of CFIN Inbound Interfaces 1
41 pages
Database Questions and Answers
No ratings yet
Database Questions and Answers
4 pages
Bsis 102 Notes
No ratings yet
Bsis 102 Notes
76 pages
Practice: Open File Named Poohsticks Rating - Vlookup Answer, and Study The VLOOKUP Formula For The Following Instructions
No ratings yet
Practice: Open File Named Poohsticks Rating - Vlookup Answer, and Study The VLOOKUP Formula For The Following Instructions
3 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
SAP HANA Cloud ODATA Service
No ratings yet
SAP HANA Cloud ODATA Service
15 pages
Replication Setup For DB2 Universal Database: A Step-By-Step Approach To "User-Copy" Replication
No ratings yet
Replication Setup For DB2 Universal Database: A Step-By-Step Approach To "User-Copy" Replication
13 pages
Chapter-1: Internship Management System
No ratings yet
Chapter-1: Internship Management System
37 pages
RTN NE Software Package by FTP
No ratings yet
RTN NE Software Package by FTP
3 pages
Practical No 24
No ratings yet
Practical No 24
8 pages
Ch05 - Physical Database Design and Performance
No ratings yet
Ch05 - Physical Database Design and Performance
38 pages
Chapter 1 Part 2 Advanced SQL
No ratings yet
Chapter 1 Part 2 Advanced SQL
70 pages
DBMS Question Bank for AIML Students
No ratings yet
DBMS Question Bank for AIML Students
2 pages
SQL Stored Procedures
No ratings yet
SQL Stored Procedures
5 pages
Database Systems Lab Guide
No ratings yet
Database Systems Lab Guide
6 pages
Electrical Engineer Resume SEO Skills
No ratings yet
Electrical Engineer Resume SEO Skills
3 pages
New Version of The NSD Analysis Tool - Assono Blog
No ratings yet
New Version of The NSD Analysis Tool - Assono Blog
4 pages
PHP 5 Introduction
No ratings yet
PHP 5 Introduction
53 pages