Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views12 pages

Module 1-1

The document provides an overview of Information Retrieval (IR), detailing its processes, including data extraction, information extraction, and data mining. It discusses various indexing methods such as inverted indexes, positional indexes, and biword indexes, along with query processing techniques and optimization strategies. Additionally, it highlights the differences between unstructured data in IR and structured data in databases, along with challenges in indexing and processing documents.

Uploaded by

colii44220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Module 1-1

The document provides an overview of Information Retrieval (IR), detailing its processes, including data extraction, information extraction, and data mining. It discusses various indexing methods such as inverted indexes, positional indexes, and biword indexes, along with query processing techniques and optimization strategies. Additionally, it highlights the differences between unstructured data in IR and structured data in databases, along with challenges in indexing and processing documents.

Uploaded by

colii44220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Module 1

Introduction
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from within
large collections (usually stored on computers).
IR includes:

1. Web search

2. Email search

3. Searching laptop

4. Corporate knowledge bases

5. Legal information retrieval

Data Extraction is the process that involves retrieval of data from various
sources to migrate it to data repository, process it, and analyze it.
Information Extraction (IE) is the automated retrieval of specific information
related to a selected topic from body/bodies of text.

Data Mining is the process of analyzing dense volumes of data to find patterns,
discover trends. and gain insight into how that data can be used. Data miners
can use the result of that process to make decisions, or predict an outcome.

Web Mining is the process of using data mining techniques and algorithms to
extract information from web documents and services, web content, hyperlinks
and server logs.
Web Crawler (Spider) is a standalone bot that systematically scans the Internet
for indexing and searching for content, following internal links on web pages.
Web Scraper is the process of extracting and searching for specific information
on specific websites or pages.
Collection is a set of documents.
Information need is the topic which the user desires to know more.
Note that information need ≠ query, the query is what the user conveys to the
computer in an attempt to communicate the information retrieval

Module 1 1
Relevance:

A document is relevant if it contains useful information based on the user’s


query.

A document is irrelevant if it does not help answer the user's need.

Effectiveness: To assess an IR system precision and recall are used

Precision = Relevant documents retrieved / Total retrieved documents

Recall = Relevant documents retrieved / Total Relevant documents in the


system

Goal of IR is to retreive docs with information that is relevant to the user’s need
to help him complete a task.

Classic Search Model

The classic search model can go wrong in 2 steps it may misconcept or


misformulation.

Misformulation causes lowering in precision

Module 1 2
ADHOC Retreival is a standard task in IR. It refers to the process where a user
submits a one-time (ad hoc) query, and the system retrieves relevant
documents from a collection, without applying any refinements on the query.
Grepping is the process of searching a text in docs linearly. But it can’t be used
in case of having huge collection as it won’t be efficient and there will be no
enough space to hold the matrix used for search.

Term-Document Incidence Matrix

Then Brutus and Caesar and Not Calpurnia matches both “Antony and
Cleopatra” and “Hamlet”
Suppose we have N = 1 million documents. Suppose each document is about
1000 words long (2–3 book pages), assume an average of 6 bytes per word
including spaces and punctuation. This is a document collection about 6 GB in
size. Typically, there might be about M = 500,000 distinct terms in these
documents So the matrix size should be 500k x 1M. The matrix would have
maximum 1 billion 1’s, and the rest are zeros.
So we can’t use the matrix, so the better solution is the Inverted Index

Inverted Index
For each term t, we must store all documents that contain t

Module 1 3
Each term has corresponding posting list.

Posting list is sorted, and variable sized, stored as linked list, containiny doc
IDs that contain term t.

Forward Index is list of documents mapped with words appear in them.

Inverted Index is list of words mapped with documents that contains it.

Process of generating Inverted Index:

The Modified tokens are the token stream after applying Normalization,
Stemming, and Removing Stop words

To build inverted index:

1. Read the documents and tokenize it, then preprocess.

2. Add the terms one by one with the corresponding doc id.

3. Sort the terms alphabetically.

4. Merge the like terms and merge their posting lists, and make sure they
are sorted.

Module 1 4
Query Processing with Inverted Index
Ex: Given term x and term y, search for the query x and y

Sol:

1. Get the posting list of term x and y

2. Merge(intersect) the two posting lists

3. The result contains doc ids that contain both terms x and y

Time complexity for query is O(length of x’s posting list + length of y’s posting
list) approximatley O(N), where N is the number of docs in the collection.
→ The intersecting is done using two pointers, by following this algorithm:

Boolean Retrieval Model


These types of models can solve boolean queries using AND, OR, NOT.

It was the primary commercial retrieval tool for 3 decades.

It works by viewing every document as a set of words and check if it


matches the condition or not.

Extended Boolean Retrieval Model

Module 1 5
Its goal is to overcome the drawbacks of the Boolean model. Because the
boolean model doesn't consider term weights in queries, and the result set
of a Boolean query is often either too small or too big.

Ex: What is limitations in case involving the federal tort claims act?

LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /p access

! mean matches limit \ limitization \ limits

/3→ "LIMIT!" must be within 3 words of either "STATUTE" or


"ACTION". A limit exists, but legal action can still be taken.

/S → "FEDERAL" must be in the same sentence as the matched words.

/2 → "TORT" must be within 2 words of "FEDERAL". ‘The federal


liability and tort system is complex.’

/3 → "CLAIM" must be within 3 words of "TORT".

/P → "ACCESS" must be in the same paragraph as the matched words.

Query Optimization

WAY 1 (Optimize order of processing)


When processing a Boolean query, we want to minimize computation time by
reducing the number of documents we process. The key optimization
technique is processing terms in increasing order of posting list sizes. This
reduces the number of comparisons needed in AND operations, which are
computationally expensive (because merging is needed).

For the following query:

Solution:

1. Calpurnia AND Brutus = RESULT

Module 1 6
2. RESULT AND Caser = Answer of query

For the following query:

1. Compute (tangerine OR trees) → size O(46653) + O(316812) = O(363,465)

2. Compute (marmalade OR skies) → size O(107913) + O(271658) = O(379,571)

3. Compute (kaleidoscope OR eyes) → size O(87009) + O(213312) = O(300,321)

4. Merge (kaleidoscope OR eyes) AND (tangerine OR trees) → new set (iv)

5. Merge (iv) AND (marmalade OR skies) → final result

For the following query:

Solution:
→ If the countrymen’s frequency is small then applying it before the AND
doesn’t much help
→ If the countrymen’s frequency is high then applying it before the AND will
help reducing the docs.

1. Start with the rarest term friends or romans , whichever has the smallest
posting list.

2. Use AND to combine the first two terms ( friends AND romans) .

3. Apply NOT countrymen at the end, removing any documents that contain
"countrymen".

WAY 2 (Use Skip Pointers)


More skips = shorter skip spans:

Module 1 7
If you put a skip pointer after every few IDs, then each skip covers a
small number of IDs.

This lets you find your match faster

Pros: more opportunity to skip.


Cons: you have to perform more comparisons with these skip pointers.

Fewer skips = longer skip spans:

If you insert skip pointers less frequently, each skip covers a large
range.

Pros: you do fewer comparisons with skip pointers.


Cons: you may miss many opportunities to skip forward, reducing the
potential performance boost.

Usually skip pointers are place at √L , where L is the length of posting list

Phrase Queries
If we have a query “Stanford University” as a phrase, I don’t want “I went to
university at Stanford” to get matched, I need to have ”university” right after
“stanford”, so the normal inverted index can’t work anymore, because it stores
only one term per entry.
Biwords Indexes
To solve this we will try Biword Indexes, which stores 2 words as a term per
entry. Ex: “Friends Romans Countrymen”, the inverted index will be:

Friends Romans → {Docs Ids that has “Friends” followed by “Romans”}

Romans Countrymen → {Docs Ids that has “Romans” followed by


“Countrymen”}

Ex: Having the query phrase: “Stanford University Palo Alto”, the query
could be = Stanford University AND University Palo AND Palo Alto
Issues with Biwords Indexes:

False Positives, in the example above the document may have all of
them but not consecutive

The dictionary (Inverted index) became larger.

Positional Indexes

Module 1 8
Another way to handle phrase queries is Positional Index. Positional Index
stores each term as a key and maps it to a list of document IDs. For each
document, it stores a list of positions where the term appears.
Example: "to be or not to be"

1. Break it into word pairs:

"to be"

"be or"

"or not"

"not to"

"to be"

2. Retrieve posting lists for each term:

to → (doc1: positions 5, 20, 9), (doc2: positions 2, 70, 90)

be → (doc1: positions 6, 30, 10), (doc2: positions 3, 31, 91)

3. Merge posting lists:

Find a common document ID

Check if to appears at position x and be appears at x+1 (ensuring


they are adjacent)

Repeat for the next word pairs ( be or , or not , etc.)

4. If all pairs match in order, the document contains the phrase.

It can also be used to process proximity queries as LIMIT /3 STATUTE /3 FEDERAL /2


TORT whereas biword indexes cannot, since they only store fixed word

pairs. This is the algo for processing proximity queries:

Module 1 9
→ Biwords and Positional Indexes can be combined to form Combination
Schemes where we should use biwords in phrases like (“Michael Jackson”,
“Britney Spears”) that contains two terms only
→ Williams et al evaluated a more sophisticated mixed indexing scheme, where
a typical web query mixture was executed in ¼ of the time of using just a
positional index, but it requires 26% more space than having a positional index
alone
→ A positional index is 2–4 as large as a non positional index
→ Positional index size 35–50% of volume of original text

Difference between IR (Untructured Data) and DB (Structured


Data)
→ IR deals with unstructured data (free text) no constraints on the text
→ DB deals with structured data which is represented in tables and typically
has constraints
→ Semi-structured Data like slides (Title, bullet points)

Module 1 10
Challenges While Indexing
1. Parsing document: detecting file format/language used/character set used.

2. Multi languages/formats documents: documents may contain text of many


languages/formats.

→ Usually commercial and open-source libraries can handle a lot of this stuff.

Processing Documents
1. Tokenization: process of converting stream of text to list of tokens, where a
token is an instance of a sequence of characters (word).
Challenges:

a. Finland’s: should be 2 tokens or single token

b. Hawlett-Packard: should be 2 tokens or single token

c. Numbers/Dates

d. Some languages as German, Chinese, and Japanese has some words


with no spaces.

e. Having multi-languages in a document, where these languages have


different direction lr / rl

2. Removing stop words: having a stop list, you exclude from the dictionary
entirely the commonest words, as they have little semantic content.

There are another techniques that apply this step but doesn’t remove all of
the stop words, instead it give them small space, as IIR 5.

Also are some techniques that doesn’t apply this step but increase the
waiting time for a query, as IIR 7.

3. Normalization: is the process of converting all variations of a word to a


single form, this step is applied on both collection’s documents, and
queries. (U.S.A. → USA, anti-discriminatory → antidiscriminatory, résumé
→ resume)

4. Case folding: is the process of converting all letter into lowercase, this step
is applied on both collection’s documents and queries

→ Equivalence Classing is the stemming and lemmatization, where different


forms are reduced into a single representation.
→ Stemming chop affixes (automatic → automat) “Porter’s algorithm used”

Module 1 11
→ Lemmatization return word into its base form (automatic → automate)

→ Asymmetric Expansion involves adding related or variant forms in the


searching time (query: window, search: window + windows)
→ To get the synonyms and antonyms Thesaurus is used
→ To fix misspellings, Soundex is used

Module 1 12

You might also like