0% found this document useful (0 votes)

47 views13 pages

Inverted Index

An inverted index is a data structure that maps terms to the documents containing them, enabling efficient retrieval in information retrieval systems. It is created by indexing words in documents and is advantageous for its storage efficiency and fast search capabilities, making it ideal for search engines. However, it has drawbacks such as high storage overhead and maintenance costs during updates.

Uploaded by

ramirtt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views13 pages

Inverted Index

Uploaded by

ramirtt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Inverted index is a data structure used in information retrieval systems to

efficiently retrieve documents or web pages containing a specific term or set of

terms. In an inverted index, the index is organized by terms (words), and each
term points to a list of documents or web pages that contain that term.

What is an Inverted Index?

An inverted index is a data structure that stores a mapping between words

and the documents that contain them. It is used to quickly locate documents

or records that contain specific keywords. The inverted index is created by

indexing the words in the documents and then storing the mapping between

the words and the documents in a data structure. This data structure is then

used to quickly locate the documents that contain the keywords that are being

searched for.

How Does an Inverted Index Work?

An inverted index works by indexing the words in the documents and then

storing the mapping between the words and the documents in a data

structure. This data structure is then used to quickly locate the documents

that contain the keywords that are being searched for. The inverted index is

created by indexing the words in the documents and then storing the mapping

between the words and the documents in a data structure. This data structure

is then used to quickly locate the documents that contain the keywords that

are being searched for.

Advantages of an Inverted Index

An inverted index has several advantages over other data structures. First, it is

very efficient in terms of storage and retrieval. An inverted index can store a

large amount of data in a relatively small amount of space. Additionally, it is

very fast at locating documents that contain specific keywords. This makes it

ideal for use in search engines and databases.

How to Implement an Inverted Index

Implementing an inverted index is relatively straightforward. First, the words in

the documents must be indexed. This can be done by using a text indexer,

which is a program that indexes the words in the documents. Once the words

have been indexed, the mapping between the words and the documents can

be stored in a data structure. This data structure can then be used to quickly

locate the documents that contain the keywords that are being searched for.

How to Optimize an Inverted Index

An inverted index can be optimized in several ways. First, the indexer can be

optimized to index the words more efficiently. Additionally, the data structure

used to store the mapping between the words and the documents can be

optimized to reduce the amount of space needed to store the data. Finally, the

search algorithm used to locate the documents can be optimized to reduce

the amount of time needed to locate the documents.

For example, consider the following documents:

Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.

To create an inverted index for these documents, we first tokenize the

documents into terms, as follows:

Document 1: The, quick, brown, fox, jumped, over, the, lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.

Next, we create an index of the terms, where each term points to a list of
documents that contain that term, as follows:

The -> Document 1, Document 2

quick -> Document 1
brown -> Document 1
fox -> Document 1
jumped -> Document 1
over -> Document 1
lazy -> Document 1, Document 2
dog -> Document 1, Document 2
slept -> Document 2
in -> Document 2
sun -> Document 2

To search for documents containing a particular term or set of terms, the search
engine queries the inverted index for those terms and retrieves the list of
documents associated with each term. The search engine can then use this
information to rank the documents based on relevance to the query and present
them to the user in order of importance.

Inverted indexes are widely used in search engines, database systems, and
other applications where efficient text search is required. They are especially
useful for large collections of documents, where searching through all the
documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content,
such as words or numbers, to its locations in a document or a set of documents.
In simple words, it is a hashmap like data structure that directs you from a word
to a document or a web page.

There are two types of inverted indexes: A record-level inverted index contains
a list of references to documents for each word. A word-level inverted index
additionally contains the positions of each word within a document. The latter
form offers more functionality, but needs more processing power and space to
be created.

Suppose we want to search the texts “hello everyone, ” “this article is based on
inverted index, ” “which is hashmap like data structure”. If we index by (text,
word within the text), the index with location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has

an entry (1, 1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions
respectively (here position is based on word).

The index may have weights, frequencies, or other indicators.

Steps to build an inverted index:

● Fetch the Document

Removing of Stop Words: Stop words are most occurring and useless

words in document like “I”, “the”, “we”, “is”, “an”.

● Stemming of Root Word

Whenever I want to search for “cat”, I want to see a document that has

information about it. But the word present in the document is called

“cats” or “catty” instead of “cat”. To relate the both words, I’ll chop

some part of each and every word I read so that I could get the “root

word”. There are standard tools for performing this like “Porter’s

Stemmer”.

● Record Document IDs

If word is already present add reference of document to index else

create new entry. Add additional information like frequency of word,

location of word etc.

Example:
Words Document
ant doc1
demo doc2
world doc1, doc2

# Define the documents

document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

# Step 1: Tokenize the documents

# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index

# Create an empty dictionary to store the inverted index
inverted_index = {}

# For each term, find the documents that contain it

for term in terms:
documents = []
if term in tokens1:
documents.append("Document 1")
if term in tokens2:
documents.append("Document 2")
inverted_index[term] = documents

# Step 3: Print the inverted index

for term, documents in inverted_index.items():
print(term, "->", ", ".join(documents))

Explaination of above code:

First two lines defines two sample documents to be used as input to the
algorithm.

Step 1 : tokenize the input documents by converting them to lowercase and

splitting them into individual words. Then combine the resulting tokens from
both documents into a single list of unique terms.

Step 2: create an empty dictionary to store the inverted index, and then iterate
through each term in the list of unique terms. For each term,create an empty list
of documents, and then check if the term appears in each input document.

If the term appears in a document, add the document to the list for that term.
Finally, add an entry to the inverted index dictionary for the current term, with
the list of documents that contain that term as its value.

Step 3: iterate through the entries in the inverted index dictionary and print out
each term along with the list of documents that contain it.
Output
jumped -> Document 1
fox -> Document 1
lazy -> Document 1, Document 2
the -> Document 1, Document 2
in -> Document 2
dog. -> Document 1
quick -> Document 1
dog -> Document 2
slept -> Document 2
sun. -> Document 2
brown -> Document 1
over -> Document 1

Advantage of Inverted Index are:

● Inverted index is to allow fast full text searches, at a cost of increased

processing when a document is added to the database.

● It is easy to develop.

● It is the most popular data structure used in document retrieval

systems, used on a large scale for example in search engines.

Inverted Index also has disadvantage:

● Large storage overhead and high maintenance costs on update, delete

and insert.

● Instead of retrieving the data in a decreasing order of expected

usefulness, the records are retrieved in the order in which they occur in

the inverted lists.

Features of inverted indexes include:

Efficient search: Inverted indexes allow for efficient searching of large volumes
of text-based data. By indexing every term in every document, the index can
quickly identify all documents that contain a given search term or phrase,
significantly reducing search time.

Fast updates: Inverted indexes can be updated quickly and efficiently as new
content is added to the system. This allows for near-real-time indexing and
searching of new content.

Flexibility: Inverted indexes can be customized to suit the needs of different

types of information retrieval systems. For example, they can be configured to
handle different types of queries, such as Boolean queries or proximity queries.

Compression: Inverted indexes can be compressed to reduce storage

requirements. Various techniques such as delta encoding, gamma encoding,
variable byte encoding, etc. can be used to compress the posting list efficiently.

Support for stemming and synonym expansion: Inverted indexes can be

configured to support stemming and synonym expansion, which can improve the
accuracy and relevance of search results. Stemming is the process of reducing
words to their base or root form, while synonym expansion involves mapping
different words that have similar meanings to a common term.

Support for multiple languages: Inverted indexes can support multiple

languages, allowing users to search for content in different languages using the
same system.

Improving inverted index

While a basic inverted index can answer queries that have an exact match in
the database, it may not work in all scenarios. For example:

● Users may search for a term that is not present exactly in an inverted
index, but are still related to it. For example, searching for snow or
snowing in place of snowfall. We can address this issue through
Stemming, which is a technique that extracts the root form of the
words by removing affixes. For example, the root form of the words
eating, eats, and eaten is eat.
● Or they can search for a synonym. To solve this, the synonyms of the
searched term are also looked up in the inverted index.
● Users generally search for phrases rather than single words. To support
phrase searching, Word-level Inverted indexes record the position
of a word in the document as well to improve the search results.

Understanding the Inverted Index in Elasticsearch

An inverted index consists of all of the unique terms that appear in any

document covered by the index. For each term, the list of documents in which

the term appears, is stored. So essentially an inverted index is a mapping

between terms and which documents contain those terms. Since an inverted

index works at the document field level and stores the terms for a given field,

it doesn’t need to deal with different fields. So what you will see in the

following example is at the scope of a specific field.

Alright, so let’s see an example. Suppose that we have two recipes with the

following titles: “The Best Pasta Recipe with Pesto” and “Delicious Pasta

Carbonara Recipe.” The following table shows what the inverted index would

look like.
So the terms from both of the titles have been added to the index. For each

term, we can see which document contains the term, which enables

Elasticsearch to efficiently match documents containing specific terms. A

part of what makes this possible, is that the terms are sorted. Also notice that

the terms within the index are the results of the analysis process that you saw

in the previous post in case you read that one. So most symbols have been

removed at this point, and characters have been lowercased. This of course
depends on the analyzer that was used, but that will often be the standard

analyzer.

Performing a search involves a lot of things such as relevance, but let’s forget

about that for now. The first step of a search query is to find the documents

that match the query in the first place. So if we were to search for “pasta

recipe,” we would see that both documents contain both terms.

If we searched for “delicious recipe,” the results would be as follows.

Like I mentioned before, this is of course an oversimplification of how

searching works, but I just wanted to show you the general idea of how the

inverted index is used when performing search queries. It’s great to know

how it works, but this is all transparent to you as a user of Elasticsearch, and

you won’t have to actively deal with the inverted index; it’s just something

that Elasticsearch uses internally. That being said, it is very beneficial to

know the basics of how it works for a number of reasons.

The inverted index also holds information that is used internally, such as for

computing relevance. Some examples of this could be the number of

documents containing each term, the number of times a term appears in a

given document, the average length of a field, etc.

Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
IRS Module 5
No ratings yet
IRS Module 5
24 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Example & Program For Inverted Index
No ratings yet
Example & Program For Inverted Index
2 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Text Indexing
No ratings yet
Text Indexing
11 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
(Wiki) Inverted Index
No ratings yet
(Wiki) Inverted Index
3 pages
Chapter - 3 and 4
No ratings yet
Chapter - 3 and 4
47 pages
Indexing Structures Explained
No ratings yet
Indexing Structures Explained
44 pages
SEO Guide: Indexing Basics & Techniques
No ratings yet
SEO Guide: Indexing Basics & Techniques
34 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
L05
No ratings yet
L05
33 pages
Unit 2
No ratings yet
Unit 2
10 pages
Indexing for Efficient Retrieval
No ratings yet
Indexing for Efficient Retrieval
26 pages
IR Chapter Three
No ratings yet
IR Chapter Three
59 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Unit II-1
No ratings yet
Unit II-1
57 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
2 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
IR Practical 1
No ratings yet
IR Practical 1
5 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Indexing: 1. Static and Dynamic Inverted Index
50% (2)
Indexing: 1. Static and Dynamic Inverted Index
55 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Index Construction Guide
No ratings yet
Index Construction Guide
43 pages
IR Journal
No ratings yet
IR Journal
36 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
Preprocessing, Inverted Index
No ratings yet
Preprocessing, Inverted Index
15 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Inverted File Assignment
No ratings yet
Inverted File Assignment
6 pages
Inverted File Assignment
No ratings yet
Inverted File Assignment
6 pages
Apache Lucene
No ratings yet
Apache Lucene
19 pages
How Apache Solr Works
No ratings yet
How Apache Solr Works
15 pages
Design A Highly Available Key-Value Data Store (Cassandra - )
No ratings yet
Design A Highly Available Key-Value Data Store (Cassandra - )
4 pages
Design A Metrics Aggregation System
No ratings yet
Design A Metrics Aggregation System
3 pages
Database Scaling Through Sharding and Partitioning
No ratings yet
Database Scaling Through Sharding and Partitioning
14 pages
Design A Messaging App Like WhatsApp
No ratings yet
Design A Messaging App Like WhatsApp
3 pages
Blood Donor Management System
44% (9)
Blood Donor Management System
27 pages
Project Document Management Procedure
91% (11)
Project Document Management Procedure
32 pages
SAP BODS Transformations Full QA
No ratings yet
SAP BODS Transformations Full QA
5 pages
MongoDB NoSQL Database Overview
No ratings yet
MongoDB NoSQL Database Overview
9 pages
Visualization
No ratings yet
Visualization
75 pages
SSRS Authoring Tips for BI Pros
No ratings yet
SSRS Authoring Tips for BI Pros
25 pages
JDBC Faq From Jguru
No ratings yet
JDBC Faq From Jguru
213 pages
DBI Assignment
No ratings yet
DBI Assignment
6 pages
Database Management Systems PDF
No ratings yet
Database Management Systems PDF
18 pages
Oracle Data Viz for Business Users
No ratings yet
Oracle Data Viz for Business Users
52 pages
DMS Configuration Guide 2016
No ratings yet
DMS Configuration Guide 2016
12 pages
Unit 4
No ratings yet
Unit 4
72 pages
Hadoop Big Data Concepts Guide
100% (1)
Hadoop Big Data Concepts Guide
7 pages
Start Course - Intellipaat
No ratings yet
Start Course - Intellipaat
4 pages
12c Dataguard Switchover Best Practices Using DGMGRL (Dataguard Broker Command Prompt)
No ratings yet
12c Dataguard Switchover Best Practices Using DGMGRL (Dataguard Broker Command Prompt)
7 pages
DBMS TutorialsPoint Min
No ratings yet
DBMS TutorialsPoint Min
47 pages
Cybersecurity Essentials for B.Tech Students
No ratings yet
Cybersecurity Essentials for B.Tech Students
2 pages
E Computer Notes - Oracle9i Extensions To DML and DDL Statements
No ratings yet
E Computer Notes - Oracle9i Extensions To DML and DDL Statements
20 pages
Hoffer Mdm12e PP Ch01
No ratings yet
Hoffer Mdm12e PP Ch01
60 pages
Oracle CDB vs. Non CDB
No ratings yet
Oracle CDB vs. Non CDB
8 pages
Slides - Sentiment Analysis
No ratings yet
Slides - Sentiment Analysis
5 pages
Advanced SQL Lab 2
No ratings yet
Advanced SQL Lab 2
2 pages
Basic of CyberSecurity
No ratings yet
Basic of CyberSecurity
4 pages
Adithya R CV Data Analyst
No ratings yet
Adithya R CV Data Analyst
1 page
Thara
No ratings yet
Thara
4 pages
Informatica - Beginner Question and Answers - Trenovision
No ratings yet
Informatica - Beginner Question and Answers - Trenovision
12 pages
Clinical Sas Notes
100% (1)
Clinical Sas Notes
21 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
RDBMS Lab Manual for IT Students
No ratings yet
RDBMS Lab Manual for IT Students
65 pages
Cyber Law Unit 1
No ratings yet
Cyber Law Unit 1
3 pages

Inverted Index

Uploaded by

Inverted Index

Uploaded by

Inverted index is a data structure used in information retrieval systems to

efficiently retrieve documents or web pages containing a specific term or set of

What is an Inverted Index?

An inverted index is a data structure that stores a mapping between words

or records that contain specific keywords. The inverted index is created by

How Does an Inverted Index Work?

are being searched for.

large amount of data in a relatively small amount of space. Additionally, it is

ideal for use in search engines and databases.

How to Implement an Inverted Index

Implementing an inverted index is relatively straightforward. First, the words in

How to Optimize an Inverted Index

search algorithm used to locate the documents can be optimized to reduce

the amount of time needed to locate the documents.

For example, consider the following documents:

To create an inverted index for these documents, we first tokenize the

The -> Document 1, Document 2

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has

The index may have weights, frequencies, or other indicators.

Steps to build an inverted index:

words in document like “I”, “the”, “we”, “is”, “an”.

● Stemming of Root Word

● Record Document IDs

If word is already present add reference of document to index else

create new entry. Add additional information like frequency of word,

location of word etc.

# Define the documents

# Step 1: Tokenize the documents

# Step 2: Build the inverted index

# For each term, find the documents that contain it

# Step 3: Print the inverted index

Explaination of above code:

Step 1 : tokenize the input documents by converting them to lowercase and

Advantage of Inverted Index are:

● Inverted index is to allow fast full text searches, at a cost of increased

processing when a document is added to the database.

● It is the most popular data structure used in document retrieval

systems, used on a large scale for example in search engines.

Inverted Index also has disadvantage:

● Large storage overhead and high maintenance costs on update, delete

● Instead of retrieving the data in a decreasing order of expected

the inverted lists.

Features of inverted indexes include:

Flexibility: Inverted indexes can be customized to suit the needs of different

Compression: Inverted indexes can be compressed to reduce storage

Support for stemming and synonym expansion: Inverted indexes can be

Support for multiple languages: Inverted indexes can support multiple

Improving inverted index

Understanding the Inverted Index in Elasticsearch

the term appears, is stored. So essentially an inverted index is a mapping

following example is at the scope of a specific field.

Elasticsearch to efficiently match documents containing specific terms. A

recipe,” we would see that both documents contain both terms.

If we searched for “delicious recipe,” the results would be as follows.

that Elasticsearch uses internally. That being said, it is very beneficial to

know the basics of how it works for a number of reasons.

computing relevance. Some examples of this could be the number of

documents containing each term, the number of times a term appears in a

given document, the average length of a field, etc.

You might also like