Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
47 views13 pages

Inverted Index

An inverted index is a data structure that maps terms to the documents containing them, enabling efficient retrieval in information retrieval systems. It is created by indexing words in documents and is advantageous for its storage efficiency and fast search capabilities, making it ideal for search engines. However, it has drawbacks such as high storage overhead and maintenance costs during updates.

Uploaded by

ramirtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views13 pages

Inverted Index

An inverted index is a data structure that maps terms to the documents containing them, enabling efficient retrieval in information retrieval systems. It is created by indexing words in documents and is advantageous for its storage efficiency and fast search capabilities, making it ideal for search engines. However, it has drawbacks such as high storage overhead and maintenance costs during updates.

Uploaded by

ramirtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Inverted index is a data structure used in information retrieval systems to

efficiently retrieve documents or web pages containing a specific term or set of


terms. In an inverted index, the index is organized by terms (words), and each
term points to a list of documents or web pages that contain that term.

What is an Inverted Index?

An inverted index is a data structure that stores a mapping between words

and the documents that contain them. It is used to quickly locate documents

or records that contain specific keywords. The inverted index is created by

indexing the words in the documents and then storing the mapping between

the words and the documents in a data structure. This data structure is then

used to quickly locate the documents that contain the keywords that are being

searched for.

How Does an Inverted Index Work?

An inverted index works by indexing the words in the documents and then

storing the mapping between the words and the documents in a data

structure. This data structure is then used to quickly locate the documents

that contain the keywords that are being searched for. The inverted index is

created by indexing the words in the documents and then storing the mapping

between the words and the documents in a data structure. This data structure

is then used to quickly locate the documents that contain the keywords that

are being searched for.


Advantages of an Inverted Index

An inverted index has several advantages over other data structures. First, it is

very efficient in terms of storage and retrieval. An inverted index can store a

large amount of data in a relatively small amount of space. Additionally, it is

very fast at locating documents that contain specific keywords. This makes it

ideal for use in search engines and databases.

How to Implement an Inverted Index

Implementing an inverted index is relatively straightforward. First, the words in

the documents must be indexed. This can be done by using a text indexer,

which is a program that indexes the words in the documents. Once the words

have been indexed, the mapping between the words and the documents can

be stored in a data structure. This data structure can then be used to quickly

locate the documents that contain the keywords that are being searched for.

How to Optimize an Inverted Index

An inverted index can be optimized in several ways. First, the indexer can be

optimized to index the words more efficiently. Additionally, the data structure

used to store the mapping between the words and the documents can be

optimized to reduce the amount of space needed to store the data. Finally, the

search algorithm used to locate the documents can be optimized to reduce

the amount of time needed to locate the documents.

For example, consider the following documents:


Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.

To create an inverted index for these documents, we first tokenize the


documents into terms, as follows:

Document 1: The, quick, brown, fox, jumped, over, the, lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.

Next, we create an index of the terms, where each term points to a list of
documents that contain that term, as follows:

The -> Document 1, Document 2


quick -> Document 1
brown -> Document 1
fox -> Document 1
jumped -> Document 1
over -> Document 1
lazy -> Document 1, Document 2
dog -> Document 1, Document 2
slept -> Document 2
in -> Document 2
sun -> Document 2

To search for documents containing a particular term or set of terms, the search
engine queries the inverted index for those terms and retrieves the list of
documents associated with each term. The search engine can then use this
information to rank the documents based on relevance to the query and present
them to the user in order of importance.

Inverted indexes are widely used in search engines, database systems, and
other applications where efficient text search is required. They are especially
useful for large collections of documents, where searching through all the
documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content,
such as words or numbers, to its locations in a document or a set of documents.
In simple words, it is a hashmap like data structure that directs you from a word
to a document or a web page.

There are two types of inverted indexes: A record-level inverted index contains
a list of references to documents for each word. A word-level inverted index
additionally contains the positions of each word within a document. The latter
form offers more functionality, but needs more processing power and space to
be created.

Suppose we want to search the texts “hello everyone, ” “this article is based on
inverted index, ” “which is hashmap like data structure”. If we index by (text,
word within the text), the index with location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has


an entry (1, 1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions
respectively (here position is based on word).

The index may have weights, frequencies, or other indicators.

Steps to build an inverted index:


● Fetch the Document

Removing of Stop Words: Stop words are most occurring and useless

words in document like “I”, “the”, “we”, “is”, “an”.

● Stemming of Root Word

Whenever I want to search for “cat”, I want to see a document that has

information about it. But the word present in the document is called

“cats” or “catty” instead of “cat”. To relate the both words, I’ll chop

some part of each and every word I read so that I could get the “root

word”. There are standard tools for performing this like “Porter’s

Stemmer”.

● Record Document IDs

If word is already present add reference of document to index else

create new entry. Add additional information like frequency of word,

location of word etc.

Example:
Words Document
ant doc1
demo doc2
world doc1, doc2

# Define the documents


document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

# Step 1: Tokenize the documents


# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index


# Create an empty dictionary to store the inverted index
inverted_index = {}

# For each term, find the documents that contain it


for term in terms:
documents = []
if term in tokens1:
documents.append("Document 1")
if term in tokens2:
documents.append("Document 2")
inverted_index[term] = documents

# Step 3: Print the inverted index


for term, documents in inverted_index.items():
print(term, "->", ", ".join(documents))

Explaination of above code:

First two lines defines two sample documents to be used as input to the
algorithm.

Step 1 : tokenize the input documents by converting them to lowercase and


splitting them into individual words. Then combine the resulting tokens from
both documents into a single list of unique terms.

Step 2: create an empty dictionary to store the inverted index, and then iterate
through each term in the list of unique terms. For each term,create an empty list
of documents, and then check if the term appears in each input document.

If the term appears in a document, add the document to the list for that term.
Finally, add an entry to the inverted index dictionary for the current term, with
the list of documents that contain that term as its value.

Step 3: iterate through the entries in the inverted index dictionary and print out
each term along with the list of documents that contain it.
Output
jumped -> Document 1
fox -> Document 1
lazy -> Document 1, Document 2
the -> Document 1, Document 2
in -> Document 2
dog. -> Document 1
quick -> Document 1
dog -> Document 2
slept -> Document 2
sun. -> Document 2
brown -> Document 1
over -> Document 1

Advantage of Inverted Index are:

● Inverted index is to allow fast full text searches, at a cost of increased

processing when a document is added to the database.

● It is easy to develop.

● It is the most popular data structure used in document retrieval

systems, used on a large scale for example in search engines.

Inverted Index also has disadvantage:

● Large storage overhead and high maintenance costs on update, delete

and insert.

● Instead of retrieving the data in a decreasing order of expected

usefulness, the records are retrieved in the order in which they occur in

the inverted lists.

Features of inverted indexes include:


Efficient search: Inverted indexes allow for efficient searching of large volumes
of text-based data. By indexing every term in every document, the index can
quickly identify all documents that contain a given search term or phrase,
significantly reducing search time.

Fast updates: Inverted indexes can be updated quickly and efficiently as new
content is added to the system. This allows for near-real-time indexing and
searching of new content.

Flexibility: Inverted indexes can be customized to suit the needs of different


types of information retrieval systems. For example, they can be configured to
handle different types of queries, such as Boolean queries or proximity queries.

Compression: Inverted indexes can be compressed to reduce storage


requirements. Various techniques such as delta encoding, gamma encoding,
variable byte encoding, etc. can be used to compress the posting list efficiently.

Support for stemming and synonym expansion: Inverted indexes can be


configured to support stemming and synonym expansion, which can improve the
accuracy and relevance of search results. Stemming is the process of reducing
words to their base or root form, while synonym expansion involves mapping
different words that have similar meanings to a common term.

Support for multiple languages: Inverted indexes can support multiple


languages, allowing users to search for content in different languages using the
same system.

Improving inverted index

While a basic inverted index can answer queries that have an exact match in
the database, it may not work in all scenarios. For example:

● Users may search for a term that is not present exactly in an inverted
index, but are still related to it. For example, searching for snow or
snowing in place of snowfall. We can address this issue through
Stemming, which is a technique that extracts the root form of the
words by removing affixes. For example, the root form of the words
eating, eats, and eaten is eat.
● Or they can search for a synonym. To solve this, the synonyms of the
searched term are also looked up in the inverted index.
● Users generally search for phrases rather than single words. To support
phrase searching, Word-level Inverted indexes record the position
of a word in the document as well to improve the search results.

Understanding the Inverted Index in Elasticsearch

An inverted index consists of all of the unique terms that appear in any

document covered by the index. For each term, the list of documents in which

the term appears, is stored. So essentially an inverted index is a mapping

between terms and which documents contain those terms. Since an inverted

index works at the document field level and stores the terms for a given field,

it doesn’t need to deal with different fields. So what you will see in the

following example is at the scope of a specific field.

Alright, so let’s see an example. Suppose that we have two recipes with the

following titles: “The Best Pasta Recipe with Pesto” and “Delicious Pasta

Carbonara Recipe.” The following table shows what the inverted index would

look like.
So the terms from both of the titles have been added to the index. For each

term, we can see which document contains the term, which enables

Elasticsearch to efficiently match documents containing specific terms. A

part of what makes this possible, is that the terms are sorted. Also notice that

the terms within the index are the results of the analysis process that you saw

in the previous post in case you read that one. So most symbols have been

removed at this point, and characters have been lowercased. This of course
depends on the analyzer that was used, but that will often be the standard

analyzer.

Performing a search involves a lot of things such as relevance, but let’s forget

about that for now. The first step of a search query is to find the documents

that match the query in the first place. So if we were to search for “pasta

recipe,” we would see that both documents contain both terms.

If we searched for “delicious recipe,” the results would be as follows.


Like I mentioned before, this is of course an oversimplification of how

searching works, but I just wanted to show you the general idea of how the

inverted index is used when performing search queries. It’s great to know

how it works, but this is all transparent to you as a user of Elasticsearch, and

you won’t have to actively deal with the inverted index; it’s just something

that Elasticsearch uses internally. That being said, it is very beneficial to

know the basics of how it works for a number of reasons.


The inverted index also holds information that is used internally, such as for

computing relevance. Some examples of this could be the number of

documents containing each term, the number of times a term appears in a

given document, the average length of a field, etc.

You might also like