Inverted index is a data structure used in information retrieval systems to
efficiently retrieve documents or web pages containing a specific term or set of
terms. In an inverted index, the index is organized by terms (words), and each
term points to a list of documents or web pages that contain that term.
What is an Inverted Index?
An inverted index is a data structure that stores a mapping between words
and the documents that contain them. It is used to quickly locate documents
or records that contain specific keywords. The inverted index is created by
indexing the words in the documents and then storing the mapping between
the words and the documents in a data structure. This data structure is then
used to quickly locate the documents that contain the keywords that are being
searched for.
How Does an Inverted Index Work?
An inverted index works by indexing the words in the documents and then
storing the mapping between the words and the documents in a data
structure. This data structure is then used to quickly locate the documents
that contain the keywords that are being searched for. The inverted index is
created by indexing the words in the documents and then storing the mapping
between the words and the documents in a data structure. This data structure
is then used to quickly locate the documents that contain the keywords that
are being searched for.
Advantages of an Inverted Index
An inverted index has several advantages over other data structures. First, it is
very efficient in terms of storage and retrieval. An inverted index can store a
large amount of data in a relatively small amount of space. Additionally, it is
very fast at locating documents that contain specific keywords. This makes it
ideal for use in search engines and databases.
How to Implement an Inverted Index
Implementing an inverted index is relatively straightforward. First, the words in
the documents must be indexed. This can be done by using a text indexer,
which is a program that indexes the words in the documents. Once the words
have been indexed, the mapping between the words and the documents can
be stored in a data structure. This data structure can then be used to quickly
locate the documents that contain the keywords that are being searched for.
How to Optimize an Inverted Index
An inverted index can be optimized in several ways. First, the indexer can be
optimized to index the words more efficiently. Additionally, the data structure
used to store the mapping between the words and the documents can be
optimized to reduce the amount of space needed to store the data. Finally, the
search algorithm used to locate the documents can be optimized to reduce
the amount of time needed to locate the documents.
For example, consider the following documents:
Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.
To create an inverted index for these documents, we first tokenize the
documents into terms, as follows:
Document 1: The, quick, brown, fox, jumped, over, the, lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.
Next, we create an index of the terms, where each term points to a list of
documents that contain that term, as follows:
The -> Document 1, Document 2
quick -> Document 1
brown -> Document 1
fox -> Document 1
jumped -> Document 1
over -> Document 1
lazy -> Document 1, Document 2
dog -> Document 1, Document 2
slept -> Document 2
in -> Document 2
sun -> Document 2
To search for documents containing a particular term or set of terms, the search
engine queries the inverted index for those terms and retrieves the list of
documents associated with each term. The search engine can then use this
information to rank the documents based on relevance to the query and present
them to the user in order of importance.
Inverted indexes are widely used in search engines, database systems, and
other applications where efficient text search is required. They are especially
useful for large collections of documents, where searching through all the
documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content,
such as words or numbers, to its locations in a document or a set of documents.
In simple words, it is a hashmap like data structure that directs you from a word
to a document or a web page.
There are two types of inverted indexes: A record-level inverted index contains
a list of references to documents for each word. A word-level inverted index
additionally contains the positions of each word within a document. The latter
form offers more functionality, but needs more processing power and space to
be created.
Suppose we want to search the texts “hello everyone, ” “this article is based on
inverted index, ” “which is hashmap like data structure”. If we index by (text,
word within the text), the index with location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has
an entry (1, 1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions
respectively (here position is based on word).
The index may have weights, frequencies, or other indicators.
Steps to build an inverted index:
● Fetch the Document
Removing of Stop Words: Stop words are most occurring and useless
words in document like “I”, “the”, “we”, “is”, “an”.
● Stemming of Root Word
Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called
“cats” or “catty” instead of “cat”. To relate the both words, I’ll chop
some part of each and every word I read so that I could get the “root
word”. There are standard tools for performing this like “Porter’s
Stemmer”.
● Record Document IDs
If word is already present add reference of document to index else
create new entry. Add additional information like frequency of word,
location of word etc.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
# Define the documents
document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."
# Step 1: Tokenize the documents
# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))
# Step 2: Build the inverted index
# Create an empty dictionary to store the inverted index
inverted_index = {}
# For each term, find the documents that contain it
for term in terms:
documents = []
if term in tokens1:
documents.append("Document 1")
if term in tokens2:
documents.append("Document 2")
inverted_index[term] = documents
# Step 3: Print the inverted index
for term, documents in inverted_index.items():
print(term, "->", ", ".join(documents))
Explaination of above code:
First two lines defines two sample documents to be used as input to the
algorithm.
Step 1 : tokenize the input documents by converting them to lowercase and
splitting them into individual words. Then combine the resulting tokens from
both documents into a single list of unique terms.
Step 2: create an empty dictionary to store the inverted index, and then iterate
through each term in the list of unique terms. For each term,create an empty list
of documents, and then check if the term appears in each input document.
If the term appears in a document, add the document to the list for that term.
Finally, add an entry to the inverted index dictionary for the current term, with
the list of documents that contain that term as its value.
Step 3: iterate through the entries in the inverted index dictionary and print out
each term along with the list of documents that contain it.
Output
jumped -> Document 1
fox -> Document 1
lazy -> Document 1, Document 2
the -> Document 1, Document 2
in -> Document 2
dog. -> Document 1
quick -> Document 1
dog -> Document 2
slept -> Document 2
sun. -> Document 2
brown -> Document 1
over -> Document 1
Advantage of Inverted Index are:
● Inverted index is to allow fast full text searches, at a cost of increased
processing when a document is added to the database.
● It is easy to develop.
● It is the most popular data structure used in document retrieval
systems, used on a large scale for example in search engines.
Inverted Index also has disadvantage:
● Large storage overhead and high maintenance costs on update, delete
and insert.
● Instead of retrieving the data in a decreasing order of expected
usefulness, the records are retrieved in the order in which they occur in
the inverted lists.
Features of inverted indexes include:
Efficient search: Inverted indexes allow for efficient searching of large volumes
of text-based data. By indexing every term in every document, the index can
quickly identify all documents that contain a given search term or phrase,
significantly reducing search time.
Fast updates: Inverted indexes can be updated quickly and efficiently as new
content is added to the system. This allows for near-real-time indexing and
searching of new content.
Flexibility: Inverted indexes can be customized to suit the needs of different
types of information retrieval systems. For example, they can be configured to
handle different types of queries, such as Boolean queries or proximity queries.
Compression: Inverted indexes can be compressed to reduce storage
requirements. Various techniques such as delta encoding, gamma encoding,
variable byte encoding, etc. can be used to compress the posting list efficiently.
Support for stemming and synonym expansion: Inverted indexes can be
configured to support stemming and synonym expansion, which can improve the
accuracy and relevance of search results. Stemming is the process of reducing
words to their base or root form, while synonym expansion involves mapping
different words that have similar meanings to a common term.
Support for multiple languages: Inverted indexes can support multiple
languages, allowing users to search for content in different languages using the
same system.
Improving inverted index
While a basic inverted index can answer queries that have an exact match in
the database, it may not work in all scenarios. For example:
● Users may search for a term that is not present exactly in an inverted
index, but are still related to it. For example, searching for snow or
snowing in place of snowfall. We can address this issue through
Stemming, which is a technique that extracts the root form of the
words by removing affixes. For example, the root form of the words
eating, eats, and eaten is eat.
● Or they can search for a synonym. To solve this, the synonyms of the
searched term are also looked up in the inverted index.
● Users generally search for phrases rather than single words. To support
phrase searching, Word-level Inverted indexes record the position
of a word in the document as well to improve the search results.
Understanding the Inverted Index in Elasticsearch
An inverted index consists of all of the unique terms that appear in any
document covered by the index. For each term, the list of documents in which
the term appears, is stored. So essentially an inverted index is a mapping
between terms and which documents contain those terms. Since an inverted
index works at the document field level and stores the terms for a given field,
it doesn’t need to deal with different fields. So what you will see in the
following example is at the scope of a specific field.
Alright, so let’s see an example. Suppose that we have two recipes with the
following titles: “The Best Pasta Recipe with Pesto” and “Delicious Pasta
Carbonara Recipe.” The following table shows what the inverted index would
look like.
So the terms from both of the titles have been added to the index. For each
term, we can see which document contains the term, which enables
Elasticsearch to efficiently match documents containing specific terms. A
part of what makes this possible, is that the terms are sorted. Also notice that
the terms within the index are the results of the analysis process that you saw
in the previous post in case you read that one. So most symbols have been
removed at this point, and characters have been lowercased. This of course
depends on the analyzer that was used, but that will often be the standard
analyzer.
Performing a search involves a lot of things such as relevance, but let’s forget
about that for now. The first step of a search query is to find the documents
that match the query in the first place. So if we were to search for “pasta
recipe,” we would see that both documents contain both terms.
If we searched for “delicious recipe,” the results would be as follows.
Like I mentioned before, this is of course an oversimplification of how
searching works, but I just wanted to show you the general idea of how the
inverted index is used when performing search queries. It’s great to know
how it works, but this is all transparent to you as a user of Elasticsearch, and
you won’t have to actively deal with the inverted index; it’s just something
that Elasticsearch uses internally. That being said, it is very beneficial to
know the basics of how it works for a number of reasons.
The inverted index also holds information that is used internally, such as for
computing relevance. Some examples of this could be the number of
documents containing each term, the number of times a term appears in a
given document, the average length of a field, etc.