Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views17 pages

NLP Mod-5

The document discusses key concepts in information retrieval systems. It describes information retrieval as dealing with organizing, storing, retrieving, and evaluating information relevant to a user's query. It outlines some key design features of IR systems, including representing documents with keywords or index terms, indexing documents by converting text to keyword representations, eliminating common words like stop words, stemming words to their root forms, and assigning importance weights to index terms. Zipf's law, which states that word frequency decreases as rank increases in natural languages, is also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

NLP Mod-5

The document discusses key concepts in information retrieval systems. It describes information retrieval as dealing with organizing, storing, retrieving, and evaluating information relevant to a user's query. It outlines some key design features of IR systems, including representing documents with keywords or index terms, indexing documents by converting text to keyword representations, eliminating common words like stop words, stemming words to their root forms, and assigning importance weights to index terms. Zipf's law, which states that word frequency decreases as rank increases in natural languages, is also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module – 5

Chapter – 9
Information Retrieval
9.1 Introduction
Information Retrieval (IR) is a field that deals with organizing, storing, retrieving, and
evaluating information relevant to a user's query. When someone needs information,
they write a natural language query, and the retrieval system responds by finding and
presenting documents that seem relevant to that query. IR has been around since the
1960s, but it gained significant interest with the rise of the World Wide Web.

IR and Natural Language Processing (NLP) are becoming more interconnected. NLP
techniques, like probabilistic models and latent semantic indexing, are now being used
in IR systems. Traditionally, IR systems were designed to provide documents containing
the requested information, not the actual information itself.

In this context, the term "document" refers to text-based content, but IR can also
include non-textual information like images and speech. However, in this chapter, our
focus is solely on retrieving text documents.

IR systems differ from question answering systems and data retrieval systems. Question
answering systems give specific answers to specific questions, while data retrieval
systems fetch precise data organized in structured formats. On the other hand, queries
submitted to IR systems are often vague and imprecise, and the systems don't search for
specific data or provide direct answers like the other two types of systems do. Instead,
they identify and return documents related to the user's inquiry.

9.2 DESIGN FEATURES OF INFORMATION RETRIEVAL SYSTEMS


1. User's Information Need: The process starts with the user having a specific
need for information. They then formulate a query, expressing their request in
natural language.
2. Document Representation: Instead of using the actual text of the documents,
the system represents documents through sets of index terms or keywords. These
keywords can be single words or phrases, automatically or manually extracted
from the documents.
3. Indexing: The process of converting document text into its keyword
representation is called indexing. One commonly used data structure for indexing
is the "inverted index," which lists keywords with pointers to the documents
containing them.

Ectra: Let's take an example: Suppose you have three documents:

• Document 1: "I love dogs and cats."


• Document 2: "Cats are cute pets."
• Document 3: "Dogs are loyal animals."

The indexing process would create an inverted index like this:

• Keyword "love": Document 1


• Keyword "dogs": Document 1, Document 3
• Keyword "cats": Document 1, Document 2
• Keyword "cute": Document 2
• Keyword "pets": Document 2
• Keyword "loyal": Document 3
• Keyword "animals": Document 3

Now, if you search for "dogs," the system will look it up in the index and find that it is
present in Document 1 and Document 3. This way, the IR system can quickly retrieve the
relevant documents that match your query without having to read through the entire
content of each document every time you search.
4. Text Operations: To reduce the computational cost, certain text operations are
applied to the index terms. Stop word elimination removes unimportant words
(like "the," "and," etc.), and stemming reduces words to their root form (e.g.,
"running" becomes "run").
5. Importance Weighting: Not all terms in a document are equally important in
conveying its content. To quantify their significance, numerical values called
"weights" are assigned to index terms. These weights help in ranking the
relevance of documents to a given query.
6. Term-Weighting Schemes: Various techniques are used to determine the
weights of index terms. Many term-weighting schemes have been proposed in
the literature, and IR systems employ these to better match documents to
queries.

In summary, an IR system uses keywords to represent documents, matches them with


the keywords in the user's query, and ranks the documents based on their relevance to
the query. Text operations and term-weighting schemes help optimize the process and
improve the search results.

9.2.1 Indexing

When dealing with a small number of documents, it's easy for an IR system to directly
access each document to determine its relevance to a query. However, with a large
collection of documents, this approach becomes impractical and inefficient. So, we
transform the collection of raw documents into a more manageable and accessible
representation, and this process is called indexing.

Indexing involves selecting good document descriptors, such as keywords or terms, that
describe the content of documents effectively. These descriptors should be helpful in
distinguishing one document from another in the collection. To achieve this, we look for
words or phrases that frequently appear in the document and assume they are essential
for describing its content.

For example, if we have a sentence "Design features of information retrieval systems,"


the indexing may create the following terms: "Design," "features," "information,"
"retrieval," and "systems." We can also have multi-word phrases like "information
retrieval" and "information retrieval systems," which are obtained by identifying
frequently occurring sequences of words.

To extract meaningful phrases, we can use Natural Language Processing (NLP)


techniques, like part-of-speech tagging, which helps identify meaningful sequences of
words based on their context. For example, it can recognize proper nouns like "President
Kalam" and normalize noun phrases to represent them as a single entity.

The goal of indexing is to represent text, both queries and documents, as a set of terms
or phrases that capture the meaning and content of the original text. This representation
helps the IR system quickly identify relevant documents when you submit a query,
without having to read through the entire content of each document every time you
search.
9.2.2 Eliminating Stop Words

When we process the words or keywords used to represent documents in an


Information Retrieval (IR) system, we often remove certain words known as "stop
words." Stop words are high-frequency words in a language, like articles (e.g., "the," "a,"
"an") and prepositions (e.g., "in," "on," "to"). These words play important grammatical
roles in sentences but don't add much meaning to the content of a document when
using keyword-based representation.

Since stop words appear in many documents regardless of their topics, they don't help
in distinguishing between different documents. Eliminating them from the index terms
reduces the number of terms, making the retrieval process faster and more efficient.

However, there is a drawback to removing stop words. Sometimes, a stop word can be
part of a meaningful phrase that we want to consider in our search. For example, in the
phrase "Vitamin A," the stop word "A" is actually significant, as it represents the
vitamin's name. Removing it as a stop word would make it difficult to find documents
related to "Vitamin A" accurately.

Similarly, some phrases like "to be or not to be" consist entirely of stop words, but they
are essential and meaningful phrases from literature. Removing stop words in such
cases can lead to incorrect searches.

So, eliminating stop words is a common practice in IR to improve efficiency, but we


need to be careful not to remove meaningful terms or phrases that are important for
specific queries.

9.2.3 Stemming

Stemming is a technique used in Information Retrieval (IR) to normalize words to their


base or root form, called the stem. It does this by removing affixes from words, like
prefixes and suffixes, to reduce them to their most basic form.

For example, if we have words like "compute," "computing," "computes," and


"computer," stemming would convert them all to the same stem, which is "comput."

The idea behind stemming is to represent similar words in a unified way, so they can be
treated as the same word during the retrieval process. This simplifies the indexing and
search process, as we don't have to deal with every variation of a word separately.
A common stemming algorithm, developed by Porter in 1980, is widely used in IR
systems.

Here's an example: The sentence "Design features of information retrieval systems"


would be stemmed as [design, feature, inform, retriev, system]. Notice that stop words
like "of" and "the" have been removed, and all the remaining terms are in lower case.

While stemming can be helpful in some cases by combining similar terms and increasing
recall (i.e., finding more relevant documents), it may also result in decreased precision
(i.e., returning some irrelevant documents). For instance, if the word "computation" is
stemmed to "comput," it may cause some unrelated documents containing the word
"computer" to be mistakenly included in the search results for a query about
"computation."

Recall and precision are important measures used to evaluate the effectiveness of an IR
system, and they are explained in more detail later in the chapter.

9.2.4 Zipf's Law

Zipf's law says that the frequency of words multiplied by their ranks in a large corpus is more or
less constant. More formally,
Frequent x rank = constant

Zipf's Law is an observation made by linguist George Zipf about the distribution of
words in natural languages. It says that the frequency of a word in a large collection of
text (corpus) is roughly inversely proportional to its rank. In simpler terms, if we arrange
words in the corpus from most frequent to least frequent, the product of the word's
frequency and its rank is approximately constant.
Here's an example to illustrate this law: Let's say we have a large collection of text, and
the word "the" appears 1000 times, making it the most frequent word (rank 1). The word
"cat" appears 500 times, making it the second most frequent word (rank 2). The product
of their frequency and rank is roughly the same: 1000 (frequency of "the") x 1 (rank of
"the") ≈ 500 (frequency of "cat") x 2 (rank of "cat").

This pattern means that in any language, there are a small number of very common
words that occur with high frequency, like "the," "and," "is," etc. Then there are many
words that occur with lower frequency, and finally, there are a lot of rare words that
appear very infrequently.

In Information Retrieval, this law has significance because the high-frequency common
words don't have much distinguishing power, so they are not very useful for indexing.
On the other hand, the rare words are less likely to be included in a search query and
also don't contribute much to indexing. So, we often remove both the high-frequency
common words (stop words) and the rare words from the index terms, keeping the
medium-frequency content-bearing words for effective indexing.

In summary, Zipf's Law helps us understand the distribution of words in languages, and
it guides us in selecting the most relevant words for indexing in Information Retrieval
systems.

9.3 INFORMATION RETRIEVAL MODELS

An IR model is like a blueprint that defines how an IR system works. It determines how
documents and queries are represented, how relevant documents are found, and how
they are ranked. The main goal of any IR model is to find all the documents that are
relevant to a user's query.

There are several IR models, and they can be grouped into three categories:

1. Classical Models: These are the traditional and widely used models in IR. The
three main classical models are:
• Boolean Model: This model considers documents and queries as sets of
words. It retrieves documents that contain all the words in the query using
logical operations like "AND," "OR," and "NOT."
• Vector Model: In this model, documents and queries are represented as
vectors of word weights. It retrieves documents based on their similarity to
the query.
• Probabilistic Model: This model calculates the probability of a document
being relevant to the query and retrieves documents with the highest
probabilities.
2. Non-Classical Models: These models use different principles than the classical
ones for retrieval. They include models based on special logic techniques,
situation theory, or the concept of interaction.
3. Alternative Models: These models are enhancements of the classical models,
using specific techniques from other fields. Examples include the cluster model,
fuzzy model, and latent semantic indexing (LSI) model.

Classical models are straightforward and well understood, making them simple to
implement. Many commercial IR systems are based on these models. Non-classical
models use different approaches for retrieval, and alternative models build on the
classical models with additional techniques from other areas.

In summary, IR models determine how documents and queries are represented and how
relevant documents are retrieved and ranked. They come in different types, with classical
models being the most commonly used, non-classical models employing different
principles, and alternative models enhancing the classical ones with specific techniques.

9.4 CLASSICAL INFORMATION RETRIEVAL MODELS


9.4.1 Boolean model

The Boolean model is one of the oldest classical models used in Information Retrieval
(IR). It is based on Boolean logic and classical set theory. In this model, documents are
represented as sets of keywords, and users express their queries as Boolean expressions
using keywords connected with logical operators (AND, OR, NOT).

Here's how it works in a simplified example:

Suppose we have a set of documents, like D1, D2, and D3. Each document contains
specific keywords, and we represent them as sets. For example:

• D1 = {information, retrieval, query}


• D2 = {information, query}
• D3 = {retrieval, query}
Now, let's say a user wants to search for documents that contain both "information" and
"retrieval." Their query would be "information AND retrieval." In response to this query,
the system performs two steps:

1. The system retrieves sets of documents that contain each individual keyword:
• R1 = {D1, D2}
• R2 = {D1, D3}
2. Then, the system combines these sets using the AND operator, which means it
takes the intersection of the sets:
• R = R1 AND R2 = {D1}

So, the retrieved document is D1, which contains both "information" and "retrieval."

The Boolean model is simple and efficient and works well if the query is well formulated.
However, it has some drawbacks:

1. It can only retrieve documents that fully match the query, not those that are only
partly relevant.
2. It doesn't rank the retrieved documents based on relevance, meaning all relevant
documents have equal importance.
3. Users often don't express their queries using pure Boolean expressions, which the
model requires.

To overcome these limitations, various extensions of the Boolean model have been
proposed, like the P-norm model and the fuzzy-set model, which try to address these
weaknesses.

9.4.2 Probabilistic Model

The Probabilistic Model is an Information Retrieval (IR) model that uses probabilities to
rank and retrieve documents based on their relevance to a given query. Instead of
treating documents as exact matches to the query (like the Boolean model), the
probabilistic model calculates the probability that a document is relevant or irrelevant to
the query.

Here's how it works in simpler terms:

1. For each document and a given query, the model calculates the probability that
the document is relevant (P(R/d)) and the probability that it is irrelevant (P(I/d)).
2. The documents are then ranked based on their probability of relevance.
Documents with higher probabilities of relevance are ranked higher.
3. To retrieve documents, the model sets a threshold value (a). Documents are
retrieved if their probability of relevance is higher than the threshold value.

For example, if a document has a higher probability of being relevant to the query
"tennis match," it will be ranked higher and likely retrieved.

However, the model faces challenges in accurately estimating probabilities, especially


when there are only a few relevant documents for a particular query. This can lead to
difficulty in setting an appropriate threshold value for retrieving documents.

In summary, the Probabilistic Model ranks documents based on the probability of their
relevance to a query. It is more flexible than the Boolean model as it allows for partial
matching of the query terms. However, it may face challenges when estimating
probabilities with limited relevant documents for a query.

9.4.3 Vector Space Model

The Vector Space Model is a widely used Information Retrieval (IR) model. It represents
documents and queries as vectors of features, where each feature corresponds to a term
in the documents or the query. These vectors are placed in a multi-dimensional space,
and each dimension represents a specific term.

Here's a simple example:

Suppose we have three documents and three terms:

• Documents: D1, D2, D3


• Terms: T1, T2, T3

The weight of each term in the documents is based on its frequency in that document.
For example:

• D1: T1=2, T2=2, T3=1


• D2: T1=1, T2=0, T3=1
• D3: T1=0, T2=1, T3=1
We can represent each document as a vector with the term weights:

• D1: (2, 2, 1)
• D2: (1, 0, 1)
• D3: (0, 1, 1)

These vectors can be plotted as points in a multi-dimensional space.

To make the comparison between documents fair, we normalize the vectors to unit
length. This means we adjust the vector lengths to have the same scale, so their
magnitudes don't affect the similarity calculation.

In summary, the Vector Space Model represents documents and queries as vectors of
term weights in a multi-dimensional space. It then calculates the similarity between the
document vectors and the query vector to rank the documents based on relevance.
Normalization is used to ensure fair comparison between documents.
9.5 Non classical models of IR
The cluster model and the fuzzy model are two alternative information retrieval (IR) models used to
enhance the efficiency and precision of information retrieval systems.

Cluster Model

The cluster model aims to reduce the number of necessary matches during retrieval by grouping closely
related documents into clusters. The cluster hypothesis posits that documents associated with the same
clusters are likely to be mutually relevant. Instead of matching the query with every document in the
collection, the query is matched with representative documents from each cluster. This substantially
diminishes search time and enhances retrieval efficiency.

The cluster vectors C1 and C2 are:


r1 = (1 0.5 1 0 1)
r2 = (0 0 1 1 0)
Fuzzy Model

TF-IDF term weightage approach


TF-IDF (Term Frequency-Inverse Document Frequency) is a popular term weighting approach
used in Information Retrieval (IR) to represent the importance of terms in documents within a
collection.

Term Frequency (TF): The term frequency (TF) of a term in a document measures how
frequently a term appears in that document. It is calculated as the number of times the term
occurs in the document divided by the total number of terms in the document. A higher TF
value indicates that the term is more important or relevant in the document.

Inverse Document Frequency (IDF): Inverse Document Frequency assesses the rarity of a term
across the entire document collection. It helps to identify terms that are unique or
discriminatory. IDF assigns a higher weight to terms that appear in fewer documents and a
lower weight to terms that are common across many documents.

TF-IDF Weight: The TF-IDF weight of a term in a document combines both the TF and
IDF values to give a weight that reflects the importance of the term in that particular
document and the collection as a whole. It is calculated by multiplying the TF and IDF
values for that term. A higher TF-IDF weight signifies that the term is both frequently
occurring in the document and rare across the collection, making it more relevant to
that document.
Chapter – 12
Lexical Resources
Introduction
In this chapter, the authors introduce various tools and resources that researchers can use to
work with natural language processing (NLP). These resources are freely available, which means
anyone can access and download them from the Internet.

WORDNET
WordNet is a large database for the English language that contains sets of words with
similar meanings, known as synsets. It was created based on psycholinguistic theories
and is maintained by Princeton University. WordNet has separate databases for nouns,
verbs, and adjectives/adverbs. Each word can have multiple senses, and each sense
belongs to a different synset.

In WordNet, words are linked by various relations, such as synonymy (words with the
same meaning), hypernymy/hyponymy (generalization/specialization relationships),
antonymy (words with opposite meanings), meronymy/holonymy (part-whole
relationships), and troponymy (verbs indicating manner of action).

WordNet provides glosses, which are dictionary-style definitions with examples, to help
differentiate between the meanings of words. It is freely available for download from
their website.

WordNets for other languages have also been developed. EuroWordNet covers
European languages and includes multilingual relations to English meanings. Hindi
WordNet, developed by IIT Bombay, has over 26,000 synsets and 56,000 Hindi words. It
follows similar principles to English WordNet but includes specific relations for Hindi,
like causative relations. It is a valuable resource for the Hindi language.
1. (Hypernymy: A hypernym is a word that represents a more general concept or
category and can be used to describe a group of related words. For example,
"animal" is a hypernym for "dog," "cat," and "elephant." It represents the broader
category that includes these specific animals.
2. Hyponymy: A hyponym is a word that represents a more specific concept or
category and is a subtype or instance of a hypernym. For example, "dog" is a
hyponym of "animal." It represents a specific type of animal that falls under the
broader category of "animal."
1. Meronymy and Holonymy (Part-Whole Relationships): Meronymy and holonymy
are lexical relations that describe the relationship between a whole and its parts
or components.
• Meronymy: A meronym is a word that represents a part or a component of a
whole. For example, "wheel" is a meronym of "car" because a car consists of
wheels.
• Holonymy: A holonym is a word that represents the whole or the complete entity
that contains its parts. Using the same example, "car" is a holonym of "wheel"
because a car contains wheels.
• Troponymy (Verbs Indicating Manner of Action): Troponymy is a lexical relation
that describes the relationship between a verb and another verb that indicates a
specific manner or way of performing the action.

For example, consider the verb "walk." There are several ways or manners in which
someone can walk, such as "stroll," "run," "limp," "march," etc. These verbs (stroll, run,
limp, march) are troponyms of the verb "walk" because they indicate specific manners of
walking.

)
Applications:
1. Concept Identification in Natural Language: WordNet helps in identifying the
concepts related to a given term, allowing a deeper understanding of its meaning
in a given context.
2. Word Sense Disambiguation: WordNet is used to resolve the ambiguity of words
with multiple meanings. It provides sense definitions, synonyms, and semantic
relations, making it a valuable resource for disambiguation tasks.
3. Automatic Query Expansion: WordNet's semantic relations can be used to expand
search queries, ensuring that the search covers synonyms and related terms,
improving the retrieval of relevant documents.
4. Document Structuring and Categorization: The knowledge extracted from
WordNet can be used for organizing and categorizing text documents, making it
easier to manage and access relevant information.
5. Document Summarization: WordNet is utilized in text summarization techniques
to compute lexical chains, which help in creating concise and meaningful
summaries of large text documents.

FRAMENET
FrameNet is a database of English sentences that are annotated with semantic information. It
follows the principles of frame semantics. In FrameNet, each frame consists of a main word
(predicate) and associated roles called frame elements. For example, in the "ARREST" frame, the
main word is "nab," and it has frame elements like "AUTHORITIES" and "SUSPECT." These frame
elements represent the participants in the situation.

FrameNet is a collection of English sentences with extra information about the meaning
of the words in those sentences. It's like a special dictionary that tells us how words are
used in different situations.

Imagine you have a sentence like "The police nabbed the suspect." In FrameNet, the
word "nabbed" is associated with a specific situation or frame, which is the "ARREST"
frame. This frame includes important roles or participants in the situation, like
"AUTHORITIES" (referring to the police) and "SUSPECT."

FrameNet helps us understand that the word "nabbed" in this context is related to the
idea of arresting someone, and it involves the authorities and the suspect. This
additional information helps with tasks like word sense disambiguation (figuring out the
correct meaning of a word in a sentence) and automatic query expansion (finding
related words or synonyms to improve search results).
Frames can have different roles. For example, the "COMMUNICATION" frame includes
roles like "ADDRESSEE," "COMMUNICATOR," "TOPIC," and "MEDIUM." Frames can also
inherit roles from other frames. For instance, a "STATEMENT" frame may inherit roles
from the "COMMUNICATION" frame.

In summary, FrameNet provides a structured way of understanding how words are used
in different situations, making it a valuable resource for various natural language
processing tasks.

Applications:
1. Automatic Semantic Parsing: FrameNet data helps in understanding the meaning
of words in sentences, which is crucial for automatic semantic parsing, where we
try to understand the relationships between words and their roles in a sentence.
2. Information Extraction: FrameNet's shallow semantic roles can be useful in
information extraction tasks. For example, it can help identify that the word
"match" has the same theme role in both "The umpire stopped the match" and
"The match stopped due to bad weather" even though the syntactic roles of
"match" are different.
3. Question-Answering Systems: By using common frames that define the semantic
roles of verbs like "send" and "receive," question-answering systems can answer
questions like "Who sent a packet to Khushbu?" based on sentences that talk
about sending and receiving packets.
4. Information Retrieval: FrameNet data can improve information retrieval by
helping the system understand the meaning and context of words in search
queries and documents.
5. Machine Translation: FrameNet can be used as a bridge between different
languages to create a common interlingua for machine translation, making it
easier to translate between languages.
6. Text Summarization: FrameNet information can assist in summarizing text by
identifying important semantic roles and concepts.
7. Word Sense Disambiguation: FrameNet data is valuable in disambiguating word
meanings, which is essential in tasks where a word can have multiple senses
depending on the context.

You might also like