Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
60 views24 pages

Bai601 NLP Module 4 Lecture Notes

Uploaded by

Kalidass Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views24 pages

Bai601 NLP Module 4 Lecture Notes

Uploaded by

Kalidass Anu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Regulation – 2022 BAI601-Natural Language Processing

Module 4

Syllabus:
Information Retrieval: Design Features of Information Retrieval Systems, Information
Retrieval Models - Classical, Non-classical, Alternative Models of Information Retrieval -
Custer model, Fuzzy model, LSTM model, Major Issues in Information Retrieval.
Lexical Resources: WordNet, FrameNet, Stemmers, Parts-of-Speech Tagger, Research
Corpora.

Information Retrieval:
Introduction
Information Retrieval (IR) deals with the organization, storage, retrieval, and evaluation of
information relevant to a user’s query. A user in need of information formulates a request in
the form of a query written in a natural language. The retrieval system responds by retrieving
the document that seems relevant to the query. 1960s the text retrieval system was
introduced. There are many NLP techniques used including probabilistic model. Traditionally,
IR systems are not expected to return the actual information, only documents containing that
information. The word document is a general term that includes non-textual information such
as images and speech.

4.1 Design Features of Information Retrieval Systems


The Figure illustrates the basic process of IR. It begins with the user’s information need. Based
on this need, their formulates a query. The IR system returns documents that seem relevant to
the query. The basic question involved is "what constitutes the information in the documents
and the queries”.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

This is turn is related to the problem of representation of documents and queries. The
retrieval is performed by matching the query representation with document representation.
The actual text of the document is not used in the retrieval process instead, documents in a
collection are frequently represented through a set of index terms or keywords. The word
term and keyword will be used independently. Keywords can be single word or multiword
phrases. They might be extracted automatically or manually. Such a representation provides a
logical view of the document. The process of transforming document text to some
representation of it known as indexing.

Indexing
In a small collection of documents, an IR system can access a document to decide its relevance
to a query. A collection of raw documents is usually transformed into a easily accessible
representation. Most indexing techniques involve identifying good document descriptors such
as keywords or terms, which describe the information the information content of documents.
The word term can be a single word or multiword phrases.
For Example,
design, features, information, retrieval, systems
It can be represented by the set of terms
design, features, information retrieval, information retrieval systems
These multi-word terms can be obtained by looking at frequency appearing sequences of
words, n-gram, part of speech tags.
Eliminating Stop Words
Stop words are commonly used words in a language. These words don’t carry significant
meaning for tasks. Stop words are high frequency words which have little semantic weight.
These words play grammatical roles in language but do not contribute to the semantic content
of a document in a keyword-based representations. If we are remove stop words it reduces
noise and improve the performance of tasks.
Example of stop words
1. Articles –a, an, the
2. Prepositions- in, on, at, by
3. Pronouns – he, she, it
4. Conjunctions –and, or, but

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

Stemming
Stemming reduces words to their common grammatical roots. It refers to the process of
reducing words to their root or base form, often by removing affixes. The goal of stemming is
to normalize words so that variations of the same word are treated as a single entity.
 For example, words like "running," "runner," and "ran" would all be reduced to the
root word "run.“
 Example, “compute”, “computing”, “computes”, and “computer” all are reduced to same
word stem “comput”
Design Features of Information Retrieval Systems

Zipf’s Law
Zipf’s law helps identify and remove these to reduce index size and improve search efficiency.
Zipf's law can be applied to further reduce the size of index set.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

Zipf made an important observation on the distribution of words in natural language. Zipf’s
laws says that the frequency of words multiplied by their ranks in a large corpus is more or
less constant
Frequent X rank ≈ constant
This means that if we compute of the words in a corpus and arrange them in decreasing order
of frequency.

4.2 Information Retrieval Models


IR model is a pattern that defines several aspects of the retrieval procedure.
for example,
 how documents and user's queries are represented.
 how a system retrieves relevant documents according to users' queries.
 how retrieved documents are ranked.
 The IR system consists of a model for documents, a model for queries, and a matching
function which compares queries to documents.
 These models can be classified as follows:
1. Classical models of IR
2. Non-Classical models of IR
3. Alternative models for IR

4.2.1 Classical Model


There are three classical IR models- Boolean, vector and probabilistic –are based on
mathematical knowledge.
4.2.1.1Boolean Model
A document is retrieved only if it exactly matches the query conditions. Based on Boolean
logic (AND, OR, NOT). It may be gives binary results — either relevant or not.
Example 1:
car AND red returns documents that contain both words.
Example 2:
A legal database search where a lawyer searches:
("contract breach" AND "damages") NOT "employment“
Only documents that contain both "contract breach" and "damages" but exclude
"employment" are retrieved.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

4.2.1.2 Vector space Model


The Vector Space Model (VSM) is a classical model in Information Retrieval where both
documents and queries are represented as vectors in a multi-dimensional space. It is
measures similarity using cosine similarity between document and query vectors. It provides
ranking of documents based on relevance score.
Example :
A search engine like Elastic search or Lucene ranks results when a user types:
“best budget travel destinations in India”
Each document is a vector of terms. The engine calculates cosine similarity between the query
and documents.
4.2.1.3 Probabilistic Model
The Probabilistic Model in Information Retrieval ranks documents based on the probability
that a document is relevant to a given query. Retrieval depends on whether probability of
relevance of a document is higher than that non relevance.
Example 1
If you search "top universities for AI research" on Google:
 The system uses past data to estimate which documents are most likely relevant.
 If many users clicked and stayed on a particular result.
 It’s considered more relevant in future searches.
Example 2
Google Search with feedback signals

4.2.2 Non-Classical Model


Non-classical IR models are based on principles other than similarity, probability, Boolean
operations. Non-classical models include information logic model, situation theory model, and
interaction model.

4.2.2.1 Information logic model


The information logic model is based on a special logic technique called logical imaging.
Retrieval is performed by making inferences from document to query. Documents and
Queries as Logical Expressions. Each document is represented by a logical formula & A query
is also a logical expression.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

Example:
Document D1: "apple AND banana“ Query Q: "apple“
Since D1 ⇒ Q (if a document talks about both apple and banana, it implies it talks about
apple), D1 is considered relevant.

4.2.2.2 Situation theory model


Situation theory is used to describe how pieces of information (infon) are true in specific
situation. Documents are treated as a set of infons (facts or information units). A user’s query
represents an informational need in a specific situation. Retrieval involves matching
documents whose infons are relevant to the user's situation.
Example:
Suppose a user in Bangalore searches for "best dosa near me".
 Traditional models may just match "dosa" and "best".
Situation Theory Model considers: Location: Bangalore Time: Lunch hours Agent: User
preference (vegetarian)

4.2.2.3 Interaction model


The interaction IR model was first introduced in Dominich (1992, 1993) and Rijsbergen
(1996). In this model, the documents are not isolated: instead, they are interconnected. The
query interacts with the interconnected documents.
Retrieval is conceived as a result of this interaction. Artificial neural networks can be used to
implement this model. Each document is modelled as a neuron; the document set as a whole
form a neural network.
Example:
 A user wants to buy a budget friendly smartphone
 RAM size, camera, games, apps

4.2.3 Alternative Model of IR


1. Cluster Model
2. Fuzzy Model
3. Latent Semantic Indexing Model (LSIM)

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

4.2.3.1 Cluster Model


The cluster model is an attempt to reduce the number of matches during retrieval. The
documents grouped to clusters based on similarity and retrieval is performed on these
clusters.
Example:
News App Search
 You open a news app and search for “global warming effects”.
 The app has already grouped articles into clusters like:
Cluster A: Climate Change & Environment
Cluster B: Politics & Policies
Cluster C: Scientific Research
Cluster D: Business & Economy
 Your query best matches Cluster A
 which includes articles on melting glaciers, rising sea levels, and extreme weather
events.

4.2.3.2 Fuzzy Model


The Fuzzy Model in Information Retrieval deals with imprecise or vague matching between
queries and documents. It uses fuzzy logic to allow partial matches rather than requiring exact
matches. Instead of saying a document is either relevant (1) or not relevant (0). The fuzzy
model allows for degrees of relevance, like 0.7 or 0.3 — similar to how humans think in
"somewhat relevant" or "mostly relevant" terms. A document's relevance is calculated using
fuzzy similarity functions, such as: Cosine similarity.
Example:
 Online Shopping Search
 e-commerce website and search:"blu tooth hedphones" (typos for “Bluetooth
headphones”)
 A traditional IR system may fail to retrieve results due to the spelling errors.
 But a fuzzy model: Detects that "blu tooth" ≈ "Bluetooth“ Understands "hedphones" ≈
"headphones”
4.2.3.3 Latent Semantic Indexing Model(LSIM)
LSIM model in Information Retrieval that identifies hidden (latent) relationships between
terms and documents by analyzing word co-occurrence patterns. Even if a document doesn’t

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

contain the exact search term, it may still be relevant because it contains related or
semantically similar terms. LSIM helps uncover these semantic relationships using linear
algebra.
1. Build a term-document matrix (rows = words, columns = documents, values = word
frequency or TF-IDF).
2. Apply Singular Value Decomposition (SVD) to reduce dimensions. This captures the
most important concepts (topics) in the data.
3. Both documents and queries are mapped into this concept space.
4. Similarity is computed between query and documents in this reduced space — even if
exact words don’t match.
Example:
You search: "car accident lawyer“
 A document that talks about: “automobile collision” “legal assistance” “injury claims”
 might not contain the exact words “car” or “lawyer”
 but still gets retrieved because LSIM detects the semantic relationship between the
words “car” and “automobile”, “lawyer” and “legal”.

4.3 LSTM - Model


Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network
(RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies
in sequential data making them ideal for tasks like language translation, speech recognition
and time series forecasting. Unlike traditional RNNs which use a single hidden state passed
through time LSTMs introduce a memory cell that holds information over extended periods
addressing the challenge of learning long-term dependencies.
Problem with Long-Term Dependencies in RNN
Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a
hidden state that captures information from previous time steps. However they often face
challenges in learning long-term dependencies where information from distant time steps
becomes crucial for making accurate predictions for current state. This problem is known as
the vanishing gradient or exploding gradient problem.
 Vanishing Gradient: When training a model over time, the gradients which help the
model learn can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost irrelevant.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

 Exploding Gradient: Sometimes gradients can grow too large causing instability. This
makes it difficult for the model to learn properly as the updates to the model become
erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture long-term
dependencies in sequential data.

LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates:
1. Input gate: Controls what information is added to the memory cell.
2. Forget gate: Determines what information is removed from the memory cell.
3. Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.

1. Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xtxt (input at the particular time) and ht−1ht−1 (previous cell output) are fed to the
gate and multiplied with weight matrices followed by the addition of bias. The resultant is
passed through an activation function which gives a binary output. If for a particular cell state

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

the output is 0, the piece of information is forgotten and for output 1, the information is
retained for future use.
The equation for the forget gate is:
ft=σ(Wf⋅[ht−1,xt]+bf)
Where:
 Wf represents the weight matrix associated with the forget gate.
 [ht−1,xt] denotes the concatenation of the current input and the previous hidden state.
 bfis the bias with the forget gate.
 σ is the sigmoid activation function.

2. Input gate
The addition of useful information to the cell state is done by the input gate. First the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht−1ht−1and xtxt. Then, a vector is created
using tanh function that gives an output from -1 to +1 which contains all the possible values
from ht−1ht−1 and xtxt. At last the values of the vector and the regulated values are multiplied
to obtain the useful information.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

3. Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputsht−1ht−1and xtxt. At last the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell.

Applications of LSTM
Some of the famous applications of LSTM includes:
 Language Modeling: Used in tasks like language modeling, machine translation and text
summarization. These networks learn the dependencies between words in a sentence to
generate coherent and grammatically correct sentences.
 Speech Recognition: Used in transcribing speech to text and recognizing spoken
commands. By learning speech patterns they can match spoken words to corresponding

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

text.
 Time Series Forecasting: Used for predicting stock prices, weather and energy
consumption. They learn patterns in time series data to predict future events.
 Anomaly Detection: Used for detecting fraud or network intrusions. These networks can
identify patterns in data that deviate drastically and flag them as potential anomalies.
 Recommender Systems: In recommendation tasks like suggesting movies, music and
books. They learn user behavior patterns to provide personalized suggestions.
 Video Analysis: Applied in tasks such as object detection, activity recognition and action
classification. When combined with Convolutional Neural Networks (CNNs) they help
analyze video data and extract useful information.

4.4 Major Issues in Information Retrieval


The main issues of the Information Retrieval (IR) are Document and Query Indexing, Query
Evaluation, and System Evaluation.

1. Document and Query Indexing - Main goal of Document and Query Indexing is to find
important meanings and creating an internal representation. The factors to be considered
are accuracy to represent semantics, exhaustiveness, and facility for a computer to
manipulate.
2. Query Evaluation - In the retrieval model how can a document be represented with the
selected keywords and how are documents and query representations compared to
calculate a score. Information Retrieval (IR) deals with issues like uncertainty and
vagueness in information systems.
 Uncertainty: The available representation does not typically reflect true semantics of
objects such as images, videos etc.
 Vagueness: The information that the user requires lacks clarity, is only vaguely
expressed in a query, feedback or user action.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

3. System Evaluation - System Evaluation tell about the importance of determining the
impact of information given on user achievement. Here, we see if the efficiency of the
particular system related to time and space.

4.5 Lexical Resources


4.5.1 WORDNET
WordNet is a large lexical database for the English language. Inspired by psycholinguistic
theories, it was developed and is being maintained at the Cognitive Science Laboratory,
Princeton University, under the direction of George A. Miller. WordNet consists of three
databases one for nouns, one for verbs, and one for both adjectives and adverbs. Information is
organized into sets of synonymous words called synsets, each representing one base concept.
The synsets are linked to each other by means of lexical and semantic relations. Lexical
relations occur between word-forms (i.e., senses) and semantic relations between word
meanings. These relations include synonymy, hypernymy/ hyponymy, antonymy, meronymy/
holonymy, troponymy, etc. A word may appear in more than one synset and in more than one
part-of- speech. The meaning ofa word is called sense. WordNet lists all senses of a word, each
sense belonging to a different synset. WordNet's sense- entries consist of a set of synonyms and
a gloss. A gloss consists of a dictionary-style definition and examples demonstrating the use of a
synset in a sentence
As shown in figure 12.1, the figure shows the entries for the word ‘read’. ‘Read’ has one sense as
a noun and 11 senses as a verb.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

WordNet is freely and publicly available for download from http:l/wordnet.princeton.


edu/obtain. WordNets for other languages have also been developed, e.g., Euro Word Net and
Hindi WordNet. EuroWordNet covers European languages, incuding English, Dutch, Spanish,
Italian, German, French, Czech, and Estonian. Other than language internal relations, it also
contains multilingual relations from each WordNet to English meanings.
Hindi WordNet has been developed by CFILT (Resource Center for Indian Language
Technology Solutions), IIT Bombay. Its database consists of more than 26,208 synsets and
56,928 Hindi words. It is organized using the same principles as English WordNet but includes
some Hindi specific relations (e.g., causative relations). A total of 16 relations have been used in
Hindi WordNet. Each entry consists of synset, gloss, and position of synset in ontology.

Application of WordNet
WordNet has found numerous applications in problem related with IR and NLP. Some of these
are discussed here.
1. Concept Identification in Natural Language

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

WordNet can be used to identify concepts pertaining to a term, to suit them to the full semantic
richness and complexity of a given information need.
2. Word Sense Disambiguation
WordNet combines features of a number of the other resources commonly used in
disambiguation work. It offers sense definitions of words, identifies synstes of synonyms,
defines a number of semantic relations and is freely available. This makes it the (currently) best
known and most utilized resource for word sense disambiguation. One of the earliest attempts
to use WordNet for word sense disambiguation was in IR by Voorheese. She used Word Net
noun hierarchy (hypernym / hyponym) to achieve disambiguation. A number of other
researchers have also used WordNet for the same purpose.
3. Automatic Query Expansion
WordNet semantic relations can be used to expand queries so that the search for a document is
not confined to the pattern-matching of query terms, but also covers synonyms. The work
performed by Voorhees is based on the use of WordNet relations, such as synonyms,
hypernyms, and hyponyms, to expand queries.
4. Document Structuring and Categorization
The semantic information extracted from WordNet, and Word Net conceptual representation of
knowledge, have been used for text categorization (Scott and Matwin 1998).
5. Document Summarization
WordNet has found useful application in text summarization. The approach presented by
Barzilay and Elhadad (1997) utilizes information from WordNet to compute lexical chains.

4.5.2 FRAMENET
FrameNet is a large database of semantically annotated English sentences. It is based on
principles of frame semantics. It defines a tagset of semantic roles called the frame element.
Sentences from the British National Corpus are tagged with these frame elements. The basic
philosophy involved is that each word evokes a particular situation with particular
participants.
FrameNet aims at capturing these situations through case-frame representation of words
(verbs, adjectives, and nouns). The word that invokes a frame is called target word or
predicate, and the participant entities are defined using semantic roles, which are called frame
elements.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

The FrameNet ontology can be viewed as a semantic level representation of predicate


argument structure. Each frame contains a main lexical item as predicate and associated frame-
specific semantic roles, such as AUTHORITIES, TIME, and SUSPECT in the ARREST frame, called
frame elements. A frame may inherit roles from another frame. For example, a STATEMENT
frame may inherit from a COMMUNICATION frame; it contains roles such as SPEAKER,
ADDRESSEE, and MESSAGE.

FrameNet Applications
Gildea and Jurafsky (2002) and Kwon et al. (2004) used FrameNet data for automatic semantic
parsing. The shallow semantic role obtained from FrameNet can play an important role in
information extraction. For example, a semantic role makes it possible to identify that the
theme role played by 'match is same in sentences (1) and (2) though the syntactic role is
different.

The umpire stopped the match. (1)


The match stopped due to bad weather. (2)
In sentence (1), the word 'match' is the object, while it is the subject in sentence (2).
Semantic roles may help in the question-answering system. For example, the verb 'send' and
'receive' would share the semantic roles SENDĒR, RECIPIENT, GOODS, etc., when defined with
respect to a common TRANSFER frame. Such common frames allow a question-answering
system to answer a question such as Who sent packet to Khushbu? using sentence (3).
Khushbu received a packet from the examination cell. (3)
Other applications include IR, interlingua for machine translation, text summarization, and
word sense disambiguation.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

4.5.3 STEMMERS
Stemming, often called conflation, is the process of reducing inflected (or sometimes derived)
words to their base or root form. The stem need not be identical to the morphological base of
the word: it is usually sufficient that related words map to the same stem, even if this stem is
not in itself a valid root. Stemming is useful in search engines for query expansion or indexing
and other NLP problems. Stemming programs are commonly referred to as stemmers. The
most
common algorithm for stemming English is Porter's algorithm Porter 1980). Other existing
stemmers include Lovins stemmer Lovins 1968) and a more recent one called the Paice/Husk
stemmer (Paice 1990) Figure 12.10 shows a sample text and output produced using these
stemmers.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

1. Stemmers for European Languages


There are many stemmers available for English and other languages. Snowball presents
stemmers for English, Russian, and a number of other European languages, including French,
Spanish, Portuguese, Hungarian, Italian, German, Dutch, Swedish, Norwegian, Danish, and
Finnish. The links for stemming algorithms for these languages can be found at http: snowball.
tartarus.org/texts/stemmersoverview.html.
2. Stemmers for Indian Languages
Standard stemmers are not yet available for Hindi and other Indian languages. The major
research on Hindi stemming has been accomplished by Ramanathan and Rao (2003) and
Majumder et al. Ramanathan and Rao based their work on the use of and crafted suffix lists.
Majumder et al. used a cluster-based approach to find classes of root words and their
morphological variants. They used a task-based evaluation of their approach and reported
that stemming improves recall for Indian languages. Their observation on Indian languages
was based on a Bengali data set. The Resource Centre of Indian Language Technology (CFILT),
IIT Bomba has also developed stemmers for Indian languages, which are available at
http:/www.filt.ith. ac. in.
3. Stemming Applications
Stemmers are common elements in search and retrieval systems such as Web search engines.
Stemming reduces the variants of a word to same stem. This reduces the size of the index and
also helps retrieve documents that contain variants of a query terms. For example, a user
issuing a query for documents on astronauts would like documents on 'astronaut as well.
Stemming permits this by reducing both versions of the word to the same stem. However, the
effectiveness of stemming for English query systems is not too great, and in some cases may
even reduce precision. Text summarization and text categorization also involve term

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

frequency analysis to find features. In this analysis, stemming is used to transform various
morphological forms of words into their stems.

4.5.4 PART-OF-SPEECH TAGGER


Part-of-speech tagging is used at an early stage of text processing in many NLP applications
such as speech synthesis, machine translation, IR, and information extraction. In IR, part-of-
speech tagging can be used in indexing (for identifying useful tokens like nouns), extracting
phrases and for disambiguating word senses. The rest of this section presents a number of
part-of-speech taggers that are already in place.
1. Stanford Log-linear Part-of-Speech (POS) Tagger
This POS Tagger is based on maximum entropy Markov models. The key features of the tagger
are as follows:
(i) It makes explicit use of both the preceding and following tag contexts via a
dependency network representation.
(ii) It uses a broad range of lexical features.
(iii) It utilizes priors in conditional log-linear mnodels.
The reported accuracy of this tagger on the Penn Treebank WSI is 97.24%, which amounts to
an error reduction of 4.40 on the best previous single automatically learned tagging result.

2. A Part-of-Speech Tagger for English


This tagger uses a bi-directional inference algorithm for part- of-speech tagging. It is based on
maximum entropy Markov models (MEMM). The algorithm can enumerate all possible
decomposition structures and find the highest probability sequence together with the
corresponding decomposition structure in polynomial time. Experimental results of this part-
of-speech tagger show that the proposed bi-directional inference methods consistently
outperform unidirectional inference methods and bi-directional MEMMs give comparable
performance to that achieved by state-of-the-art learning algorithms, including kernel support
vector machines.

3. TnT tagger
Trigrams’n Tags or TnT is an efficient statistical part-of-speech tagger. This tagger is based on
hidden Markov models (HMM) and uses some optimization techniques for smoothing and
handling unknown words. It performs at least as well as other current approaches, including

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

the maximum entropy framework. Table 12.1 shows tagged text of document #93 of the
CACM collection.

4. Brill Tagger
Brill described a trainable rule-based tagger that obtained performance comparable to that of
stochastic tagger. It uses transformation-based learning to automatically induce rules. A
number of extensions to this rule-based tagger have been proposed by Brill (1994)- He
describes a method for expressing lexical relations in tagging that stochastic taggers are
currently unable to express. It implements a rule-based approach to tagging unknown words.
It demonstrates how the tagger can be extended into a k-best tagger, where multiple tags can
be assigned to words in some cases of uncertainty.

5. CLAWS Part-of-Speech Tagger for English


Constituent likelihood automatic word-tagging system (CLAWS) is one of the earliest
probabilistic taggers for English. It was developed at the University of Lancaster. The latest
version of the tagger, CLAWS4, can be considered a hybrid tagger as it involves both
probabilistic and rule-based elements. It has been designed so that it can be easily adapted to
different types of text in different input formats. CLAWS has achieved 96-97% accuracy. The
precise degree of accuracy varies according to the type of text.

6. Tree-Tagger Tree
Tagger (Schmidt 1994) is a probabilistic tagging method. It avoids problems faced by the
Markov model methods when estimating transition probabilities from sparse data, by using a
decision tree to estimate transition probabilities. The decision tree automatically determines
the appropriate size of the context to be used in estimation. The reported accuracy for the
tagger is above 96% on the Penn-Treebank WSJ corpus.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

7. ACOP0ST: A Collection of POS Taggers


ACOPST is a set of freely available POS taggers. The taggers in the set are based on different
frameworks. The programs are written in C. ACOPOST currently consists of the following four
taggers.

7.1 Maximum Entropy Tagger (MET)


This tagger is based on a framework suggested by Ratnaparkhi (1997). It uses an iterative
procedure to successively improve parameters for a set of features that help to
distinguish between relevant contexts.
7.2 Trigram Tagger (T3)
This tagger is based on HMM. The states in the model are tag pairs that emit words. The
technique has been suggested by Rabiner (1990) and the implementation is influenced by
Brants (2000).
7.3 Error-driven Transformation-based Tagger (TBT)
This tagger is based on the transformation-based tagging approach proposed by Brill
(1993). It uses annotated corpuses to learn transformation rules, which are then used to
change the assigned tag using contextual information.
7.4 Example-based Tagger (ET)
The underlying assumption of example-based models (also called memory- based,
instance-based or distance-based models) is that cognitive behaviour can be achieved by
looking at past experiences that match the current problem, instead of learning and
applying abstract rules.

4.5.6 RESEARCH CORPORA


Research corpora have been developed for a number of NLP-related tasks. In the following
section, we point out few of the available standard document collections for a variety of NLP-
related tasks, along with their Internet links.
1. IR Test Collection
Table 12.2 lists the sources of those and few more IR test collections. LETOR (learning to
rank) is a package of benchmark data sets released by Microsoft Research Asia, It consists of
two datasets OHSUMED and TREC (TD2003 and TD2004). LETOR is packaged with extracted
features for each query-document pair in the collection, baseline results of several state-of-
the- art learning-to-rank algorithms on the data and evaluation tools. The data set is aimed at

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

supporting future research in the area of learning ranking function for information retrieval.

2. Summarization Data
Evaluating a text summarizing system requires existence of 'gold summaries'. DUC provides
document collections with known extracts and abstracts, which are used for evaluating
performance of summarization systems submitted at TREC conferences. Figure12.11 shows a
sample document and its extract from DUC 2002 summarization data.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

3. Word Sense Disambiguation


SEMCOR is a sense-tagged corpus used in disambiguation. It is a subset of the Brown corpus,
sense-tagged with WordNet Word Expert attempts to create a very large sense-tagged corpus.
It synsets. Open Mind collects word sense tagging from the general public over the Web.

4. Asian Language Corpora

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.


Regulation – 2022 BAI601-Natural Language Processing

The multilingual EMILLE corpus is the result of the enabling minority language engineering
(EMILLE) project at Lancaster University, UK. The project focuses on generation of data,
software resources and basic language engineering tools for the NLP of south Asian languages.
Central Institute for Indian Languages (CIIL), the Indian partner in the project, extended the
set of target languages to include a number of Indian languages. CIIL provides a wider range of
data in these languages from a wide range of genres.

The data sources that EMILLE made available include monolingual written and spoken
corpuses, parallel and annotated corpuses. The monolingual corpus includes written data for
14 South Asian languages and spoken data for five languages (Hindi, Bengali, Gujrati, Punjabi,
and Urdu). The spoken corpus was constructed from radio broadcasts on the BBC Asia
network. The parallel corpus contains English text and its translation in five languages. The
text incudes UK government advice leaflets which are published in multiple languages. The
corpus is aligned at sentence level. The parallel corpus provided by EMILLE corpus is a
valuable resource for statistical machine translation research. The annotated component
includes Urdu data annotated for part-of-speech tagging, and a Hindi corpus annotated to
show nature of demonstrative use.

Prepared by: Prof. M.KALIDASS Department of AIML, SSCE, Anekal, Bengaluru.

You might also like