Phases of NLP:
Natural Language Processing works on multiple levels and most often, these
different areas synergize well with each other. This article will offer a brief
overview of each and provide some examples of how they are used in information
retrieval.
Morphological
The morphological level of linguistic processing deals with the study of word
structures and word formation, focusing on the analysis of the individual
components of words. The most important unit of morphology, defined as having
the “minimal unit of meaning”, is referred to as the morpheme.
Taking, for example, the word: “unhappiness”. It can be broken down into three
morphemes (prefix, stem, and suffix), with each conveying some form of meaning:
the prefix un- refers to “not being”, while the suffix -ness refers to “a state of
being”.
The stem happy is considered as a free morpheme since it is a “word” in its own
right.
Bound morphemes (prefixes and suffixes) require a free morpheme to which it can
be attached to, and can therefore not appear as a “word” on their own.
In Information Retrieval, document and query terms can be stemmed to match the
morphological variants of terms between the documents and query; such that the
singular form of a noun in a query will match even with its plural form in the
document, and vice versa, thereby increasing recall.
Lexical
Lexical analysis is the process of trying to understand what words mean, intuit
their context, and note the relationship of one word to others. It is often the entry
point to many NLP data pipelines.
Lexical analysis can come in many forms and varieties. It is used as the first
step of a compiler, for example, and takes a source code file and breaks down
the lines of code to a series of "tokens", removing any whitespace or
comments.
In other types of analysis, lexical analysis might preserve multiple words
together as an "n-gram" (or a sequence of items).
After tokenization, the computer will proceed to look up words in a dictionary
and attempt to extract their meanings.
For a compiler, this would involve finding keywords and associating operations
or variables with the tokens.
In other contexts, such as a chat bot, the lookup may involve using a database to
match intent. As noted above, there are often multiple meanings for a specific
word, which means that the computer has to decide what meaning the word has in
relation to the sentence in which it is used.
This second task if often accomplished by associating each word in the dictionary
with the context of the target word. For example, the word "baseball field" may be
tagged in the machine as LOCATION for syntactic analysis
Syntactic
The syntax of the input string refers to the arrangement of words in a sentence
so they grammatically make sense. NLP uses syntactic analysis to assess whether
or not the natural language aligns with grammatical or other logical rules.
To apply these grammar rules, a collection of algorithms is utilized to describe
words and derive meaning from them. Syntax techniques that are frequently used
in NLP include the following:
Lemmatization / Stemming - reduces word complexity to simpler forms
that have less variation. Lemmatization uses a dictionary to reduce the
natural language to its root words. Stemming uses simple matching patterns
to strip away suffixes such as 's' and 'ing'.
Parsing - This is the process of undergoing grammatical analysis of a given
sentence. A common method is called Dependency Parsing, which assesses
the relationships between words in a sentence.
Nevertheless, syntax can still be ambiguous at times as in the case of the news
headline:
grammar to parse a sentence −
“The bird pecks the grains”
Articles (DET) − a | an | the
Nouns − bird | birds | grain | grains
Noun Phrase (NP) − Article + Noun | Article + Adjective + Noun
= DET N | DET ADJ N
Verbs − pecks | pecking | pecked
Verb Phrase (VP) − NP V | V NP
Adjectives (ADJ) − beautiful | small | chirping
The parse tree breaks down the sentence into structured parts so that the computer
can easily understand and process it. In order for the parsing algorithm to construct this
parse tree, a set of rewrite rules, which describe what tree structures are legal, need to
be constructed.
These rules say that a certain symbol may be expanded in the tree by a sequence of
other symbols. According to first order logic rule, if there are two strings Noun Phrase
(NP) and Verb Phrase (VP), then the string combined by NP followed by VP is a
sentence. The rewrite rules for the sentence are as follows −
S → NP VP
NP → DET N | DET ADJ N
VP → V NP
Lexocon −
DET → a | the
ADJ → beautiful | perching
N → bird | birds | grain | grains
V → peck | pecks | pecking
The parse tree can be created as shown −
Now consider the above rewrite rules. Since V can be replaced by both, "peck" or
"pecks", sentences such as "The bird peck the grains" can be wrongly permitted. i. e.
the subject-verb agreement error is approved as correct
Semantic Analysis:
Semantics refers to the meaning that is conveyed by the input text. This analysis is
one of the difficult tasks involved in NLP, as it requires algorithms to understand
the meaning and interpretation of words in addition to the overall structure of a
sentence. Semantic analysis techniques include:
Entity Extraction - This means identifying and extracting categorical
entities such as people, places, companies, or things. It is essential to
simplifying the contextual analysis of natural language.
Machine Translation - This is used to automatically translate text from one
human language to another.
Natural Language Generation - This is the process of converting
information of the computer semantic intention into readable human
language. This is utilized by chatbots to effectively and realistically respond
to users.
Natural Language Understanding - This involves converting pieces of text
into representations that are structured logically for the computer programs
to easily manipulate.
Discourse Processing
The discourse level of linguistic processing deals with the analysis of structure
and meaning of text beyond a single sentence, making connections between
words and sentences. At this level, Anaphora Resolution is also achieved by
identifying the entity referenced by an anaphor (most commonly in the form of, but
not limited to, a pronoun). An example is shown below.
Fig: Anaphora Resolution Illustration
With the capability to recognize and resolve anaphora relationships, document and
query representations are improved, since, at the lexical level, the implicit presence
of concepts is accounted for throughout the document as well as in the query, while
at the semantic and discourse levels, an integrated content representation of the
documents and queries are generated.
Structured documents also benefit from the analysis at the discourse level since
sections can be broken down into (1) title, (2) abstract, (3) introduction, (4) body,
(5) results, (6) analysis, (7) conclusion, and (8) references. Information Retrieval
systems are significantly improved, as the specific roles of pieces of information are
determined as for whether it is a conclusion, an opinion, a prediction, or a fact.
Pragmatic Processing:
The pragmatic level of linguistic processing deals with the use of real-world
knowledge and understanding of how this impacts the meaning of what is being
communicated. By analyzing the contextual dimension of the documents and
queries, a more detailed representation is derived.
In Information Retrieval, this level of Natural Language Processing primarily
engages query processing and understanding by integrating the user’s history and
goals as well as the context upon which the query is being made. Contexts may
include time and location.
This level of analysis enables major breakthroughs in Information Retrieval as it
facilitates the conversation between the IR system and the users, allowing the
elicitation of the purpose upon which the information being sought is planned to be
used, thereby ensuring that the information retrieval system is fit for purpose.