NLP Unit 1
NLP Unit 1
UNIT 1
Introduction
Words are considered the smallest linguistic units capable of conveying meaning through
utterance. However, the concept of a "word" can vary significantly across languages. The
following are the various fundamental components of words:
•Tokens: In many languages, such as English, words are delimited by whitespace and
punctuation, forming tokens. Yet, this is not a universal rule; languages like Japanese,
Chinese, and Thai utilise character strings without whitespace for word delimitation. Other
languages, like Arabic or Hebrew, concatenate certain tokens, where word forms change
depending on preceding or following elements.
•Morphemes: These are the minimal parts of words that convey meaning. Morphemes
constitute the fundamental morphological units and contribute to the overall meaning of a
word.
Morphological Typology
Morphological typology classifies languages based on the number of morphemes per word
and the degree of fusion between them. The types of languages are shown below:
*****
1. Irregularity:
Irregularity in language refers to instances where word forms do not follow general rules or
predictable patterns, posing a considerable challenge for morphological parsing. These
irregularities are particularly pronounced in languages with rich morphology, such as Arabic
or Korean, and can affect both derivation and inflection.
It hinders the ability of linguistic systems to generalise and abstract from observed word
forms, necessitating detailed descriptions of each irregular form. This can lead to issues
with accuracy, increased computational complexity, and difficulties in verifying associated
information.
nature, presents complex challenges due to its deep-rooted morphological processes and
the interaction between phonology and orthography. The deep study of morphological
processes in Arabic is essential for mastering inflection and derivation, as well as for
handling irregular forms. For instance, certain irregular forms might not be derived from the
general morphological template, as seen with the word jadīd 'new'.
2. Ambiguity
Ambiguity in morphological parsing arises when a word form can be understood in multiple
ways, possessing distinct functions or meanings. While morphological parsing identifies the
components of words, it does not directly concern the disambiguation of words in their
context. Homonyms, words that look alike but have different meanings, are particularly
problematic in morphological analysis.
Korean provides clear examples of systematic homonyms, where the same word form can
represent different meanings depending on its context or the endings attached to it, as
illustrated in the below Table. For instance, 'mwut.ko' can mean 'bury', 'ask', or 'bite'. Arabic,
with its morphologically rich nature, frequently presents ambiguities, especially because the
written script often omits diacritical marks that distinguish different forms4. The problem of
Even in Czech, while morphology helps to identify semantic features, it can introduce
ambiguities concerning abstract semantic forms6. For example, the meaning of 'stavení' in
Czech might be unclear without additional context. This highlights that even with detailed
morphological information; the intended meaning can remain elusive, requiring further
contextual analysis.
3. Productivity
Productivity is the ability of a language to form new words continually poses a challenge
for maintaining comprehensive lexicons.
Productivity refers to the inherent capacity of a language to generate new words or new
forms of existing words. This includes both inflectional changes (like verb conjugations or
noun declensions) and derivational processes (creating new words from existing ones)8.
Human languages are considered "open class" systems, meaning they can continually
create an infinite number of linguistic units, making it impossible to list every possible word
form explicitly in a dictionary or lexicon.
This ongoing creation of new forms poses a significant challenge for traditional dictionary-
based lookup approaches, which are finite and cannot account for all possible word forms.
Therefore, morphological analysis systems must be able to process and understand novel
or unseen word forms that are not explicitly pre-computed. This necessitates sophisticated
computational linguistic models, such as finite-state transducers and unification-based
approaches, which can handle productive patterns and generate/recognise new forms
based on underlying rules rather than just stored entries. The goal is to develop robust
systems that can effectively manage the dynamism of language, predicting and parsing
forms that have not been encountered before.
*****
MORPHOLOGICAL MODELS
1. Dictionary Lookup:
Dictionary lookup is a fundamental process in morphological analysis where word forms are
associated with their corresponding linguistic descriptions. This method relies on
precomputed data structures like lists, dictionaries, or databases, which are kept
synchronised with sophisticated morphological models.
Limitations: While effective, the set of associations between word forms and their
desired descriptions is finite, meaning that the generative potential of the language
is not fully exploited3. This approach can be tedious, prone to errors, and inefficient
for large, unreliable linguistic resources, especially for enumerative models that are
often sufficient for general purposes. It's also less suitable for complex morphology,
as seen in Korean, where dictionary-based approaches can depend on a large
dictionary of all possible combinations of allomorphs and morphological alternations.
Functionality: FSTs can compute and compare regular relations, defining the relationship
between an input (surface string) and an output (lexical string, including morphemes and
features). FSM is well-suited for analysing morphological processes in various languages,
including isolating and agglutinative types. They can construct full-fledged morphological
analysers (parsing words into morphemes), morphological generators (producing word
forms from morphemes), and tokenizers.
Advantages: FSTs are flexible, efficient, and robust. They offer a general-purpose
approach for pattern matching and substitution, allowing for the building of complex
morphological analysers and generators.
3. Unification-Based Morphology:
Functionality: This model can manage complex and recursively nested linguistic
information, expressed by atomic symbols or more appropriate data structures5.
Unification, as the key operation, merges informative feature structures, making it highly
versatile for representing intricate linguistic details.
Advantages: These models are typically formulated as logic programs and use unification
to solve constraint systems. This offers advantages such as better abstraction possibilities
4. Functional Morphology:
Functional Morphology is a model that defines morphological operations using principles of
functional programming and type theory.
Advantages: This approach offers greater freedom for developers to define their own
lexical constructions, leading to domain-specific embedded languages for morphological
analysis. It supports full-featured, real-world applications and promotes reusability of
linguistic data.
5. Morphology Induction:
Morphology Induction focuses on discovering and inferring word structure, moving beyond
pre-existing linguistic knowledge.
Motivation: This approach is especially valuable for languages where linguistic expertise is
limited or unavailable or for situations where an unsupervised or semi-supervised learning
method is preferred.
Challenges: Deducing word structure from forms and context presents several
challenges, including dealing with ambiguity and irregularity in morphology, as well
as orthographic and phonological alterations and non-linear morphological
processes.
These models, while distinct, complement each other, offering various tools and
perspectives for addressing the complex task of finding and representing the structure of
words across the diverse range of human languages. The choice of model often depends
on the specific language being analysed and the desired application.
***
1. What is a morpheme?
2. What is Morphology?
Morphology is the study of word structure and formation. It examines how words are
constructed from smaller meaningful units called morphemes and how these units combine
to form complex words. The discovery of word structure is specifically referred to as
morphological parsing. Morphological analysis is considered an essential part of
language processing, as it helps convert diverse word forms into well-defined linguistic units
with explicit lexical and morphological properties. Understanding word structure involves
identifying distinct types of units in human languages and how their internal structure
connects with grammatical properties and lexical concepts
Words are the smallest linguistic units that can form a complete utterance by themselves1.
Their internal structure can be modelled in relation to their grammatical properties and the
lexical concepts they represent. The discovery of this word structure is known as
morphological parsing.
The structure of words is built upon morphemes, which are defined as the minimal parts of
a word that convey meaning. These are also referred to as segments or morphs and are
considered the fundamental morphological units.
Human languages employ various methods to combine these morphs and morphemes into
complete word forms.
•The simplest method is concatenation, where morphemes are joined sequentially, such
as in "dis-agree-ment-s". In this example, "agree" is a free lexical morpheme, while "dis-", "-
ment-", and "-s" are bound grammatical morphemes that contribute partial meaning.
•In more complex systems, morphs can interact with each other, leading to
morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.
•Word structure is frequently described by how stems combine with root and pattern
morphemes, along with other elements that may be attached to either side.
It is important to note that some properties or features of a word may not be explicitly visible
in its morphological structure. The structural components can be associated with, and
dependent on, multiple functions concurrently, without necessarily having a singular
grammatical interpretation within their lexical meaning.
Ultimately, the way word structure is described can depend on the specific language being
analysed and the morphological theory being applied6. Deducing word structure can be
challenging due to factors such as ambiguity, irregularity, and variations in orthography and
phonology.
Allomorphs are the alternative forms of a morpheme. They represent variations of a single
morpheme that are chosen based on phonological context or other linguistic rules.
OR
What are the foundational concepts and methodologies for understanding word
structure across languages?
1.Words as Basic Linguistic Units Words are considered the smallest linguistic units
capable of forming a complete utterance by themselves. Their internal structure can be
modeled in relation to their grammatical properties and the lexical concepts they represent.
2.Morphemes The structure of words is fundamentally built upon morphemes, which are
defined as the minimal parts of a word that convey meaning. They are also referred to as
segments or morphs and are considered the elementary morphological units
◦Concatenation The simplest method is sequential joining, as seen in words like "dis-
agree-ment-s". In this example, "agree" is a free lexical morpheme (can stand alone), while
"dis-", "-ment-", and "-s" are bound grammatical morphemes (cannot stand alone) that
contribute partial meaning.
◦Stems, Roots, and Patterns Word structure is frequently described by how stems
combine with root and pattern morphemes, along with other elements that may be
attached to either side.
◦Implicit Properties It's important to note that some properties or features of a word may
not be explicitly visible in its morphological structure. Word structure components can be
associated with and dependent on multiple functions concurrently, without necessarily
having a singular grammatical interpretation within their lexical meaning.
◦Synthetic Languages These languages combine more morphemes per word than
isolating languages.
◦Fusional Languages These languages (e.g., Arabic, Czech, Latin, Sanskrit, German)
often have a feature-per-morpheme ratio higher than one, meaning a single morpheme
can convey multiple grammatical features8.
◦Concatenative Languages These languages link morphs and morphemes one after
another.
The discovery of word structure is broadly known as morphological parsing. This process
is crucial for various Natural Language Processing (NLP) tasks, including semantic and
syntactic analysis.
◦Concatenation Languages like Arabic and Hebrew often concatenate certain tokens with
preceding or following elements, leading to changes in word forms and appearing as a
single, compact string of letters. These are sometimes called clitics.
◦Speech/Cognitive Units In Korean, character strings are grouped into units called
"eojeol" ("word segment"), which are typically larger than individual words but smaller than
clauses.
◦Mechanism FSTs represent the relationship between surface word forms (how words
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). They function by mapping input symbols to output symbols. An FST is based
on finite-state automata, where a finite set of nodes (states) are connected by directed
edges labeled with pairs of input and output symbols. This network translates a sequence
of input symbols into a sequence of corresponding output symbols.
◦Functionality FSTs are capable of computing and comparing regular relations. They
define the relationship between the input (surface string) and the output (lexical string,
which includes morphemes and their features). FSM is particularly well-suited for analysing
morphological processes in both isolating and agglutinative languages. It can be used to
build full-fledged morphological analysers, which identify morphemes within a word, or
generators, which produce word forms from given morphemes. It is also valuable for
constructing tokenizers.
◦Dictionary Lookup This is a process where word forms are associated with their
corresponding linguistic descriptions.
•Ambiguity Word forms can be understood in multiple ways or have the same form but
distinct functions or meanings (homonyms). Morphological parsing deals with disabiguating
words in their context.
•Irregularity Some word forms may not follow regular patterns and may not be explicitly
listed in a lexicon.
In human language, both written and spoken, words and sentences are not arranged
randomly; instead, they inherently possess an underlying structure. This inherent structure
includes meaningful grammatical units like sentences, requests, commands, and self-
contained units of discourse that relate to a particular point or idea. The automatic
extraction of this document structure is a fundamental and often prerequisite step for a wide
range of Natural Language Processing (NLP) applications.
For instance, tasks such as parsing, machine translation, and semantic role labeling rely on
sentences as their basic processing unit. Furthermore, the ability to chunk input text or
speech into topically coherent blocks facilitates better organisation and indexing of data,
enabling more efficient information retrieval and further processing of specific topics.
"Finding the Structure of Documents," delves into methods for identifying these
structural elements, specifically focusing on two key tasks: Sentence Boundary Detection
(SBD) and Topic Boundary Detection. It explores statistical classification approaches that
infer the presence of sentence and topic boundaries by leveraging various features of the
input, such as punctuation, pauses, and lexical cues. These methods are crucial for
transforming raw text or speech into more manageable and semantically meaningful units
for subsequent NLP analysis.
--------------------------------------------------------------------------------
1. Explain Sentence Boundary Detection and Topic Boundary Detection.
In written text, sentence boundaries are typically identified by punctuation marks such as
periods (.), question marks (?), and exclamation marks (!). Ambiguity arises because the
same punctuation, especially a period, can signify an abbreviation rather than the end of a
sentence (e.g., "Dr." or "Mt. Rushmore"). To resolve such ambiguities, SBD systems
consider additional cues, including the presence of a punctuation mark, a pause in speech,
or the beginning of a new word in a document. Capitalized initials and numbers preceding
periods are also used to distinguish between sentence-ending punctuation and
abbreviations. Statistical methods often infer boundaries based on these features. For
example, in the Wall Street Journal Corpus, 47% of periods are used to mark an
abbreviation.
For spoken language, particularly in multiparty meetings, the task becomes more complex
as traditional punctuation cues are absent. Here, SBD relies on features like pause duration
and pitch range. Dialogue acts, which represent self-contained units of discourse, can also
serve as segment boundaries in such context. Challenges include handling conversational
speech, which may lack clear boundaries, and errors introduced by OCR or ASR systems.
Unlike sentence segmentation, topic segmentation typically deals with longer segments of
text. In multiparty meetings, topic segmentation often draws inspiration from discourse
analysis, though defining boundaries can be less straightforward compared to well-
structured formats like news articles. One of the main challenges for topic segmentation is
its non-trivial nature, as topic boundaries are often fluid and lack clear, linguistically explicit
markers.
Statistical approaches are commonly used for topic segmentation, inferring boundaries
based on various features1. For instance, the TextTiling method is a popular approach that
employs a lexical cohesion metric within a word vector space to assess the similarity
between consecutive text segments6. A decrease in this similarity score often indicates a
topic shift. This method helps identify points where new vocabulary is introduced or where
the lexical content changes significantly, signalling a transition to a new topic. Other
methods for computing similarity scores between blocks include block comparison and the
vocabulary introduction method, which assigns a score based on the number of new words
appearing in an interval.
****
Methods: An Overview
"Finding the Structure of Documents" discusses various methods for segmenting text into
meaningful units, primarily focusing on sentence and topic boundaries.
These methods are broadly categorised into the following three categories:
The most probable boundary sequence for a given document is typically obtained using the
Viterbi algorithm. While effective, the conventional HMM approach has known weaknesses,
particularly in its inability to effectively use information from broad linguistic cues like part-
of-speech (POS) tags or prosodic cues for morphological segmentation. Extensions to
HMMs, such as the "hidden event language model (HELM)," have been proposed to
address these limitations by incorporating additional features and supporting non-lexical
information.
.
Discriminative approaches are widely used in much speech and language processing tasks
because they often outperform generative methods, especially when training data is
plentiful8. They require iterative optimization and typically incorporate local and contextual
features8. Examples of discriminative classifiers include Naive Bayes, Support Vector
Machines (SVMs), boosting, maximum entropy, and regression trees. These methods have
been successfully applied to tasks like POS tagging, where a tag is assigned to a word,
similar to segmenting text into boundaries.
A prominent example for topic segmentation is TextTiling, which employs a lexical cohesion
metric in a vector space. TextTiling operates as a local classification method by measuring
similarity between consecutive segmentation units, marking a boundary where the similarity
falls below a certain threshold.
CRFs, in particular, are powerful discriminative models that globally optimise the conditional
probability of a boundary sequence given all input features6. They offer advantages over
HMMs by allowing for rich, overlapping features and avoiding the "label bias problem"
common in MEMMs. CRFs have been widely applied to tasks such as speech
segmentation, indicating their versatility in handling sequence labelling challenges.
4.Hybrid Approaches
For example, a hybrid approach might estimate P(yi|xi) using an HMM, and then optimise
parameters like α and β using a held-out dataset, potentially leveraging discriminative local
classification methods. Successful implementations of hybrid approaches have shown
improved performance, particularly in areas like multilingual broadcast news speech
segmentation. They can provide a more robust and accurate solution by integrating
different types of information and modelling techniques.
Sentence segmentation, compared to other tasks like topic segmentation, faces a unique
challenge due to the potentially "quadratic number of boundaries". Traditional approaches
often focus on local decisions, but global modelling aims to improve accuracy by
considering the entire sentence structure or document context when making segmentation
decisions.
These extensions typically involve integrating local scores (e.g., from a discriminative
classifier) with higher-level, sentence-level features. This can be achieved by working with a
"pruned sentence lattice," which allows for the combination of local boundary scores with
more holistic features derived from syntactic parsing or global prosodic patterns. Such
methods lead to a more efficient and accurate manner of finding the optimal sentence
boundaries by considering broader contextual information rather than just local cues.
Various studies have reported different performance levels for these methods. For instance,
rule-based systems for sentence boundary detection have shown error rates as low as
1.41%. Supervised classifiers like SVMs, often combined with POS tag features, have
achieved F1-measures as high as 97.5%. Hybrid approaches, such as those combining
HMMs with CRFs, have also demonstrated high F1-scores, ranging from 78.2% to
89.1% in different contexts. The specific features used, such as lexical, prosodic, or
grammatical features, significantly influence the performance of discriminative approaches.
IMPORTANT QUESTIONS