Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views18 pages

NLP Unit 1

The document discusses the importance of morphology in Natural Language Processing (NLP), focusing on the structure of words and their components such as tokens, morphemes, lexemes, and allomorphs. It outlines challenges in morphological parsing, including irregularity, ambiguity, and productivity, and describes various morphological models like Dictionary Lookup, Finite-State Morphology, and Unification-Based Morphology. The document emphasizes the complexity of human language and the need for sophisticated models to effectively analyze and process word structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

NLP Unit 1

The document discusses the importance of morphology in Natural Language Processing (NLP), focusing on the structure of words and their components such as tokens, morphemes, lexemes, and allomorphs. It outlines challenges in morphological parsing, including irregularity, ambiguity, and productivity, and describes various morphological models like Dictionary Lookup, Finite-State Morphology, and Unification-Based Morphology. The document emphasizes the complexity of human language and the need for sophisticated models to effectively analyze and process word structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

SBIT – AUTONOMOUS NLP

NATURAL LANGUAGE PROCESSING

UNIT 1

1. FINDING THE STRUCUTRE OF WORDS

Introduction

The study of word structure, known as morphology, is a fundamental aspect of Natural


Language Processing (NLP). This discipline is essential for understanding human
language, which is inherently complex, enabling us to express thoughts and infer meaning
from various levels of detail. Morphology is crucial for processing human language,
including tasks like semantic and syntactic analysis, and is particularly vital in multilingual
settings. The discovery of word structure is termed morphological parsing.

Words and their Components

Explain Words and their Components.

Words are considered the smallest linguistic units capable of conveying meaning through
utterance. However, the concept of a "word" can vary significantly across languages. The
following are the various fundamental components of words:

•Tokens: In many languages, such as English, words are delimited by whitespace and
punctuation, forming tokens. Yet, this is not a universal rule; languages like Japanese,
Chinese, and Thai utilise character strings without whitespace for word delimitation. Other
languages, like Arabic or Hebrew, concatenate certain tokens, where word forms change
depending on preceding or following elements.

•Morphemes: These are the minimal parts of words that convey meaning. Morphemes
constitute the fundamental morphological units and contribute to the overall meaning of a
word.

•Lexemes: A lexeme is a linguistic form that expresses a concept, independent of its


various inflectional categories. The citation form of a lexeme is known as its lemma. When
a word form is converted into its lemma, this process is called lemmatisation.

•Allomorphs: Morphemes can exhibit variations in their sound (phonemes) or spelling


(graphemes), which are termed allomorphs. These variations are due to phonological or
orthographic constraints. Examples include the differing forms of morphemes in Korean and
the non concatenative morphology of Arabic, where word structure is determined by stems,
roots, and patterns.

 Morphological Typology
Morphological typology classifies languages based on the number of morphemes per word
and the degree of fusion between them. The types of languages are shown below:

III CSE (AI/ML) Page 1 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

1) Isolating Languages: These languages typically have one or relatively few


morphemes per word, with minimal inflectional changes. Examples include Chinese,
Vietnamese, and Thai.

2) Agglutinative Languages: Characterised by a high number of morphemes per


word, which are often easily separable and combine to form long words. Korean,
Japanese, Finnish, and Turkish are examples.

3) Synthetic Languages: Morphemes in these languages tend to combine and fuse

4) Fusional Languages, a subset of synthetic languages, often express multiple


grammatical features (e.g., gender, number, case) with a single morpheme. Arabic,
Czech, Latin, and Sanskrit are examples of fusional languages.

*****

Issues and Challenges

Explain issues and challenges of Morphological Parsing. (Essay Question)

OR Explain issues and challenges in finding the structure of words,

Understanding the structure of words is a complex task in human language processing,


involving various linguistic disciplines like morphology, semantics, etymology, and
lexicology. Morphological parsing, which aims to identify and analyse the structure of
words, faces significant issues and challenges, particularly concerning irregularity,
ambiguity, and productivity. Addressing these challenges is crucial for developing effective
language processing systems.

1. Irregularity:

Irregularity in language refers to instances where word forms do not follow general rules or
predictable patterns, posing a considerable challenge for morphological parsing. These
irregularities are particularly pronounced in languages with rich morphology, such as Arabic
or Korean, and can affect both derivation and inflection.

It hinders the ability of linguistic systems to generalise and abstract from observed word
forms, necessitating detailed descriptions of each irregular form. This can lead to issues
with accuracy, increased computational complexity, and difficulties in verifying associated
information.

For example, Korean languages exhibit exceptional constraints on the selection of


grammatical morphemes, often showing irregular inflection in agglutinative structures. The
below Table shows examples of major irregular verb classes in Korean, such as 'cip-'
(pick) becoming 'cip-' or 'kip-'. Arabic morphology, known for its richness and derivational

III CSE (AI/ML) Page 2 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

nature, presents complex challenges due to its deep-rooted morphological processes and
the interaction between phonology and orthography. The deep study of morphological
processes in Arabic is essential for mastering inflection and derivation, as well as for
handling irregular forms. For instance, certain irregular forms might not be derived from the
general morphological template, as seen with the word jadīd 'new'.

Czech morphology also demonstrates irregularity, particularly in its extensive inflectional


paradigms. The source illustrates this with the morphological paradigms of the Czech word
dům 'house', which has various forms like budova 'building', stavba 'building', and stavení
'building'. These examples show the complexity arising from non-standard variations that
deviate from typical morphological rules.

2. Ambiguity

Word forms can have multiple possible interpretations, leading to ambiguity

Ambiguity in morphological parsing arises when a word form can be understood in multiple
ways, possessing distinct functions or meanings. While morphological parsing identifies the
components of words, it does not directly concern the disambiguation of words in their
context. Homonyms, words that look alike but have different meanings, are particularly
problematic in morphological analysis.

Korean provides clear examples of systematic homonyms, where the same word form can
represent different meanings depending on its context or the endings attached to it, as
illustrated in the below Table. For instance, 'mwut.ko' can mean 'bury', 'ask', or 'bite'. Arabic,
with its morphologically rich nature, frequently presents ambiguities, especially because the
written script often omits diacritical marks that distinguish different forms4. The problem of

III CSE (AI/ML) Page 3 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

morphological disambiguation in Arabic extends beyond resolving structural components to


include aspects like tokenisation and normalisation.

Even in Czech, while morphology helps to identify semantic features, it can introduce
ambiguities concerning abstract semantic forms6. For example, the meaning of 'stavení' in
Czech might be unclear without additional context. This highlights that even with detailed
morphological information; the intended meaning can remain elusive, requiring further
contextual analysis.

3. Productivity

Productivity is the ability of a language to form new words continually poses a challenge
for maintaining comprehensive lexicons.

Productivity refers to the inherent capacity of a language to generate new words or new
forms of existing words. This includes both inflectional changes (like verb conjugations or
noun declensions) and derivational processes (creating new words from existing ones)8.
Human languages are considered "open class" systems, meaning they can continually
create an infinite number of linguistic units, making it impossible to list every possible word
form explicitly in a dictionary or lexicon.

This ongoing creation of new forms poses a significant challenge for traditional dictionary-
based lookup approaches, which are finite and cannot account for all possible word forms.
Therefore, morphological analysis systems must be able to process and understand novel
or unseen word forms that are not explicitly pre-computed. This necessitates sophisticated
computational linguistic models, such as finite-state transducers and unification-based
approaches, which can handle productive patterns and generate/recognise new forms
based on underlying rules rather than just stored entries. The goal is to develop robust

III CSE (AI/ML) Page 4 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

systems that can effectively manage the dynamism of language, predicting and parsing
forms that have not been encountered before.

In conclusion, irregularity, ambiguity, and productivity are fundamental challenges in finding


the structure of words. Addressing these issues requires sophisticated morphological
models capable of handling exceptions, distinguishing between multiple interpretations, and
processing the continuous generation of new word forms.

*****

MORPHOLOGICAL MODELS

Explain various Morphological models. (Essay Question)

Morphological models are computational linguistic approaches designed to understand and


represent the complex structure of words across human languages. They are crucial for
addressing various problems in natural language processing (NLP), ranging from basic
word segmentation to more advanced semantic and syntactic analysis. Due to the inherent
complexity of human language, linguistic expressions are structured at multiple levels of
detail, making these models essential for processing.

The following are various morphological models:

1. Dictionary Lookup:

Dictionary lookup is a fundamental process in morphological analysis where word forms are
associated with their corresponding linguistic descriptions. This method relies on
precomputed data structures like lists, dictionaries, or databases, which are kept
synchronised with sophisticated morphological models.

 Data Structure: Linguistic data is typically understood as a data structure that


directly enables efficient lookup operations.

 Efficiency: Lookup operations can be optimised using data structures such as


binary search trees, tries, hash tables, and so on.

 Limitations: While effective, the set of associations between word forms and their
desired descriptions is finite, meaning that the generative potential of the language
is not fully exploited3. This approach can be tedious, prone to errors, and inefficient
for large, unreliable linguistic resources, especially for enumerative models that are
often sufficient for general purposes. It's also less suitable for complex morphology,
as seen in Korean, where dictionary-based approaches can depend on a large
dictionary of all possible combinations of allomorphs and morphological alternations.

III CSE (AI/ML) Page 5 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

2. Finite-State Morphology (FSM) :

Finite-State Morphology is a widely adopted computational linguistic approach that employs


finite-state transducers (FSTs) to model and analyse word structure.

Mechanism: FSTs are directly compiled from specifications written by human


programmers. They represent the relationship between the surface form of words (how they
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). An FST is based on finite-state automata, consisting of a finite set of nodes
(states) connected by directed edges. These edges are labelled with pairs of input and
output symbols, translating a sequence of input symbols into a corresponding sequence of
output symbols.

Functionality: FSTs can compute and compare regular relations, defining the relationship
between an input (surface string) and an output (lexical string, including morphemes and
features). FSM is well-suited for analysing morphological processes in various languages,
including isolating and agglutinative types. They can construct full-fledged morphological
analysers (parsing words into morphemes), morphological generators (producing word
forms from morphemes), and tokenizers.

Advantages: FSTs are flexible, efficient, and robust. They offer a general-purpose
approach for pattern matching and substitution, allowing for the building of complex
morphological analysers and generators.

Limitations: A theoretical limitation of FSTs is their primary focus on generating regular


languages4. This can be challenging for natural language phenomena that exhibit non-
regular patterns, such as certain types of reduplication4

3. Unification-Based Morphology:

Unification-Based Morphology is a declarative approach inspired by various formal linguistic


grammars, particularly head-driven phrase structure grammar (HPSG).

Core Concept: It relies on the concept of feature structures to represent linguistic


information5. These feature structures are viewed as directed acyclic graphs.

Logic Programming: The methods and concepts of unification-based formalism are


closely connected to logic programming.

Functionality: This model can manage complex and recursively nested linguistic
information, expressed by atomic symbols or more appropriate data structures5.
Unification, as the key operation, merges informative feature structures, making it highly
versatile for representing intricate linguistic details.

Advantages: These models are typically formulated as logic programs and use unification
to solve constraint systems. This offers advantages such as better abstraction possibilities

III CSE (AI/ML) Page 6 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

for developing morphological grammars and eliminating redundant information. Unification-


based models can be implemented for various languages, including Russian, Czech,
Slovenian, Persian, Hebrew, and Arabic.

4. Functional Morphology:
Functional Morphology is a model that defines morphological operations using principles of
functional programming and type theory.

Approach: It treats morphological operations as pure mathematical functions, organising


linguistic elements as abstract models of distinct types and value classes.

Compatibility: Functional morphology can be compiled into finite-state transducers,


enabling efficient computation within an interpreted mode.

Advantages: This approach offers greater freedom for developers to define their own
lexical constructions, leading to domain-specific embedded languages for morphological
analysis. It supports full-featured, real-world applications and promotes reusability of
linguistic data.

Applicability: It is particularly useful for fusional languages and is influenced by functional


programming frameworks like Haskell. ElixirFM, for instance, implements Arabic
morphology using this framework.

5. Morphology Induction:
Morphology Induction focuses on discovering and inferring word structure, moving beyond
pre-existing linguistic knowledge.

Motivation: This approach is especially valuable for languages where linguistic expertise is
limited or unavailable or for situations where an unsupervised or semi-supervised learning
method is preferred.

 Process: It aims at the automated acquisition of morphological and lexical


information8. Even if not perfect, this information can be used to bootstrap and
enhance classical morphological models.

 Research Focus: Studies in unsupervised learning of morphology, as seen in the


works of Hammarström and Goldsmith , involve categorising approaches, comparing
and clustering words based on similarity, and identifying prominent features of word
forms.

 Key Problem: Most published approaches frame morphology induction as the


problem of word boundary and morpheme boundary detection. This also includes
tasks like morphological tagging and tokenization and normalization.

III CSE (AI/ML) Page 7 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

 Challenges: Deducing word structure from forms and context presents several
challenges, including dealing with ambiguity and irregularity in morphology, as well
as orthographic and phonological alterations and non-linear morphological
processes.

 Advancements: To improve statistical inference, methods like parallel learning of


morphologies for multiple languages have been proposed by Snyder and Barzilay.
Discriminative log-linear models, such as those by Poon, Cherry, and Toutanova ,
enhance generalization by employing overlapping contextual features for
segmentation decisions.

These models, while distinct, complement each other, offering various tools and
perspectives for addressing the complex task of finding and representing the structure of
words across the diverse range of human languages. The choice of model often depends
on the specific language being analysed and the desired application.

***

Short answer questions

1. What is a morpheme?

In Natural Language Processing (NLP), a morpheme is defined as the minimal part of a


word that conveys meaning. Morphemes are considered the fundamental morphological
units. They contribute to various aspects of a word's meaning1 and are essentially the
structural components of word forms.

2. What is Morphology?

Morphology is the study of word structure and formation. It examines how words are
constructed from smaller meaningful units called morphemes and how these units combine
to form complex words. The discovery of word structure is specifically referred to as
morphological parsing. Morphological analysis is considered an essential part of
language processing, as it helps convert diverse word forms into well-defined linguistic units
with explicit lexical and morphological properties. Understanding word structure involves
identifying distinct types of units in human languages and how their internal structure
connects with grammatical properties and lexical concepts

3. Define Morphological parsing in Natural Language Processing (NLP).

Morphological parsing in Natural Language Processing (NLP) refers to the discovery of


word structure. It is the process of identifying and analysing the constituent morphemes
within a word to understand its meaning and grammatical function.

III CSE (AI/ML) Page 8 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

This process is a fundamental aspect of understanding human language, which is


inherently complex and organised across multiple levels of detail. Morphology, the study of
word structure, is an essential part of language processing and is particularly significant in
multilingual settings. Morphological parsing is crucial for various NLP tasks, including
semantic and syntactic analysis.

4. What is word segmentation?

In Natural Language Processing, word segmentation is a fundamental step in


morphological analysis1. It is also known as tokenization1. This process is crucial and
serves as a prerequisite for most language processing applications, particularly in
languages where words are not explicitly delimited by whitespace or punctuation1. For
instance, in languages like Japanese, Chinese, and Thai, words are character strings
without whitespace, and word segmentation is essential to identify the individual words.

5. How are words delimited?

The delimitation of words, often referred to as tokenization or word segmentation, varies


significantly across different languages. The following are the methods used for how words
are delimited in various linguistic contexts:
•Whitespace and Punctuation In many languages, such as English, words are primarily
delimited by whitespace and punctuation. This means that spaces and common
punctuation marks serve as explicit boundaries between individual words.
•Absence of Whitespace Delimitation In other languages, like Japanese, Chinese, and
Thai, whitespace is not used to separate words. Instead, the writing systems of these
languages present words as character strings without clear word-level delimiters. In such
cases, units that are graphically delimited are typically larger structures like sentences or
clauses.
•Concatenation and Form Changes Languages such as Arabic and Hebrew often
concatenate certain tokens with preceding or following elements. This concatenation can
lead to changes in the word forms themselves, causing the underlying lexical or syntactic
units to appear as a single, compact string of letters rather than distinct words2. These
concatenated units are sometimes referred to as clitics

6. How are words structured?

Words are the smallest linguistic units that can form a complete utterance by themselves1.
Their internal structure can be modelled in relation to their grammatical properties and the
lexical concepts they represent. The discovery of this word structure is known as
morphological parsing.

The structure of words is built upon morphemes, which are defined as the minimal parts of
a word that convey meaning. These are also referred to as segments or morphs and are
considered the fundamental morphological units.

III CSE (AI/ML) Page 9 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

Human languages employ various methods to combine these morphs and morphemes into
complete word forms.

•The simplest method is concatenation, where morphemes are joined sequentially, such
as in "dis-agree-ment-s". In this example, "agree" is a free lexical morpheme, while "dis-", "-
ment-", and "-s" are bound grammatical morphemes that contribute partial meaning.

•In more complex systems, morphs can interact with each other, leading to
morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.

•Word structure is frequently described by how stems combine with root and pattern
morphemes, along with other elements that may be attached to either side.

It is important to note that some properties or features of a word may not be explicitly visible
in its morphological structure. The structural components can be associated with, and
dependent on, multiple functions concurrently, without necessarily having a singular
grammatical interpretation within their lexical meaning.

Ultimately, the way word structure is described can depend on the specific language being
analysed and the morphological theory being applied6. Deducing word structure can be
challenging due to factors such as ambiguity, irregularity, and variations in orthography and
phonology.

What are Allomorphs?

Allomorphs are the alternative forms of a morpheme. They represent variations of a single
morpheme that are chosen based on phonological context or other linguistic rules.

(Complete topic: Finding the structure of words)

Explain Finding the structure of words

OR

What are the foundational concepts and methodologies for understanding word
structure across languages?

Understanding word structure across languages involves several foundational concepts


and methodologies that aim to decipher how words are built and what meanings and
functions their components convey.

III CSE (AI/ML) Page 10 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

Foundational Concepts of Word Structure

1.Words as Basic Linguistic Units Words are considered the smallest linguistic units
capable of forming a complete utterance by themselves. Their internal structure can be
modeled in relation to their grammatical properties and the lexical concepts they represent.

2.Morphemes The structure of words is fundamentally built upon morphemes, which are
defined as the minimal parts of a word that convey meaning. They are also referred to as
segments or morphs and are considered the elementary morphological units

3.Combining Morphemes Human languages employ various methods to combine


morphemes into complete word forms:

◦Concatenation The simplest method is sequential joining, as seen in words like "dis-
agree-ment-s". In this example, "agree" is a free lexical morpheme (can stand alone), while
"dis-", "-ment-", and "-s" are bound grammatical morphemes (cannot stand alone) that
contribute partial meaning.

◦Morphophonemic Changes In more complex systems, morphs can interact, leading to


morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.

◦Stems, Roots, and Patterns Word structure is frequently described by how stems
combine with root and pattern morphemes, along with other elements that may be
attached to either side.

◦Implicit Properties It's important to note that some properties or features of a word may
not be explicitly visible in its morphological structure. Word structure components can be
associated with and dependent on multiple functions concurrently, without necessarily
having a singular grammatical interpretation within their lexical meaning.

Morphological Typologies Languages can be categorized based on how they structure


words:

◦Isolating Languages These languages (e.g., Chinese, Vietnamese, Thai) typically


have one morpheme per word.

◦Synthetic Languages These languages combine more morphemes per word than
isolating languages.

◦Agglutinative Languages A type of synthetic language (e.g., Korean, Japanese,


Finnish, Tamil), where morphemes often combine with one function at a time.

◦Fusional Languages These languages (e.g., Arabic, Czech, Latin, Sanskrit, German)
often have a feature-per-morpheme ratio higher than one, meaning a single morpheme
can convey multiple grammatical features8.

◦Concatenative Languages These languages link morphs and morphemes one after
another.

III CSE (AI/ML) Page 11 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

◦Non-concatenative Languages These involve changing consonantal or vocalic


templates, common in Arabic.

Methodologies for Understanding Word Structure

The discovery of word structure is broadly known as morphological parsing. This process
is crucial for various Natural Language Processing (NLP) tasks, including semantic and
syntactic analysis.

1. Word Segmentation (Tokenization) This is a fundamental and prerequisite step for


most language processing applications. It involves identifying individual words within a
text2.

◦Delimitation by Whitespace and Punctuation In languages like English, words are


primarily delimited by whitespace and punctuation.

◦Absence of Whitespace Delimitation In languages such as Japanese, Chinese, and


Thai, words are character strings without explicit whitespace delimiters. In these cases,
graphically delimited units are usually larger structures like sentences or clauses.

◦Concatenation Languages like Arabic and Hebrew often concatenate certain tokens with
preceding or following elements, leading to changes in word forms and appearing as a
single, compact string of letters. These are sometimes called clitics.

◦Speech/Cognitive Units In Korean, character strings are grouped into units called
"eojeol" ("word segment"), which are typically larger than individual words but smaller than
clauses.

2.Finite-State Morphology (FSM) FSM is a prominent computational linguistic approach


that employs finite-state transducers (FSTs) to model and analyse word structure.

◦Mechanism FSTs represent the relationship between surface word forms (how words
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). They function by mapping input symbols to output symbols. An FST is based
on finite-state automata, where a finite set of nodes (states) are connected by directed
edges labeled with pairs of input and output symbols. This network translates a sequence
of input symbols into a sequence of corresponding output symbols.

◦Functionality FSTs are capable of computing and comparing regular relations. They
define the relationship between the input (surface string) and the output (lexical string,
which includes morphemes and their features). FSM is particularly well-suited for analysing
morphological processes in both isolating and agglutinative languages. It can be used to
build full-fledged morphological analysers, which identify morphemes within a word, or
generators, which produce word forms from given morphemes. It is also valuable for
constructing tokenizers.

◦Theoretical Basis Some morphological models, such as Functional Morphology, can be


compiled into finite-state transducers.

III CSE (AI/ML) Page 12 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

◦Limitations A theoretical limitation of FSTs is that they primarily generate regular


languages. However, some aspects of natural language, such as certain types of
reduplication, might exhibit non-regular patterns.

3.Other Morphological Models

◦Dictionary Lookup This is a process where word forms are associated with their
corresponding linguistic descriptions.

◦Unification-Based Morphology These models use feature structures to represent


linguistic information and can be based on logic programming.

◦Functional Morphology This approach defines morphological operations using principles


of functional programming and type theory, and it can be compiled into finite-state
transducers2.

Issues and Challenges

Deducing word structure can be challenging due to several factors:

•Ambiguity Word forms can be understood in multiple ways or have the same form but
distinct functions or meanings (homonyms). Morphological parsing deals with disabiguating
words in their context.

•Irregularity Some word forms may not follow regular patterns and may not be explicitly
listed in a lexicon.

Productivity: Productivity is the ability of a language to form new words continually


poses a challenge for maintaining comprehensive lexicons.

2. Finding the Structure of Documents

Introduction to "Finding the Structure of Documents":

In human language, both written and spoken, words and sentences are not arranged
randomly; instead, they inherently possess an underlying structure. This inherent structure
includes meaningful grammatical units like sentences, requests, commands, and self-
contained units of discourse that relate to a particular point or idea. The automatic
extraction of this document structure is a fundamental and often prerequisite step for a wide
range of Natural Language Processing (NLP) applications.

For instance, tasks such as parsing, machine translation, and semantic role labeling rely on
sentences as their basic processing unit. Furthermore, the ability to chunk input text or

III CSE (AI/ML) Page 13 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

speech into topically coherent blocks facilitates better organisation and indexing of data,
enabling more efficient information retrieval and further processing of specific topics.

"Finding the Structure of Documents," delves into methods for identifying these
structural elements, specifically focusing on two key tasks: Sentence Boundary Detection
(SBD) and Topic Boundary Detection. It explores statistical classification approaches that
infer the presence of sentence and topic boundaries by leveraging various features of the
input, such as punctuation, pauses, and lexical cues. These methods are crucial for
transforming raw text or speech into more manageable and semantically meaningful units
for subsequent NLP analysis.

--------------------------------------------------------------------------------
1. Explain Sentence Boundary Detection and Topic Boundary Detection.

Sentence Boundary Detection (SBD)

Sentence Boundary Detection (SBD), also referred to as sentence segmentation, is a


fundamental Natural Language Processing (NLP) task focused on automatically
segmenting a continuous sequence of words into individual sentence units. This process is
crucial as sentences serve as a basic processing unit for many downstream NLP
applications, including parsing, machine translation, and semantic role labeling. SBD also
significantly enhances the human readability of output from automatic speech recognition
(ASR) systems.

In written text, sentence boundaries are typically identified by punctuation marks such as
periods (.), question marks (?), and exclamation marks (!). Ambiguity arises because the
same punctuation, especially a period, can signify an abbreviation rather than the end of a
sentence (e.g., "Dr." or "Mt. Rushmore"). To resolve such ambiguities, SBD systems
consider additional cues, including the presence of a punctuation mark, a pause in speech,
or the beginning of a new word in a document. Capitalized initials and numbers preceding
periods are also used to distinguish between sentence-ending punctuation and
abbreviations. Statistical methods often infer boundaries based on these features. For
example, in the Wall Street Journal Corpus, 47% of periods are used to mark an
abbreviation.

For spoken language, particularly in multiparty meetings, the task becomes more complex
as traditional punctuation cues are absent. Here, SBD relies on features like pause duration
and pitch range. Dialogue acts, which represent self-contained units of discourse, can also
serve as segment boundaries in such context. Challenges include handling conversational
speech, which may lack clear boundaries, and errors introduced by OCR or ASR systems.

III CSE (AI/ML) Page 14 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

Topic Boundary Detection

Topic Boundary Detection, also known as topic segmentation, involves automatically


dividing a stream of discourse or text into cohesive, topically homogeneous blocks. The
primary objective is to identify points where the subject matter or topic shifts within a
document or conversation. This task is vital for various language-understanding
applications, including information retrieval and text summarization, by enabling the
processing of content in more manageable and contextually relevant chunks.

Unlike sentence segmentation, topic segmentation typically deals with longer segments of
text. In multiparty meetings, topic segmentation often draws inspiration from discourse
analysis, though defining boundaries can be less straightforward compared to well-
structured formats like news articles. One of the main challenges for topic segmentation is
its non-trivial nature, as topic boundaries are often fluid and lack clear, linguistically explicit
markers.

Statistical approaches are commonly used for topic segmentation, inferring boundaries
based on various features1. For instance, the TextTiling method is a popular approach that
employs a lexical cohesion metric within a word vector space to assess the similarity
between consecutive text segments6. A decrease in this similarity score often indicates a
topic shift. This method helps identify points where new vocabulary is introduced or where
the lexical content changes significantly, signalling a transition to a new topic. Other
methods for computing similarity scores between blocks include block comparison and the
vocabulary introduction method, which assigns a score based on the number of new words
appearing in an interval.

****

2. Explanation of Methods Used in "Finding the Structure of Documents".

Methods: An Overview

In segmenting documents into sentences or topics lies in identifying the boundaries,


sentence segmentation aims to determine where sentences begin and end, often relying
on punctuation, sentence length, and other contextual cues. Topic segmentation, on the
other hand, identifies boundaries between discourse or topic segments, chunking input text
into topically coherent blocks.

"Finding the Structure of Documents" discusses various methods for segmenting text into
meaningful units, primarily focusing on sentence and topic boundaries.
These methods are broadly categorised into the following three categories:

1. Generative Sequence Classification Methods


2. Discriminative Local Classification Methods
3. Discriminative Sequence Classification Methods
4. Hybrid approaches.

III CSE (AI/ML) Page 15 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

5. Extension for Global Modeling for Sentence Segmentation

1. Generative Sequence Classification Methods:

Generative models in sequence classification aim to estimate the joint probability


distribution, P(X, Y), where X represents the input (e.g., words) and Y represents the
desired output sequence (e.g., boundary types). The most common generative approach
for this task is the Hidden Markov Model (HMM).

In the context of document segmentation, an HMM models the sequence of observed


words (emitted words) and the underlying sequence of hidden states, which represent
whether a word is a boundary (B) or a non-boundary (NB). The probability of the
observed words P(X) and the probability of the states P(Y) are estimated, along with state
transition probabilities P(Yi|Yi-1) and observation likelihoods P(Xi|Yi). For instance, a
"simple two-state Markov model" can be used for sentence segmentation, where the
states are 'sentence boundary' and 'non-boundary'.

The most probable boundary sequence for a given document is typically obtained using the
Viterbi algorithm. While effective, the conventional HMM approach has known weaknesses,
particularly in its inability to effectively use information from broad linguistic cues like part-
of-speech (POS) tags or prosodic cues for morphological segmentation. Extensions to
HMMs, such as the "hidden event language model (HELM)," have been proposed to
address these limitations by incorporating additional features and supporting non-lexical
information.

2.Discriminative Local Classification Methods

In contrast to generative models, discriminative classifiers directly model the conditional


probability P(Yi|Xi), which represents the probability of a label (e.g., boundary or non-
boundary) given the input features. These methods are less concerned with modelling the
joint distribution of all observations and states and instead focus on the class densities

.
Discriminative approaches are widely used in much speech and language processing tasks
because they often outperform generative methods, especially when training data is
plentiful8. They require iterative optimization and typically incorporate local and contextual
features8. Examples of discriminative classifiers include Naive Bayes, Support Vector
Machines (SVMs), boosting, maximum entropy, and regression trees. These methods have
been successfully applied to tasks like POS tagging, where a tag is assigned to a word,
similar to segmenting text into boundaries.

A prominent example for topic segmentation is TextTiling, which employs a lexical cohesion
metric in a vector space. TextTiling operates as a local classification method by measuring
similarity between consecutive segmentation units, marking a boundary where the similarity
falls below a certain threshold.

III CSE (AI/ML) Page 16 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

3. Discriminative Sequence Classification Methods

These methods represent a more general extension of local discriminative models,


specifically designed for sequence classification tasks. They infer labels by considering all
dependencies and using local discriminative models. Key examples include Conditional
Random Fields (CRFs) and extensions of HMMs like Maximum Entropy Markov Models
(MEMMs) and Infused Relaxed Maximum Margin (MIRM).

CRFs, in particular, are powerful discriminative models that globally optimise the conditional
probability of a boundary sequence given all input features6. They offer advantages over
HMMs by allowing for rich, overlapping features and avoiding the "label bias problem"
common in MEMMs. CRFs have been widely applied to tasks such as speech
segmentation, indicating their versatility in handling sequence labelling challenges.

4.Hybrid Approaches

Hybrid approaches combine the strengths of both generative and discriminative


classification algorithms, aiming to leverage their respective benefits. This often involves
using a generative model like an HMM to estimate probabilities, which are then refined or
augmented by discriminative classifiers that can incorporate richer, more complex features
such as pause duration, pitch range, or explicit rhetorical features.

For example, a hybrid approach might estimate P(yi|xi) using an HMM, and then optimise
parameters like α and β using a held-out dataset, potentially leveraging discriminative local
classification methods. Successful implementations of hybrid approaches have shown
improved performance, particularly in areas like multilingual broadcast news speech
segmentation. They can provide a more robust and accurate solution by integrating
different types of information and modelling techniques.

5 Extensions for Global Modelling for Sentence Segmentation

Sentence segmentation, compared to other tasks like topic segmentation, faces a unique
challenge due to the potentially "quadratic number of boundaries". Traditional approaches
often focus on local decisions, but global modelling aims to improve accuracy by
considering the entire sentence structure or document context when making segmentation
decisions.

These extensions typically involve integrating local scores (e.g., from a discriminative
classifier) with higher-level, sentence-level features. This can be achieved by working with a
"pruned sentence lattice," which allows for the combination of local boundary scores with
more holistic features derived from syntactic parsing or global prosodic patterns. Such
methods lead to a more efficient and accurate manner of finding the optimal sentence
boundaries by considering broader contextual information rather than just local cues.

III CSE (AI/ML) Page 17 Mrs. N Savitha M.Tech.,(Ph.D)


SBIT – AUTONOMOUS NLP

Performance of the Approaches

The performance of these segmentation approaches is commonly evaluated using


metrics such as error rate and F1-measure, which is the harmonic mean of precision
and recall. Precision measures the proportion of correctly returned sentence boundaries
among all returned boundaries, while recall measures the proportion of correctly returned
boundaries among all reference boundaries.

Various studies have reported different performance levels for these methods. For instance,
rule-based systems for sentence boundary detection have shown error rates as low as
1.41%. Supervised classifiers like SVMs, often combined with POS tag features, have
achieved F1-measures as high as 97.5%. Hybrid approaches, such as those combining
HMMs with CRFs, have also demonstrated high F1-scores, ranging from 78.2% to
89.1% in different contexts. The specific features used, such as lexical, prosodic, or
grammatical features, significantly influence the performance of discriminative approaches.

In conclusion, finding the structure of documents through sentence and topic


segmentation is a complex task addressed by a range of computational methods. From
foundational generative models like HMMs to advanced discriminative classifiers such
as CRFs, and increasingly powerful hybrid and global modelling approaches, the field
continues to evolve, leveraging sophisticated features and algorithms to enhance accuracy
and efficiency.

IMPORTANT QUESTIONS

1. Explain Words and their components.

2. Explain Issues and challenges.(All 3)

3. Explain Morphological Models. (All 5)

4. Explain Sentence Boundary Detection and Topic Boundary Detection.

5. Explain methods in finding the structure of documents.

III CSE (AI/ML) Page 18 Mrs. N Savitha M.Tech.,(Ph.D)

You might also like