0% found this document useful (0 votes)

21 views18 pages

NLP Unit 1

The document discusses the importance of morphology in Natural Language Processing (NLP), focusing on the structure of words and their components such as tokens, morphemes, lexemes, and allomorphs. It outlines challenges in morphological parsing, including irregularity, ambiguity, and productivity, and describes various morphological models like Dictionary Lookup, Finite-State Morphology, and Unification-Based Morphology. The document emphasizes the complexity of human language and the need for sophisticated models to effectively analyze and process word structures.

Uploaded by

pallavichamakuri96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views18 pages

NLP Unit 1

Uploaded by

pallavichamakuri96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

SBIT – AUTONOMOUS NLP

NATURAL LANGUAGE PROCESSING

UNIT 1

1. FINDING THE STRUCUTRE OF WORDS

Introduction

The study of word structure, known as morphology, is a fundamental aspect of Natural

Language Processing (NLP). This discipline is essential for understanding human
language, which is inherently complex, enabling us to express thoughts and infer meaning
from various levels of detail. Morphology is crucial for processing human language,
including tasks like semantic and syntactic analysis, and is particularly vital in multilingual
settings. The discovery of word structure is termed morphological parsing.

Words and their Components

Explain Words and their Components.

Words are considered the smallest linguistic units capable of conveying meaning through
utterance. However, the concept of a "word" can vary significantly across languages. The
following are the various fundamental components of words:

•Tokens: In many languages, such as English, words are delimited by whitespace and
punctuation, forming tokens. Yet, this is not a universal rule; languages like Japanese,
Chinese, and Thai utilise character strings without whitespace for word delimitation. Other
languages, like Arabic or Hebrew, concatenate certain tokens, where word forms change
depending on preceding or following elements.

•Morphemes: These are the minimal parts of words that convey meaning. Morphemes
constitute the fundamental morphological units and contribute to the overall meaning of a
word.

•Lexemes: A lexeme is a linguistic form that expresses a concept, independent of its

various inflectional categories. The citation form of a lexeme is known as its lemma. When
a word form is converted into its lemma, this process is called lemmatisation.

•Allomorphs: Morphemes can exhibit variations in their sound (phonemes) or spelling

(graphemes), which are termed allomorphs. These variations are due to phonological or
orthographic constraints. Examples include the differing forms of morphemes in Korean and
the non concatenative morphology of Arabic, where word structure is determined by stems,
roots, and patterns.

 Morphological Typology
Morphological typology classifies languages based on the number of morphemes per word
and the degree of fusion between them. The types of languages are shown below:

III CSE (AI/ML) Page 1 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

1) Isolating Languages: These languages typically have one or relatively few

morphemes per word, with minimal inflectional changes. Examples include Chinese,
Vietnamese, and Thai.

2) Agglutinative Languages: Characterised by a high number of morphemes per

word, which are often easily separable and combine to form long words. Korean,
Japanese, Finnish, and Turkish are examples.

3) Synthetic Languages: Morphemes in these languages tend to combine and fuse

4) Fusional Languages, a subset of synthetic languages, often express multiple

grammatical features (e.g., gender, number, case) with a single morpheme. Arabic,
Czech, Latin, and Sanskrit are examples of fusional languages.

*****

Issues and Challenges

Explain issues and challenges of Morphological Parsing. (Essay Question)

OR Explain issues and challenges in finding the structure of words,

Understanding the structure of words is a complex task in human language processing,

involving various linguistic disciplines like morphology, semantics, etymology, and
lexicology. Morphological parsing, which aims to identify and analyse the structure of
words, faces significant issues and challenges, particularly concerning irregularity,
ambiguity, and productivity. Addressing these challenges is crucial for developing effective
language processing systems.

1. Irregularity:

Irregularity in language refers to instances where word forms do not follow general rules or
predictable patterns, posing a considerable challenge for morphological parsing. These
irregularities are particularly pronounced in languages with rich morphology, such as Arabic
or Korean, and can affect both derivation and inflection.

It hinders the ability of linguistic systems to generalise and abstract from observed word
forms, necessitating detailed descriptions of each irregular form. This can lead to issues
with accuracy, increased computational complexity, and difficulties in verifying associated
information.

For example, Korean languages exhibit exceptional constraints on the selection of

grammatical morphemes, often showing irregular inflection in agglutinative structures. The
below Table shows examples of major irregular verb classes in Korean, such as 'cip-'
(pick) becoming 'cip-' or 'kip-'. Arabic morphology, known for its richness and derivational

III CSE (AI/ML) Page 2 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

nature, presents complex challenges due to its deep-rooted morphological processes and
the interaction between phonology and orthography. The deep study of morphological
processes in Arabic is essential for mastering inflection and derivation, as well as for
handling irregular forms. For instance, certain irregular forms might not be derived from the
general morphological template, as seen with the word jadīd 'new'.

Czech morphology also demonstrates irregularity, particularly in its extensive inflectional

paradigms. The source illustrates this with the morphological paradigms of the Czech word
dům 'house', which has various forms like budova 'building', stavba 'building', and stavení
'building'. These examples show the complexity arising from non-standard variations that
deviate from typical morphological rules.

2. Ambiguity

Word forms can have multiple possible interpretations, leading to ambiguity

Ambiguity in morphological parsing arises when a word form can be understood in multiple
ways, possessing distinct functions or meanings. While morphological parsing identifies the
components of words, it does not directly concern the disambiguation of words in their
context. Homonyms, words that look alike but have different meanings, are particularly
problematic in morphological analysis.

Korean provides clear examples of systematic homonyms, where the same word form can
represent different meanings depending on its context or the endings attached to it, as
illustrated in the below Table. For instance, 'mwut.ko' can mean 'bury', 'ask', or 'bite'. Arabic,
with its morphologically rich nature, frequently presents ambiguities, especially because the
written script often omits diacritical marks that distinguish different forms4. The problem of

III CSE (AI/ML) Page 3 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

morphological disambiguation in Arabic extends beyond resolving structural components to

include aspects like tokenisation and normalisation.

Even in Czech, while morphology helps to identify semantic features, it can introduce
ambiguities concerning abstract semantic forms6. For example, the meaning of 'stavení' in
Czech might be unclear without additional context. This highlights that even with detailed
morphological information; the intended meaning can remain elusive, requiring further
contextual analysis.

3. Productivity

Productivity is the ability of a language to form new words continually poses a challenge
for maintaining comprehensive lexicons.

Productivity refers to the inherent capacity of a language to generate new words or new
forms of existing words. This includes both inflectional changes (like verb conjugations or
noun declensions) and derivational processes (creating new words from existing ones)8.
Human languages are considered "open class" systems, meaning they can continually
create an infinite number of linguistic units, making it impossible to list every possible word
form explicitly in a dictionary or lexicon.

This ongoing creation of new forms poses a significant challenge for traditional dictionary-
based lookup approaches, which are finite and cannot account for all possible word forms.
Therefore, morphological analysis systems must be able to process and understand novel
or unseen word forms that are not explicitly pre-computed. This necessitates sophisticated
computational linguistic models, such as finite-state transducers and unification-based
approaches, which can handle productive patterns and generate/recognise new forms
based on underlying rules rather than just stored entries. The goal is to develop robust

III CSE (AI/ML) Page 4 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

systems that can effectively manage the dynamism of language, predicting and parsing
forms that have not been encountered before.

In conclusion, irregularity, ambiguity, and productivity are fundamental challenges in finding

the structure of words. Addressing these issues requires sophisticated morphological
models capable of handling exceptions, distinguishing between multiple interpretations, and
processing the continuous generation of new word forms.

*****

MORPHOLOGICAL MODELS

Explain various Morphological models. (Essay Question)

Morphological models are computational linguistic approaches designed to understand and

represent the complex structure of words across human languages. They are crucial for
addressing various problems in natural language processing (NLP), ranging from basic
word segmentation to more advanced semantic and syntactic analysis. Due to the inherent
complexity of human language, linguistic expressions are structured at multiple levels of
detail, making these models essential for processing.

The following are various morphological models:

1. Dictionary Lookup:

Dictionary lookup is a fundamental process in morphological analysis where word forms are
associated with their corresponding linguistic descriptions. This method relies on
precomputed data structures like lists, dictionaries, or databases, which are kept
synchronised with sophisticated morphological models.

 Data Structure: Linguistic data is typically understood as a data structure that

directly enables efficient lookup operations.

 Efficiency: Lookup operations can be optimised using data structures such as

binary search trees, tries, hash tables, and so on.

 Limitations: While effective, the set of associations between word forms and their
desired descriptions is finite, meaning that the generative potential of the language
is not fully exploited3. This approach can be tedious, prone to errors, and inefficient
for large, unreliable linguistic resources, especially for enumerative models that are
often sufficient for general purposes. It's also less suitable for complex morphology,
as seen in Korean, where dictionary-based approaches can depend on a large
dictionary of all possible combinations of allomorphs and morphological alternations.

III CSE (AI/ML) Page 5 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

2. Finite-State Morphology (FSM) :

Finite-State Morphology is a widely adopted computational linguistic approach that employs

finite-state transducers (FSTs) to model and analyse word structure.

Mechanism: FSTs are directly compiled from specifications written by human

programmers. They represent the relationship between the surface form of words (how they
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). An FST is based on finite-state automata, consisting of a finite set of nodes
(states) connected by directed edges. These edges are labelled with pairs of input and
output symbols, translating a sequence of input symbols into a corresponding sequence of
output symbols.

Functionality: FSTs can compute and compare regular relations, defining the relationship
between an input (surface string) and an output (lexical string, including morphemes and
features). FSM is well-suited for analysing morphological processes in various languages,
including isolating and agglutinative types. They can construct full-fledged morphological
analysers (parsing words into morphemes), morphological generators (producing word
forms from morphemes), and tokenizers.

Advantages: FSTs are flexible, efficient, and robust. They offer a general-purpose
approach for pattern matching and substitution, allowing for the building of complex
morphological analysers and generators.

Limitations: A theoretical limitation of FSTs is their primary focus on generating regular

languages4. This can be challenging for natural language phenomena that exhibit non-
regular patterns, such as certain types of reduplication4

3. Unification-Based Morphology:

Unification-Based Morphology is a declarative approach inspired by various formal linguistic

grammars, particularly head-driven phrase structure grammar (HPSG).

Core Concept: It relies on the concept of feature structures to represent linguistic

information5. These feature structures are viewed as directed acyclic graphs.

Logic Programming: The methods and concepts of unification-based formalism are

closely connected to logic programming.

Functionality: This model can manage complex and recursively nested linguistic
information, expressed by atomic symbols or more appropriate data structures5.
Unification, as the key operation, merges informative feature structures, making it highly
versatile for representing intricate linguistic details.

Advantages: These models are typically formulated as logic programs and use unification
to solve constraint systems. This offers advantages such as better abstraction possibilities

III CSE (AI/ML) Page 6 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

for developing morphological grammars and eliminating redundant information. Unification-

based models can be implemented for various languages, including Russian, Czech,
Slovenian, Persian, Hebrew, and Arabic.

4. Functional Morphology:
Functional Morphology is a model that defines morphological operations using principles of
functional programming and type theory.

Approach: It treats morphological operations as pure mathematical functions, organising

linguistic elements as abstract models of distinct types and value classes.

Compatibility: Functional morphology can be compiled into finite-state transducers,

enabling efficient computation within an interpreted mode.

Advantages: This approach offers greater freedom for developers to define their own
lexical constructions, leading to domain-specific embedded languages for morphological
analysis. It supports full-featured, real-world applications and promotes reusability of
linguistic data.

Applicability: It is particularly useful for fusional languages and is influenced by functional

programming frameworks like Haskell. ElixirFM, for instance, implements Arabic
morphology using this framework.

5. Morphology Induction:
Morphology Induction focuses on discovering and inferring word structure, moving beyond
pre-existing linguistic knowledge.

Motivation: This approach is especially valuable for languages where linguistic expertise is
limited or unavailable or for situations where an unsupervised or semi-supervised learning
method is preferred.

 Process: It aims at the automated acquisition of morphological and lexical

information8. Even if not perfect, this information can be used to bootstrap and
enhance classical morphological models.

 Research Focus: Studies in unsupervised learning of morphology, as seen in the

works of Hammarström and Goldsmith , involve categorising approaches, comparing
and clustering words based on similarity, and identifying prominent features of word
forms.

 Key Problem: Most published approaches frame morphology induction as the

problem of word boundary and morpheme boundary detection. This also includes
tasks like morphological tagging and tokenization and normalization.

III CSE (AI/ML) Page 7 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

 Challenges: Deducing word structure from forms and context presents several
challenges, including dealing with ambiguity and irregularity in morphology, as well
as orthographic and phonological alterations and non-linear morphological
processes.

 Advancements: To improve statistical inference, methods like parallel learning of

morphologies for multiple languages have been proposed by Snyder and Barzilay.
Discriminative log-linear models, such as those by Poon, Cherry, and Toutanova ,
enhance generalization by employing overlapping contextual features for
segmentation decisions.

These models, while distinct, complement each other, offering various tools and
perspectives for addressing the complex task of finding and representing the structure of
words across the diverse range of human languages. The choice of model often depends
on the specific language being analysed and the desired application.

***

Short answer questions

1. What is a morpheme?

In Natural Language Processing (NLP), a morpheme is defined as the minimal part of a

word that conveys meaning. Morphemes are considered the fundamental morphological
units. They contribute to various aspects of a word's meaning1 and are essentially the
structural components of word forms.

2. What is Morphology?

Morphology is the study of word structure and formation. It examines how words are
constructed from smaller meaningful units called morphemes and how these units combine
to form complex words. The discovery of word structure is specifically referred to as
morphological parsing. Morphological analysis is considered an essential part of
language processing, as it helps convert diverse word forms into well-defined linguistic units
with explicit lexical and morphological properties. Understanding word structure involves
identifying distinct types of units in human languages and how their internal structure
connects with grammatical properties and lexical concepts

3. Define Morphological parsing in Natural Language Processing (NLP).

Morphological parsing in Natural Language Processing (NLP) refers to the discovery of

word structure. It is the process of identifying and analysing the constituent morphemes
within a word to understand its meaning and grammatical function.

III CSE (AI/ML) Page 8 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

This process is a fundamental aspect of understanding human language, which is

inherently complex and organised across multiple levels of detail. Morphology, the study of
word structure, is an essential part of language processing and is particularly significant in
multilingual settings. Morphological parsing is crucial for various NLP tasks, including
semantic and syntactic analysis.

4. What is word segmentation?

In Natural Language Processing, word segmentation is a fundamental step in

morphological analysis1. It is also known as tokenization1. This process is crucial and
serves as a prerequisite for most language processing applications, particularly in
languages where words are not explicitly delimited by whitespace or punctuation1. For
instance, in languages like Japanese, Chinese, and Thai, words are character strings
without whitespace, and word segmentation is essential to identify the individual words.

5. How are words delimited?

The delimitation of words, often referred to as tokenization or word segmentation, varies

significantly across different languages. The following are the methods used for how words
are delimited in various linguistic contexts:
•Whitespace and Punctuation In many languages, such as English, words are primarily
delimited by whitespace and punctuation. This means that spaces and common
punctuation marks serve as explicit boundaries between individual words.
•Absence of Whitespace Delimitation In other languages, like Japanese, Chinese, and
Thai, whitespace is not used to separate words. Instead, the writing systems of these
languages present words as character strings without clear word-level delimiters. In such
cases, units that are graphically delimited are typically larger structures like sentences or
clauses.
•Concatenation and Form Changes Languages such as Arabic and Hebrew often
concatenate certain tokens with preceding or following elements. This concatenation can
lead to changes in the word forms themselves, causing the underlying lexical or syntactic
units to appear as a single, compact string of letters rather than distinct words2. These
concatenated units are sometimes referred to as clitics

6. How are words structured?

Words are the smallest linguistic units that can form a complete utterance by themselves1.
Their internal structure can be modelled in relation to their grammatical properties and the
lexical concepts they represent. The discovery of this word structure is known as
morphological parsing.

The structure of words is built upon morphemes, which are defined as the minimal parts of
a word that convey meaning. These are also referred to as segments or morphs and are
considered the fundamental morphological units.

III CSE (AI/ML) Page 9 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

Human languages employ various methods to combine these morphs and morphemes into
complete word forms.

•The simplest method is concatenation, where morphemes are joined sequentially, such
as in "dis-agree-ment-s". In this example, "agree" is a free lexical morpheme, while "dis-", "-
ment-", and "-s" are bound grammatical morphemes that contribute partial meaning.

•In more complex systems, morphs can interact with each other, leading to
morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.

•Word structure is frequently described by how stems combine with root and pattern
morphemes, along with other elements that may be attached to either side.

It is important to note that some properties or features of a word may not be explicitly visible
in its morphological structure. The structural components can be associated with, and
dependent on, multiple functions concurrently, without necessarily having a singular
grammatical interpretation within their lexical meaning.

Ultimately, the way word structure is described can depend on the specific language being
analysed and the morphological theory being applied6. Deducing word structure can be
challenging due to factors such as ambiguity, irregularity, and variations in orthography and
phonology.

What are Allomorphs?

Allomorphs are the alternative forms of a morpheme. They represent variations of a single
morpheme that are chosen based on phonological context or other linguistic rules.

(Complete topic: Finding the structure of words)

Explain Finding the structure of words

What are the foundational concepts and methodologies for understanding word
structure across languages?

Understanding word structure across languages involves several foundational concepts

and methodologies that aim to decipher how words are built and what meanings and
functions their components convey.

III CSE (AI/ML) Page 10 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

Foundational Concepts of Word Structure

1.Words as Basic Linguistic Units Words are considered the smallest linguistic units
capable of forming a complete utterance by themselves. Their internal structure can be
modeled in relation to their grammatical properties and the lexical concepts they represent.

2.Morphemes The structure of words is fundamentally built upon morphemes, which are
defined as the minimal parts of a word that convey meaning. They are also referred to as
segments or morphs and are considered the elementary morphological units

3.Combining Morphemes Human languages employ various methods to combine

morphemes into complete word forms:

◦Concatenation The simplest method is sequential joining, as seen in words like "dis-
agree-ment-s". In this example, "agree" is a free lexical morpheme (can stand alone), while
"dis-", "-ment-", and "-s" are bound grammatical morphemes (cannot stand alone) that
contribute partial meaning.

◦Morphophonemic Changes In more complex systems, morphs can interact, leading to

morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.

◦Stems, Roots, and Patterns Word structure is frequently described by how stems
combine with root and pattern morphemes, along with other elements that may be
attached to either side.

◦Implicit Properties It's important to note that some properties or features of a word may
not be explicitly visible in its morphological structure. Word structure components can be
associated with and dependent on multiple functions concurrently, without necessarily
having a singular grammatical interpretation within their lexical meaning.

Morphological Typologies Languages can be categorized based on how they structure

words:

◦Isolating Languages These languages (e.g., Chinese, Vietnamese, Thai) typically

have one morpheme per word.

◦Synthetic Languages These languages combine more morphemes per word than
isolating languages.

◦Agglutinative Languages A type of synthetic language (e.g., Korean, Japanese,

Finnish, Tamil), where morphemes often combine with one function at a time.

◦Fusional Languages These languages (e.g., Arabic, Czech, Latin, Sanskrit, German)
often have a feature-per-morpheme ratio higher than one, meaning a single morpheme
can convey multiple grammatical features8.

◦Concatenative Languages These languages link morphs and morphemes one after
another.

III CSE (AI/ML) Page 11 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

◦Non-concatenative Languages These involve changing consonantal or vocalic

templates, common in Arabic.

Methodologies for Understanding Word Structure

The discovery of word structure is broadly known as morphological parsing. This process
is crucial for various Natural Language Processing (NLP) tasks, including semantic and
syntactic analysis.

1. Word Segmentation (Tokenization) This is a fundamental and prerequisite step for

most language processing applications. It involves identifying individual words within a
text2.

◦Delimitation by Whitespace and Punctuation In languages like English, words are

primarily delimited by whitespace and punctuation.

◦Absence of Whitespace Delimitation In languages such as Japanese, Chinese, and

Thai, words are character strings without explicit whitespace delimiters. In these cases,
graphically delimited units are usually larger structures like sentences or clauses.

◦Concatenation Languages like Arabic and Hebrew often concatenate certain tokens with
preceding or following elements, leading to changes in word forms and appearing as a
single, compact string of letters. These are sometimes called clitics.

◦Speech/Cognitive Units In Korean, character strings are grouped into units called
"eojeol" ("word segment"), which are typically larger than individual words but smaller than
clauses.

2.Finite-State Morphology (FSM) FSM is a prominent computational linguistic approach

that employs finite-state transducers (FSTs) to model and analyse word structure.

◦Mechanism FSTs represent the relationship between surface word forms (how words
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). They function by mapping input symbols to output symbols. An FST is based
on finite-state automata, where a finite set of nodes (states) are connected by directed
edges labeled with pairs of input and output symbols. This network translates a sequence
of input symbols into a sequence of corresponding output symbols.

◦Functionality FSTs are capable of computing and comparing regular relations. They
define the relationship between the input (surface string) and the output (lexical string,
which includes morphemes and their features). FSM is particularly well-suited for analysing
morphological processes in both isolating and agglutinative languages. It can be used to
build full-fledged morphological analysers, which identify morphemes within a word, or
generators, which produce word forms from given morphemes. It is also valuable for
constructing tokenizers.

◦Theoretical Basis Some morphological models, such as Functional Morphology, can be

compiled into finite-state transducers.

III CSE (AI/ML) Page 12 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

◦Limitations A theoretical limitation of FSTs is that they primarily generate regular

languages. However, some aspects of natural language, such as certain types of
reduplication, might exhibit non-regular patterns.

3.Other Morphological Models

◦Dictionary Lookup This is a process where word forms are associated with their
corresponding linguistic descriptions.

◦Unification-Based Morphology These models use feature structures to represent

linguistic information and can be based on logic programming.

◦Functional Morphology This approach defines morphological operations using principles

of functional programming and type theory, and it can be compiled into finite-state
transducers2.

Issues and Challenges

Deducing word structure can be challenging due to several factors:

•Ambiguity Word forms can be understood in multiple ways or have the same form but
distinct functions or meanings (homonyms). Morphological parsing deals with disabiguating
words in their context.

•Irregularity Some word forms may not follow regular patterns and may not be explicitly
listed in a lexicon.

Productivity: Productivity is the ability of a language to form new words continually

poses a challenge for maintaining comprehensive lexicons.

2. Finding the Structure of Documents

Introduction to "Finding the Structure of Documents":

In human language, both written and spoken, words and sentences are not arranged
randomly; instead, they inherently possess an underlying structure. This inherent structure
includes meaningful grammatical units like sentences, requests, commands, and self-
contained units of discourse that relate to a particular point or idea. The automatic
extraction of this document structure is a fundamental and often prerequisite step for a wide
range of Natural Language Processing (NLP) applications.

For instance, tasks such as parsing, machine translation, and semantic role labeling rely on
sentences as their basic processing unit. Furthermore, the ability to chunk input text or

III CSE (AI/ML) Page 13 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

speech into topically coherent blocks facilitates better organisation and indexing of data,
enabling more efficient information retrieval and further processing of specific topics.

"Finding the Structure of Documents," delves into methods for identifying these
structural elements, specifically focusing on two key tasks: Sentence Boundary Detection
(SBD) and Topic Boundary Detection. It explores statistical classification approaches that
infer the presence of sentence and topic boundaries by leveraging various features of the
input, such as punctuation, pauses, and lexical cues. These methods are crucial for
transforming raw text or speech into more manageable and semantically meaningful units
for subsequent NLP analysis.

--------------------------------------------------------------------------------
1. Explain Sentence Boundary Detection and Topic Boundary Detection.

Sentence Boundary Detection (SBD)

Sentence Boundary Detection (SBD), also referred to as sentence segmentation, is a

fundamental Natural Language Processing (NLP) task focused on automatically
segmenting a continuous sequence of words into individual sentence units. This process is
crucial as sentences serve as a basic processing unit for many downstream NLP
applications, including parsing, machine translation, and semantic role labeling. SBD also
significantly enhances the human readability of output from automatic speech recognition
(ASR) systems.

In written text, sentence boundaries are typically identified by punctuation marks such as
periods (.), question marks (?), and exclamation marks (!). Ambiguity arises because the
same punctuation, especially a period, can signify an abbreviation rather than the end of a
sentence (e.g., "Dr." or "Mt. Rushmore"). To resolve such ambiguities, SBD systems
consider additional cues, including the presence of a punctuation mark, a pause in speech,
or the beginning of a new word in a document. Capitalized initials and numbers preceding
periods are also used to distinguish between sentence-ending punctuation and
abbreviations. Statistical methods often infer boundaries based on these features. For
example, in the Wall Street Journal Corpus, 47% of periods are used to mark an
abbreviation.

For spoken language, particularly in multiparty meetings, the task becomes more complex
as traditional punctuation cues are absent. Here, SBD relies on features like pause duration
and pitch range. Dialogue acts, which represent self-contained units of discourse, can also
serve as segment boundaries in such context. Challenges include handling conversational
speech, which may lack clear boundaries, and errors introduced by OCR or ASR systems.

III CSE (AI/ML) Page 14 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

Topic Boundary Detection

Topic Boundary Detection, also known as topic segmentation, involves automatically

dividing a stream of discourse or text into cohesive, topically homogeneous blocks. The
primary objective is to identify points where the subject matter or topic shifts within a
document or conversation. This task is vital for various language-understanding
applications, including information retrieval and text summarization, by enabling the
processing of content in more manageable and contextually relevant chunks.

Unlike sentence segmentation, topic segmentation typically deals with longer segments of
text. In multiparty meetings, topic segmentation often draws inspiration from discourse
analysis, though defining boundaries can be less straightforward compared to well-
structured formats like news articles. One of the main challenges for topic segmentation is
its non-trivial nature, as topic boundaries are often fluid and lack clear, linguistically explicit
markers.

Statistical approaches are commonly used for topic segmentation, inferring boundaries
based on various features1. For instance, the TextTiling method is a popular approach that
employs a lexical cohesion metric within a word vector space to assess the similarity
between consecutive text segments6. A decrease in this similarity score often indicates a
topic shift. This method helps identify points where new vocabulary is introduced or where
the lexical content changes significantly, signalling a transition to a new topic. Other
methods for computing similarity scores between blocks include block comparison and the
vocabulary introduction method, which assigns a score based on the number of new words
appearing in an interval.

****

2. Explanation of Methods Used in "Finding the Structure of Documents".

Methods: An Overview

In segmenting documents into sentences or topics lies in identifying the boundaries,

sentence segmentation aims to determine where sentences begin and end, often relying
on punctuation, sentence length, and other contextual cues. Topic segmentation, on the
other hand, identifies boundaries between discourse or topic segments, chunking input text
into topically coherent blocks.

"Finding the Structure of Documents" discusses various methods for segmenting text into
meaningful units, primarily focusing on sentence and topic boundaries.
These methods are broadly categorised into the following three categories:

1. Generative Sequence Classification Methods

2. Discriminative Local Classification Methods
3. Discriminative Sequence Classification Methods
4. Hybrid approaches.

III CSE (AI/ML) Page 15 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

5. Extension for Global Modeling for Sentence Segmentation

1. Generative Sequence Classification Methods:

Generative models in sequence classification aim to estimate the joint probability

distribution, P(X, Y), where X represents the input (e.g., words) and Y represents the
desired output sequence (e.g., boundary types). The most common generative approach
for this task is the Hidden Markov Model (HMM).

In the context of document segmentation, an HMM models the sequence of observed

words (emitted words) and the underlying sequence of hidden states, which represent
whether a word is a boundary (B) or a non-boundary (NB). The probability of the
observed words P(X) and the probability of the states P(Y) are estimated, along with state
transition probabilities P(Yi|Yi-1) and observation likelihoods P(Xi|Yi). For instance, a
"simple two-state Markov model" can be used for sentence segmentation, where the
states are 'sentence boundary' and 'non-boundary'.

The most probable boundary sequence for a given document is typically obtained using the
Viterbi algorithm. While effective, the conventional HMM approach has known weaknesses,
particularly in its inability to effectively use information from broad linguistic cues like part-
of-speech (POS) tags or prosodic cues for morphological segmentation. Extensions to
HMMs, such as the "hidden event language model (HELM)," have been proposed to
address these limitations by incorporating additional features and supporting non-lexical
information.

2.Discriminative Local Classification Methods

In contrast to generative models, discriminative classifiers directly model the conditional

probability P(Yi|Xi), which represents the probability of a label (e.g., boundary or non-
boundary) given the input features. These methods are less concerned with modelling the
joint distribution of all observations and states and instead focus on the class densities

.
Discriminative approaches are widely used in much speech and language processing tasks
because they often outperform generative methods, especially when training data is
plentiful8. They require iterative optimization and typically incorporate local and contextual
features8. Examples of discriminative classifiers include Naive Bayes, Support Vector
Machines (SVMs), boosting, maximum entropy, and regression trees. These methods have
been successfully applied to tasks like POS tagging, where a tag is assigned to a word,
similar to segmenting text into boundaries.

A prominent example for topic segmentation is TextTiling, which employs a lexical cohesion
metric in a vector space. TextTiling operates as a local classification method by measuring
similarity between consecutive segmentation units, marking a boundary where the similarity
falls below a certain threshold.

III CSE (AI/ML) Page 16 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

3. Discriminative Sequence Classification Methods

These methods represent a more general extension of local discriminative models,

specifically designed for sequence classification tasks. They infer labels by considering all
dependencies and using local discriminative models. Key examples include Conditional
Random Fields (CRFs) and extensions of HMMs like Maximum Entropy Markov Models
(MEMMs) and Infused Relaxed Maximum Margin (MIRM).

CRFs, in particular, are powerful discriminative models that globally optimise the conditional
probability of a boundary sequence given all input features6. They offer advantages over
HMMs by allowing for rich, overlapping features and avoiding the "label bias problem"
common in MEMMs. CRFs have been widely applied to tasks such as speech
segmentation, indicating their versatility in handling sequence labelling challenges.

4.Hybrid Approaches

Hybrid approaches combine the strengths of both generative and discriminative

classification algorithms, aiming to leverage their respective benefits. This often involves
using a generative model like an HMM to estimate probabilities, which are then refined or
augmented by discriminative classifiers that can incorporate richer, more complex features
such as pause duration, pitch range, or explicit rhetorical features.

For example, a hybrid approach might estimate P(yi|xi) using an HMM, and then optimise
parameters like α and β using a held-out dataset, potentially leveraging discriminative local
classification methods. Successful implementations of hybrid approaches have shown
improved performance, particularly in areas like multilingual broadcast news speech
segmentation. They can provide a more robust and accurate solution by integrating
different types of information and modelling techniques.

5 Extensions for Global Modelling for Sentence Segmentation

Sentence segmentation, compared to other tasks like topic segmentation, faces a unique
challenge due to the potentially "quadratic number of boundaries". Traditional approaches
often focus on local decisions, but global modelling aims to improve accuracy by
considering the entire sentence structure or document context when making segmentation
decisions.

These extensions typically involve integrating local scores (e.g., from a discriminative
classifier) with higher-level, sentence-level features. This can be achieved by working with a
"pruned sentence lattice," which allows for the combination of local boundary scores with
more holistic features derived from syntactic parsing or global prosodic patterns. Such
methods lead to a more efficient and accurate manner of finding the optimal sentence
boundaries by considering broader contextual information rather than just local cues.

III CSE (AI/ML) Page 17 Mrs. N Savitha M.Tech.,(Ph.D)

SBIT – AUTONOMOUS NLP

Performance of the Approaches

The performance of these segmentation approaches is commonly evaluated using

metrics such as error rate and F1-measure, which is the harmonic mean of precision
and recall. Precision measures the proportion of correctly returned sentence boundaries
among all returned boundaries, while recall measures the proportion of correctly returned
boundaries among all reference boundaries.

Various studies have reported different performance levels for these methods. For instance,
rule-based systems for sentence boundary detection have shown error rates as low as
1.41%. Supervised classifiers like SVMs, often combined with POS tag features, have
achieved F1-measures as high as 97.5%. Hybrid approaches, such as those combining
HMMs with CRFs, have also demonstrated high F1-scores, ranging from 78.2% to
89.1% in different contexts. The specific features used, such as lexical, prosodic, or
grammatical features, significantly influence the performance of discriminative approaches.

In conclusion, finding the structure of documents through sentence and topic

segmentation is a complex task addressed by a range of computational methods. From
foundational generative models like HMMs to advanced discriminative classifiers such
as CRFs, and increasingly powerful hybrid and global modelling approaches, the field
continues to evolve, leveraging sophisticated features and algorithms to enhance accuracy
and efficiency.

IMPORTANT QUESTIONS

1. Explain Words and their components.

2. Explain Issues and challenges.(All 3)

3. Explain Morphological Models. (All 5)

4. Explain Sentence Boundary Detection and Topic Boundary Detection.

5. Explain methods in finding the structure of documents.

III CSE (AI/ML) Page 18 Mrs. N Savitha M.Tech.,(Ph.D)

Emily M. Bender - Linguistic Fundamentals For Natural Language Processing-Morgan & Claypool (2013)
100% (1)
Emily M. Bender - Linguistic Fundamentals For Natural Language Processing-Morgan & Claypool (2013)
166 pages
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
100% (1)
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
378 pages
Noise Reduction in Speech Processing PDF
100% (1)
Noise Reduction in Speech Processing PDF
240 pages
LING 285: Final Project 3 Due Uploaded To Blackboard: Monday 5/6, 8:00pm
No ratings yet
LING 285: Final Project 3 Due Uploaded To Blackboard: Monday 5/6, 8:00pm
5 pages
Dynamic Factor Allocation Leveraging Regime-Switching Signals
No ratings yet
Dynamic Factor Allocation Leveraging Regime-Switching Signals
23 pages
Cross-Linguistic Perspectives On Morphological Processing: An Introduction
No ratings yet
Cross-Linguistic Perspectives On Morphological Processing: An Introduction
8 pages
AI Unit-3 Notes
No ratings yet
AI Unit-3 Notes
23 pages
Statistical NLP
No ratings yet
Statistical NLP
19 pages
Morphology
No ratings yet
Morphology
19 pages
An Introduction To English Morphology-Famala
100% (1)
An Introduction To English Morphology-Famala
147 pages
Machine Learning Notes For KTU Semester 7
No ratings yet
Machine Learning Notes For KTU Semester 7
226 pages
English Morphology
No ratings yet
English Morphology
19 pages
Morphology
100% (1)
Morphology
52 pages
Makalah Morphology
No ratings yet
Makalah Morphology
8 pages
Morphology-1 F Eng
No ratings yet
Morphology-1 F Eng
9 pages
Morphology (Linguistics)
No ratings yet
Morphology (Linguistics)
13 pages
Morfologia
No ratings yet
Morfologia
43 pages
On Looking Into Words and Beyond
No ratings yet
On Looking Into Words and Beyond
628 pages
Handwritten Text Recognition: M.J. Castro-Bleda, S. Espa Na-Boquera, F. Zamora-Mart Inez
No ratings yet
Handwritten Text Recognition: M.J. Castro-Bleda, S. Espa Na-Boquera, F. Zamora-Mart Inez
24 pages
Hidden Markov Model Run Scoring Baseball
No ratings yet
Hidden Markov Model Run Scoring Baseball
14 pages
Hindi Character Recognition Report
No ratings yet
Hindi Character Recognition Report
50 pages
Computational Morphology Insights
No ratings yet
Computational Morphology Insights
12 pages
Hidden Markov Model Introduction
No ratings yet
Hidden Markov Model Introduction
36 pages
Speech Recognition As Emerging Revolutionary Technology
No ratings yet
Speech Recognition As Emerging Revolutionary Technology
4 pages
LINB10 Week 10 PDF
No ratings yet
LINB10 Week 10 PDF
32 pages
Kinect-Based Human Action Robot
No ratings yet
Kinect-Based Human Action Robot
6 pages
Comparison of Decoding Strategies For CTC Acoustic Models
No ratings yet
Comparison of Decoding Strategies For CTC Acoustic Models
5 pages
Brown Vintage Illustrative Watercolor Sunday Sermon Church Presentation
No ratings yet
Brown Vintage Illustrative Watercolor Sunday Sermon Church Presentation
17 pages
Paper Assignment of Morphology To Linguistic
100% (1)
Paper Assignment of Morphology To Linguistic
12 pages
INTRODUCTION OF LINGUISTIC Paper
No ratings yet
INTRODUCTION OF LINGUISTIC Paper
11 pages
NLP Unit-I-1
No ratings yet
NLP Unit-I-1
84 pages
Linguistic Morphology Overview
No ratings yet
Linguistic Morphology Overview
32 pages
Interaction Between Morphology and Semantics
100% (1)
Interaction Between Morphology and Semantics
8 pages
Morphology Group 1 New
No ratings yet
Morphology Group 1 New
27 pages
AI & Data Science Exam Guide
No ratings yet
AI & Data Science Exam Guide
33 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
An Introduction To English Morphology-Famala Tanpa Cover
No ratings yet
An Introduction To English Morphology-Famala Tanpa Cover
154 pages
An Introduction To English Morphology-Famala
No ratings yet
An Introduction To English Morphology-Famala
25 pages
Mixture Hidden Markov Models For Sequence Data: The Seqhmm Package in R
No ratings yet
Mixture Hidden Markov Models For Sequence Data: The Seqhmm Package in R
33 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Robot Programming for Engineers
No ratings yet
Robot Programming for Engineers
26 pages
Chapter2 Morphology
No ratings yet
Chapter2 Morphology
4 pages
Morphology Resume
No ratings yet
Morphology Resume
9 pages
Understanding Morphology in Linguistics
No ratings yet
Understanding Morphology in Linguistics
10 pages
Seminar Guidline
No ratings yet
Seminar Guidline
13 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
Morphological Processing of Semitic Languages
No ratings yet
Morphological Processing of Semitic Languages
14 pages
IS 7118 Unit-6 HMM
No ratings yet
IS 7118 Unit-6 HMM
78 pages
MSC AI Syllabus
No ratings yet
MSC AI Syllabus
63 pages
1 s2.0 S004452311500008X Main
No ratings yet
1 s2.0 S004452311500008X Main
12 pages
02 - Morphological Analysis
No ratings yet
02 - Morphological Analysis
17 pages
Morphology
No ratings yet
Morphology
2 pages
Automated Sign Language Interpreter
No ratings yet
Automated Sign Language Interpreter
5 pages
Understanding Word Structure
No ratings yet
Understanding Word Structure
42 pages
Chinese Morphology
No ratings yet
Chinese Morphology
19 pages
Buy Ebook Syntactic Structures and Morphological Information Uwe Junghanns (Editor) Cheap Price
100% (13)
Buy Ebook Syntactic Structures and Morphological Information Uwe Junghanns (Editor) Cheap Price
84 pages
MORPHOLOGY
No ratings yet
MORPHOLOGY
5 pages
NLP Notes
No ratings yet
NLP Notes
180 pages
Materia Primer Parcial Morphology
No ratings yet
Materia Primer Parcial Morphology
83 pages
The Handbook of Morphology
No ratings yet
The Handbook of Morphology
72 pages
NLP Course Content
No ratings yet
NLP Course Content
62 pages
9077.docx 9077 Compressed
No ratings yet
9077.docx 9077 Compressed
21 pages
Unit 2 - Speech and Video Processing (SVP) - 1
No ratings yet
Unit 2 - Speech and Video Processing (SVP) - 1
23 pages
Detecting Cross-Site Scripting (XSS) Using Machine Learning
No ratings yet
Detecting Cross-Site Scripting (XSS) Using Machine Learning
4 pages
Biological Sequence Analysis Probabilist
No ratings yet
Biological Sequence Analysis Probabilist
1 page
NLP JNTUH Unit 1
No ratings yet
NLP JNTUH Unit 1
9 pages
A Review On Automatic Speech Recognition Architect
No ratings yet
A Review On Automatic Speech Recognition Architect
13 pages
Linguistic Issues and Methods in Computa A4919075
No ratings yet
Linguistic Issues and Methods in Computa A4919075
7 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
AI Unit 3
No ratings yet
AI Unit 3
18 pages
Presentation - Theory Session 4
No ratings yet
Presentation - Theory Session 4
19 pages
Morphology From Data To Theories (Antonia Fábregas, Sergio Scalise)
100% (1)
Morphology From Data To Theories (Antonia Fábregas, Sergio Scalise)
317 pages
Unit 12 (3 Half)
No ratings yet
Unit 12 (3 Half)
37 pages
Researches in Morphology
No ratings yet
Researches in Morphology
6 pages
Document
No ratings yet
Document
6 pages
Free Bound Lexical Grammatical Content Function: Unhappiness Un - Ness Happy Sing-Sang
No ratings yet
Free Bound Lexical Grammatical Content Function: Unhappiness Un - Ness Happy Sing-Sang
5 pages
Dissertation NILM
No ratings yet
Dissertation NILM
112 pages
AI Unit 5 Notes
No ratings yet
AI Unit 5 Notes
35 pages
Techniques For Reliability Evaluation in Distribution System Planning
No ratings yet
Techniques For Reliability Evaluation in Distribution System Planning
4 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
11 pages
2 Words Lexicon
No ratings yet
2 Words Lexicon
23 pages
Adil Juma's Summary
No ratings yet
Adil Juma's Summary
11 pages
Unit Pattern
No ratings yet
Unit Pattern
6 pages
Qureshi Guess 9077
No ratings yet
Qureshi Guess 9077
15 pages
(Ebook) Syntactic Structures and Morphological Information by Uwe Junghanns (Editor) Luka Szucsich (Editor) ISBN 9783110904758, 3110904756 Complete Edition
No ratings yet
(Ebook) Syntactic Structures and Morphological Information by Uwe Junghanns (Editor) Luka Szucsich (Editor) ISBN 9783110904758, 3110904756 Complete Edition
132 pages
NLP U12
No ratings yet
NLP U12
12 pages
Morphology 9081
No ratings yet
Morphology 9081
43 pages
Syntactic Structures and Morphological Information Uwe Junghanns (Editor) No Waiting Time
No ratings yet
Syntactic Structures and Morphological Information Uwe Junghanns (Editor) No Waiting Time
107 pages