0% found this document useful (0 votes)

40 views7 pages

Multilingual Issues

The document discusses the challenges of word segmentation in languages like Chinese, which lack explicit word boundaries, highlighting the ambiguity in segmenting phrases and the importance of choosing the correct segmentation for understanding. It compares different parsing methods, including integrated parsing, pipeline approaches, and the use of word lattices, emphasizing the benefits of allowing multiple segmentation possibilities. Additionally, it covers morphological analysis, the complexity of morphemes in various languages, and the significance of semantic parsing in natural language processing.

Uploaded by

Prabhakar Gantela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views7 pages

Multilingual Issues

Uploaded by

Prabhakar Gantela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

The text introduces the concept of word segmentation, which is necessary

because many written languages (using Chinese as the prime example) do not
use spaces or other explicit marks to separate words like English does. A
continuous string of characters needs to be broken down into meaningful word
units.
Ambiguity Example: It provides a concrete example using the Chinese text: 北京
大学生比赛 (Běijīng dàxuéshēng bǐsài).
Segmentation 1 (Plausible): The text shows one way to segment this:
北京 (Běijīng) = Beijing
大学生 (dàxuéshēng) = university students
比赛 (bǐsài) = competition
Meaning: This segmentation translates to "competition among university
students in Beijing".
Segmentation 2 (Implied Alternative): It then introduces a crucial point of
ambiguity. If the first part, 北京大学 (Běijīng Dàxué), is interpreted as a single unit
meaning "Beijing University", then the segmentation would have to be
different. The text cuts off, but the implication is that the segmentation would
likely become:
北京大学 (Běijīng Dàxué) = Beijing University
生 (shēng) = student(or maybe part of another word, depending on context)
比赛 (bǐsài) = competition
Meaning: This segmentation would lead to a different meaning, perhaps
"competition for Beijing University students" or similar, depending on how the
remaining '生' is handled.
In essence: This section highlights that:
Word segmentation is a fundamental task for processing languages like
Chinese that lack word delimiters.

This task is non-trivial because the same sequence of characters can often be
segmented in multiple valid ways, leading to different interpretations

1
(ambiguity). Choosing the correct segmentation is crucial for understanding
the text.
This passage discusses different strategies for Chinese parsing, focusing
specifically on the challenge of word segmentation (determining word
boundaries in a character stream, as Chinese doesn't typically use spaces).
Integrated Parsing and Segmentation:
One approach mentioned is to have the parser perform word segmentation as
part of the parsing process itself.
In this method, the structure built by the parser (the parse tree) inherently
defines word boundaries as part of the parsing process . Nonterminals in the
tree that span a group of characters essentially mark those word boundaries.
//However, a study found that immediate (local) context is the most useful
factor for predicting word boundaries, more so than the global sentence
context captured by the entire parse tree. This suggests that while the
integrated approach can capture long-distance dependencies that might
resolve some segmentation ambiguities, relying heavily on local information
might be more effective overall for boundary detection.
Pipeline Approach and its Drawback:
The text contrasts the integrated approach with a pipeline method. This
usually involves:
Step 1: Use a dedicated word segmentation model to produce the single best
segmentation for the sentence.
Step 2: Feed this segmented sentence into the parser.
The major disadvantage highlighted is that if the initial segmentation model
produces only one result (even if other segmentations were plausible), the
parser is stuck with it. It has no way to consider alternative segmentations,
even if the initial one leads to a poor or impossible parse.
Parsing Word Lattices (A More Flexible Approach):

To overcome the limitation of the pipeline, the text proposes using a word
lattice as input to the parser.

2
A word lattice is a compact representation (like a graph or finite-state
automaton) that encodes multiple possible word segmentations
simultaneously.
Drawing on established results (Bar-Hillel et al.), parsers designed for Context-
Free Grammars (CFGs) can be adapted to process these lattices directly.
Instead of indexing into a simple string, they use the states of the automaton
(lattice) as generalized indices.
The benefit: The parser can explore the different segmentation paths within
the lattice. It can potentially use information from the segmentation model
(e.g., probabilities or rankings associated with different paths/words in the
lattice) and its own grammatical knowledge to select the segmentation that
results in the most accurate overall parse.
In essence: The text compares methods for handling Chinese word
segmentation during parsing. It critiques the inflexibility of using only the
single best segmentation from a dedicated model (pipeline) and highlights the
limitations found in a fully integrated approach (local context dominating). It
then presents parsing word lattices as a promising alternative that allows the
parser to consider multiple segmentation possibilities simultaneously,
potentially leading to more robust and accurate results by combining
segmentation likelihood with syntactic analysis

Morphology: Definition & Core Issue

3
Definition: Morphology studies how words are formed from smaller
meaningful units called morphemes (stems combined with other components).
Core Idea: The meaning of a word is derived from the combination of its
morphemes' meanings.
Problem: Simply splitting text by spaces (tokenization) is insufficient or
problematic in many languages because individual words carry complex
internal structure.
In some languages, splitting text into words using spaces does not work well
because words can be made up of smaller meaningful units called morphemes.
A morpheme is the smallest part of a word that has meaning – like prefixes,
suffixes or stems.

Multilingual Dimension
This is a significant issue in languages beyond those easily handled by space-
splitting.
Agglutinative Languages (e.g., Turkish, Finnish): Characterized by combining
many morphemes together, creating very complex words.

Inflectional Languages (e.g., Czech, Russian): Use numerous morphemes to

mark grammatical properties like case, gender, number, tense, etc. While
potentially less extreme than agglutinative, still poses significant challenges.

Morphemes for different properties (e.g., gender, case) can often combine
orthogonally (independently).

Challenges Arising from Complex Morphology

Vast Number of Word Forms: The combination of various morphemes leads to

a huge number of possible inflected forms for a single base word (stem).

4
Example (Czech Adjectives): Can potentially have forms covering 4 grammatical
genders(male humans, male non living ex table, female humans, objects) , 7
cases, 3 degrees of comparison, and positive/negative polarity, resulting in up
to 336 distinct word forms per adjective stem.

Morphological Ambiguity: A single word form might be analyzable into

different sets of morphemes, leading to different interpretations. This is a
distinct challenge in addition to syntactic ambiguity (sentence structure
ambiguity).
How to handle this?
The task of finding the most likely sequence of morphemes in a word is similar
to POS tagging.
Approach to Handling Morphological Disambiguation
Goal: Determine the most likely sequence/combination of morphemes for a
given word.
Method: Reduce the problem to a Part-of-Speech (POS) tagging task.
Process:
Each word is tagged with a complex POS tag that includes multiple features
like:
 Part of Speech
 Gender
 Person
 Case
 Tense
 Etc
Words are not physically split into morphemes.
Instead, each word is assigned a complex POS tag.
This tag encodes information about the stem and the various morphemes
affecting the word across multiple dimensions (e.g., V--M--3--- indicating a
Verb stem, Masculine gender, 3rd person, with other potential morpheme
slots specified as absent in this instance).

5
A POS tagger is trained using multiple sub-classifiers, each predicting one part
of the tag like gender, tense etc.
These outputs are combined to create the final complex tag/

The word itself is not split into morphemes, but the POS tag captures all the
morpheme information and these tags can be used by a statistical parses to
understand the grammar of such complex languages.

Training: Typically involves training separate classifiers for each

component/dimension of the complex tag and combining their outputs to
assign the final tag to a word.
Benefit: This enriched tag set (complex POS tags) provides valuable, detailed
features for statistical parsers, improving their accuracy when analyzing
morphologically rich/complex languages.

Semantic Parsing: As you've mentioned, semantic parsing is the process of

identifying meaning chunks (or "semantic units") from a piece of text and
representing them in a structured format that a computer can understand and
manipulate. This structured format could be a logical expression, a knowledge
graph, or some other data structure.
Key Points:
* Granularity of Meaning: Semantic parsing can range from identifying basic
relationships between entities to more complex tasks like understanding the
roles of entities in events.
* Ambiguity: The term "semantic parsing" itself can be ambiguous due to the
varying levels of meaning representation it can encompass.
* Purpose: The ultimate goal of semantic parsing is to enable computers to
perform higher-level tasks like information retrieval, question answering, and
natural language generation.
In the context of natural language processing (NLP):
* Text as Input: The information signal in this case is human language text.

6
* Mapping Text to Structured Representation: The aim is to map this text into
a structured representation that captures its meaning in a way that's useful for
the computer.
Let's summarize:
Semantic parsing is a crucial task in NLP that involves extracting meaning from
text and representing it in a structured format that computers can understand
and utilize for various tasks like information retrieval and question answering.

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Unit 3, 4 Textbook
No ratings yet
Unit 3, 4 Textbook
83 pages
Text Analysis: Stemming & POS Tagging
No ratings yet
Text Analysis: Stemming & POS Tagging
15 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
A Review of Techniques For Morphological Analysis in Natural Language Processing
No ratings yet
A Review of Techniques For Morphological Analysis in Natural Language Processing
11 pages
7
No ratings yet
7
4 pages
NLP Unit 3
No ratings yet
NLP Unit 3
20 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
NLP Techniques: POS & Semantic Tagging
No ratings yet
NLP Techniques: POS & Semantic Tagging
30 pages
NLP Final
No ratings yet
NLP Final
27 pages
NLP m2
No ratings yet
NLP m2
71 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
Tasks in NLP
No ratings yet
Tasks in NLP
7 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
NLP One Mark Questions With Answers
No ratings yet
NLP One Mark Questions With Answers
8 pages
UNIT 4 New
No ratings yet
UNIT 4 New
14 pages
Unit V Expert Systems Notes
No ratings yet
Unit V Expert Systems Notes
15 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
Steps in Natural Language Processing
No ratings yet
Steps in Natural Language Processing
7 pages
NLP Unit 3 (Part 1)
No ratings yet
NLP Unit 3 (Part 1)
7 pages
Unit 3 NLP New
No ratings yet
Unit 3 NLP New
15 pages
Natural Language Processing Unit 3
No ratings yet
Natural Language Processing Unit 3
55 pages
NLP - Unit 3 Part2
No ratings yet
NLP - Unit 3 Part2
12 pages
NLP QB
No ratings yet
NLP QB
13 pages
NLP Assign Mod-4,5,6 IramShaikh
No ratings yet
NLP Assign Mod-4,5,6 IramShaikh
10 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
NLP - Mid 2 Examination
No ratings yet
NLP - Mid 2 Examination
4 pages
Assignment Two
No ratings yet
Assignment Two
5 pages
Project Report
No ratings yet
Project Report
12 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP Unit-Ii - Mma
No ratings yet
NLP Unit-Ii - Mma
19 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
NLP 6 Lecture 1 1
No ratings yet
NLP 6 Lecture 1 1
43 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
NLP Module3
No ratings yet
NLP Module3
27 pages
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
No ratings yet
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
8 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Part-Of-Speech Tagging With Rich Language Description: Anastasyev D. G. Andrianov A. I. Indenbom E. M
No ratings yet
Part-Of-Speech Tagging With Rich Language Description: Anastasyev D. G. Andrianov A. I. Indenbom E. M
12 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
NLP 39-48
No ratings yet
NLP 39-48
11 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Unit 2
No ratings yet
Unit 2
8 pages
NLP Unit 2 Imp
No ratings yet
NLP Unit 2 Imp
4 pages
NLP IA1 Question Bank: Concept
No ratings yet
NLP IA1 Question Bank: Concept
10 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
Toolbox Spanish C-OrAL-ROM, Guirao y Moreno Sandoval
No ratings yet
Toolbox Spanish C-OrAL-ROM, Guirao y Moreno Sandoval
5 pages
3nlp Computer
No ratings yet
3nlp Computer
83 pages
4.chapter5 - Syntactic and Semantic Representations
No ratings yet
4.chapter5 - Syntactic and Semantic Representations
47 pages
NLP Components and Techniques Guide
No ratings yet
NLP Components and Techniques Guide
26 pages
NLU: Challenges and Applications
No ratings yet
NLU: Challenges and Applications
3 pages
NLP: A Guide for Tech Enthusiasts
No ratings yet
NLP: A Guide for Tech Enthusiasts
15 pages
Semantic Parsing for NLP Experts
No ratings yet
Semantic Parsing for NLP Experts
21 pages
NLP: Word Sense Disambiguation
No ratings yet
NLP: Word Sense Disambiguation
68 pages
A Lexicon of Arabic Verbs Constructed On
No ratings yet
A Lexicon of Arabic Verbs Constructed On
9 pages
12 Chapter 6
No ratings yet
12 Chapter 6
35 pages
Curso Básico Lenca - para Lingüistas
No ratings yet
Curso Básico Lenca - para Lingüistas
44 pages
Understanding Bantu Language Origins
No ratings yet
Understanding Bantu Language Origins
8 pages
What Is Contrastive Analysis
0% (1)
What Is Contrastive Analysis
27 pages
Morphological Systems
100% (2)
Morphological Systems
12 pages
Typological Classification of Languages. Part 1.
No ratings yet
Typological Classification of Languages. Part 1.
37 pages
My Chapter 2
0% (1)
My Chapter 2
75 pages
Tense and Mood in Indo-European Syntax (Kiparsky)
No ratings yet
Tense and Mood in Indo-European Syntax (Kiparsky)
29 pages
Linguistics for Language Scholars
No ratings yet
Linguistics for Language Scholars
34 pages
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
No ratings yet
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
7 pages
Word Formation in Esperanto
No ratings yet
Word Formation in Esperanto
17 pages
What Is Contrastive Analysis
91% (11)
What Is Contrastive Analysis
26 pages
Morphological Theory Insights
No ratings yet
Morphological Theory Insights
13 pages
Durharthal Dwarven Phrases
No ratings yet
Durharthal Dwarven Phrases
58 pages
Linguistics: Word Structure Basics
No ratings yet
Linguistics: Word Structure Basics
2 pages
Weisheit Der Indiern Inglés
No ratings yet
Weisheit Der Indiern Inglés
45 pages
Morphological Cluster Induction of Bantu Words Using
No ratings yet
Morphological Cluster Induction of Bantu Words Using
9 pages
Typologia 2
No ratings yet
Typologia 2
3 pages
Banfi and Arcodia The Sheng Sheng Complex Words in Chinese Between Morphology and Semantics
No ratings yet
Banfi and Arcodia The Sheng Sheng Complex Words in Chinese Between Morphology and Semantics
15 pages
Sample Essay - Malayalam
No ratings yet
Sample Essay - Malayalam
13 pages
Colloquial Turkish The Complete Course For Beginners Second Edition Backus Download
100% (1)
Colloquial Turkish The Complete Course For Beginners Second Edition Backus Download
60 pages
A Rule-Based Afan Oromo Grammar Checker
No ratings yet
A Rule-Based Afan Oromo Grammar Checker
5 pages
EMP211S-2015-Morphological Typology of Language
No ratings yet
EMP211S-2015-Morphological Typology of Language
5 pages
ETRUSCAN
100% (1)
ETRUSCAN
16 pages
The First Landing On Wrangel IslandWith Some Remarks On The Northern Inhabitants by Rosse, Irving C. (Irving Collins), 1842-1901
No ratings yet
The First Landing On Wrangel IslandWith Some Remarks On The Northern Inhabitants by Rosse, Irving C. (Irving Collins), 1842-1901
30 pages
Conversational Korean Grammar v21 (PDF) - 240527 - 094212
100% (1)
Conversational Korean Grammar v21 (PDF) - 240527 - 094212
299 pages
Linguistic Features of Protolanguages
No ratings yet
Linguistic Features of Protolanguages
28 pages
Oxford Handbooks Online: Derivation in A Social Context
No ratings yet
Oxford Handbooks Online: Derivation in A Social Context
20 pages
Essentials of Comparative Linguistics
No ratings yet
Essentials of Comparative Linguistics
24 pages

Multilingual Issues

Uploaded by

Multilingual Issues

Uploaded by

The text introduces the concept of word segmentation, which is necessary

Morphology: Definition & Core Issue

Inflectional Languages (e.g., Czech, Russian): Use numerous morphemes to

Challenges Arising from Complex Morphology

Vast Number of Word Forms: The combination of various morphemes leads to

Morphological Ambiguity: A single word form might be analyzable into

Training: Typically involves training separate classifiers for each

Semantic Parsing: As you've mentioned, semantic parsing is the process of

You might also like