Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
40 views7 pages

Multilingual Issues

The document discusses the challenges of word segmentation in languages like Chinese, which lack explicit word boundaries, highlighting the ambiguity in segmenting phrases and the importance of choosing the correct segmentation for understanding. It compares different parsing methods, including integrated parsing, pipeline approaches, and the use of word lattices, emphasizing the benefits of allowing multiple segmentation possibilities. Additionally, it covers morphological analysis, the complexity of morphemes in various languages, and the significance of semantic parsing in natural language processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views7 pages

Multilingual Issues

The document discusses the challenges of word segmentation in languages like Chinese, which lack explicit word boundaries, highlighting the ambiguity in segmenting phrases and the importance of choosing the correct segmentation for understanding. It compares different parsing methods, including integrated parsing, pipeline approaches, and the use of word lattices, emphasizing the benefits of allowing multiple segmentation possibilities. Additionally, it covers morphological analysis, the complexity of morphemes in various languages, and the significance of semantic parsing in natural language processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

The text introduces the concept of word segmentation, which is necessary

because many written languages (using Chinese as the prime example) do not
use spaces or other explicit marks to separate words like English does. A
continuous string of characters needs to be broken down into meaningful word
units.
Ambiguity Example: It provides a concrete example using the Chinese text: 北京
大学生比赛 (Běijīng dàxuéshēng bǐsài).
Segmentation 1 (Plausible): The text shows one way to segment this:
北京 (Běijīng) = Beijing
大学生 (dàxuéshēng) = university students
比赛 (bǐsài) = competition
Meaning: This segmentation translates to "competition among university
students in Beijing".
Segmentation 2 (Implied Alternative): It then introduces a crucial point of
ambiguity. If the first part, 北京大学 (Běijīng Dàxué), is interpreted as a single unit
meaning "Beijing University", then the segmentation would have to be
different. The text cuts off, but the implication is that the segmentation would
likely become:
北京大学 (Běijīng Dàxué) = Beijing University
生 (shēng) = student(or maybe part of another word, depending on context)
比赛 (bǐsài) = competition
Meaning: This segmentation would lead to a different meaning, perhaps
"competition for Beijing University students" or similar, depending on how the
remaining '生' is handled.
In essence: This section highlights that:
Word segmentation is a fundamental task for processing languages like
Chinese that lack word delimiters.

This task is non-trivial because the same sequence of characters can often be
segmented in multiple valid ways, leading to different interpretations

1
(ambiguity). Choosing the correct segmentation is crucial for understanding
the text.
This passage discusses different strategies for Chinese parsing, focusing
specifically on the challenge of word segmentation (determining word
boundaries in a character stream, as Chinese doesn't typically use spaces).
Integrated Parsing and Segmentation:
One approach mentioned is to have the parser perform word segmentation as
part of the parsing process itself.
In this method, the structure built by the parser (the parse tree) inherently
defines word boundaries as part of the parsing process . Nonterminals in the
tree that span a group of characters essentially mark those word boundaries.
//However, a study found that immediate (local) context is the most useful
factor for predicting word boundaries, more so than the global sentence
context captured by the entire parse tree. This suggests that while the
integrated approach can capture long-distance dependencies that might
resolve some segmentation ambiguities, relying heavily on local information
might be more effective overall for boundary detection.
Pipeline Approach and its Drawback:
The text contrasts the integrated approach with a pipeline method. This
usually involves:
Step 1: Use a dedicated word segmentation model to produce the single best
segmentation for the sentence.
Step 2: Feed this segmented sentence into the parser.
The major disadvantage highlighted is that if the initial segmentation model
produces only one result (even if other segmentations were plausible), the
parser is stuck with it. It has no way to consider alternative segmentations,
even if the initial one leads to a poor or impossible parse.
Parsing Word Lattices (A More Flexible Approach):

To overcome the limitation of the pipeline, the text proposes using a word
lattice as input to the parser.

2
A word lattice is a compact representation (like a graph or finite-state
automaton) that encodes multiple possible word segmentations
simultaneously.
Drawing on established results (Bar-Hillel et al.), parsers designed for Context-
Free Grammars (CFGs) can be adapted to process these lattices directly.
Instead of indexing into a simple string, they use the states of the automaton
(lattice) as generalized indices.
The benefit: The parser can explore the different segmentation paths within
the lattice. It can potentially use information from the segmentation model
(e.g., probabilities or rankings associated with different paths/words in the
lattice) and its own grammatical knowledge to select the segmentation that
results in the most accurate overall parse.
In essence: The text compares methods for handling Chinese word
segmentation during parsing. It critiques the inflexibility of using only the
single best segmentation from a dedicated model (pipeline) and highlights the
limitations found in a fully integrated approach (local context dominating). It
then presents parsing word lattices as a promising alternative that allows the
parser to consider multiple segmentation possibilities simultaneously,
potentially leading to more robust and accurate results by combining
segmentation likelihood with syntactic analysis

Morphology: Definition & Core Issue

3
Definition: Morphology studies how words are formed from smaller
meaningful units called morphemes (stems combined with other components).
Core Idea: The meaning of a word is derived from the combination of its
morphemes' meanings.
Problem: Simply splitting text by spaces (tokenization) is insufficient or
problematic in many languages because individual words carry complex
internal structure.
In some languages, splitting text into words using spaces does not work well
because words can be made up of smaller meaningful units called morphemes.
A morpheme is the smallest part of a word that has meaning – like prefixes,
suffixes or stems.

Multilingual Dimension
This is a significant issue in languages beyond those easily handled by space-
splitting.
Agglutinative Languages (e.g., Turkish, Finnish): Characterized by combining
many morphemes together, creating very complex words.

Inflectional Languages (e.g., Czech, Russian): Use numerous morphemes to


mark grammatical properties like case, gender, number, tense, etc. While
potentially less extreme than agglutinative, still poses significant challenges.

Morphemes for different properties (e.g., gender, case) can often combine
orthogonally (independently).

Challenges Arising from Complex Morphology

Vast Number of Word Forms: The combination of various morphemes leads to


a huge number of possible inflected forms for a single base word (stem).

4
Example (Czech Adjectives): Can potentially have forms covering 4 grammatical
genders(male humans, male non living ex table, female humans, objects) , 7
cases, 3 degrees of comparison, and positive/negative polarity, resulting in up
to 336 distinct word forms per adjective stem.

Morphological Ambiguity: A single word form might be analyzable into


different sets of morphemes, leading to different interpretations. This is a
distinct challenge in addition to syntactic ambiguity (sentence structure
ambiguity).
How to handle this?
The task of finding the most likely sequence of morphemes in a word is similar
to POS tagging.
Approach to Handling Morphological Disambiguation
Goal: Determine the most likely sequence/combination of morphemes for a
given word.
Method: Reduce the problem to a Part-of-Speech (POS) tagging task.
Process:
Each word is tagged with a complex POS tag that includes multiple features
like:
 Part of Speech
 Gender
 Person
 Case
 Tense
 Etc
Words are not physically split into morphemes.
Instead, each word is assigned a complex POS tag.
This tag encodes information about the stem and the various morphemes
affecting the word across multiple dimensions (e.g., V--M--3--- indicating a
Verb stem, Masculine gender, 3rd person, with other potential morpheme
slots specified as absent in this instance).

5
A POS tagger is trained using multiple sub-classifiers, each predicting one part
of the tag like gender, tense etc.
These outputs are combined to create the final complex tag/

The word itself is not split into morphemes, but the POS tag captures all the
morpheme information and these tags can be used by a statistical parses to
understand the grammar of such complex languages.

Training: Typically involves training separate classifiers for each


component/dimension of the complex tag and combining their outputs to
assign the final tag to a word.
Benefit: This enriched tag set (complex POS tags) provides valuable, detailed
features for statistical parsers, improving their accuracy when analyzing
morphologically rich/complex languages.

Semantic Parsing: As you've mentioned, semantic parsing is the process of


identifying meaning chunks (or "semantic units") from a piece of text and
representing them in a structured format that a computer can understand and
manipulate. This structured format could be a logical expression, a knowledge
graph, or some other data structure.
Key Points:
* Granularity of Meaning: Semantic parsing can range from identifying basic
relationships between entities to more complex tasks like understanding the
roles of entities in events.
* Ambiguity: The term "semantic parsing" itself can be ambiguous due to the
varying levels of meaning representation it can encompass.
* Purpose: The ultimate goal of semantic parsing is to enable computers to
perform higher-level tasks like information retrieval, question answering, and
natural language generation.
In the context of natural language processing (NLP):
* Text as Input: The information signal in this case is human language text.

6
* Mapping Text to Structured Representation: The aim is to map this text into
a structured representation that captures its meaning in a way that's useful for
the computer.
Let's summarize:
Semantic parsing is a crucial task in NLP that involves extracting meaning from
text and representing it in a structured format that computers can understand
and utilize for various tasks like information retrieval and question answering.

You might also like