Adama Science and Technology University
Department of Computer Science and Engineering
Course Title: Natural Language Processing
Individual Assignment Two (Book Chapter)
Topic: Morphological Processing of Semitic Languages
Submitted to Dr. Beharu
Submitted By: Abdi Mosisa
ID: pgr/28201/15
Morphological Processing of Semitic Languages
2.1 Introduction
In NLP, morphology is the study of how words are built up from smaller meaningful units called
morphemes [1]. These morphemes are the building blocks of words, and understanding them
helps computers process language more effectively.
Here's a breakdown of key concepts in NLP morphology:
Morphemes: The smallest units of meaning in a language. They can't be broken down
further into smaller meaningful parts. There are two main types:
o Stems: The core meaning-carrying unit of a word. For example, "ሰበረ" in "ሰበረች" is
the stem.
o Affixes: Bound morphemes that attach to stems to modify their meaning or
grammatical function. Examples include prefixes (አል- in አልሰበረም), suffixes (-ዉ in
ሰበረዉ), and infixes (-በ- in ሰባበረ).
Morphological Analysis: The process of breaking down a word into its constituent
morphemes. This helps NLP tasks like:
o Lemmatization: Reducing words to their base form (lemma) for better understanding.
For instance, "playing" and "played" would both be reduced to "play".
o Part-of-Speech Tagging: Identifying the grammatical function of a word (noun, verb,
adjective etc.) based on its morphemes. For example, the "-ed" suffix often indicates
past tense for verbs.
Morphological Generation: The process of building words by combining stems and affixes.
This can be useful for tasks like:
o Machine translation: Understanding how morphemes are combined in one language
to create their equivalent in another.
The process of morphological analysis involves identifying and categorizing these morphemes
within words to understand how they contribute to meaning and grammatical structure. This
analysis can reveal insights into word formation, inflectional patterns, and derivational processes
within a language.
2
This chapter addresses morphological processing of Semitic languages. In light of the complex
morphology and problematic orthography of many of the Semitic languages, the chapter begins
with a recapitulation of the challenges these phenomena pose on computational applications. It
then discusses the approaches that were suggested to cope with these challenges in the past. The
bulk of the chapter, then, discusses available solutions for morphological processing, including
analysis, generation, and disambiguation, in a variety of Semitic languages. The concluding
section discusses future research directions.
Semitic languages are characterized by complex, productive morphology, with a basic word-
formation mechanism, root-and-pattern, that is unique to languages of this family. Morphological
processing of Semitic languages therefore necessitates technology that can successfully cope
with these complexities. Several linguistic theories, and, consequently, computational linguistic
approaches, are often developed with a narrow set of (mostly European) languages in mind. The
adequacy of such approaches to other families of languages is sometimes sub-optimal. A related
issue is the long tradition of scholarly work on some Semitic languages, notably Arabic [2] and
Amharic [1], which cannot always be easily consolidated with contemporary approaches.
Inconsistencies between modern, English-centric approaches and traditional ones are easily
observed in matters of lexicography. In order to annotate corpora or produce tree-banks, an
agreed-upon set of part-of-speech (POS) categories is required. Since early approaches to POS
tagging were limited to English, resources for other languages tend to use “tag sets”, or
inventories of categories, that are minor modifications of the standard English set. Such an
adaptation is problematic for Semitic languages.
These issues are complicated further when morphology is considered. The rich, non-
concatenative morphology of Semitic languages frequently requires innovative solutions that
standard approaches do not always provide.
2.2 Basic Notions
The word ‘word’ is one of the most loaded and ambiguous notions in linguistic theory [3]. Since
most computational applications deal with written texts (as opposed to spoken language), the
3
most useful notion is that of an orthographic word. This is a string of characters, from a well-
defined alphabet of letters, delimited by spaces, or other delimiters, such as punctuation. A text
typically consists of sequences of orthographic words, delimited by spaces or punctuation;
orthographic words in a text are often referred to as tokens.
Orthographic words are frequently not atomic: they can be further divided to smaller units, called
morphemes. Morphemes are the smallest meaning-bearing linguistic elements; they are
elementary pairings of form and meaning. Morphemes can be either free, meaning that they can
occur in isolation, as a single orthographic word; or bound, in which case they must combine
with other morphemes in order to yield a word. For example, the word two consists of a single
(free) morpheme, whereas dogs consists of two morphemes: the free morpheme dog, combined
with the bound morpheme -s. The latter form indicates the fact that it must combine with other
morphemes (hence the preceding dash); its function is, of course, denoting plurality. When a
word consists of some free morpheme, potentially with combined bound morphemes, the free
morpheme is called a stem, or sometimes root.
Bound morphemes are typically affixes. Affixes come in many varieties: prefixes attach to a
stem before the stem (e.g., የ- in የአበበ), suffixes attach after the stem (-ነት in ሰዉነት), infixes
occur inside a stem .Morphological processes define the shape of words. They are usually
classified to two types of processes. Derivational morphology deals with word formation; such
processes can create new words from existing ones, potentially changing the cate- gory of the
original word. For example, the processes that create በታማኝነት from ታማኝነት, and ታማኝነት
from ታማኝ, are derivational. Such processes are typically not highly productive; for example,
one cannot derive አፍቃሪ from ፍቅር.
In contrast, inflectional morphology yields inflected forms, variants of some base, or citation
form, of words; these forms are constructed to adhere to some syntactic constraints, but they do
not change the basic meaning of the base form. Inflectional processes are usually highly
productive, applying to most members of a particular word class. For example, English nouns
inflect for number, so most nouns occur in two forms, the singular (which is considered the
citation form) and the plural, regularly obtained by adding the suffix -s to the base form.
4
Word formation in Semitic languages is based on a unique mechanism, known as root-and-
pattern. Words in this language family are often created by the combination of two bound
morphemes, a root and a pattern. The root is a sequence of consonants only, typically three; and
the pattern is a sequence of vowels and consonants with open slots in it. The root combines with
the pattern through a process called interdigitation: each letter of the root (radical) fills a slot in
the pattern.
In addition to the unique root-and-pattern morphology, Semitic languages are characterized by a
productive system of more standard affixation processes. These include prefixes, suffixes, infixes
and circumfixes, which are involved in both inflectional and derivational processes.
For example, Amharic is one of the highly inflectional Semitic language[1]. Nouns as well as
adjectives are inflected for number, gender, genitive, accusative, diminutive, dative,
instrumental, conjunctive, possessive and determiner.
Example:-
• አዲስ ቤት - new house
• አዲስ ቤትዎቸ - new houses
• ብልጥ - smart (male)
• ብልጥት -smart (female)
2.3 The Challenges of Morphological Processing
Morphological processing is a crucial component of many natural language processing (NLP)
applications. Whether the goal is information retrieval, question answering, text summarization
or machine translation, NLP systems must be aware of word structure. For some languages and
for some applications, simply stipulating a list of surface forms is a viable option; this is not the
case for languages with complex morphology, in particular Semitic languages, both because of
the huge number of potential forms and because of the difficulty of such an approach to handle
5
out-of-lexicon items (in particular, proper names), which may combine with prefix or suffix
particles.
An alternative solution would be a dedicated morphological analyzer, implementing the
morphological and orthographic rules of the language. Ideally, a morphological analyzer for any
language should reflect the rules underlying derivational and inflectional processes in that
language. Of course, the more complex the rules, the harder it is to construct such an analyzer.
The main challenge of morphological analysis of Semitic languages stems from the need to
faithfully implement a complex set of interacting rules, some of which are non-concatenative.
Once such a grammar is available, it typically produces more than one analysis for any given
surface form; in other words, Semitic languages exhibit a high degree of morphological
ambiguity, which has to be resolved in a typical computational application. The level of
morphological ambiguity is higher in many Semitic languages than it is in English, due to the
rich morphology and deficient orthography. This calls for sophisticated methods for
disambiguation. While in English (and other European languages) morphological disambiguation
may amount to POS tagging, Semitic languages require more effort, since determining the
correct POS of a given token is intertwined with the problem of segmenting the token to
morphemes, the set of morphological features (and their values) is larger, and consequently the
number of classes is too large for standard classification techniques. Several models were
proposed to address these issues.
Contemporary approaches to part-of-speech tagging are all based on machine learning: a large
corpus of text is manually annotated with the correct POS for each word; then, a classifier is
trained on the annotated corpus, resulting in a system that can predict POS tags for unseen texts
with high accuracy. The state of the art in POS tagging for English is extremely good, with
accuracies that are indistinguishable from human level performance. Various classifiers were
built for this task, implementing a variety of classification techniques, such as Hidden Markov
Models (HMM) [4], Average Perceptron [5], Maximum Entropy [6], Support Vector Machines
(SVM) [7], and others.
For languages with complex morphology, and Semitic languages in particular, however, these
standard techniques do not perform as well, for several reasons:
6
1. Due to issues of orthography, a single token in several Semitic languages can actually be a sequence
of more than one lexical item, and hence be associated with a sequence of tags.
2. The rich morphology implies a much larger tagset, since tags reflect the wealth of morphological
information which words exhibit. The richer tagset immediately implies problems of data
sparseness, since most of the tags occur only rarely, if at all, in a given corpus.
3. As a result of both orthographic deficiencies and morphological wealth, word forms in Semitic
languages tend to be ambiguous.
4. Word order in Semitic is relatively free, and in any case freer than in English.
2.4 Computational Approaches to Morphology
No single method exists that provides an adequate solution for the challenges involved in
morphological processing of Semitic languages. The most common approach to morphological
processing of natural language is finite-state technology[8]. The adequacy of this technology for
Semitic languages has frequently been challenged, but clearly, with some sophisticated
developments, such as flag diacritics [9], multi-tape automata [10] or registered automata [11],
finite-state technology has been effectively used for describing the morphological structure of
several Semitic languages [12].
2.4.1 Two-Level Morphology
Two-level morphology was “the first general model in the history of computational linguistics
for the analysis and generation of morphologically complex languages” [13]. Developed by
Koskenniemi [14], this technology facilitates the specification of rules that relate pairs of surface
strings through systematic rules. Such rules, however, do not specify how one string is to be
derived from another; rather, they specify mutual constraints on those strings. Furthermore, rules
do not apply sequentially. Instead, a set of rules, each of which constrains a particular string pair
correspondence, is applied in parallel, such that all the constraints must hold simultaneously. In
practice, one of the strings in a pair would be a surface realization, while the other would be an
underlying form.
2.4.2 Registered Automata
7
Finite-state registered automata [11] were developed with the goal of facilitating the expression
of various non-concatenative morphological phenomena in an efficient way. The main idea is to
augment standard finite-state automata with (finite) amount of memory, in the form of registers
associated with the automaton transitions. This is done in a restricted way that saves space but
does not add expressivity. The number of registers is finite, usually small, and eliminates the
need to duplicate paths as it enables the automaton to ‘remember’ a finite number of symbols.
Technically, each arc in the automaton is associated (in addition to an alphabet symbol) with an
action on the registers. Cohen-Sygal and Wintner [11] define two kinds of actions, read and
write. The former allows an arc to be traversed only if a designated register contains a specific
symbol. The latter writes a specific symbol into a designated register when an arc is traversed.
Cohen-Sygal and Wintner [11] show that finite-state registered automata can efficiently model
several non-concatenative morphological phenomena, including circumfixation, root and pattern
word formation in Semitic languages, vowel har- mony, limited reduplication etc. The
representation is highly efficient: for example, to account for all the possible combinations of r
roots and p patterns, an ordinary FSA requires O(r × p) arcs whereas a registered automaton
requires only O(r C p) arcs. Unfortunately, no implementation of the model exists as part of an
available finite-state toolkit.
2.4.3 Analysis by Generation
Most of the approaches discussed above allow for a declarative specification of (morphological)
grammar rules, from which both analyzers and generators can be created automatically. A
simpler, less generic yet highly efficient approach to the morphology of Semitic languages had
been popular with actual applications. In this framework, which we call analysis by generation
here, the morphological rules that describe word formation and/or affixation are specified in a
way that supports generation, but not necessarily analysis. Coupled with a lexicon of morphemes
(typically, base forms and concatenative affixes), such rules can be applied in one direction to
generate all the surface forms of the language. This can be done off-line, and the generated forms
can then be stored in a database; analysis, in this paradigm, amounts more or less to simple table
8
lookup.
2.4.4 Functional Morphology
Functional morphology [16] is a computational framework for defining language resources, in
particular lexicons. It is a language-independent tool, based on a word- and-paradigm model,
which allows the grammar writer to specify the inflectional paradigms of words in a specific
language in a similar way to printed paradigm tables. A lexicon in functional morphology
consists of a list of words, each associated with its paradigm name, and an inflection engine that
can apply the inflectional rules of the language to the words of the lexicon.
This framework was used to define morphological grammars for several languages, including
modeling of non-concatenative processes such as vowel harmony, reduplication, and templatic
morphology. In particular, uses this paradigm to implement a morphological processor of Arabic.
2.4.5 Morphological Analysis and Generation of Semitic Languages
2.4.5.1 Amharic
Computational work on Amharic began only recently. Fissaha and Haller [17] describe a
preliminary lexicon of verbs, and discuss the difficulties involved in implementing verbal
morphology with XFST. XFST is also the framework of choice for the development of an
Amharic morphological grammar [12]; but evaluation on a small set of 1,620 words reveal that
while the coverage of the grammar on this corpus is rather high (85–94 %, depending on the part
of speech), its precision is low and many word forms (especially verbs) are associated with
wrong analyses.
Argaw and Asker [1] describe a stemmer for Amharic. Using a large dictionary, the stemmer
first tries to segment surface forms to sequences of prefixes, stem, and affixes. The candidate
stems are then looked up in the dictionary, and the longest found stem is chosen (ties are
9
resolved by the frequency of the stem in a corpus). Evaluation on a small corpus of 1,500 words
shows accuracy of close to 77 %.
The state of the art in Amharic, however, is most probably HornMorpho: it is a system for
morphological processing of Amharic, as well as Tigrinya (another Ethiopian Semitic language)
and Oromo (which is not Semitic). The system is based on finite-state technology, but the basic
transducers are augmented by feature structures, implementing ideas introduced by Amtrup.
Manual evaluation on 200 Tigrinya verbs and 400 Amharic nouns and verbs shows very accurate
results: in over 96 % of the words, the system produced all and only the correct analyses.
2.4.6 Related Applications
Also worth mentioning here are a few works that address other morphology-related tasks. These
include a shallow morphological analyzer for Arabic [10] that basically segments word forms to
sequences of (at most one) prefix, a stem and (at most one) suffix; a system for identifying the
roots of Hebrew and Arabic (possibly inflected) words; various programs for vocalization, or
restoring diacritics, in Arabic and in other Semitic languages; determining case endings of
Arabic words; and correction of optical character recognizer (OCR) errors.
When downstream applications are considered, such as chunking, parsing, or machine
translation, the question of tokenization gains much importance. Morpho- logical analysis
determines the lexeme and the inflections (and, sometimes, also the derivational) morphemes of
a surface form; but the way in which a surface form is broken down to its morphemes for the
purpose of further processing can have a significant impact on the accuracy of such applications.
Several works investigate various pre-processing techniques for Arabic, in the context of
developing Arabic-to-English statistical machine translation systems [17,51].
2.5 Morphological Disambiguation of Semitic Languages
10
Early attempts at POS tagging and morphological disambiguation of Semitic languages relied on
more “traditional” approaches, borrowed directly from the general (i.e., English) POS tagging
literature. The first such work is probably [10], who defined a set of 131 POS tags, manually
annotated a corpus of 50,000 words and then implemented a tagger that combines statistical and
rule-based techniques that performs both segmentation and tag disambiguation. Similarly, [17]
use SVM to automatically tokenize, POS-tag, and chunk Arabic texts. To this end, they use a
reduced tag set of only 24 tags, with which the reported results are very high. The set of tags is
extended to 75 in [11].
As for Amharic, [1] uses condition random fields for POS tagging. As the annotated corpus used
for training is extremely small (1,000 words), it is not surprising the accuracy is rather low: 84 %
for segmentation, and only 74 % for POS tagging. Two other works use a recently-created
210,000-word annotated corpus to train Amharic POS taggers with a tag set of size 30. Gambäck
et al. experiment with HMM, SVM and Maximum Entropy; accuracy ranges between 88 and 95
%, depending on the test corpus. Similarly, investigate various classification techniques, using
the same corpus for the same task. The best accuracy, achieved with SVM, is over 86 %, but
other classification methods, including conditional random fields and memory-based learning,
perform well.
11
2.6 Summary
The discussion above establishes the inherent difficulty of morphological processing with
Semitic languages, as one instance of languages with rich and complex morphology. Having said
that, it is clear that with a focused effort, contemporary computational technology is sufficient
for tackling the difficulties. Sect. 2.5 shows that morphological disambiguation of these two
languages can be done with high accuracy, nearing the accuracy of disambiguation with
European languages.
However, for the less-studied languages, including Amharic, Maltese and others, much work is
still needed in order to produce tools of similar precision. Resembling the situation in Arabic and
Hebrew, this effort should focus on two fronts: development of formal, computationally-
implementable sets of rules that describe the morphology of the language in question; and
collection and annotation of corpora from which morphological disambiguation modules can be
trained.
As for future technological improvements, we note that “pipeline” approaches, whereby the input
text is fed, in sequence, to a tokenizer, a morphological analyzer, a morphological
disambiguation module and then a parser, have probably reached a ceiling, and the stage is ripe
for more elaborate, unified approaches.
12
References
1. Argaw AA, Asker L (2007) An Amharic stemmer: reducing words to their citation forms. In:
Proceedings of the ACL-2007 workshop on computational approaches to Semitic lan- guages,
Prague
2. Owens J (1997) The Arabic grammatical tradition. In: Hetzron R (ed) The Semitic languages.
Routledge, London/New York, chap 3, pp 46–58
3. Harley HB (2006) English words: a linguistic introduction. The language library. Wiley-
Blackwell, Malden
4. Brants T (2000) TnT: a statistical part-of-speech tagger. In: Proceedings of the sixth conference on
applied natural language processing, Seattle. Association for Computational Linguistics, pp 224–
231. doi:10.3115/974147.974178, http://www.aclweb.org/anthology/ A00-1031
5. Collins M (2002) Discriminative training methods for hidden markov models: theory and
experiments with perceptron algorithms. In: Proceedings of the ACL-02 conference on empirical
methods in natural language processing, EMNLP ’02, Philadelphia, Vol 10. Association for
Computational Linguistics, pp 1–8. doi:http://dx.doi.org/10.3115/1118693. 1118694
6. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Brill E, Church K
(eds) Proceedings of the conference on empirical methods in natural language processing,
Copenhagen. Association for Computational Linguistics, pp 133–142
7. Giménez J, Màrquez L (2004) SVMTool: a general POS tagger generator based on support vector
machines. In: Proceedings of 4th international conference on language resources and evaluation
(LREC), Lisbon, pp 43–46
8. Beesley KR, Karttunen L (2003) Finite-state morphology: xerox tools and techniques. CSLI,
Stanford
9. Beesley KR (1998) Arabic morphology using only finite-state operations. In: Rosner M (ed)
Proceedings of the workshop on computational approaches to Semitic languages, COLING-
ACL’98, Montreal, pp 50–57
10. Kiraz GA (2000) Multitiered nonlinear morphology using multitape finite automata: a case study
13
on Syriac and Arabic. Comput Linguist 26(1):77–105
11. Cohen-Sygal Y, Wintner S (2006) Finite-state registered automata for non-concatenative
morphology. Comput Linguist 32(1):49–82
12. Amsalu S, Gibbon D (2005) A complete finite-state model for Amharic morphographemics. In:
Yli-Jyrä A, Karttunen L, Karhumäki J (eds) FSMNLP. Lecture notes in computer science, vol
4002. Springer, Berlin/New York, pp 283–284
13. Karttunen L, Beesley KR (2001) A short history of two-level morphology. In: Talk given at the
ESSLLI workshop on finite state methods in natural language processing. http://www.
helsinki.fi/esslli/evening/20years/twol-history.html
14. Koskenniemi K (1983) Two-level morphology: a general computational model for word-form
recognition and production. The Department of General Linguistics, University of Helsinki
15. Choueka Y (1966) Computers and grammar: mechnical analysis of Hebrew verbs. In:
Proceedings of the annual conference of the Israeli Association for Information Process- ing,
Rehovot, pp 49–66. (in Hebrew)
16. Forsberg M, Ranta A (2004) Functional morphology. In: Proceedings of the ninth ACM SIGPLAN
international conference on functional programming (ICFP’04), Snowbird. ACM, New York, pp
213–223
17. Fissaha S, Haller J (2003) Amharic verb lexicon in the context of machine translation. In:
Proceedings of the TALN workshop on natural language processing of minority languages, Batz-sur-
Mer
14