A Word Sense Induction Model
A Word Sense Induction Model
for Portuguese
Examination Committee
November 2017
Acknowledgements
I’d like to thank my supervisors, for all the help and feedback they provided, as well as to L2F and
INESC for the opportunity to work with them in this project. I’d also like to thank my friends and family
for their support throughout the development of this project.
i
Resumo
Com o crescer exponencial de informação publicada, processar a mesma torna-se cada vez mais
necessário. A Desambiguação Semântica de Palavras é um componente fundamental deste processo,
mas não é possível sem corpora anotados com o sentido de cada palavra em cada contexto. A In-
dução Semântica de Palavras tenta resolver este problema ao aglomerar palavras que ocorram juntas
frequentemente em corpora não anotados. Este projeto, SENse Through InDuctiOn (SENTIDO), tenta
resolver este problema para a língua portuguesa utilizando algoritmos com base em grafos.
iii
Abstract
With the exponential growth of data published, processing this data automatically ever more neces-
sary. Word Sense Disambiguation (WSD) is a fundamental component of this processing, but it is not
possible without a rich sense inventory. Word Sense Induction (WSI) tries to solve this problem by clus-
tering words which occur together frequently in non-annotated corpora. This project, SENse Through
InDuctiOn (SENTIDO), attempts to solve this problem for the Portuguese language using a graph-based
algorithm.
v
Keywords
Word Sense Induction
Semantic Annotation
Sense Annotation
Text Analysis
Anotação Semântica
Anotação de Sentidos
Análise de Texto
vii
Contents
Acknowledgements i
Resumo iii
Abstract v
Keywords vii
List of Tables xv
Listings xvii
Acronyms xix
1 Introduction 1
2.2.2 HERMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
ix
2.4.2 MaxMax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 STRING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Architecture 23
4 Implementation 27
4.1.1 CETEMPúblico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Evaluation 31
6 Conclusion 39
Bibliography 41
List of Figures
3.4 The Entity–Relationship model used to store the information in the database . . . . . . . 25
5.1 Image of the induction graph for the word vingar, using the Chinese Whispers (CW) algo-
rithm and the Normalized Pointwise Mutual Information (NPMI) association measure. . . . 36
xiii
List of Tables
xv
Listings
4.1 SQL Query to extract all co-occurrences in same context as the target word . . . . . . . . 29
xvii
Acronyms
AJAX Asynchronous JavaScript and XML
AR Anaphora Resolution
CW Chinese Whispers
EM Expectation-Maximization
ER Entity–Relationship
GS Golden Standard
IC Integrity Constraint
IR Information Retrieval
MT Machine Translation
xix
POS Part-of-Speech
RI Random Indexing
In these two sentences, the word bow means different things. In sentence (a), it is used as a
hunting tool, while in sentence (b), it is a ribbon. To be able to do some tasks in Natural Language
Processing (NLP), machines need to be able to differentiate between the different meaning of the same
word in each use.
This problem affects, among other NLP tasks, Machine Translation (MT), Information Retrieval (IR)
and content categorization (Navigli, 2009). Taking an example from MT, the Portuguese word laço can
be translated in various different English words depending on context. It can be translated into ribbon or
it can be translated as the bond between two people.
A system which naïvely uses the Most Frequent Sense (MFS) of a given word, as found in a se-
mantically annotated corpus, may improperly translate or categorize this word when it is used outside of
its most common context or in a different corpus.
To deal with this problem, it is necessary to identify the specific meaning of a word based on its
context. This process is called Word Sense Disambiguation (WSD) (Navigli, 2009). It requires both
sense inventories and large amounts of sense-tagged corpora to function efficiently. As a result, under-
resourced languages need to deal with greater hardships to be able to achieve satisfactory results (Ng,
1997).
A solution to the lack of resources is to automatically identify the meaning of words in their given
context, without the requirement of manually annotated data. This is called Word Sense Induction (WSI)
(Agirre and Soroa, 2007). Most WSI models rely on word co-occurrence to determine the main senses
a word may have. Syntactic dependencies between words are seldom used.
1
The goal of this dissertation is to investigate the feasibility of creating a WSI model for the Por-
tuguese language, which would be capable of using syntactic dependency information to determine the
main senses a word may have. Furthermore, this dissertation looks into evaluating the quality of this
new model against the MFS baseline.
Additionally, this dissertation presents the results of this investigation in a project called SENse
Through InDuctiOn (SENTIDO). SENTIDO is a WSI and WSD model which infers the possible senses
of a word from sense-untagged corpora based its syntactic relations with its context; and, given a word
and its context, the system disambiguates between the previously inferred senses.
This dissertation is organized as follows: In Chapter 2, existing WSI implementations and models
are described, as well and the theoretical foundations and additional tools that support and aide them.
In Chapter 3, the architecture of the model is outlined, as well as the various stages that compose it. In
Chapter 4, the implementation details, such as the corpora used, existing tools from which the model
is developed, and adaptations made to those same tools, are depicted. Chapter 5 describes how the
algorithm is tested, which are the theoretical foundations for the soundness of the methodology, the test
corpus used, and the chosen parameters and finally, this chapter analyses the results of the evaluation.
Chapter 6 provides a synopsis of the findings and describes what can be improved in the future.
2
2 State of the Art
In this chapter the state of the art is described. The overall approaches and their principles are
described, as well as the various algorithms used to exploit those same principles.
Distributional Semantic Models (DSMs) discover word senses from text. These are based on the Dis-
tributional Hypothesis, which is based on the idea expressed by the famous quotation ‘You shall know
a word by the company it keeps’ (Firth, 1957). That is, words are semantically similar if they appear in
similar documents, similar context windows or similar syntactic contexts (Van, 2010).
These methods implement DSMs based on statistically or geometrically oriented probability distribution
models (Van, 2010). Some of these methods are presented below.
P
miei ,f × miej ,f
f
(2.1) sim(ei , ej ) = qP
2 P 2
f miei ,f × f miej ,f
The CBC algorithm consists of three phases. In phase I, the top-k similar elements of each element
e are computed. In phase II a collection of tight clusters is constructed, where the elements of each
cluster form a committee. In each step, the algorithm finds a set of tight clusters, called committees,
and identifies residue elements not covered by any. First the top-k similar elements are clustered using
average-link clustering (Han and Kamber, 2000). A committee covers an element if its similarity to the
centroid of the committee exceeds a given threshold. The algorithm then recursively attempts to find
more committees among the residue elements. The details are presented in Algorithm 1.
3
Algorithm 1 Phase II of CBC (Pantel, 2003)
Input: A list of elements E to be clustered, a similarity database S from Phase I, thresholds θ1 and θ2 .
Cluster the top elements of e from S using average-link clustering (Han and Kamber, 2000).
For each discovered cluster c, compute the score |c| × avgsim(c), where |c| is the number of
elements in c and avgsim(c) is the average pairwise similarity between elements in c.
Compute the centroid of c by averaging the feature vectors of its elements and computing the
mutual information scores in the same way as done for individual elements.
If the similarity of c to the centroid of each committee added before to C is below threshold θ1 , add
c to C.
If the similarity of e to every committee in C is below threshold θ2 , add e to the list of residues R.
Step 6: If R is empty, finish and return C, otherwise return the union of C and the output of a recursive
call to Phase II using the same input, except replacing E with R.
4
In phase III each element e is assigned to its most similar clusters according to Algorithm 2. The
similarity between a cluster and an element is computed using the centroid of the committee members.
Once an element e is assigned to a cluster c, the intersecting features between both are removed from
e to allow CBC to discover the less frequent senses of a word and avoid duplicate senses.
To evaluate the system, its output was compared with WordNet, with the frequency counts for the
nodes, called synsets, obtained from the SemCor corpus (Miller et al., 1993).
The corpus was obtained by processing 144 million words of newspaper text from the TREC Col-
lection (1988 AP Newswire, 1989-90 LA Times, and 1991 San Jose Mercury)1 . The test set was con-
structed by intersecting the words in WordNet with the nouns in the corpus, resulting in a test set of
13,403 words with an average number of 740.8 features per word. CBC obtained a precision of 60.8%,
a recall of 50.8% and an F-score of 55.4%. These evaluation measures are later described in Sec-
tion 5.1.1.
2.2.2 HERMIT
The model from Jurgens and Stevens (2010) performs WSI by modelling individual contexts in a high
dimensional word space. Word senses are induced by finding contexts which are similar and a hybrid
clustering method is used to group similar contexts.
Each context of a word is approximated using the Random Indexing (RI) word space model (Kan-
erva et al., 2000), in which the occurrence of a word is represented with an index vector instead of a set
of dimensions. RI can be described as a two-step operation (Sahlgren, 2005):
1. Each context in the data is assigned an unique and randomly generated representation, called
an index vector, which is sparse and high-dimensional. Its dimensionality (d) is on the order of
1 http://trec.nist.gov/data/test_coll.html, last accessed on 28th September 2017.
5
thousands and it consists of a small number of randomly distributed values of ±1, with the rest of
the elements set to 0;
2. Then, context vectors are produced by scanning through the text. Each time a word occurs in a
context, the context’s d-dimensional index vector is added to the context vector for that word. Words
are represented by d-dimensional context vectors which are the sum of the words’ contexts.
This allows for a creation of a co-occurrence matrix Fw×d which is an approximation of a standard
co-occurrence matrix Fw×c , but in which d c (c being the number of contexts). As a result, HER-
MIT transforms the original co-occurrence counts into a smaller and denser representation without the
computational overhead of other dimensional reduction techniques such as Singular Value Decomposi-
tion (SVD) (Jurgens and Stevens, 2010).
The identification of related contexts is made through the use of clustering, which separates similar
context vectors into dissimilar clusters representing the distinct senses of a word. A hybrid of the online
k-means clustering (Liberty et al., 2016) and Hierarchical Agglomerative Clustering (HAC) (Zepeda-
Mendoza and Resendis-Antonio, 2013) with a threshold is used. The threshold allows for the number of
clusters to be determined by data similarity instead of manually specified.
The context vectors are clustered using k-means clustering, which assigns a context to the most
similar cluster centroid. If the nearest centroid has a smaller similarity than the cluster threshold and
there are not any k clusters, the context forms a new cluster. The similarity between context vectors is
defined as the cosine similarity (see above).
Clusters are then repeatedly merged using HAC with average link criteria, that is, cluster similarity
is the mean cosine similarity of the pairwise similarity of all data points from each cluster. When the
two most similar clusters have a similarity less than the threshold, merging stops. The resulting clusters
should then represent distinct word senses.
The model was submitted for SemEval-2010 Task 14 (Manandhar and Klapaftis, 2009) and eval-
uated in an unsupervised method, which compares the found clusters with the gold data for the word
senses, and a supervised method, which evaluates the results of WSD using the induced clusters. The
first was measured according to the F-score and V-measure. V-measure is the harmonic mean between
the homogeneity and completeness (and will be later described in Subsection 5.1.1); while the second
metric is measured using supervised recall. Using the provided test corpus, parameters were tuned for
a context window size of ±1, a clustering threshold of 0.15 and a maximum of 15 clusters per word.
The final results of the SemEval-2010 Task 14 showed that Hermit achieved a V-measure of 16.7%
and an F-score of 24.4% in nouns in the unsupervised evaluation and a supervised recall of 53.6%.
In graph-based models, the meanings of a word are represented by a weighted and undirected co-
occurrence graph. The nodes (or vertices) of the graph are the words which occur in the corpus and the
6
transitions (or edges) are co-occurrences, with the weight of the edge describing the number of times
that co-occurrence exists. Two words are said to co-occur if they both occur within the same context.
These models are based on the idea that co-occurrence graphs have the properties of small world
networks (Véronis, 2004), that is, most nodes are not neighbours to one another, but most nodes can be
reached by a small number of steps, as shown in Figure 2.1. These properties then allow to search for
highly interconnected bundles of co-occurrences, that is, high density components, which correspond to
the senses being searched.
E
F D
B
C
A
J
G
L
K
H I
A graph from a Part-of-Speech (POS)-tagged corpus is made in (Widdows and Dorow, 2002), using
words as nodes and the grammatical co-occurrence relationships between pairs as edges. The rela-
tionships extracted are the co-occurrences of Noun-Verb, Verb-Noun, Adjective-Noun, Noun-Noun. To
generate the edges, the top-n neighbours of each word are selected and turned into edges.
An incremental algorithm is then used to extract categories based on a given word using affinity
scores, which give more importance to words that are linked to the existing neighbours. The algorithm
is tailored to avoid infections from spurious co-occurrences, preventing spurious links from relating to
genuine semantic similarity. The process to select and add the most similar node to a set of nodes is
described in Algorithm 3, where A is a set of nodes and N (a) are all the nodes which are linked to a
node a.
7
Algorithm 3 Select the most similar node
N (A) ← ∪a∈A N (a) . A is a set of nodes, N (A) the neighbours of A
b ← u ∈ N (A) \ A, max( |N (u)∩N (A)|
|N (u)| )
A←A∪b
The model was built using the British National Corpus (BNC)2 and evaluated against classes in
WordNet. A WordNet class was considered to be the collection of synsets subsumed by a parent synset,
for example, the class musical instruments was the collection of all synsets subsumed by the WordNet
musical instruments synset. For a given seed word, the algorithm in 3 was used to find the n nodes most
closely related to it. Ten classes were chosen beforehand, and for each class 20 words were retrieved
using a single seed-word from the class in question.
The results show that of a total of 200 retrieved words, only 36 were incorrect, giving an accuracy
of 82%.
A graph of collocations is used in (Klapaftis and Manandhar, 2008) to generate a taxonomy of senses
which takes into consideration word polysemy. In this, each node corresponds to an occurrence of two
words in the same context window (which in this model is a paragraph) and two nodes are connected if
two collocations occur in the same context window.
The base corpus, bc, consists of paragraphs containing the target word, tw. Besides bc, there is
also a large reference corpus, rc. The project focuses in inducing the sense of tw given bc as the only
input.
At first, paragraphs with tw are removed from bc and all paragraphs from bc and rc are POS tagged.
From these, only nouns are kept and lemmatised.
Log-likelihood (G2 ) (Dunning, 1993) is then used to filter common nouns which are not semantically
related to tw, by checking if a given word wi has a similar distribution in bc and rc. If that is true, G2 will
have a small value and wi should be removed from bc.
The noun frequencies of bc are stored in a list lbc and the noun frequencies of rc are stored in a list
lrc. For each word wi ∈ lbc is created a table of observed counts OT taken from lbc and lrc, and a table
of expected values ET under the model of independence. G2 is then calculated using the equations in
Subsection 2.5.3.
The lbc list is then filtered by removing words with a smaller relative frequency in lbc in relation to lrc.
The resulting lbc list is then sorted by the G2 values. Words that have a G2 smaller than a pre-specified
threshold p1 are then removed from bc. By the end of this process, each paragraph in bc is converted to
a list of nouns assumed to be topically related to tw.
8
n
With the base corpus now processed, collocations of two nouns are detected by generating all 2
combinations for each n-length paragraph. Conditional probabilities (Equation 2.2) are then used to
generate the weights of each collocation. Collocations that have a frequency and weight higher than a
pre-specified threshold are then used to generate the nodes of the graph G.
fij
(2.2) p(i|j) =
fj
The constructed graph G is sparse. A smoothing technique is applied to discover new edges be-
tween vertices and to assign weight to all of the graph’s edges. For each vertex i, a vertex vector V Ci is
assigned containing the vertices which share an edge with i in G. The similarity between each V Ci and
V Cj is then calculated using the Jaccard Similarity Coefficient (JC) (Equation 2.3).
|V Ci ∩ V Cj |
(2.3) JC(V Ci , V Cj ) =
|V Ci ∪ V Cj |
The final graph, G0 , is then clustered using the Chinese Whispers (CW) algorithm, further described
in Section 2.4.1. This algorithm was used because it does not require input parameters and performs in
linear time to the number of edges, although it is not guaranteed to converge.
WSD is at last made by assigning one of the induced clusters to each instance of the target word.
For each target word in a given paragraph, a score for each induced cluster is applied, based on the
number of collocations which occur in the paragraph.
To preform WSD, for each paragraph of bc, each induced cluster is assigned a score equal to the
number of collocations occurring in it.
To evaluate the model, the framework of the SemEval 2007 WSI task (SWSI) (Agirre and Soroa,
2007) was used. The test corpus consists of texts from the Wall Street Journal corpus, hand-tagged
with OntoNotes senses (Hovy et al., 2006).
The model was evaluated under unsupervised evaluation and supervised evaluation (see above).
In the unsupervised evaluation, the model achieved 88.6% purity, 31% entropy (these measures are
described below, in Section 5.1.1) and an F-Score of 78%.
The model developed by Korkontzelos and Manandhar (2010) is a relaxed version of the model in (Kla-
paftis and Manandhar, 2008), described in Section 2.3.2, in which a node is generated from a single
word if this word is considered unambiguous, otherwise, a node is only generated from a set of two
words.
9
The corpus is first preprocessed, with the aim of capturing words contextually related to the target.
Sentences or paragraphs, snippets, which contain the target word are lemmatised and POS tagged
using the GENIA tagger. Only nouns are kept and words which occour in a stoplist are filtered out.
Nouns which are infrequent in the reference corpus are removed and the log-likelihood ratio (G2 ) is
used to compare the distribution of each noun to its distribution in the reference corpus. If a noun’s G2
is lower than a specified threshold, or if the noun has a higher relative frequency in the reference corpus
than in the target corpus, then that noun is removed. At this stage, each snippet is a list of lemmatised
nouns contextually related to the target word.
The graph is constructed by first representing all nouns in the list as graph vertices. Each noun
within a snippet is combined with every other, generating n2 pairs. G2 is applied once again to the
pairs.
To filter out pairs that refer to the same sense, a vector with the snippet IDs in which they occur is
generated for each pair and each noun. A pair is discarded if its vector is similar to both vectors of their
component nouns, using for that purpose the Dice coefficient (Dice, 1945), which is later described in
Subsection 2.5.3.
Edges are drawn based on the co-occurrence of the corresponding vertices in snippets. The weight
of the edge is the maximum of the conditional probabilities of its vertices, calculated according to Equa-
tion (2.4), and lowly weighted edges are filtered out.
1 fa,b fa,b
(2.4) wa,b = +
2 fa fb
The graph is then clustered using CW, described in Section 2.4.1. To reduce the number of clusters,
a post-processing stage is applied, in which, for each cluster li , a set of all snippets Si containing at least
one vertex of li is generated. For any clusters la and lb , if Sa ⊆ Sb or Sa ⊇ Sb , these are merged.
The model was submitted for SemEval-2010 Task 14 (Manandhar and Klapaftis, 2009) and eval-
uated using an unsupervised and a supervised method. The first was measured according to the V-
measure and F-score, while the second was measured using recall. Parameters were tuned by choos-
ing maximum supervised recall, resulting in threshold frequencies of 10, G2 threshold of 10, collocations
weights of 0.4 and similarity threshold for pairs-of-nouns vertices of 0.8.
Results showed that the model achieved results of a V-measure of 20.6%, an F-score of 38.2% on
the unsupervised evaluation with nouns, and a supervised recall of 59.4% on the supervised evaluation.
10
2.4 Graph Partitioning and Clustering Algorithms
The task of finding clusters which are optimal with respect to fitness measures is NP-complete (Šíma
and Schaeffer, 2006). The following are two current algorithms, time-linear to the number of edges,
which try to find solutions through approximation.
CW is parameter-free, as the number of partitions emerges naturally during the process. But for-
mally, CW does not converge, as a node can become tied and be randomly assigned a different class at
each iteration, without ever stabilising, nor is it deterministic, due to the random orders and assignments
in the algorithm.
In CW, a weighted graph, G = (V, E), has nodes vi ∈ V and weighted edges (vi , vj , wij ) ∈ E with
weight wij . If (vi , vj , wij ) ∈ E implies (vj , vi , wij ) ∈ E then G is undirected. If all weights are 1, G is
unweighted. The degree of a node is the number of edges it takes part in. The neighbourhood of a node
v is the set of all nodes v 0 such that (v, v 0 , w) ∈ E or (v 0 , v, w) ∈ E.
CW works as outlined in Algorithm 4. First, all nodes get different classes. For a small number of
iterations, in each iteration, the nodes inherit the strongest class in their neighbourhood. This is the class
whose sum of edge weights to the current node is maximal. If multiple classes are equally the strongest,
one is chosen randomly. Classes are updated immediately, a node can obtain classes introduced in that
same iteration.
Regions of the same class stabilize during the iteration, and grow until they reach the border of
another class.
Apart from ties, the class of a node usually does not change more than a few times. The number of
iterations depends on the larger distance between two nodes in the graph.
11
CW was evaluated on tasks of WSI using a similar approach to the one in (Dorow and Widdows,
2003), by replacing the Markov Clustering algorithm with CW. The evaluation method in (Bordag, 2006)
was used. In this, the neighbourhood of two words is merged and the ability of the algorithm to separate
the merged graph is evaluated. The evaluation measures used included retrieval precision (rP ), the
similarity of the found sense with the gold standard sense using the overlap measure, and the retrieval
recall (rR), which are the amount of words assigned correctly to the gold standard sense.
Results for nouns showed that CW had a retrieval precision of 94.8% and a retrieval recall of 71.3%,
which suggest similar performance as specialized graph-clustering algorithms for WSI given the same
input.
2.4.2 MaxMax.
MaxMax is a soft-clustering algorithm applicable to edge-weighted graphs (Hope and Keller, 2013a). It
is parameter-free, runs in linear time to the number of edges and it is deterministic. Test results show it
to return scores comparable with existing state-of-the-art systems.
In MaxMax, a notion of maximal affinity is used, in which affinity between vertices u and v is the
edge weight w(u, v). A vertex u has maximal affinity to a vertex v if w(u, v) is maximal among all edges
with u. v is said to be a maximal vertex of u.
MaxMax consists of two stages, as described in Algorithm 5. First, the weighted graph G is trans-
formed in an unweighted, directed graph G0 . Maximal affinity relationships between vertices are used to
determine edge direction in G0 . If in G a vertex u has two maximal vertexes v and w, in G0 u will have
only two directed edges, from v to u and from w to u.
Then, clusters are identified, finding the root vertices of subgraphs of G0 and by marking all de-
scendants of a vertex as ¬root. In the directed graph G0 , a vertex v is a descendant of u is there is a
directed path from u to v. At the end of the stage, vertices which are still marked as root uniquely identify
clusters.
MaxMax was evaluated in the context of the SemEval 2010 WSI Task (Manandhar et al., 2010),
using an adaptation of the Shared Nearest Neighbours algorithm and then using MaxMax to identify
sense clusters in the generated target word graph. MaxMax has the best scoring V-measure (of 32.8%)
12
among the systems evaluated in (Manandhar et al., 2010), and the worst F-score (of 13.2%) among the
systems evaluated. The authors claim that their system is overly penalized by the F-score due to the
way this is known to be biased towards clustering solutions returning large clusters and to punish small
clusters disproportionately.
In addition to the existing work in WSD and WSI, it is relevant to mention other works which make use
of word co-occurrences in text.
2.5.1 STRING
Statistical and Rule-Based Natural Language Processing Chain (STRING) (Mamede et al., 2012) is a
NLP system for the Portuguese language, capable of preforming all basic NLP tasks, including Named
Entity Recognition (NER), IR, Anaphora Resolution (AR), among other tasks. STRING is composed of
several stages, as described in Figure 2.2.
Event Ordering
Sloft Filling
Resolution
malization
Time Nor-
RuDriCo2
Anaphora
LexMan
MARv4
Join
XIP
Preprocessing
In the first stage, the text is passed through a tokenizer, which is also responsible for recognising some
special tokens, such as numbers or punctuation. The tokens are then passed through a POS tagger,
LexMan (Vicente, 2013), which uses finite-state transducers to label tokens with their appropriate POS
tags. At last, the text is split into sentences mainly based on the punctuation and having into account
abbreviations, acronyms, and the use of ellipsis.
POS Disambiguation
This stage comprises two steps, which are preformed by two distinct modules:
13
2. Morphossyntactic Ambiguity Resolver (MARv), which preforms statistical disambiguation (Ribeiro,
2003).
RuDriCo2 is responsible for refining the initial segmentation done by LexMan. It uses declarative
pattern-matching rules to modify the original segmentation or disambiguate some of the POS tags.
RuDriCo2 also preforms the expansion of contracted words.
MARv is responsible for resolving morphossyntactic ambiguity. It analyses the labels generated
for each word and then chooses the most likely tag given its immediate context. This is done using
Hidden Markov Models (HMMs). MARv uses second-order models, which codify contextual information
concerning entities, and unigrams, which codify lexical information.
Syntactic Analysis
Syntactic analysis is done through the use of Xerox Incremental Parser (XIP) (Aït-Mokhtar et al., 2002),
which also allows for adding lexical, syntactic and semantic information to the output of the previous
modules. XIP is responsible for several tasks:
(i) It adds information to the existing tokens through the use of lexicons. XIP includes a pre-existing
lexicon that can be both extended or modified;
(ii) It allows to define local grammars using pattern-matching rules, to group lexical units together into
a single entity;
(iii) It also preforms a shallow parsing to group elements into chunks (noun phrase – NP, prepositional
phrase – PP, among others) using linear precedence and sequence rules;
(iv) Finally it preforms a deep parsing to extract the dependencies between the different chunks (sub-
ject, direct object, modifier, etc.).
Post-Syntactic Analysis
At this stage, additional tasks are executed, which extract additional information from the text. One is
the AR task (Marques, 2013), which preforms pronominal anaphora identification in the text and then
uses the Expectation-Maximization (EM) algorithm to obtain the probabilities of each word to be the
antecedent of a given anaphora candidate. Other tasks performed include time normalisation (Maurício,
2011) and event ordering (Cabrita, 2014).
In Correia (2015), a tool was developed making use of STRING to allow one to explore co-occurrence
data obtained from Portuguese texts. The presented solution was composed of a tool to extract co-
occurrences and calculate the association measures of these as well as a web-based interface to display
these co-occurrences in an intuitive fashion.
14
The developed solution takes advantage of the rich lexical resources available and the syntactic
and semantic analysis of STRING to provide information about the patterns of co-occurrences found in
the corpora evaluated.
For the project in question, each co-occurrence is based on a syntactic dependency extracted from XIP.
These dependencies have two words, in which the first word is the modified element and the second
word is the modifier, information about the dependency itself, and a set of properties, if these exist.
The project only considers a subset of possible properties as relevant for its goals, which indicate if the
modifier is before or after the modified word, or if the modifier is a focus adverb.
Architecture
The implemented solution splits the problem into four separate components:
Storage Format
An SQLite database was chosen as the format to store the obtained information. An Entity–Relationship
(ER) model was developed to represent the database, and a relational model was generated from the
created ER model.
To store the information in the database, the ER model in Figure 2.3 was created and used. In this
model:
• The entities Corpus, Word and Dependency are used to store the information about each corpus,
word and dependency type respectively.
• The weak entity Property associates the Dependency with a property type.
• Each word has a Belongs relation with the corpus, which indicates how often the word occurs in
the Corpus.
• Each pair of words and property also have a Co-occurrence relation, with a frequency attribute
which defines how often these words occur together in the same corpus with the given property.
Additionally, several attributes exist for each kind of association measure.
• The weak entity Sentence has a Belongs association with the Corpus, and the aggregation of the
Co-occurrence associations has a Exemplifies association with the Sentence entity.
15
1. The words in the Co-occurrence association must belong to the Corpus to which they are associ-
ated with.
2. The Co-occurrence association must be associated to the same Property with which the words
associate with in Belongs.
3. The sentences in the Exemplifies association must belong to the Corpus to which the given Co-
occurrence is associated with.
dependencyType
Dependency
frequency
has
Co-occurrence extraction
The co-occurrence extractor obtains the processed Extensible Markup Language (XML) from XIP and,
then, parses it to obtain the dependency information.
The extractor reads all dependencies parsed from the XML and stores the following information
about each of them in the database:
16
Association Measures
After the database is populated with the co-occurrences from the corpus, the association measures are
calculated in batches of 2,000 co-occurrences each.
Information Display
The extracted information, which was stored in the database, is then displayed to the user through
the use of a web interface written in PHP: Hypertext Preprocessor (PHP) and AngularJS. The front-end,
executed on the client-side, makes Asynchronous JavaScript and XML (AJAX) requests to the back-end,
executed on the server-side.
As it could be seen in Section 2.2 and Section 2.3, many works rely in metrics to measure the cohesion
of two words (Pantel, 2003; Pantel and Lin, 2002; Jurgens and Stevens, 2010; Klapaftis and Manand-
har, 2008; Korkontzelos and Manandhar, 2010; Correia, 2015). This section further examines some
association measures and their properties.
PMI is a measure of how much the actual probability of a co-occurrence p(x, y) differs from what would
be expected given the probabilities of the individual events and assuming the independence of p(x) and
p(y) (Church and Hanks, 1990). PMI is thus defined in Equation 2.5.
p(x, y)
(2.5) i(x, y) = ln
p(x)p(y)
Normalized Pointwise Mutual Information (NPMI) normalizes the upper and lower bound of PMI
(Bouma, 2009). In this case, in (x, y) = 1 when the two words only occur together, in (x, y) = 0 when
x and y are independent, and in (x, y) approaches −1 when the two words occur separately but not to-
gether, that is, when p(x, y) approaches 0 and p(x), p(y) are fixed. NPMI is thus defined in Equation 2.6.
17
p(x, y)
(2.6) in (x, y) = ln − ln p(x, y)
p(x)p(y)
The Dice coefficient measures the amount of association between two co-occurrences (Dice, 1945) and
ranges from 1.0, which indicates association of the two co-occurrences every time they occurred, to
0.0, which indicates no association whatsoever between the two co-occurrences under any time they
occurred at all. The Dice coefficient is defined in Equation 2.7.
2p(x, y)
(2.7) D=
p(x) + p(y)
LogDice addresses the fact that the Dice coefficient usually generates very small numbers (Rychlỳ,
2008). For logDice, the maximum is 14. A value of 0 means there is less than 1 co-occurrence of XY
per 16,000 X or 16,000 Y . A value increase in 1 unit means the co-occurrence occurs twice as often,
and a value increase of 7 means the co-occurrence occurs about 100 times as often. logDice is defined
in Equation 2.8.
2p(x, y)
(2.8) logDice = 14 + log2 D = 14 + log2
p(x) + p(y)
χ2 is a test for dependence which does not assume normally distributed probabilities (Manning and
Schütze, 1999). The test compares observed frequencies with the frequencies expected for indepen-
dence. In the simplest case, this test is applied to 2-by-2 tables such as Table 2.1. The χ2 test sums
the differences between the observed and expected values in all squares of the table, scaled by the
magnitude of the expected values, as per Equation 2.9, where i is the table row, j is the table column,
Oij is the observed value of the cell (i, j) and Eij is the expected value.
X (Oij − Eij )2
(2.9) χ2 =
ij
Eij
18
The expected frequencies Eij are calculated by converting the total of the rows and columns into
proportions. For example, E11 is calculated by adding the items of the first row and dividing it by the
sample size (N ), adding the items of the first column and dividing it by the sample size, and then
multiplying these two and multiply the result by the sample size, as exemplified in Equation 2.10.
For 2-by-2 tables, the test can be simplified into Equation 2.11.
The Log-likelihood ratio (G2 or −2 log λ) was created to address the overestimation of the normal distri-
bution when dealing with very small probabilities (np(1 − p) < 5 for p being the probability that the next
word matches a prototype and n the amount of words for which the match is being tested) (Dunning,
1993).
The likelihood ratio tests do not depend on assuming a normal distribution and instead use the
generalized likelihood ratio, which can be used effectively in smaller volumes of text than is necessary
for conventional tests based on assumed normal distributions (Dunning, 1993).
For events k1 , k2 , the likelihood ratio for the binomial distribution is defined in Equation 2.12.
X nij kij k1j + k2j
(2.12) − 2 log λ = 2 nij log , nij = , mij =
i,j
mij ki1 + ki2 N
Significance Measure
The significance measure is based on the statistical G-Test for Poisson distributions: given two words
A, B, each occurring a, b times and k times together, the significance sig(A, B) of their occurrence in a
ab
sentence is defined in Equation 2.13, with n being the number of sentences and x = n (Biemann et al.,
2004).
19
2.6 Summary
Below is a comparison of the various algorithms mentioned in the previous sections. The algorithms are
evaluated according to their performance and to their features.
DSM-based algorithms have to deal with scalability problems as the number of contexts increases. Many
of them try to deal with the problem by using denser representations of the context vectors used.
In comparison, graph-based algorithms try to deal with the significance of a number of co-
occurrences, trying to suppress irrelevant co-occurrences as edges or using smoothing techniques to
generate edges, which were not represented in the original graph.
As it can be seen in Table 2.2 and Table 2.3, current implementations of DSM and graph-based
algorithms have equivalent performance when compared using either F-score, V-measure or supervised
recall. It is important to note that unless the Golden Standard (GS) and test data used is the same, the
results obtained are not comparable among implementations.
Table 2.2: Unsupervised evaluation of WSI algorithms in nouns. All measures are in percentage (%).
1c1word, MFS, and Random are baselines from each of the respective datasets. 1c1word and MFS
groups all instances of a word into a single cluster.
Algorithm GS Prec. Recall Purity Entropy F-score V-measure
CBC WordNet 60.8 50.8 — — 55.4 —
Widdows and Dorow WordNet — — 82.0 — — —
Collocations-JC SWSI 2007 — — 88.6 31.0 78.0 —
Collocations-BL SWSI 2007 — — 89.6 29.0 73.1 —
UoY WSI&D 2010 — — — — 38.2 20.6
HERMIT WSI&D 2010 — — — — 30.1 16.7
1c1word SWSI 2007 — — — — 80.7 0.01
Random SWSI 2007 — — — — 38.1 4.91
MFS WSI&D 2010 — — — — 57.0 0.0
Random WSI&D 2010 — — — — 30.4 4.2
Table 2.3: Supervised evaluation of WSI algorithms. Unless otherwise specified, in the WSI&D 2010
dataset, the 80-20 split is used.
Algorithm Testing Corpus Sup. Recall (%)
Collocations-JC SWSI 2007 86.4
Collocations-BL SWSI 2007 85.6
UoY WSI&D 2010 59.4
HERMIT WSI&D 2010 53.6
MFS WSI&D 2010 53.2
Random WSI&D 2010 51.5
The work in this area introduces algorithms that are time-linear to the number of edges with results
comparable to older existing algorithms.
1 The mentioned data points were obtained from (Manandhar and Klapaftis, 2009).
20
Both algorithms are shown to be suitable for WSI tasks. Both algorithms are time-linear to the
number of edges and ideal for execution in large-scale graphs which feature small-world features, as it
can be seen on Table 2.4.
CW is already frequently used in the area of WSI with good results, but unlike MaxMax, it is not
deterministic, and it does not allow to place the same node in several clusters – soft-clustering –, a
feature specially desirable for a global graph approach so that one word can have several senses. On
the other hand, MaxMax is known to generate many fine-grained clusters, which then have to be merged
to obtain clusters more similar in size to the target senses (Hope and Keller, 2013b).
Table 2.4: Comparison of Graph Clustering Algorithms. Retrieval Precision (rPrec.), Retrieval Recall
(rRecall), F-score and V-measure are measured in percentage (%).
Alg. Determ. Soft-Clust. rPrec. rRecall F-sc. V-mes.
Chinese Whispers No No 94.8 71.3 — —
MaxMax Yes Yes — — 13.2 32.8
21
22
3 Architecture
In this chapter, the architecture of the WSI and WSD system is presented, as well as its articulation
with the STRING NLP system.
The architecture of this system consists of four components, ordered as per Figure 3.1:
1. a corpus pre-parser, which prepares the corpus being used to be processed by STRING;
2. the co-occurrence extractor from Correia (2015);
3. a graph constructor and clustering algorithm;
4. a word sense disambiguation module.
Sentences
STRING
XIP XML
Dependency
Extractor
Dependency
Database
Graph
Generator
Dependency
Graph
Graph
Clusterer
Dependency
Clusters
Disambiguator
The data flow is represented in Figure 3.2. The processed text from STRING goes through a modi-
fied implementation of the co-occurrence extractor from Correia (2015), which stores the co-occurrence
word-pairs along with their frequencies and association measures’ corpora’s values in a database.
Syntactic co-occurrences are extracted using the work by Correia (2015), as per the example in Fig-
ure 3.3. Additionally, this work uses the named entities obtained from XIP. If the word in a co-occurrence
is a named entity, the lemma of the word is attributed with the appropriate category (for example, the
word Pedro is identified with the lemma INDIVIDUAL).
23
Figure 3.2: The progression of data as it evolves through the architecture
XIP XML
co-occurrence extraction
co-
occurrences
words relations
nodes edges
co-
occurrence
graph
graph construction
clustering algorithm
cluster
of nodes
cdir post
subj pre quantd mod post
The co-occurrences are then stored in an SQLite database, using the ER model in Figure 3.4,
adapted from Correia (2015), with minor changes to keep information of all the sentences used (in the
Context entity) instead of only 20 randomly selected sentences for each co-occurrence.
Additionally, the relationship Exemplifies is replaced with Occurs, which associates all Co-
occurrence aggregations with all the respective Contexts where they occur.
For a target word w, a query to the database is made to obtain the word pairs and association measure
values of all co-occurrences which occur in the same contexts as w. After filtering all co-occurrences
which do not reach the minimum threshold value of the association measure being used, the co-
occurrences are saved in a graph structure.
To ensure that only words directly related to w remain, a breadth-first search is made starting from
the target word. Only the nodes which were visited during this process are kept in the final graph.
24
Figure 3.4: The Entity–Relationship model used to store the information in the database
Dependency type
frequency belongs occurs
frequency
has
word
Word co-occurrence Property type
class
After the graph is generated, a graph clustering algorithm is run against it, and the resulting senses
are stored.
Disambiguation is done by measuring the separation between the words in the given context and each
of the induced sense clusters. The sense with the lowest separation score is the most likely sense of
the word.
For a target word w from a given context c, the co-occurrences from c are extracted and used to
generate the cluster for the context, Ci .
Then, for each inferred sense cluster Cj of w, the Separation between Ci and Cj is calculated
according to Equation 3.1 (Hope and Keller, 2013b). proximity is defined as the weight of the co-
occurrence in the dependency graph of the word.
P
proximity(x, y)
x∈Ci
y∈Cj
(3.1) separation(Ci , Cj ) = 1 −
|Ci | × |Cj |
The cluster Cj with the lowest separation score compared to Ci is then considered the most likely
sense of the target word w.
25
26
4 Implementation
To preform the induction step, the CETEMPúblico corpus (Rocha and Santos, 2000) was used.
This corpus has more than 1 million extracts of articles from the Portuguese newspaper Público and
over 191 million words. The resulting database has more than 50 million co-occurrences, as well as the
respective association measures.
This following chapter describes the changes which were required to be able to populate the
database within a reasonable time frame, as well as implementation details of some of the stages.
Two corpora were chosen to be used in this project, the CETEMPúblico corpus (Rocha and Santos,
2000), and a dump of the Portuguese edition of Wikipedia1 . Each corpus used its own syntax to describe
its contents. As STRING only parses plain text sentences, the additional meta-data provided was either
removed or adapted for parsing by STRING.
4.1.1 CETEMPúblico
CETEMPúblico was provided in Standard Generalized Markup Language (SGML) format with the fol-
lowing tags:
To parse it, a Python script was designed, which parses it as XML using a parser that generates the
Document Object Model (DOM) Element Tree incrementally. Tags considered irrelevant were ignored,
and tags with special meanings, such as the ones setting the boundaries of a sentence, paragraph or
excerpt were replaced with unique plain text identifiers.
Because the corpus was in SGML format and not XML, a few replacements were made before
feeding each line to the parser to make sure that the parser would only be fed valid XML. These made
sure that attributes were quoted and that all elements had an opening and closing tag.
1 https://en.wikipedia.org/wiki/Wikipedia:Database_download, last accessed on 21st June 2017.
27
4.1.2 Wikipedia
After obtaining the Wikipedia dump, a tool called WikiExtract2 was used to convert the obtained XML
into mostly plain text files.
Further cleaning was executed, in which all possible invalid XML or possible leftover tags were found
using a regular expression, added to a list, and removed automatically.
At last, document boundaries and paragraphs were replaced with unique plain text identifiers which
can be recognized even after being parsed by STRING.
To obtain the co-occurrences, the extractor created by Correia (2015) was used, but it had to be adapted
because it had been developed under different requirements and conditions distinct from the ones nec-
essary for this work. These adaptations are described below.
To be able to provide all the required information to the graph construction algorithm, the model used to
store the information had to be modified. A new ER was designed, as shown in Figure 3.4, and used to
generate the relational model used in the database.
All tables have their own id, used as the primary key, with the previous primary key being set with
a UNIQUE constraint. The new id primary key is used to reference to a given table’s line in other tables.
This helped to reduce the space occupied by repeated references to text-based primary keys.
The XIP files were being parsed as XML using Java’s W3C-based DOM parser3 . This parser loads the
file in memory and creates the DOM-like tree structure from there.
This implementation was having problems with larger files, on the range of 100MB and larger, taking
exponential amounts of time to perform the most basic operations.
The existing DOM parser was thus replaced with a pull-parser. This reads the file sequentially and,
as new tags are found, such as the start or the end of an element, an event is generated, with only the
contents pertaining to it.
The pull-parser has an O(1) memory usage for parsing, as only the currently parsed segment needs
to stay in memory while the document structure is built.
2 https://github.com/attardi/wikiextractor, last accessed on 21st June 2017.
3 https://docs.oracle.com/javase/8/docs/api/org/w3c/dom/package-summary.html, last accessed on 6th
August 2017
28
On top of the XML pull parser, a basic stack is used to keep track of what is the current element
and depth in the document. The name of the current tag is pushed to the stack when a start event is
emitted, and the top element is popped when an end event is emitted.
With the events and the stack, a state machine is used, which is responsible for deciding the next
action in the construction of the XIP document in memory.
After parsing the XIP files, the dependencies are extracted and added to the database as co-
occurrences. First an INSERT is attempted, if it fails due to UNIQUE violations, the existing entry is
updated with the new information.
After all co-occurrences are added to the database, the values of the association measures are
updated in batches of 2,000 at a time. To prevent slowdowns while waiting to read, a cache of read
values from the database is used to prevent reading multiple times the same value from the database,
allowing to reduce considerably the time taken to populate the values of the association measures.
Given a target word w, a query to the database is made to obtain all co-occurrences that happen in the
same context as co-occurrences with w as either the first or second word, as in Listing 4.1.
SELECT Coocorrencia.*,
p1.palavra AS p1lemma,
p1.classe AS p1class,
p2.palavra AS p2lemma,
p2.classe AS p2class
FROM Coocorrencia
INNER JOIN CoOccurrenceContexts ON Coocorrencia.id = CoOccurrenceContexts.
cooccurrence
INNER JOIN Palavra AS p1 ON Coocorrencia.idPalavra1 = p1.idPalavra
INNER JOIN Palavra AS p2 ON Coocorrencia.idPalavra2 = p2.idPalavra
WHERE CoOccurrenceContexts.context IN
( SELECT CoOccurrenceContexts.context
FROM CoOccurrenceContexts
INNER JOIN Coocorrencia ON CoOccurrenceContexts.cooccurrence =
Coocorrencia.id
WHERE Coocorrencia.idPalavra1 = ?1
OR Coocorrencia.idPalavra2 = ?1)
Listing 4.1: SQL Query to extract all co-occurrences in same context as the target word
The resulting set of rows consists of all co-occurrences happening in the same context as w. These
are then assembled into a graph, in which all the nodes represent the words from the set, and the edges
represent the co-occurrences of that same set.
All co-occurrences in which the association measure’s weight is lower than a pre-specified threshold
are then removed from the graph.
29
A breadth-first search is then applied, starting from w, to ensure that only nodes directly connected
to w by any number of steps remain in the graph. This eliminates nodes and co-occurrences not con-
nected to the graph at all, coming – for example – from dangling co-occurrences that were previously
connected through a path removed in one of the previous steps of the generation.
The resulting graph is finally clustered using the CW algorithm, explained in Section 2.4.
To be able to perform disambiguation, additional changes had to be done to the co-occurrence extractor
from Correia (2015). The extractor had merged together the logic to extract the co-occurrences and
the code to write them in the database. To make the extractor usable for the task of disambiguation,
a separation of the code to write in the database and the logic of extracting the co-occurrences was
performed.
Having applied those changes, the disambiguation for a target word w and a context c starts by
using the modified extractor to obtain all word co-occurrences in c. These are considered the cluster of
co-occurrences of the word w in context c.
To discover which of the induced senses s is the most likely to be in use, each one of them is
compared to the cluster of co-occurrences from the context using the measure of separation defined in
Equation 3.1. The cluster with the lowest separation is considered the most likely sense of the word in
the given context.
As many of these steps can take considerable amounts of time, it is desirable to avoid repeating them
as often as it is possible. As a result, most of the execution pipeline of the code was adapted to read
and write from files as often as possible, allowing the project to use these files as a cache for calculated
operations.
The format chosen was a subset of Comma Separated Values (CSV) files, in which each element
was a row and each property was a column. Each time an extensive operation is concluded, such as
obtaining the co-occurrences in the context of a word, or generating the clusters for a word using a given
set of parameters, the results are saved in the CSV file. If that same set of data is required later on,
instead of re-calculating it from scratch, the information is obtained from the existing file.
30
5 Evaluation
The evaluation is composed of two methods, an unsupervised evaluation and a supervised evaluation.
The unsupervised evaluation is used to assess the resulting clusters’ similarity to the Golden Standard
(GS) senses. The supervised evaluation is used as an application-oriented assessment of the resulting
clusters in the task of WSD.
In this evaluation method, the set of resulting clusters is compared to a GS. This comparison is made by
evaluating the clusters’ homogeneity and completeness (Manandhar and Klapaftis, 2009). Homogeneity
refers to the degree that each cluster consists of data points primarily belonging to a single GS class.
Completeness refers to the degree that each GS class consists of data points assigned to a single
cluster. To evaluate homogeneity and completeness, the F-score and the V-measure will be used.
Given a particular GS sense gsi of size ai and a cluster cj of size aj , the F-score of gsi and cj is
the harmonic mean of its precision and its recall, as defined in Equation 5.1 (Agirre and Soroa, 2007).
Precision of a class gsi with respect to cluster cj is the number of common instances divided by total
aij
cluster size, i.e. P (gsi , ci ) = aj . The recall of a class gsi with respect to cluster cj is the number of
aij
common instances divided by total sense size, i.e. R(gsi , cj ) = ai .
2P (gsi , cj )R(gsi , cj )
(5.1) f (gsi , cj ) =
P (gsi , cj ) + R(gsi , cj )
The F-score of a class gsi is the maximum F-score obtained at any cluster, as according to Equa-
tion 5.2. The F-score of the entire clustering solution is the weighted average of the F-scores of each
GS, as in Equation 5.3, where q is the number of GS senses and N is the total number of target word
instances.
31
q
X |gsi |
(5.3) FS = F (gsi )
i=1
N
The F-score measures both homogeneity (precision) and completeness (recall). However, the
F-score suffers from the matching problem, by not evaluating the entire membership of a cluster (Rosen-
berg and Hirschberg, 2007). This is due to the F-score not considering the components of the clusters
beyond the majority class. Furthermore, the F-score penalises systems for getting the number of GS
classes wrongly (Manandhar and Klapaftis, 2009).
Thus, to complement the F-score, the V-measure is also used. V-measure is an entropy-based
measure that explicitly measures how successfully the criteria of homogeneity and completeness have
been satisfied (Rosenberg and Hirschberg, 2007). Just as precision and recall are combined to form
the F-score, homogeneity and completeness are combined using the harmonic mean to compute the
V-measure.
For the homogeneity criterion, a clustering must assign only the data points of a single class to a
single cluster. That is, the class distribution of each cluster should be skewed to a single class, zero
entropy (Rosenberg and Hirschberg, 2007). In a perfectly homogeneous case, H(GS|C) = 0 and in
an imperfect situation, this value is dependent on the size of the dataset and distribution of class sizes.
Therefore, the V-measure normalizes this value by the maximum reduction in entropy the clustering
could provide, H(GS), resulting in Equations 5.4, 5.5, and 5.6.
1,
if H(GS, C) = 0
(5.4) h= H(GS|C)
1 −
, otherwise
H(GS)
|C| |GS|
X X aij aij
(5.6) H(GS|C) = − log P|GS|
j=1 i=1
N akj
k=1
Symmetrically to homogeneity, for the completeness criterion, a clustering solution must assign all
the datapoints of a single class to a single cluster. This can be evaluated by calculating the conditional
entropy of the proposed cluster distribution given the class of the component data points, H(C|GS).
In a perfectly complete case, H(C|GS) = 0 and in the worst case scenario each class is represented
by every cluster with a distribution equal to the distribution of cluster sizes, H(C|GS) is maximal and
32
equals H(C). Therefore, adding a way symmetric to that used for homogeneity, the V-measure defines
completeness as in Equations 5.7, 5.8, and 5.9.
1,
if H(C, GS) = 0
(5.7) c= H(C|GS)
1 −
, otherwise
H(C)
|GS| |C|
X X aij aij
(5.9) H(C|GS) = − log P|C|
i=1 j=1
N aik
k=1
Based on these calculations of homogeneity and completeness, the V-measure of a clustering solu-
tion is then computed using the weighted harmonic mean of homogeneity and completeness, according
to Equation 5.10, in which if β is greater than 1 completeness is weighted more strongly and if less than
1 homogeneity is weighted more strongly.
(1 + β)hc
(5.10) Vβ =
(βh) + c
Although the V-measure does not increase monotonically, it is known to tend to favour systems
producing a higher number of clusters than the number of GS senses (Manandhar et al., 2010). With
this in mind, both the F-score and the V-measure are used for this evaluation method, as the F-score
penalises systems when they produce a different number of clusters from the number of GS senses.
Additional measures for unsupervised evaluation include entropy and purity. Entropy measures
how various classes of objects are distributed within each cluster. Generally, the smaller the entropy, the
better the clustering algorithm performs. On the other hand, Purity measures the extent to which each
cluster contains objects from primarily one class. The larger the purity, the better the clustering algorithm
performs. A formal definition of these measures is available in (Zhao et al., 2005). However, as both of
them evaluate only the homogeneity of a clustering algorithm, disregarding completeness (Manandhar
and Klapaftis, 2009), they will not be considered in this evaluation.
33
5.1.2 Supervised Evaluation
In the supervised evaluation method, the target corpus is split into a testing and training part. The
training part is used to map the automatically induced clusters to GS senses (Agirre et al., 2006). After
that, the testing corpus is used to evaluate the clustering resulting in a WSD setting.
Suppose there are m clusters and n senses for the target word. M is the set of probabilities of
words belonging to clusters, M = {mij }, 1 ≤ i ≤ m, 1 ≤ j ≤ n and each mij = P (gsj |ci ), that is, mij
is the probability of a word sense j given it that has been assigned to a cluster i. This probability can
be computed counting the times an occurrence with sense j has been assigned to cluster i in the train
corpus.
The matrix M is then used to transform any cluster score vector h returned by the algorithm into a
sense vector s. This is done by multiplying the score vector by the matrix M , that is, s = hM .
The mapping matrix M is used in order to convert the cluster score vector h of each test corpus
instance into a sense score vector s, and then assign the sense with maximum score to that instance.
As the algorithm always returns an answer, its recall is of 100% in all cases, there are no false
negatives as there are no negatives at all. So the algorithm only needs to be evaluated according to its
precision (Agirre et al., 2006).
The NPMI and the logDice association measures were chosen as due to their normalized scores,
which allow to use the same parameter between different words while keeping the same underlying
meaning. The NPMI association measure was tested with minimum thresholds of 0.0, 0.25, 0.5, and
0.75, ranging from each word being at best independent of the other up to both occurring mostly together.
The logDice association measure was tested with minimum thresholds of 0.0, 2.5, 5.0, 7.5, and 10.
34
up to 34.4% for the highest. Not counting the most restrictive threshold of logDice and NPMI, all tests
had better results than MFS, which scored 0.4% in V-Measure.
Another thing which is possible to see if that not enough points are available to form a meaningful
view of the contexts when the threshold is too high, resulting in no clusters at all and giving poor results.
When evaluating the number of clusters, it is possible to see that most tests might have been
penalised due to the high number of clusters they had compared to the average number of GS senses.
The results on supervised WSD (seen in Table 5.2) were very poor overall. None of the tests were able
to surpass the results of MFS, with a precision of 65.7%. The highest result was using logDice with a
threshold of 7.5, which reached a precision of 10.1%.
Overall, the tests had poor results. In all examples MFS was able to achieve better results, showing the
project is not ready to be used for disambiguation.
The high number of clusters obtained (on average above the hundreds) shows that the results are
too fine-grained to be able to properly match them to the senses one is trying to disambiguate.
35
Further inspection into specific graphs of some words, such as the graph for the word vingar (Fig-
ure 5.1) can further explain the obtained results.
Figure 5.1: Image of the induction graph for the word vingar, using the CW algorithm and the NPMI
association measure.
acolhedor,ADJ
Bradl,NOUN
filipe,NOUN
Rillieux-la-Pape,NOUN
barateiro,NOUN
Pensacola-Florida,NOUN
Gemayel,NOUN
disc-jokeys,NOUN Lopólito,NOUN
entendimento jurídico,NOUN
emérito,ADJ quianda,NOUN Gaitskell,NOUN
Benghazi,NOUN
efabulador,NOUN vida de excesso,NOUN
SC,ADJ
Windsor,NOUN
MPG,ADJ
Kerkhof,NOUN
estratrégia,NOUN
vingar,VERB
Messala,NOUN
cansado,NOUN
azul-real,NOUN
nemesi,NOUN dezembrismo,NOUN
Tindel,NOUN
mulher-arquétipo,ADJ
Shrek,NOUN
sangrentamente,ADV
Olivehurst,NOUN
natambém,NOUN
Esquire,NOUN Pipoz,NOUN
rigoroso,NOUN
cirunstância,NOUN
Leica,NOUN
Materno-Infantil,NOUN
rissol de camarão,NOUN
ambiental,ADJ Delain,NOUN
Al-Khawaja,NOUN
cão amestrado,NOUN
economizar,VERB
The first noticeable thing is that the graph includes a few words with spelling errors, such as the
nodes natambém and estratrégia. The second noticeable thing is that although the work by Correia
(2015) identifies named entities and replaces their lemmas with their categories, many of the words
in the graph are named entities which were not recognized as such by STRING. This can be seen in
nodes such as Windsor and Shrek, and adds noise to the graph, increasing the number of small clusters
generated.
But the most noticeable thing is how sparse the graph is. Algorithms such as CW or MaxMax require
a small-world network with several high-density areas to be able to find clusters in the graph. In a graph
such as the one in Figure 5.1, with the exception of the node corresponding to the target word, no nodes
36
have more than 2 neighbours. This undermines the assumptions used in graph-clustering algorithms,
and prevents the possibility of better results.
It is possible the graph is sparse because the syntactic dependencies used impose a stricter rela-
tionship between the two words than the usage of mere words which co-occur in the same sentence or
paragraph would. The stricter relationship between the words changes the behaviour of the resulting
graph, which make the graph-clustering algorithms behave poorly.
Additionally, the stricter relationship might be preventing words that are related but do not have a
syntactic relationship from being included in the graph. This might make the generated graphs unsuitable
for the specific WSI algorithms used in this project.
Another possible cause of the poor results might be the absence of categories, such as PERSON,
PLACE or ORG, among others, in the algorithm used. As it is blind to the categories, the algorithm can
not make use of them to help infer the senses of the target word.
37
38
6 Conclusion
This project proposed an algorithm for disambiguating the sense of words in the Portuguese lan-
guage without the need to manually create a sense inventory or sense-tagged corpora. An experimental
implementation was created, which used a graph-based algorithm and the dependencies produced by
STRING. And with this experimental implementation, a database was populated with the co-occurrences
of the CETEMPúblico corpus.
Finally, the use of syntactic dependencies for the task of WSI was evaluated and compared with a
baseline of the MFS.
Future investigation possibilities could focus on improving the results obtained in this implementation
as well as improving the system’s performance.
One possibility to improve the results would be to evaluate the use of relaxed co-occurrences based
on context windows, sentences or even paragraphs, instead of the syntactic dependency co-occurrences
used in this project. This might influence the characteristics of the generated graphs and improve the
performance of the graph-clustering algorithms.
Another aspect should be to investigate vector-based algorithms. These might be capable of making
use of the dependency information provided by XIP, and might perform better in relation to algorithms
blind to this information, such as the ones used in this project.
To improve performance, it might be relevant to investigate porting the data to a graph database
engine. The traditional relational database used requires the use of multiple JOINs to execute the
required queries. A database with a native concept of a graph would allow to remove these JOINs and
improve execution speed.
Although the project was unable to achieve satisfactory results in the evaluation, it was possible to
pinpoint where possible weaknesses are located. A base model was defined and documented, and it
can be used as a base for future attempts at solving the challenges found.
39
40
Bibliography
Agirre, E., D. Martínez, O. L. De Lacalle, and A. Soroa (2006). Evaluating and optimizing the parameters
of an unsupervised graph-based WSD algorithm. In Proceedings of the 1th Workshop on Graph Based
Methods for Natural Language Processing, pp. 89–96. Association for Computational Linguistics.
Agirre, E. and A. Soroa (2007). Semeval-2007 task 02: Evaluating word sense induction and discrimina-
tion systems. In Proceedings of the 4th Intl. Workshop on Semantic Evaluations, pp. 7–12. Association
for Computational Linguistics.
Aït-Mokhtar, S., J.-P. Chanod, and C. Roux (2002). Robustness beyond shallowness: incremental deep
parsing. Natural Language Engineering 8(2-3), 121–144.
Baptista, J. (2013). ViPEr: uma base de dados de construções léxico-sintáticas de verbos do Português
Europeu. In XXVIII Encontro Nacional da Associação Portuguesa de Linguística, pp. 111–129. APL.
Biemann, C. (2006). Chinese Whispers: An efficient graph clustering algorithm and its application to
natural language processing problems. In Proceedings of the 1st Workshop on Graph-Based Methods
for Natural Language Processing, pp. 73–80. Association for Computational Linguistics.
Biemann, C., S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff (2004). Language-independent methods
for compiling monolingual lexical data. In 5th Intl. Conf. on Computational Linguistics and Intelligent
Text Processing, pp. 217–228. Springer.
Bordag, S. (2006). Word sense induction: Triplet-based clustering and automatic evaluation. In Pro-
ceedings of the 11th Conf. of the European Chapter of the Association for Computational Linguistics,
pp. 137–144.
Cabrita, V. (2014, November). Identificar, ordenar e relacionar eventos. Master’s thesis, Instituto Supe-
rior Técnico, Universidade de Lisboa.
Church, K. W. and P. Hanks (1990, March). Word association norms, mutual information, and lexicogra-
phy. Computational Linguistics 16(1), 22–29.
Correia, J. (2015, November). Syntax deep explorer. Master’s thesis, Instituto Superior Técnico, Univer-
sidade de Lisboa.
Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology 26(3),
297–302.
41
Diniz, C. (2010, October). Um conversor baseado em regras de transformação declarativas. Master’s
thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.
Dorow, B. and D. Widdows (2003). Discovering corpus-specific word senses. In Proceedings of the 10th
Conf. on European Chapter of the Association for Computational Linguistics, Volume 2, pp. 79–82.
Association for Computational Linguistics.
Dunning, T. (1993, March). Accurate methods for the statistics of surprise and coincidence. Computa-
tional linguistics 19(1), 61–74.
Han, J. and M. Kamber (2000). Data Mining: Concepts and Techniques. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc.
Hope, D. and B. Keller (2013a). MaxMax: A graph-based soft clustering algorithm applied to Word
Sense Induction. In Proceedings of the 14th Intl. Conf. on Computational Linguistics and Intelligent
Text Processing, Volume 1, pp. 368–381. Springer.
Hope, D. and B. Keller (2013b, June). UoS: a graph-based system for graded word sense induction. 2,
689–694.
Hovy, E., M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2006). OntoNotes: The 90% solution.
In Proceedings of the Human Language Technology Conf. of the NAACL, Companion Volume: Short
Papers, pp. 57–60. Association for Computational Linguistics.
Jurgens, D. and K. Stevens (2010). HERMIT: Flexible clustering for the SemEval-2 WSI task. In Proceed-
ings of the 5th Intl. Workshop on Semantic Evaluation, pp. 359–362. Association for Computational
Linguistics.
Kanerva, P., J. Kristofersson, and A. Holst (2000). Random indexing of text samples for latent semantic
analysis. In Proceedings of the 22nd Annual Conf. of the Cognitive Science Society, Volume 1036.
Klapaftis, I. and S. Manandhar (2008). Word sense induction using graphs of collocations. In Proceed-
ings of the 2008 Conf. on ECAI 2008: 18th European Conf. on Artificial Intelligence, pp. 298–302.
Korkontzelos, I. and S. Manandhar (2010). UoY: Graphs of unambiguous vertices for word sense in-
duction and disambiguation. In Proceedings of the 5th Intl. Workshop on Semantic Evaluation, pp.
355–358. Association for Computational Linguistics.
Liberty, E., R. Sriharsha, and M. Sviridenko (2016). An algorithm for online k-means clustering. In
Proceedings of the 18th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 81–89.
Mamede, N., J. Baptista, C. Diniz, and V. Cabarrão (2012, April). STRING: A hybrid statistical and
rule-based natural language processing chain for Portuguese. In 10th Intl. Conf. on Computational
Processing of Portuguese (PROPOR). (Demo session).
42
Manandhar, S. and I. Klapaftis (2009). SemEval-2010 task 14: Evaluation setting for word sense induc-
tion & disambiguation systems. In Proceedings of the Workshop on Semantic Evaluations: Recent
Achievements and Future Directions, pp. 117–122. Association for Computational Linguistics.
Manandhar, S., I. Klapaftis, D. Dligach, and S. Pradhan (2010). SemEval-2010 task 14: Word sense
induction & disambiguation. In Proceedings of the 5th Intl. Workshop on Semantic Evaluation, pp.
63–68. Association for Computational Linguistics.
Manning, C. and H. Schütze (1999). Foundations of Statistical Natural Language Processing. Cam-
bridge, MA: MIT Press.
Marques, J. (2013, November). Anaphora resolution in Portuguese: A hybrid approach. Master’s thesis,
Instituto Superior Técnico, Universidade Técnica de Lisboa.
Miller, G., C. Leacock, R. Tengi, and R. T. Bunker (1993). A semantic concordance. In Proceedings
of the Workshop on Human Language Technology, HLT ’93, Stroudsburg, PA, USA, pp. 303–308.
Association for Computational Linguistics.
Nascimento, M., P. Marrafa, L. Pereira, R. Ribeiro, R. Veloso, and L. Wittmann (1998). LE-PAROLE –
Do corpus à modelização da informação lexical num sistema multifunção. Actas do XIII Encontro da
APL, 115–134.
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2),
1–69.
Ng, H. (1997). Getting serious about word sense disambiguation. In Proceedings of the ACL SIGLEX
Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, pp. 191–197.
Pantel, P. and D. Lin (2002). Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD
Intl. Conf. on Knowledge Discovery and Data Mining, pp. 613–619. ACM.
Rosenberg, A. and J. Hirschberg (2007, June). V-measure: A conditional entropy-based external cluster
evaluation measure. In Proceedings of the 2007 Joint Conf. on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420. Associa-
tion for Computational Linguistics.
43
Rychlỳ, P. (2008). A lexicographer-friendly association score. Proceedings of Recent Advances in
Slavonic Natural Language Processing 2008, 6–9.
Salton, G. and M. McGill (1986). Introduction to Modern Information Retrieval. New York, NY, USA:
McGraw-Hill, Inc.
Šíma, J. and S. Schaeffer (2006). On the NP-completeness of some graph cluster measures. In Intl.
Conf. on Current Trends in Theory and Practice of Computer Science, pp. 530–537. Springer.
Van, T. (2010). Mining for meaning: The extraction of lexico-semantic knowledge from text. Volume 82.
Véronis, J. (2004). Hyperlex: lexical cartography for information retrieval. Computer Speech & Lan-
guage 18(3), 223–252.
Vicente, A. (2013, June). Lexman: um segmentador e analisador morfológico com transdutores. Mas-
ter’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.
Widdows, D. and B. Dorow (2002). A graph model for unsupervised lexical acquisition. In Proceedings
of the 19th Intl. Conf. on Computational Linguistics, Volume 1, pp. 1–7. Association for Computational
Linguistics.
Zhao, Y., G. Karypis, and U. Fayyad (2005). Hierarchical clustering algorithms for document datasets.
Data Mining and Knowledge Discovery 10(2), 141–168.
44