Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
56 views66 pages

A Word Sense Induction Model

This dissertation describes a word sense induction model called SENTIDO for the Portuguese language. SENTIDO uses graph-based algorithms to cluster words that occur frequently together in unlabeled corpora in order to induce word senses without the need for annotated data. The dissertation reviews related work on distributional semantics, statistical and vector space models, graph-based models, and graph clustering algorithms. It then presents the architecture of SENTIDO, which extracts word co-occurrences from corpora, generates a graph, clusters the graph to induce senses, and can disambiguate word senses. The implementation and evaluation methodology are also described.

Uploaded by

mat.jose.carmino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views66 pages

A Word Sense Induction Model

This dissertation describes a word sense induction model called SENTIDO for the Portuguese language. SENTIDO uses graph-based algorithms to cluster words that occur frequently together in unlabeled corpora in order to induce word senses without the need for annotated data. The dissertation reviews related work on distributional semantics, statistical and vector space models, graph-based models, and graph clustering algorithms. It then presents the architecture of SENTIDO, which extracts word co-occurrences from corpora, generates a graph, clusters the graph to induce senses, and can disambiguate word senses. The implementation and evaluation methodology are also described.

Uploaded by

mat.jose.carmino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

SENTIDO: A Word Sense Induction Model

for Portuguese

José Pedro de Almeida Arvela

Dissertation to obtain the Master of Science Degree in

Computer Science and Engineering

Supervisor(s): Prof. Nuno João Neves Mamede,


Prof. Jorge Manuel Evangelista Baptista

Examination Committee

Chairperson: Prof. Miguel Nuno Dias Alves Pupo Correia


Supervisor: Prof. Nuno João Neves Mamede
Members of the Committee: Prof. Ricardo Daniel Santos Faro Marques Ribeiro

November 2017
Acknowledgements
I’d like to thank my supervisors, for all the help and feedback they provided, as well as to L2F and
INESC for the opportunity to work with them in this project. I’d also like to thank my friends and family
for their support throughout the development of this project.

i
Resumo
Com o crescer exponencial de informação publicada, processar a mesma torna-se cada vez mais
necessário. A Desambiguação Semântica de Palavras é um componente fundamental deste processo,
mas não é possível sem corpora anotados com o sentido de cada palavra em cada contexto. A In-
dução Semântica de Palavras tenta resolver este problema ao aglomerar palavras que ocorram juntas
frequentemente em corpora não anotados. Este projeto, SENse Through InDuctiOn (SENTIDO), tenta
resolver este problema para a língua portuguesa utilizando algoritmos com base em grafos.

iii
Abstract
With the exponential growth of data published, processing this data automatically ever more neces-
sary. Word Sense Disambiguation (WSD) is a fundamental component of this processing, but it is not
possible without a rich sense inventory. Word Sense Induction (WSI) tries to solve this problem by clus-
tering words which occur together frequently in non-annotated corpora. This project, SENse Through
InDuctiOn (SENTIDO), attempts to solve this problem for the Portuguese language using a graph-based
algorithm.

v
Keywords
Word Sense Induction

Word Sense Disambiguation

Semantic Annotation

Sense Annotation

Natural Language Processing

Text Analysis

Indução Semântica de Palavras

Desambiguação Semântica de Palavras

Anotação Semântica

Anotação de Sentidos

Processamento de Lingua Natural

Análise de Texto

vii
Contents

Acknowledgements i

Resumo iii

Abstract v

Keywords vii

List of Figures xiii

List of Tables xv

Listings xvii

Acronyms xix

1 Introduction 1

2 State of the Art 3

2.1 Distributional Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Statistical and Vector Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Clustering By Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 HERMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Co-occurrence Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Graph Model for Unsupervised Lexical Acquisition . . . . . . . . . . . . . . . . . . 7

2.3.2 Word Sense Induction Using Graphs of Collocations . . . . . . . . . . . . . . . . . 8

2.3.3 UoY: Graphs of Unambiguous Vertices . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Graph Partitioning and Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Chinese Whispers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

ix
2.4.2 MaxMax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.1 STRING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2 Syntax Deep Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.3 Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.1 WSI Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.2 Graph Clustering Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Architecture 23

3.1 Co-occurrence Extraction and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Graph Generation and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Implementation 27

4.1 Corpora Pre-Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 CETEMPúblico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.2 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Co-occurrence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Storage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Parsing Xerox Incremental Parser (XIP) files . . . . . . . . . . . . . . . . . . . . . 28

4.2.3 Populating the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Sense Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Avoiding Duplicated Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Evaluation 31

5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Unsupervised Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.2 Supervised Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


5.2 Test Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Parameter choosing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Unsupervised Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.5 Supervised Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.6 Results interpretation and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Conclusion 39

Bibliography 41
List of Figures

2.1 An example of a small world network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 STRING Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 ER model of (Correia et al. 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 The architecture of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 The progression of data as it evolves through the architecture . . . . . . . . . . . . . . . . 24

3.3 Example of the co-occurrences extracted from a sample sentence . . . . . . . . . . . . . 24

3.4 The Entity–Relationship model used to store the information in the database . . . . . . . 25

5.1 Image of the induction graph for the word vingar, using the Chinese Whispers (CW) algo-
rithm and the Normalized Pointwise Mutual Information (NPMI) association measure. . . . 36

xiii
List of Tables

2.1 A 2-by-2 table showing the dependence of two words . . . . . . . . . . . . . . . . . . . . 18

2.2 Unsupervised evaluation of Word Sense Induction algorithms . . . . . . . . . . . . . . . . 20

2.3 Supervised evaluation of Word Sense Induction algorithms . . . . . . . . . . . . . . . . . 20

2.4 Comparison of Graph Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Results of the unsupervised Word Sense Induction (WSI) evaluation. . . . . . . . . . . . . 35

5.2 Results of the supervised Word Sense Disambiguation (WSD) evaluation. . . . . . . . . . 35

xv
Listings
4.1 SQL Query to extract all co-occurrences in same context as the target word . . . . . . . . 29

xvii
Acronyms
AJAX Asynchronous JavaScript and XML

AR Anaphora Resolution

BNC British National Corpus

CBC Clustering By Committee

CSV Comma Separated Values

CW Chinese Whispers

DOM Document Object Model

DSM Distributional Semantic Model

EM Expectation-Maximization

ER Entity–Relationship

GS Golden Standard

HAC Hierarchical Agglomerative Clustering

HMM Hidden Markov Model

IC Integrity Constraint

IR Information Retrieval

JC Jaccard Similarity Coefficient

MARv Morphossyntactic Ambiguity Resolver

MFS Most Frequent Sense

MT Machine Translation

NER Named Entity Recognition

NLP Natural Language Processing

NPMI Normalized Pointwise Mutual Information

PHP PHP: Hypertext Preprocessor

PMI Pointwise Mutual Information

xix
POS Part-of-Speech

RI Random Indexing

RuDriCo2 Rule-Driven Converter

SENTIDO SENse Through InDuctiOn

SGML Standard Generalized Markup Language

STRING Statistical and Rule-Based Natural Language Processing Chain

SVD Singular Value Decomposition

SWSI SemEval 2007 WSI task

WSD Word Sense Disambiguation

WSI Word Sense Induction

XIP Xerox Incremental Parser

XML Extensible Markup Language


1 Introduction
In human languages, it is possible that the same word has different meanings according to the
context in which it is used. For example:

(a) She shot the arrow using her bow.


(b) He tied the bow of the gift.

In these two sentences, the word bow means different things. In sentence (a), it is used as a
hunting tool, while in sentence (b), it is a ribbon. To be able to do some tasks in Natural Language
Processing (NLP), machines need to be able to differentiate between the different meaning of the same
word in each use.

This problem affects, among other NLP tasks, Machine Translation (MT), Information Retrieval (IR)
and content categorization (Navigli, 2009). Taking an example from MT, the Portuguese word laço can
be translated in various different English words depending on context. It can be translated into ribbon or
it can be translated as the bond between two people.

(a) Ela atou o presente com um laço.


She tied the gift with a ribbon.
(b) Há um forte laço entre eles.
There is a strong bond between them.

A system which naïvely uses the Most Frequent Sense (MFS) of a given word, as found in a se-
mantically annotated corpus, may improperly translate or categorize this word when it is used outside of
its most common context or in a different corpus.

To deal with this problem, it is necessary to identify the specific meaning of a word based on its
context. This process is called Word Sense Disambiguation (WSD) (Navigli, 2009). It requires both
sense inventories and large amounts of sense-tagged corpora to function efficiently. As a result, under-
resourced languages need to deal with greater hardships to be able to achieve satisfactory results (Ng,
1997).

A solution to the lack of resources is to automatically identify the meaning of words in their given
context, without the requirement of manually annotated data. This is called Word Sense Induction (WSI)
(Agirre and Soroa, 2007). Most WSI models rely on word co-occurrence to determine the main senses
a word may have. Syntactic dependencies between words are seldom used.

1
The goal of this dissertation is to investigate the feasibility of creating a WSI model for the Por-
tuguese language, which would be capable of using syntactic dependency information to determine the
main senses a word may have. Furthermore, this dissertation looks into evaluating the quality of this
new model against the MFS baseline.

Additionally, this dissertation presents the results of this investigation in a project called SENse
Through InDuctiOn (SENTIDO). SENTIDO is a WSI and WSD model which infers the possible senses
of a word from sense-untagged corpora based its syntactic relations with its context; and, given a word
and its context, the system disambiguates between the previously inferred senses.

This dissertation is organized as follows: In Chapter 2, existing WSI implementations and models
are described, as well and the theoretical foundations and additional tools that support and aide them.
In Chapter 3, the architecture of the model is outlined, as well as the various stages that compose it. In
Chapter 4, the implementation details, such as the corpora used, existing tools from which the model
is developed, and adaptations made to those same tools, are depicted. Chapter 5 describes how the
algorithm is tested, which are the theoretical foundations for the soundness of the methodology, the test
corpus used, and the chosen parameters and finally, this chapter analyses the results of the evaluation.
Chapter 6 provides a synopsis of the findings and describes what can be improved in the future.

2
2 State of the Art
In this chapter the state of the art is described. The overall approaches and their principles are
described, as well as the various algorithms used to exploit those same principles.

2.1 Distributional Semantics

Distributional Semantic Models (DSMs) discover word senses from text. These are based on the Dis-
tributional Hypothesis, which is based on the idea expressed by the famous quotation ‘You shall know
a word by the company it keeps’ (Firth, 1957). That is, words are semantically similar if they appear in
similar documents, similar context windows or similar syntactic contexts (Van, 2010).

2.2 Statistical and Vector Space Models

These methods implement DSMs based on statistically or geometrically oriented probability distribution
models (Van, 2010). Some of these methods are presented below.

2.2.1 Clustering By Committee

An implementation of a clustering approach is Clustering By Committee (CBC) (Pantel, 2003). The


algorithm represents each word as a feature vector, in which each feature corresponds to a context
where the word occurs. The value of the feature is the Pointwise Mutual Information (PMI) between the
feature and the word (Bouma, 2009) (PMI is later described in Subsection 2.5.3). The similarity between
two words is computed using the cosine coefficient (Salton and McGill, 1986) of their mutual information
vectors, as in Equation 2.1.

P
miei ,f × miej ,f
f
(2.1) sim(ei , ej ) = qP
2 P 2
f miei ,f × f miej ,f

The CBC algorithm consists of three phases. In phase I, the top-k similar elements of each element
e are computed. In phase II a collection of tight clusters is constructed, where the elements of each
cluster form a committee. In each step, the algorithm finds a set of tight clusters, called committees,
and identifies residue elements not covered by any. First the top-k similar elements are clustered using
average-link clustering (Han and Kamber, 2000). A committee covers an element if its similarity to the
centroid of the committee exceeds a given threshold. The algorithm then recursively attempts to find
more committees among the residue elements. The details are presented in Algorithm 1.

3
Algorithm 1 Phase II of CBC (Pantel, 2003)
Input: A list of elements E to be clustered, a similarity database S from Phase I, thresholds θ1 and θ2 .

Step 1: For each element e ∈ E

Cluster the top elements of e from S using average-link clustering (Han and Kamber, 2000).

For each discovered cluster c, compute the score |c| × avgsim(c), where |c| is the number of
elements in c and avgsim(c) is the average pairwise similarity between elements in c.

Store the highest-scoring cluster in a list L.

Step 2: Sort the clusters in L in descending order of their scores.

Step 3: Let C be a list of committees, initially empty.

For each cluster c ∈ L in sorted order

Compute the centroid of c by averaging the feature vectors of its elements and computing the
mutual information scores in the same way as done for individual elements.

If the similarity of c to the centroid of each committee added before to C is below threshold θ1 , add
c to C.

Step 4: If C is empty, finish and return C.

Step 5: For each element e ∈ E

If the similarity of e to every committee in C is below threshold θ2 , add e to the list of residues R.

Step 6: If R is empty, finish and return C, otherwise return the union of C and the output of a recursive
call to Phase II using the same input, except replacing E with R.

Output: A list of committees.

4
In phase III each element e is assigned to its most similar clusters according to Algorithm 2. The
similarity between a cluster and an element is computed using the centroid of the committee members.
Once an element e is assigned to a cluster c, the intersecting features between both are removed from
e to allow CBC to discover the less frequent senses of a word and avoid duplicate senses.

Algorithm 2 Phase III of CBC


C is a list of clusters initially empty
function PHASE III(e)
S ← The top-200 similar clusters to e
while ¬(S ⊂ ∅) do
c ← The cluster in S most similar to e
if similarity(e, c) < σ then
break
end if
if c is not similar to any cluster in C then
e←c
Remove from e its features that overlap with the features of c
C ←C ∪c
end if
S ← S \ {c}
end while
return C
end function

To evaluate the system, its output was compared with WordNet, with the frequency counts for the
nodes, called synsets, obtained from the SemCor corpus (Miller et al., 1993).

The corpus was obtained by processing 144 million words of newspaper text from the TREC Col-
lection (1988 AP Newswire, 1989-90 LA Times, and 1991 San Jose Mercury)1 . The test set was con-
structed by intersecting the words in WordNet with the nouns in the corpus, resulting in a test set of
13,403 words with an average number of 740.8 features per word. CBC obtained a precision of 60.8%,
a recall of 50.8% and an F-score of 55.4%. These evaluation measures are later described in Sec-
tion 5.1.1.

2.2.2 HERMIT

The model from Jurgens and Stevens (2010) performs WSI by modelling individual contexts in a high
dimensional word space. Word senses are induced by finding contexts which are similar and a hybrid
clustering method is used to group similar contexts.

Each context of a word is approximated using the Random Indexing (RI) word space model (Kan-
erva et al., 2000), in which the occurrence of a word is represented with an index vector instead of a set
of dimensions. RI can be described as a two-step operation (Sahlgren, 2005):

1. Each context in the data is assigned an unique and randomly generated representation, called
an index vector, which is sparse and high-dimensional. Its dimensionality (d) is on the order of
1 http://trec.nist.gov/data/test_coll.html, last accessed on 28th September 2017.

5
thousands and it consists of a small number of randomly distributed values of ±1, with the rest of
the elements set to 0;
2. Then, context vectors are produced by scanning through the text. Each time a word occurs in a
context, the context’s d-dimensional index vector is added to the context vector for that word. Words
are represented by d-dimensional context vectors which are the sum of the words’ contexts.

This allows for a creation of a co-occurrence matrix Fw×d which is an approximation of a standard
co-occurrence matrix Fw×c , but in which d  c (c being the number of contexts). As a result, HER-
MIT transforms the original co-occurrence counts into a smaller and denser representation without the
computational overhead of other dimensional reduction techniques such as Singular Value Decomposi-
tion (SVD) (Jurgens and Stevens, 2010).

The identification of related contexts is made through the use of clustering, which separates similar
context vectors into dissimilar clusters representing the distinct senses of a word. A hybrid of the online
k-means clustering (Liberty et al., 2016) and Hierarchical Agglomerative Clustering (HAC) (Zepeda-
Mendoza and Resendis-Antonio, 2013) with a threshold is used. The threshold allows for the number of
clusters to be determined by data similarity instead of manually specified.

The context vectors are clustered using k-means clustering, which assigns a context to the most
similar cluster centroid. If the nearest centroid has a smaller similarity than the cluster threshold and
there are not any k clusters, the context forms a new cluster. The similarity between context vectors is
defined as the cosine similarity (see above).

Clusters are then repeatedly merged using HAC with average link criteria, that is, cluster similarity
is the mean cosine similarity of the pairwise similarity of all data points from each cluster. When the
two most similar clusters have a similarity less than the threshold, merging stops. The resulting clusters
should then represent distinct word senses.

The model was submitted for SemEval-2010 Task 14 (Manandhar and Klapaftis, 2009) and eval-
uated in an unsupervised method, which compares the found clusters with the gold data for the word
senses, and a supervised method, which evaluates the results of WSD using the induced clusters. The
first was measured according to the F-score and V-measure. V-measure is the harmonic mean between
the homogeneity and completeness (and will be later described in Subsection 5.1.1); while the second
metric is measured using supervised recall. Using the provided test corpus, parameters were tuned for
a context window size of ±1, a clustering threshold of 0.15 and a maximum of 15 clusters per word.

The final results of the SemEval-2010 Task 14 showed that Hermit achieved a V-measure of 16.7%
and an F-score of 24.4% in nouns in the unsupervised evaluation and a supervised recall of 53.6%.

2.3 Co-occurrence Graph Models

In graph-based models, the meanings of a word are represented by a weighted and undirected co-
occurrence graph. The nodes (or vertices) of the graph are the words which occur in the corpus and the

6
transitions (or edges) are co-occurrences, with the weight of the edge describing the number of times
that co-occurrence exists. Two words are said to co-occur if they both occur within the same context.

These models are based on the idea that co-occurrence graphs have the properties of small world
networks (Véronis, 2004), that is, most nodes are not neighbours to one another, but most nodes can be
reached by a small number of steps, as shown in Figure 2.1. These properties then allow to search for
highly interconnected bundles of co-occurrences, that is, high density components, which correspond to
the senses being searched.

Figure 2.1: An example of a small world network.

E
F D

B
C
A

J
G
L
K

H I

2.3.1 Graph Model for Unsupervised Lexical Acquisition

A graph from a Part-of-Speech (POS)-tagged corpus is made in (Widdows and Dorow, 2002), using
words as nodes and the grammatical co-occurrence relationships between pairs as edges. The rela-
tionships extracted are the co-occurrences of Noun-Verb, Verb-Noun, Adjective-Noun, Noun-Noun. To
generate the edges, the top-n neighbours of each word are selected and turned into edges.

An incremental algorithm is then used to extract categories based on a given word using affinity
scores, which give more importance to words that are linked to the existing neighbours. The algorithm
is tailored to avoid infections from spurious co-occurrences, preventing spurious links from relating to
genuine semantic similarity. The process to select and add the most similar node to a set of nodes is
described in Algorithm 3, where A is a set of nodes and N (a) are all the nodes which are linked to a
node a.

7
Algorithm 3 Select the most similar node
N (A) ← ∪a∈A N (a) . A is a set of nodes, N (A) the neighbours of A
b ← u ∈ N (A) \ A, max( |N (u)∩N (A)|
|N (u)| )
A←A∪b

The model was built using the British National Corpus (BNC)2 and evaluated against classes in
WordNet. A WordNet class was considered to be the collection of synsets subsumed by a parent synset,
for example, the class musical instruments was the collection of all synsets subsumed by the WordNet
musical instruments synset. For a given seed word, the algorithm in 3 was used to find the n nodes most
closely related to it. Ten classes were chosen beforehand, and for each class 20 words were retrieved
using a single seed-word from the class in question.

The results show that of a total of 200 retrieved words, only 36 were incorrect, giving an accuracy
of 82%.

2.3.2 Word Sense Induction Using Graphs of Collocations

A graph of collocations is used in (Klapaftis and Manandhar, 2008) to generate a taxonomy of senses
which takes into consideration word polysemy. In this, each node corresponds to an occurrence of two
words in the same context window (which in this model is a paragraph) and two nodes are connected if
two collocations occur in the same context window.

The base corpus, bc, consists of paragraphs containing the target word, tw. Besides bc, there is
also a large reference corpus, rc. The project focuses in inducing the sense of tw given bc as the only
input.

At first, paragraphs with tw are removed from bc and all paragraphs from bc and rc are POS tagged.
From these, only nouns are kept and lemmatised.

Log-likelihood (G2 ) (Dunning, 1993) is then used to filter common nouns which are not semantically
related to tw, by checking if a given word wi has a similar distribution in bc and rc. If that is true, G2 will
have a small value and wi should be removed from bc.

The noun frequencies of bc are stored in a list lbc and the noun frequencies of rc are stored in a list
lrc. For each word wi ∈ lbc is created a table of observed counts OT taken from lbc and lrc, and a table
of expected values ET under the model of independence. G2 is then calculated using the equations in
Subsection 2.5.3.

The lbc list is then filtered by removing words with a smaller relative frequency in lbc in relation to lrc.
The resulting lbc list is then sorted by the G2 values. Words that have a G2 smaller than a pre-specified
threshold p1 are then removed from bc. By the end of this process, each paragraph in bc is converted to
a list of nouns assumed to be topically related to tw.

2 http://www.natcorp.ox.ac.uk/, last accessed on 28th September 2017.

8
n

With the base corpus now processed, collocations of two nouns are detected by generating all 2

combinations for each n-length paragraph. Conditional probabilities (Equation 2.2) are then used to
generate the weights of each collocation. Collocations that have a frequency and weight higher than a
pre-specified threshold are then used to generate the nodes of the graph G.

fij
(2.2) p(i|j) =
fj

The constructed graph G is sparse. A smoothing technique is applied to discover new edges be-
tween vertices and to assign weight to all of the graph’s edges. For each vertex i, a vertex vector V Ci is
assigned containing the vertices which share an edge with i in G. The similarity between each V Ci and
V Cj is then calculated using the Jaccard Similarity Coefficient (JC) (Equation 2.3).

|V Ci ∩ V Cj |
(2.3) JC(V Ci , V Cj ) =
|V Ci ∪ V Cj |

The final graph, G0 , is then clustered using the Chinese Whispers (CW) algorithm, further described
in Section 2.4.1. This algorithm was used because it does not require input parameters and performs in
linear time to the number of edges, although it is not guaranteed to converge.

WSD is at last made by assigning one of the induced clusters to each instance of the target word.
For each target word in a given paragraph, a score for each induced cluster is applied, based on the
number of collocations which occur in the paragraph.

To preform WSD, for each paragraph of bc, each induced cluster is assigned a score equal to the
number of collocations occurring in it.

To evaluate the model, the framework of the SemEval 2007 WSI task (SWSI) (Agirre and Soroa,
2007) was used. The test corpus consists of texts from the Wall Street Journal corpus, hand-tagged
with OntoNotes senses (Hovy et al., 2006).

The model was evaluated under unsupervised evaluation and supervised evaluation (see above).
In the unsupervised evaluation, the model achieved 88.6% purity, 31% entropy (these measures are
described below, in Section 5.1.1) and an F-Score of 78%.

2.3.3 UoY: Graphs of Unambiguous Vertices

The model developed by Korkontzelos and Manandhar (2010) is a relaxed version of the model in (Kla-
paftis and Manandhar, 2008), described in Section 2.3.2, in which a node is generated from a single
word if this word is considered unambiguous, otherwise, a node is only generated from a set of two
words.

9
The corpus is first preprocessed, with the aim of capturing words contextually related to the target.
Sentences or paragraphs, snippets, which contain the target word are lemmatised and POS tagged
using the GENIA tagger. Only nouns are kept and words which occour in a stoplist are filtered out.
Nouns which are infrequent in the reference corpus are removed and the log-likelihood ratio (G2 ) is
used to compare the distribution of each noun to its distribution in the reference corpus. If a noun’s G2
is lower than a specified threshold, or if the noun has a higher relative frequency in the reference corpus
than in the target corpus, then that noun is removed. At this stage, each snippet is a list of lemmatised
nouns contextually related to the target word.

The graph is constructed by first representing all nouns in the list as graph vertices. Each noun
within a snippet is combined with every other, generating n2 pairs. G2 is applied once again to the


pairs.

To filter out pairs that refer to the same sense, a vector with the snippet IDs in which they occur is
generated for each pair and each noun. A pair is discarded if its vector is similar to both vectors of their
component nouns, using for that purpose the Dice coefficient (Dice, 1945), which is later described in
Subsection 2.5.3.

Edges are drawn based on the co-occurrence of the corresponding vertices in snippets. The weight
of the edge is the maximum of the conditional probabilities of its vertices, calculated according to Equa-
tion (2.4), and lowly weighted edges are filtered out.

 
1 fa,b fa,b
(2.4) wa,b = +
2 fa fb

The graph is then clustered using CW, described in Section 2.4.1. To reduce the number of clusters,
a post-processing stage is applied, in which, for each cluster li , a set of all snippets Si containing at least
one vertex of li is generated. For any clusters la and lb , if Sa ⊆ Sb or Sa ⊇ Sb , these are merged.

The model was submitted for SemEval-2010 Task 14 (Manandhar and Klapaftis, 2009) and eval-
uated using an unsupervised and a supervised method. The first was measured according to the V-
measure and F-score, while the second was measured using recall. Parameters were tuned by choos-
ing maximum supervised recall, resulting in threshold frequencies of 10, G2 threshold of 10, collocations
weights of 0.4 and similarity threshold for pairs-of-nouns vertices of 0.8.

Results showed that the model achieved results of a V-measure of 20.6%, an F-score of 38.2% on
the unsupervised evaluation with nouns, and a supervised recall of 59.4% on the supervised evaluation.

10
2.4 Graph Partitioning and Clustering Algorithms

The task of finding clusters which are optimal with respect to fitness measures is NP-complete (Šíma
and Schaeffer, 2006). The following are two current algorithms, time-linear to the number of edges,
which try to find solutions through approximation.

2.4.1 Chinese Whispers.

CW is a randomised graph-clustering algorithm which is time-linear in the number of edges, developed


by Biemann (Biemann, 2006). It is a basic, yet effective, algorithm to partition nodes of weighted,
undirected graphs, and it is said to perform well in small-world graphs.

CW is parameter-free, as the number of partitions emerges naturally during the process. But for-
mally, CW does not converge, as a node can become tied and be randomly assigned a different class at
each iteration, without ever stabilising, nor is it deterministic, due to the random orders and assignments
in the algorithm.

In CW, a weighted graph, G = (V, E), has nodes vi ∈ V and weighted edges (vi , vj , wij ) ∈ E with
weight wij . If (vi , vj , wij ) ∈ E implies (vj , vi , wij ) ∈ E then G is undirected. If all weights are 1, G is
unweighted. The degree of a node is the number of edges it takes part in. The neighbourhood of a node
v is the set of all nodes v 0 such that (v, v 0 , w) ∈ E or (v 0 , v, w) ∈ E.

CW works as outlined in Algorithm 4. First, all nodes get different classes. For a small number of
iterations, in each iteration, the nodes inherit the strongest class in their neighbourhood. This is the class
whose sum of edge weights to the current node is maximal. If multiple classes are equally the strongest,
one is chosen randomly. Classes are updated immediately, a node can obtain classes introduced in that
same iteration.

Regions of the same class stabilize during the iteration, and grow until they reach the border of
another class.

Algorithm 4 The Chinese Whispers algorithm


function C HINESE W HISPERS(V, E)
for all vi ∈ V do
class(vi ) ← i
end for
while changes do
for all v ∈ V , random order do
class(v) ← max_rank(class(neighbourhood(v)))
end for
end while
end function

Apart from ties, the class of a node usually does not change more than a few times. The number of
iterations depends on the larger distance between two nodes in the graph.

11
CW was evaluated on tasks of WSI using a similar approach to the one in (Dorow and Widdows,
2003), by replacing the Markov Clustering algorithm with CW. The evaluation method in (Bordag, 2006)
was used. In this, the neighbourhood of two words is merged and the ability of the algorithm to separate
the merged graph is evaluated. The evaluation measures used included retrieval precision (rP ), the
similarity of the found sense with the gold standard sense using the overlap measure, and the retrieval
recall (rR), which are the amount of words assigned correctly to the gold standard sense.

Results for nouns showed that CW had a retrieval precision of 94.8% and a retrieval recall of 71.3%,
which suggest similar performance as specialized graph-clustering algorithms for WSI given the same
input.

2.4.2 MaxMax.

MaxMax is a soft-clustering algorithm applicable to edge-weighted graphs (Hope and Keller, 2013a). It
is parameter-free, runs in linear time to the number of edges and it is deterministic. Test results show it
to return scores comparable with existing state-of-the-art systems.

In MaxMax, a notion of maximal affinity is used, in which affinity between vertices u and v is the
edge weight w(u, v). A vertex u has maximal affinity to a vertex v if w(u, v) is maximal among all edges
with u. v is said to be a maximal vertex of u.

MaxMax consists of two stages, as described in Algorithm 5. First, the weighted graph G is trans-
formed in an unweighted, directed graph G0 . Maximal affinity relationships between vertices are used to
determine edge direction in G0 . If in G a vertex u has two maximal vertexes v and w, in G0 u will have
only two directed edges, from v to u and from w to u.

Then, clusters are identified, finding the root vertices of subgraphs of G0 and by marking all de-
scendants of a vertex as ¬root. In the directed graph G0 , a vertex v is a descendant of u is there is a
directed path from u to v. At the end of the stage, vertices which are still marked as root uniquely identify
clusters.

Algorithm 5 The MaxMax algorithm


function M AX M AX(G=(V, E))
G0 = (V, E 0 ) ← a directed graph where (v, u) ∈ E 0 iff (u, v) ∈ E and v is a maximal vertex of u
Mark all vertices of G0 initially as root
for all v ∈ V do
if v is root then
for all u ∈ descentant(v) \ v do
u ← ¬root
end for
end if
end for
end function

MaxMax was evaluated in the context of the SemEval 2010 WSI Task (Manandhar et al., 2010),
using an adaptation of the Shared Nearest Neighbours algorithm and then using MaxMax to identify
sense clusters in the generated target word graph. MaxMax has the best scoring V-measure (of 32.8%)

12
among the systems evaluated in (Manandhar et al., 2010), and the worst F-score (of 13.2%) among the
systems evaluated. The authors claim that their system is overly penalized by the F-score due to the
way this is known to be biased towards clustering solutions returning large clusters and to punish small
clusters disproportionately.

2.5 Other Tools

In addition to the existing work in WSD and WSI, it is relevant to mention other works which make use
of word co-occurrences in text.

2.5.1 STRING

Statistical and Rule-Based Natural Language Processing Chain (STRING) (Mamede et al., 2012) is a
NLP system for the Portuguese language, capable of preforming all basic NLP tasks, including Named
Entity Recognition (NER), IR, Anaphora Resolution (AR), among other tasks. STRING is composed of
several stages, as described in Figure 2.2.

Figure 2.2: STRING Architecture

Event Ordering
Sloft Filling
Resolution

malization
Time Nor-
RuDriCo2

Anaphora
LexMan

MARv4

Join
XIP

Preprocessing

In the first stage, the text is passed through a tokenizer, which is also responsible for recognising some
special tokens, such as numbers or punctuation. The tokens are then passed through a POS tagger,
LexMan (Vicente, 2013), which uses finite-state transducers to label tokens with their appropriate POS
tags. At last, the text is split into sentences mainly based on the punctuation and having into account
abbreviations, acronyms, and the use of ellipsis.

POS Disambiguation

This stage comprises two steps, which are preformed by two distinct modules:

1. Rule-Driven Converter (RuDriCo2), which preforms rule-driven morphossyntactic disambiguation


(Diniz, 2010);

13
2. Morphossyntactic Ambiguity Resolver (MARv), which preforms statistical disambiguation (Ribeiro,
2003).

RuDriCo2 is responsible for refining the initial segmentation done by LexMan. It uses declarative
pattern-matching rules to modify the original segmentation or disambiguate some of the POS tags.
RuDriCo2 also preforms the expansion of contracted words.

MARv is responsible for resolving morphossyntactic ambiguity. It analyses the labels generated
for each word and then chooses the most likely tag given its immediate context. This is done using
Hidden Markov Models (HMMs). MARv uses second-order models, which codify contextual information
concerning entities, and unigrams, which codify lexical information.

Syntactic Analysis

Syntactic analysis is done through the use of Xerox Incremental Parser (XIP) (Aït-Mokhtar et al., 2002),
which also allows for adding lexical, syntactic and semantic information to the output of the previous
modules. XIP is responsible for several tasks:

(i) It adds information to the existing tokens through the use of lexicons. XIP includes a pre-existing
lexicon that can be both extended or modified;
(ii) It allows to define local grammars using pattern-matching rules, to group lexical units together into
a single entity;
(iii) It also preforms a shallow parsing to group elements into chunks (noun phrase – NP, prepositional
phrase – PP, among others) using linear precedence and sequence rules;
(iv) Finally it preforms a deep parsing to extract the dependencies between the different chunks (sub-
ject, direct object, modifier, etc.).

Post-Syntactic Analysis

At this stage, additional tasks are executed, which extract additional information from the text. One is
the AR task (Marques, 2013), which preforms pronominal anaphora identification in the text and then
uses the Expectation-Maximization (EM) algorithm to obtain the probabilities of each word to be the
antecedent of a given anaphora candidate. Other tasks performed include time normalisation (Maurício,
2011) and event ordering (Cabrita, 2014).

2.5.2 Syntax Deep Explorer

In Correia (2015), a tool was developed making use of STRING to allow one to explore co-occurrence
data obtained from Portuguese texts. The presented solution was composed of a tool to extract co-
occurrences and calculate the association measures of these as well as a web-based interface to display
these co-occurrences in an intuitive fashion.

14
The developed solution takes advantage of the rich lexical resources available and the syntactic
and semantic analysis of STRING to provide information about the patterns of co-occurrences found in
the corpora evaluated.

Dependencies and Properties

For the project in question, each co-occurrence is based on a syntactic dependency extracted from XIP.
These dependencies have two words, in which the first word is the modified element and the second
word is the modifier, information about the dependency itself, and a set of properties, if these exist.
The project only considers a subset of possible properties as relevant for its goals, which indicate if the
modifier is before or after the modified word, or if the modifier is a focus adverb.

Architecture

The implemented solution splits the problem into four separate components:

• The storage format of the extracted and computed information;


• The co-occurrence extraction from the corpus;
• The calculation of the various association measures;
• The display of the information in an user-friendly form.

Storage Format

An SQLite database was chosen as the format to store the obtained information. An Entity–Relationship
(ER) model was developed to represent the database, and a relational model was generated from the
created ER model.

To store the information in the database, the ER model in Figure 2.3 was created and used. In this
model:

• The entities Corpus, Word and Dependency are used to store the information about each corpus,
word and dependency type respectively.
• The weak entity Property associates the Dependency with a property type.
• Each word has a Belongs relation with the corpus, which indicates how often the word occurs in
the Corpus.
• Each pair of words and property also have a Co-occurrence relation, with a frequency attribute
which defines how often these words occur together in the same corpus with the given property.
Additionally, several attributes exist for each kind of association measure.
• The weak entity Sentence has a Belongs association with the Corpus, and the aggregation of the
Co-occurrence associations has a Exemplifies association with the Sentence entity.

Additionally, the following Integrity Constraints (IC) were identified:

15
1. The words in the Co-occurrence association must belong to the Corpus to which they are associ-
ated with.
2. The Co-occurrence association must be associated to the same Property with which the words
associate with in Belongs.
3. The sentences in the Exemplifies association must belong to the Corpus to which the given Co-
occurrence is associated with.

Figure 2.3: The ER model used in (Correia, 2015).

name source year sentenceNum file sentence

genre Corpus has Sentence

dependencyType

update belongs Exemplifies

Dependency
frequency
has

class Word Co-Occurrence property


significance
frequency
propertyType
loglikelihood
pmi
word wordId dice logdice chipearson
aggregation

Co-occurrence extraction

The co-occurrence extractor obtains the processed Extensible Markup Language (XML) from XIP and,
then, parses it to obtain the dependency information.

The extractor reads all dependencies parsed from the XML and stores the following information
about each of them in the database:

• The type of the dependency in question;


• The lemma and class of each word in the dependency;
• The property of the dependency;
• The sentence in which the dependency occurred.

16
Association Measures

After the database is populated with the co-occurrences from the corpus, the association measures are
calculated in batches of 2,000 co-occurrences each.

The measures calculated are:

• Pointwise Mutual Information;


• Dice Coefficient;
• LogDice;
• Pearson’s Chi-Square Test;
• Log Likelihood Ratio;
• Significance Measure.

Information Display

The extracted information, which was stored in the database, is then displayed to the user through
the use of a web interface written in PHP: Hypertext Preprocessor (PHP) and AngularJS. The front-end,
executed on the client-side, makes Asynchronous JavaScript and XML (AJAX) requests to the back-end,
executed on the server-side.

2.5.3 Association Measures

As it could be seen in Section 2.2 and Section 2.3, many works rely in metrics to measure the cohesion
of two words (Pantel, 2003; Pantel and Lin, 2002; Jurgens and Stevens, 2010; Klapaftis and Manand-
har, 2008; Korkontzelos and Manandhar, 2010; Correia, 2015). This section further examines some
association measures and their properties.

Pointwise Mutual Information and Normalized Pointwise Mutual Information

PMI is a measure of how much the actual probability of a co-occurrence p(x, y) differs from what would
be expected given the probabilities of the individual events and assuming the independence of p(x) and
p(y) (Church and Hanks, 1990). PMI is thus defined in Equation 2.5.

p(x, y)
(2.5) i(x, y) = ln
p(x)p(y)

Normalized Pointwise Mutual Information (NPMI) normalizes the upper and lower bound of PMI
(Bouma, 2009). In this case, in (x, y) = 1 when the two words only occur together, in (x, y) = 0 when
x and y are independent, and in (x, y) approaches −1 when the two words occur separately but not to-
gether, that is, when p(x, y) approaches 0 and p(x), p(y) are fixed. NPMI is thus defined in Equation 2.6.

17
 
p(x, y)
(2.6) in (x, y) = ln − ln p(x, y)
p(x)p(y)

Dice Coefficient and LogDice

The Dice coefficient measures the amount of association between two co-occurrences (Dice, 1945) and
ranges from 1.0, which indicates association of the two co-occurrences every time they occurred, to
0.0, which indicates no association whatsoever between the two co-occurrences under any time they
occurred at all. The Dice coefficient is defined in Equation 2.7.

2p(x, y)
(2.7) D=
p(x) + p(y)

LogDice addresses the fact that the Dice coefficient usually generates very small numbers (Rychlỳ,
2008). For logDice, the maximum is 14. A value of 0 means there is less than 1 co-occurrence of XY
per 16,000 X or 16,000 Y . A value increase in 1 unit means the co-occurrence occurs twice as often,
and a value increase of 7 means the co-occurrence occurs about 100 times as often. logDice is defined
in Equation 2.8.

2p(x, y)
(2.8) logDice = 14 + log2 D = 14 + log2
p(x) + p(y)

Pearson’s Chi-Squared Test

χ2 is a test for dependence which does not assume normally distributed probabilities (Manning and
Schütze, 1999). The test compares observed frequencies with the frequencies expected for indepen-
dence. In the simplest case, this test is applied to 2-by-2 tables such as Table 2.1. The χ2 test sums
the differences between the observed and expected values in all squares of the table, scaled by the
magnitude of the expected values, as per Equation 2.9, where i is the table row, j is the table column,
Oij is the observed value of the cell (i, j) and Eij is the expected value.

Table 2.1: A 2-by-2 table showing the dependence of two words


w1 = x w1 6= x
w2 = y O11 O12
w2 6= y O21 O22

X (Oij − Eij )2
(2.9) χ2 =
ij
Eij

18
The expected frequencies Eij are calculated by converting the total of the rows and columns into
proportions. For example, E11 is calculated by adding the items of the first row and dividing it by the
sample size (N ), adding the items of the first column and dividing it by the sample size, and then
multiplying these two and multiply the result by the sample size, as exemplified in Equation 2.10.

O11 + O12 O11 + O21


(2.10) E11 = × ×N
N N

For 2-by-2 tables, the test can be simplified into Equation 2.11.

N (O11 O22 − O12 O21 )2


(2.11) χ2 =
(O11 + O12 )(O11 + O21 )(O12 + O22 )(O21 + O22 )

Log Likelihood Ratio

The Log-likelihood ratio (G2 or −2 log λ) was created to address the overestimation of the normal distri-
bution when dealing with very small probabilities (np(1 − p) < 5 for p being the probability that the next
word matches a prototype and n the amount of words for which the match is being tested) (Dunning,
1993).

The likelihood ratio tests do not depend on assuming a normal distribution and instead use the
generalized likelihood ratio, which can be used effectively in smaller volumes of text than is necessary
for conventional tests based on assumed normal distributions (Dunning, 1993).

For events k1 , k2 , the likelihood ratio for the binomial distribution is defined in Equation 2.12.

 
X nij kij k1j + k2j
(2.12) − 2 log λ = 2 nij log , nij = , mij =
i,j
mij ki1 + ki2 N

Significance Measure

The significance measure is based on the statistical G-Test for Poisson distributions: given two words
A, B, each occurring a, b times and k times together, the significance sig(A, B) of their occurrence in a
ab
sentence is defined in Equation 2.13, with n being the number of sentences and x = n (Biemann et al.,
2004).

(2.13) sig(A, B) = x − k log x = log k!

19
2.6 Summary
Below is a comparison of the various algorithms mentioned in the previous sections. The algorithms are
evaluated according to their performance and to their features.

2.6.1 WSI Algorithms.

DSM-based algorithms have to deal with scalability problems as the number of contexts increases. Many
of them try to deal with the problem by using denser representations of the context vectors used.

In comparison, graph-based algorithms try to deal with the significance of a number of co-
occurrences, trying to suppress irrelevant co-occurrences as edges or using smoothing techniques to
generate edges, which were not represented in the original graph.

As it can be seen in Table 2.2 and Table 2.3, current implementations of DSM and graph-based
algorithms have equivalent performance when compared using either F-score, V-measure or supervised
recall. It is important to note that unless the Golden Standard (GS) and test data used is the same, the
results obtained are not comparable among implementations.

Table 2.2: Unsupervised evaluation of WSI algorithms in nouns. All measures are in percentage (%).
1c1word, MFS, and Random are baselines from each of the respective datasets. 1c1word and MFS
groups all instances of a word into a single cluster.
Algorithm GS Prec. Recall Purity Entropy F-score V-measure
CBC WordNet 60.8 50.8 — — 55.4 —
Widdows and Dorow WordNet — — 82.0 — — —
Collocations-JC SWSI 2007 — — 88.6 31.0 78.0 —
Collocations-BL SWSI 2007 — — 89.6 29.0 73.1 —
UoY WSI&D 2010 — — — — 38.2 20.6
HERMIT WSI&D 2010 — — — — 30.1 16.7
1c1word SWSI 2007 — — — — 80.7 0.01
Random SWSI 2007 — — — — 38.1 4.91
MFS WSI&D 2010 — — — — 57.0 0.0
Random WSI&D 2010 — — — — 30.4 4.2

Table 2.3: Supervised evaluation of WSI algorithms. Unless otherwise specified, in the WSI&D 2010
dataset, the 80-20 split is used.
Algorithm Testing Corpus Sup. Recall (%)
Collocations-JC SWSI 2007 86.4
Collocations-BL SWSI 2007 85.6
UoY WSI&D 2010 59.4
HERMIT WSI&D 2010 53.6
MFS WSI&D 2010 53.2
Random WSI&D 2010 51.5

2.6.2 Graph Clustering Algorithms.

The work in this area introduces algorithms that are time-linear to the number of edges with results
comparable to older existing algorithms.
1 The mentioned data points were obtained from (Manandhar and Klapaftis, 2009).

20
Both algorithms are shown to be suitable for WSI tasks. Both algorithms are time-linear to the
number of edges and ideal for execution in large-scale graphs which feature small-world features, as it
can be seen on Table 2.4.

CW is already frequently used in the area of WSI with good results, but unlike MaxMax, it is not
deterministic, and it does not allow to place the same node in several clusters – soft-clustering –, a
feature specially desirable for a global graph approach so that one word can have several senses. On
the other hand, MaxMax is known to generate many fine-grained clusters, which then have to be merged
to obtain clusters more similar in size to the target senses (Hope and Keller, 2013b).

Table 2.4: Comparison of Graph Clustering Algorithms. Retrieval Precision (rPrec.), Retrieval Recall
(rRecall), F-score and V-measure are measured in percentage (%).
Alg. Determ. Soft-Clust. rPrec. rRecall F-sc. V-mes.
Chinese Whispers No No 94.8 71.3 — —
MaxMax Yes Yes — — 13.2 32.8

21
22
3 Architecture
In this chapter, the architecture of the WSI and WSD system is presented, as well as its articulation
with the STRING NLP system.

The architecture of this system consists of four components, ordered as per Figure 3.1:

1. a corpus pre-parser, which prepares the corpus being used to be processed by STRING;
2. the co-occurrence extractor from Correia (2015);
3. a graph constructor and clustering algorithm;
4. a word sense disambiguation module.

Figure 3.1: The architecture of the system

Sentences
STRING
XIP XML
Dependency
Extractor
Dependency
Database
Graph
Generator
Dependency
Graph
Graph
Clusterer
Dependency
Clusters
Disambiguator

The data flow is represented in Figure 3.2. The processed text from STRING goes through a modi-
fied implementation of the co-occurrence extractor from Correia (2015), which stores the co-occurrence
word-pairs along with their frequencies and association measures’ corpora’s values in a database.

3.1 Co-occurrence Extraction and Storage

Syntactic co-occurrences are extracted using the work by Correia (2015), as per the example in Fig-
ure 3.3. Additionally, this work uses the named entities obtained from XIP. If the word in a co-occurrence
is a named entity, the lemma of the word is attributed with the appropriate category (for example, the
word Pedro is identified with the lemma INDIVIDUAL).

23
Figure 3.2: The progression of data as it evolves through the architecture

XIP XML

co-occurrence extraction

co-
occurrences
words relations

nodes edges

co-
occurrence
graph

graph construction
clustering algorithm
cluster
of nodes

Figure 3.3: Example of the co-occurrences extracted from a sample sentence

cdir post
subj pre quantd mod post

Eu comi uma laranja pequena

The co-occurrences are then stored in an SQLite database, using the ER model in Figure 3.4,
adapted from Correia (2015), with minor changes to keep information of all the sentences used (in the
Context entity) instead of only 20 randomly selected sentences for each co-occurrence.

Additionally, the relationship Exemplifies is replaced with Occurs, which associates all Co-
occurrence aggregations with all the respective Contexts where they occur.

3.2 Graph Generation and Clustering

For a target word w, a query to the database is made to obtain the word pairs and association measure
values of all co-occurrences which occur in the same contexts as w. After filtering all co-occurrences
which do not reach the minimum threshold value of the association measure being used, the co-
occurrences are saved in a graph structure.

To ensure that only words directly related to w remain, a breadth-first search is made starting from
the target word. Only the nodes which were visited during this process are kept in the final graph.

24
Figure 3.4: The Entity–Relationship model used to store the information in the database

source year genre update name sentence

name Corpus has File has Context

Dependency type
frequency belongs occurs
frequency
has

word
Word co-occurrence Property type
class

frequency pmi npmi dice logDice logLikelihood


chiPearson significance

After the graph is generated, a graph clustering algorithm is run against it, and the resulting senses
are stored.

3.3 Sense Disambiguation

Disambiguation is done by measuring the separation between the words in the given context and each
of the induced sense clusters. The sense with the lowest separation score is the most likely sense of
the word.

For a target word w from a given context c, the co-occurrences from c are extracted and used to
generate the cluster for the context, Ci .

Then, for each inferred sense cluster Cj of w, the Separation between Ci and Cj is calculated
according to Equation 3.1 (Hope and Keller, 2013b). proximity is defined as the weight of the co-
occurrence in the dependency graph of the word.

P
proximity(x, y)

x∈Ci
y∈Cj
(3.1) separation(Ci , Cj ) = 1 −  
|Ci | × |Cj |

The cluster Cj with the lowest separation score compared to Ci is then considered the most likely
sense of the target word w.

25
26
4 Implementation
To preform the induction step, the CETEMPúblico corpus (Rocha and Santos, 2000) was used.
This corpus has more than 1 million extracts of articles from the Portuguese newspaper Público and
over 191 million words. The resulting database has more than 50 million co-occurrences, as well as the
respective association measures.

This following chapter describes the changes which were required to be able to populate the
database within a reasonable time frame, as well as implementation details of some of the stages.

4.1 Corpora Pre-Parse

Two corpora were chosen to be used in this project, the CETEMPúblico corpus (Rocha and Santos,
2000), and a dump of the Portuguese edition of Wikipedia1 . Each corpus used its own syntax to describe
its contents. As STRING only parses plain text sentences, the additional meta-data provided was either
removed or adapted for parsing by STRING.

4.1.1 CETEMPúblico

CETEMPúblico was provided in Standard Generalized Markup Language (SGML) format with the fol-
lowing tags:

ext Extract (usually composed of two paragraphs);


p Paragraph;
s Sentence;
t Title;
a Author;
li List element.

To parse it, a Python script was designed, which parses it as XML using a parser that generates the
Document Object Model (DOM) Element Tree incrementally. Tags considered irrelevant were ignored,
and tags with special meanings, such as the ones setting the boundaries of a sentence, paragraph or
excerpt were replaced with unique plain text identifiers.

Because the corpus was in SGML format and not XML, a few replacements were made before
feeding each line to the parser to make sure that the parser would only be fed valid XML. These made
sure that attributes were quoted and that all elements had an opening and closing tag.
1 https://en.wikipedia.org/wiki/Wikipedia:Database_download, last accessed on 21st June 2017.

27
4.1.2 Wikipedia

After obtaining the Wikipedia dump, a tool called WikiExtract2 was used to convert the obtained XML
into mostly plain text files.

Further cleaning was executed, in which all possible invalid XML or possible leftover tags were found
using a regular expression, added to a list, and removed automatically.

At last, document boundaries and paragraphs were replaced with unique plain text identifiers which
can be recognized even after being parsed by STRING.

4.2 Co-occurrence Extraction

To obtain the co-occurrences, the extractor created by Correia (2015) was used, but it had to be adapted
because it had been developed under different requirements and conditions distinct from the ones nec-
essary for this work. These adaptations are described below.

4.2.1 Storage Model

To be able to provide all the required information to the graph construction algorithm, the model used to
store the information had to be modified. A new ER was designed, as shown in Figure 3.4, and used to
generate the relational model used in the database.

All tables have their own id, used as the primary key, with the previous primary key being set with
a UNIQUE constraint. The new id primary key is used to reference to a given table’s line in other tables.
This helped to reduce the space occupied by repeated references to text-based primary keys.

4.2.2 Parsing XIP files

The XIP files were being parsed as XML using Java’s W3C-based DOM parser3 . This parser loads the
file in memory and creates the DOM-like tree structure from there.

This implementation was having problems with larger files, on the range of 100MB and larger, taking
exponential amounts of time to perform the most basic operations.

The existing DOM parser was thus replaced with a pull-parser. This reads the file sequentially and,
as new tags are found, such as the start or the end of an element, an event is generated, with only the
contents pertaining to it.

The pull-parser has an O(1) memory usage for parsing, as only the currently parsed segment needs
to stay in memory while the document structure is built.
2 https://github.com/attardi/wikiextractor, last accessed on 21st June 2017.
3 https://docs.oracle.com/javase/8/docs/api/org/w3c/dom/package-summary.html, last accessed on 6th
August 2017

28
On top of the XML pull parser, a basic stack is used to keep track of what is the current element
and depth in the document. The name of the current tag is pushed to the stack when a start event is
emitted, and the top element is popped when an end event is emitted.

With the events and the stack, a state machine is used, which is responsible for deciding the next
action in the construction of the XIP document in memory.

4.2.3 Populating the Database

After parsing the XIP files, the dependencies are extracted and added to the database as co-
occurrences. First an INSERT is attempted, if it fails due to UNIQUE violations, the existing entry is
updated with the new information.

After all co-occurrences are added to the database, the values of the association measures are
updated in batches of 2,000 at a time. To prevent slowdowns while waiting to read, a cache of read
values from the database is used to prevent reading multiple times the same value from the database,
allowing to reduce considerably the time taken to populate the values of the association measures.

4.3 Sense Induction

Given a target word w, a query to the database is made to obtain all co-occurrences that happen in the
same context as co-occurrences with w as either the first or second word, as in Listing 4.1.

SELECT Coocorrencia.*,
p1.palavra AS p1lemma,
p1.classe AS p1class,
p2.palavra AS p2lemma,
p2.classe AS p2class
FROM Coocorrencia
INNER JOIN CoOccurrenceContexts ON Coocorrencia.id = CoOccurrenceContexts.
cooccurrence
INNER JOIN Palavra AS p1 ON Coocorrencia.idPalavra1 = p1.idPalavra
INNER JOIN Palavra AS p2 ON Coocorrencia.idPalavra2 = p2.idPalavra
WHERE CoOccurrenceContexts.context IN
( SELECT CoOccurrenceContexts.context
FROM CoOccurrenceContexts
INNER JOIN Coocorrencia ON CoOccurrenceContexts.cooccurrence =
Coocorrencia.id
WHERE Coocorrencia.idPalavra1 = ?1
OR Coocorrencia.idPalavra2 = ?1)
Listing 4.1: SQL Query to extract all co-occurrences in same context as the target word

The resulting set of rows consists of all co-occurrences happening in the same context as w. These
are then assembled into a graph, in which all the nodes represent the words from the set, and the edges
represent the co-occurrences of that same set.

All co-occurrences in which the association measure’s weight is lower than a pre-specified threshold
are then removed from the graph.

29
A breadth-first search is then applied, starting from w, to ensure that only nodes directly connected
to w by any number of steps remain in the graph. This eliminates nodes and co-occurrences not con-
nected to the graph at all, coming – for example – from dangling co-occurrences that were previously
connected through a path removed in one of the previous steps of the generation.

The resulting graph is finally clustered using the CW algorithm, explained in Section 2.4.

4.4 Sense Disambiguation

To be able to perform disambiguation, additional changes had to be done to the co-occurrence extractor
from Correia (2015). The extractor had merged together the logic to extract the co-occurrences and
the code to write them in the database. To make the extractor usable for the task of disambiguation,
a separation of the code to write in the database and the logic of extracting the co-occurrences was
performed.

Having applied those changes, the disambiguation for a target word w and a context c starts by
using the modified extractor to obtain all word co-occurrences in c. These are considered the cluster of
co-occurrences of the word w in context c.

To discover which of the induced senses s is the most likely to be in use, each one of them is
compared to the cluster of co-occurrences from the context using the measure of separation defined in
Equation 3.1. The cluster with the lowest separation is considered the most likely sense of the word in
the given context.

4.5 Avoiding Duplicated Calculations

As many of these steps can take considerable amounts of time, it is desirable to avoid repeating them
as often as it is possible. As a result, most of the execution pipeline of the code was adapted to read
and write from files as often as possible, allowing the project to use these files as a cache for calculated
operations.

The format chosen was a subset of Comma Separated Values (CSV) files, in which each element
was a row and each property was a column. Each time an extensive operation is concluded, such as
obtaining the co-occurrences in the context of a word, or generating the clusters for a word using a given
set of parameters, the results are saved in the CSV file. If that same set of data is required later on,
instead of re-calculating it from scratch, the information is obtained from the existing file.

30
5 Evaluation

5.1 Evaluation Methodology

The evaluation is composed of two methods, an unsupervised evaluation and a supervised evaluation.
The unsupervised evaluation is used to assess the resulting clusters’ similarity to the Golden Standard
(GS) senses. The supervised evaluation is used as an application-oriented assessment of the resulting
clusters in the task of WSD.

5.1.1 Unsupervised Evaluation

In this evaluation method, the set of resulting clusters is compared to a GS. This comparison is made by
evaluating the clusters’ homogeneity and completeness (Manandhar and Klapaftis, 2009). Homogeneity
refers to the degree that each cluster consists of data points primarily belonging to a single GS class.
Completeness refers to the degree that each GS class consists of data points assigned to a single
cluster. To evaluate homogeneity and completeness, the F-score and the V-measure will be used.

Given a particular GS sense gsi of size ai and a cluster cj of size aj , the F-score of gsi and cj is
the harmonic mean of its precision and its recall, as defined in Equation 5.1 (Agirre and Soroa, 2007).
Precision of a class gsi with respect to cluster cj is the number of common instances divided by total
aij
cluster size, i.e. P (gsi , ci ) = aj . The recall of a class gsi with respect to cluster cj is the number of
aij
common instances divided by total sense size, i.e. R(gsi , cj ) = ai .

2P (gsi , cj )R(gsi , cj )
(5.1) f (gsi , cj ) =
P (gsi , cj ) + R(gsi , cj )

The F-score of a class gsi is the maximum F-score obtained at any cluster, as according to Equa-
tion 5.2. The F-score of the entire clustering solution is the weighted average of the F-scores of each
GS, as in Equation 5.3, where q is the number of GS senses and N is the total number of target word
instances.

(5.2) F (gsi ) = maxf (gsi , cj )


cj

31
q
X |gsi |
(5.3) FS = F (gsi )
i=1
N

The F-score measures both homogeneity (precision) and completeness (recall). However, the
F-score suffers from the matching problem, by not evaluating the entire membership of a cluster (Rosen-
berg and Hirschberg, 2007). This is due to the F-score not considering the components of the clusters
beyond the majority class. Furthermore, the F-score penalises systems for getting the number of GS
classes wrongly (Manandhar and Klapaftis, 2009).

Thus, to complement the F-score, the V-measure is also used. V-measure is an entropy-based
measure that explicitly measures how successfully the criteria of homogeneity and completeness have
been satisfied (Rosenberg and Hirschberg, 2007). Just as precision and recall are combined to form
the F-score, homogeneity and completeness are combined using the harmonic mean to compute the
V-measure.

For the homogeneity criterion, a clustering must assign only the data points of a single class to a
single cluster. That is, the class distribution of each cluster should be skewed to a single class, zero
entropy (Rosenberg and Hirschberg, 2007). In a perfectly homogeneous case, H(GS|C) = 0 and in
an imperfect situation, this value is dependent on the size of the dataset and distribution of class sizes.
Therefore, the V-measure normalizes this value by the maximum reduction in entropy the clustering
could provide, H(GS), resulting in Equations 5.4, 5.5, and 5.6.


1,
 if H(GS, C) = 0
(5.4) h= H(GS|C)
1 −
 , otherwise
H(GS)

|GS| P|C| P|C|


X j=1 aij j=1 aij
(5.5) H(GS) = − log
i=1
N N

|C| |GS|
X X aij aij
(5.6) H(GS|C) = − log P|GS|
j=1 i=1
N akj
k=1

Symmetrically to homogeneity, for the completeness criterion, a clustering solution must assign all
the datapoints of a single class to a single cluster. This can be evaluated by calculating the conditional
entropy of the proposed cluster distribution given the class of the component data points, H(C|GS).
In a perfectly complete case, H(C|GS) = 0 and in the worst case scenario each class is represented
by every cluster with a distribution equal to the distribution of cluster sizes, H(C|GS) is maximal and

32
equals H(C). Therefore, adding a way symmetric to that used for homogeneity, the V-measure defines
completeness as in Equations 5.7, 5.8, and 5.9.


1,
 if H(C, GS) = 0
(5.7) c= H(C|GS)
1 −
 , otherwise
H(C)

|C| P|GS| P|GS|


X aij
i=1 i=1 aij
(5.8) H(C) = − log
j=1
N N

|GS| |C|
X X aij aij
(5.9) H(C|GS) = − log P|C|
i=1 j=1
N aik
k=1

Based on these calculations of homogeneity and completeness, the V-measure of a clustering solu-
tion is then computed using the weighted harmonic mean of homogeneity and completeness, according
to Equation 5.10, in which if β is greater than 1 completeness is weighted more strongly and if less than
1 homogeneity is weighted more strongly.

(1 + β)hc
(5.10) Vβ =
(βh) + c

Although the V-measure does not increase monotonically, it is known to tend to favour systems
producing a higher number of clusters than the number of GS senses (Manandhar et al., 2010). With
this in mind, both the F-score and the V-measure are used for this evaluation method, as the F-score
penalises systems when they produce a different number of clusters from the number of GS senses.

Additional measures for unsupervised evaluation include entropy and purity. Entropy measures
how various classes of objects are distributed within each cluster. Generally, the smaller the entropy, the
better the clustering algorithm performs. On the other hand, Purity measures the extent to which each
cluster contains objects from primarily one class. The larger the purity, the better the clustering algorithm
performs. A formal definition of these measures is available in (Zhao et al., 2005). However, as both of
them evaluate only the homogeneity of a clustering algorithm, disregarding completeness (Manandhar
and Klapaftis, 2009), they will not be considered in this evaluation.

33
5.1.2 Supervised Evaluation

In the supervised evaluation method, the target corpus is split into a testing and training part. The
training part is used to map the automatically induced clusters to GS senses (Agirre et al., 2006). After
that, the testing corpus is used to evaluate the clustering resulting in a WSD setting.

Suppose there are m clusters and n senses for the target word. M is the set of probabilities of
words belonging to clusters, M = {mij }, 1 ≤ i ≤ m, 1 ≤ j ≤ n and each mij = P (gsj |ci ), that is, mij
is the probability of a word sense j given it that has been assigned to a cluster i. This probability can
be computed counting the times an occurrence with sense j has been assigned to cluster i in the train
corpus.

The matrix M is then used to transform any cluster score vector h returned by the algorithm into a
sense vector s. This is done by multiplying the score vector by the matrix M , that is, s = hM .

The mapping matrix M is used in order to convert the cluster score vector h of each test corpus
instance into a sense score vector s, and then assign the sense with maximum score to that instance.

As the algorithm always returns an answer, its recall is of 100% in all cases, there are no false
negatives as there are no negatives at all. So the algorithm only needs to be evaluated according to its
precision (Agirre et al., 2006).

5.2 Test Corpus


To evaluate the project, the corpus from (Baptista, 2013) was used. This was a 290K word fragment
of the PAROLE corpus (Nascimento et al., 1998), with each verb instance manually annotated with its
ViPEr class and reviewed by linguists. The splitting used for evaluation was to use the whole corpus for
training, except for the sentence being evaluated.

5.3 Parameter choosing


The graph partitioning algorithm chosen was CW, as the other option, MaxMax, has a tendency to
generate many fine-grained clusters (Hope and Keller, 2013b).

The NPMI and the logDice association measures were chosen as due to their normalized scores,
which allow to use the same parameter between different words while keeping the same underlying
meaning. The NPMI association measure was tested with minimum thresholds of 0.0, 0.25, 0.5, and
0.75, ranging from each word being at best independent of the other up to both occurring mostly together.
The logDice association measure was tested with minimum thresholds of 0.0, 2.5, 5.0, 7.5, and 10.

5.4 Unsupervised Evaluation


The results of the unsupervised evaluation (Table 5.1) show that all tests scored poorly in F-Score, while
MFS had a result on average of 83.4%. In the V-Measure, results varied from 0.0% for the lowest values

34
up to 34.4% for the highest. Not counting the most restrictive threshold of logDice and NPMI, all tests
had better results than MFS, which scored 0.4% in V-Measure.

Another thing which is possible to see if that not enough points are available to form a meaningful
view of the contexts when the threshold is too high, resulting in no clusters at all and giving poor results.

When evaluating the number of clusters, it is possible to see that most tests might have been
penalised due to the high number of clusters they had compared to the average number of GS senses.

Table 5.1: Results of the unsupervised WSI evaluation.


Algorithm Association Measure Threshold F-Score (%) V-Measure (%) # Clusters
CW logDice 0.0 1.95 33.62 147.3
CW logDice 2.5 1.80 33.46 252.2
CW logDice 5.0 2.46 29.60 259.7
CW logDice 7.5 2.70 18.26 48.8
CW logDice 10.0 0.83 3.33 0.1
CW NPMI 0.0 2.34 30.97 76.5
CW NPMI 0.25 1.69 34.37 380.1
CW NPMI 0.5 0.72 9.80 0.2
CW NPMI 0.75 0.00 0.00 0.0
MFS — — 83.36 0.37 1.0

5.5 Supervised Evaluation

The results on supervised WSD (seen in Table 5.2) were very poor overall. None of the tests were able
to surpass the results of MFS, with a precision of 65.7%. The highest result was using logDice with a
threshold of 7.5, which reached a precision of 10.1%.

Table 5.2: Results of the supervised WSD evaluation.


Algorithm Association Measure Threshold Precision (%)
CW logDice 0.0 5.37
CW logDice 2.5 0.00
CW logDice 5.0 8.27
CW logDice 7.5 10.10
CW logDice 10.0 2.55
CW NPMI 0.0 0.22
CW NPMI 0.25 2.65
CW NPMI 0.5 0.00
CW NPMI 0.75 0.00
MFS — — 65.74

5.6 Results interpretation and evaluation

Overall, the tests had poor results. In all examples MFS was able to achieve better results, showing the
project is not ready to be used for disambiguation.

The high number of clusters obtained (on average above the hundreds) shows that the results are
too fine-grained to be able to properly match them to the senses one is trying to disambiguate.

35
Further inspection into specific graphs of some words, such as the graph for the word vingar (Fig-
ure 5.1) can further explain the obtained results.

Figure 5.1: Image of the induction graph for the word vingar, using the CW algorithm and the NPMI
association measure.
acolhedor,ADJ

Bradl,NOUN

filipe,NOUN
Rillieux-la-Pape,NOUN
barateiro,NOUN

paraiso,NOUN marido infiel,NOUN

Pensacola-Florida,NOUN

Gemayel,NOUN

disc-jokeys,NOUN Lopólito,NOUN
entendimento jurídico,NOUN
emérito,ADJ quianda,NOUN Gaitskell,NOUN

Benghazi,NOUN
efabulador,NOUN vida de excesso,NOUN
SC,ADJ

Windsor,NOUN
MPG,ADJ

Kerkhof,NOUN
estratrégia,NOUN
vingar,VERB

Messala,NOUN

cansado,NOUN
azul-real,NOUN

nemesi,NOUN dezembrismo,NOUN
Tindel,NOUN
mulher-arquétipo,ADJ
Shrek,NOUN
sangrentamente,ADV

Olivehurst,NOUN
natambém,NOUN
Esquire,NOUN Pipoz,NOUN
rigoroso,NOUN

cirunstância,NOUN

Leica,NOUN
Materno-Infantil,NOUN

rissol de camarão,NOUN
ambiental,ADJ Delain,NOUN

Al-Khawaja,NOUN

cão amestrado,NOUN

economizar,VERB

The first noticeable thing is that the graph includes a few words with spelling errors, such as the
nodes natambém and estratrégia. The second noticeable thing is that although the work by Correia
(2015) identifies named entities and replaces their lemmas with their categories, many of the words
in the graph are named entities which were not recognized as such by STRING. This can be seen in
nodes such as Windsor and Shrek, and adds noise to the graph, increasing the number of small clusters
generated.

But the most noticeable thing is how sparse the graph is. Algorithms such as CW or MaxMax require
a small-world network with several high-density areas to be able to find clusters in the graph. In a graph
such as the one in Figure 5.1, with the exception of the node corresponding to the target word, no nodes

36
have more than 2 neighbours. This undermines the assumptions used in graph-clustering algorithms,
and prevents the possibility of better results.

It is possible the graph is sparse because the syntactic dependencies used impose a stricter rela-
tionship between the two words than the usage of mere words which co-occur in the same sentence or
paragraph would. The stricter relationship between the words changes the behaviour of the resulting
graph, which make the graph-clustering algorithms behave poorly.

Additionally, the stricter relationship might be preventing words that are related but do not have a
syntactic relationship from being included in the graph. This might make the generated graphs unsuitable
for the specific WSI algorithms used in this project.

Another possible cause of the poor results might be the absence of categories, such as PERSON,
PLACE or ORG, among others, in the algorithm used. As it is blind to the categories, the algorithm can
not make use of them to help infer the senses of the target word.

37
38
6 Conclusion
This project proposed an algorithm for disambiguating the sense of words in the Portuguese lan-
guage without the need to manually create a sense inventory or sense-tagged corpora. An experimental
implementation was created, which used a graph-based algorithm and the dependencies produced by
STRING. And with this experimental implementation, a database was populated with the co-occurrences
of the CETEMPúblico corpus.

Finally, the use of syntactic dependencies for the task of WSI was evaluated and compared with a
baseline of the MFS.

Future investigation possibilities could focus on improving the results obtained in this implementation
as well as improving the system’s performance.

One possibility to improve the results would be to evaluate the use of relaxed co-occurrences based
on context windows, sentences or even paragraphs, instead of the syntactic dependency co-occurrences
used in this project. This might influence the characteristics of the generated graphs and improve the
performance of the graph-clustering algorithms.

Another aspect should be to investigate vector-based algorithms. These might be capable of making
use of the dependency information provided by XIP, and might perform better in relation to algorithms
blind to this information, such as the ones used in this project.

To improve performance, it might be relevant to investigate porting the data to a graph database
engine. The traditional relational database used requires the use of multiple JOINs to execute the
required queries. A database with a native concept of a graph would allow to remove these JOINs and
improve execution speed.

Although the project was unable to achieve satisfactory results in the evaluation, it was possible to
pinpoint where possible weaknesses are located. A base model was defined and documented, and it
can be used as a base for future attempts at solving the challenges found.

39
40
Bibliography
Agirre, E., D. Martínez, O. L. De Lacalle, and A. Soroa (2006). Evaluating and optimizing the parameters
of an unsupervised graph-based WSD algorithm. In Proceedings of the 1th Workshop on Graph Based
Methods for Natural Language Processing, pp. 89–96. Association for Computational Linguistics.

Agirre, E. and A. Soroa (2007). Semeval-2007 task 02: Evaluating word sense induction and discrimina-
tion systems. In Proceedings of the 4th Intl. Workshop on Semantic Evaluations, pp. 7–12. Association
for Computational Linguistics.

Aït-Mokhtar, S., J.-P. Chanod, and C. Roux (2002). Robustness beyond shallowness: incremental deep
parsing. Natural Language Engineering 8(2-3), 121–144.

Baptista, J. (2013). ViPEr: uma base de dados de construções léxico-sintáticas de verbos do Português
Europeu. In XXVIII Encontro Nacional da Associação Portuguesa de Linguística, pp. 111–129. APL.

Biemann, C. (2006). Chinese Whispers: An efficient graph clustering algorithm and its application to
natural language processing problems. In Proceedings of the 1st Workshop on Graph-Based Methods
for Natural Language Processing, pp. 73–80. Association for Computational Linguistics.

Biemann, C., S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff (2004). Language-independent methods
for compiling monolingual lexical data. In 5th Intl. Conf. on Computational Linguistics and Intelligent
Text Processing, pp. 217–228. Springer.

Bordag, S. (2006). Word sense induction: Triplet-based clustering and automatic evaluation. In Pro-
ceedings of the 11th Conf. of the European Chapter of the Association for Computational Linguistics,
pp. 137–144.

Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of


German Society for Computational Linguistics, 31–40.

Cabrita, V. (2014, November). Identificar, ordenar e relacionar eventos. Master’s thesis, Instituto Supe-
rior Técnico, Universidade de Lisboa.

Church, K. W. and P. Hanks (1990, March). Word association norms, mutual information, and lexicogra-
phy. Computational Linguistics 16(1), 22–29.

Correia, J. (2015, November). Syntax deep explorer. Master’s thesis, Instituto Superior Técnico, Univer-
sidade de Lisboa.

Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology 26(3),
297–302.

41
Diniz, C. (2010, October). Um conversor baseado em regras de transformação declarativas. Master’s
thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.

Dorow, B. and D. Widdows (2003). Discovering corpus-specific word senses. In Proceedings of the 10th
Conf. on European Chapter of the Association for Computational Linguistics, Volume 2, pp. 79–82.
Association for Computational Linguistics.

Dunning, T. (1993, March). Accurate methods for the statistics of surprise and coincidence. Computa-
tional linguistics 19(1), 61–74.

Firth, J. R. (1957). Papers in linguistics, 1934-1951, Chapter 11.

Han, J. and M. Kamber (2000). Data Mining: Concepts and Techniques. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc.

Hope, D. and B. Keller (2013a). MaxMax: A graph-based soft clustering algorithm applied to Word
Sense Induction. In Proceedings of the 14th Intl. Conf. on Computational Linguistics and Intelligent
Text Processing, Volume 1, pp. 368–381. Springer.

Hope, D. and B. Keller (2013b, June). UoS: a graph-based system for graded word sense induction. 2,
689–694.

Hovy, E., M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2006). OntoNotes: The 90% solution.
In Proceedings of the Human Language Technology Conf. of the NAACL, Companion Volume: Short
Papers, pp. 57–60. Association for Computational Linguistics.

Jurgens, D. and K. Stevens (2010). HERMIT: Flexible clustering for the SemEval-2 WSI task. In Proceed-
ings of the 5th Intl. Workshop on Semantic Evaluation, pp. 359–362. Association for Computational
Linguistics.

Kanerva, P., J. Kristofersson, and A. Holst (2000). Random indexing of text samples for latent semantic
analysis. In Proceedings of the 22nd Annual Conf. of the Cognitive Science Society, Volume 1036.

Klapaftis, I. and S. Manandhar (2008). Word sense induction using graphs of collocations. In Proceed-
ings of the 2008 Conf. on ECAI 2008: 18th European Conf. on Artificial Intelligence, pp. 298–302.

Korkontzelos, I. and S. Manandhar (2010). UoY: Graphs of unambiguous vertices for word sense in-
duction and disambiguation. In Proceedings of the 5th Intl. Workshop on Semantic Evaluation, pp.
355–358. Association for Computational Linguistics.

Liberty, E., R. Sriharsha, and M. Sviridenko (2016). An algorithm for online k-means clustering. In
Proceedings of the 18th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 81–89.

Mamede, N., J. Baptista, C. Diniz, and V. Cabarrão (2012, April). STRING: A hybrid statistical and
rule-based natural language processing chain for Portuguese. In 10th Intl. Conf. on Computational
Processing of Portuguese (PROPOR). (Demo session).

42
Manandhar, S. and I. Klapaftis (2009). SemEval-2010 task 14: Evaluation setting for word sense induc-
tion & disambiguation systems. In Proceedings of the Workshop on Semantic Evaluations: Recent
Achievements and Future Directions, pp. 117–122. Association for Computational Linguistics.

Manandhar, S., I. Klapaftis, D. Dligach, and S. Pradhan (2010). SemEval-2010 task 14: Word sense
induction & disambiguation. In Proceedings of the 5th Intl. Workshop on Semantic Evaluation, pp.
63–68. Association for Computational Linguistics.

Manning, C. and H. Schütze (1999). Foundations of Statistical Natural Language Processing. Cam-
bridge, MA: MIT Press.

Marques, J. (2013, November). Anaphora resolution in Portuguese: A hybrid approach. Master’s thesis,
Instituto Superior Técnico, Universidade Técnica de Lisboa.

Maurício, A. (2011, November). Identificação, classificação e normalização de expressões temporais.


Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.

Miller, G., C. Leacock, R. Tengi, and R. T. Bunker (1993). A semantic concordance. In Proceedings
of the Workshop on Human Language Technology, HLT ’93, Stroudsburg, PA, USA, pp. 303–308.
Association for Computational Linguistics.

Nascimento, M., P. Marrafa, L. Pereira, R. Ribeiro, R. Veloso, and L. Wittmann (1998). LE-PAROLE –
Do corpus à modelização da informação lexical num sistema multifunção. Actas do XIII Encontro da
APL, 115–134.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2),
1–69.

Ng, H. (1997). Getting serious about word sense disambiguation. In Proceedings of the ACL SIGLEX
Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, pp. 191–197.

Pantel, P. (2003). Clustering by committee. Ph. D. thesis, University of Alberta.

Pantel, P. and D. Lin (2002). Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD
Intl. Conf. on Knowledge Discovery and Data Mining, pp. 613–619. ACM.

Ribeiro, R. (2003, March). Anotação morfossintáctica desambiguada do português. Master’s thesis,


Instituto Superior Técnico, Universidade Técnica de Lisboa.

Rocha, P. and D. Santos (2000). CETEMPúblico: Um corpus de grandes dimensões de linguagem


jornalística portuguesa. In Actas do V Encontro para o processamento computacional da língua
portuguesa escrita e falada, PROPOR, Volume 2000, pp. 131–140.

Rosenberg, A. and J. Hirschberg (2007, June). V-measure: A conditional entropy-based external cluster
evaluation measure. In Proceedings of the 2007 Joint Conf. on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420. Associa-
tion for Computational Linguistics.

43
Rychlỳ, P. (2008). A lexicographer-friendly association score. Proceedings of Recent Advances in
Slavonic Natural Language Processing 2008, 6–9.

Sahlgren, M. (2005). An introduction to random indexing. In Methods and Applications of Semantic


Indexing at the 7th Intl. Conf. on Terminology and Knowledge Engineering, TKE, Volume 5.

Salton, G. and M. McGill (1986). Introduction to Modern Information Retrieval. New York, NY, USA:
McGraw-Hill, Inc.

Šíma, J. and S. Schaeffer (2006). On the NP-completeness of some graph cluster measures. In Intl.
Conf. on Current Trends in Theory and Practice of Computer Science, pp. 530–537. Springer.

Van, T. (2010). Mining for meaning: The extraction of lexico-semantic knowledge from text. Volume 82.

Véronis, J. (2004). Hyperlex: lexical cartography for information retrieval. Computer Speech & Lan-
guage 18(3), 223–252.

Vicente, A. (2013, June). Lexman: um segmentador e analisador morfológico com transdutores. Mas-
ter’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.

Widdows, D. and B. Dorow (2002). A graph model for unsupervised lexical acquisition. In Proceedings
of the 19th Intl. Conf. on Computational Linguistics, Volume 1, pp. 1–7. Association for Computational
Linguistics.

Zepeda-Mendoza, M. L. and O. Resendis-Antonio (2013). Hierarchical Agglomerative Clustering, pp.


886–887. New York, NY: Springer New York.

Zhao, Y., G. Karypis, and U. Fayyad (2005). Hierarchical clustering algorithms for document datasets.
Data Mining and Knowledge Discovery 10(2), 141–168.

44

You might also like