Semantic Syntactic Doc Classifiers

an article

Uploaded by

music2850

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views2 pages

Semantic Syntactic Doc Classifiers

an article

Uploaded by

music2850

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Combining Semantic and Syntactic Document CIassifiers

to Improve First Story Detection

Nicola Stokes, Joe Carthy
Department of Computer Science,
University College Dublin,
Ireland
{nicola.stokes,joe.carthy}@ucd.ie

ABSTRACT
In this paper we describe a type of data fusion involving the
combination of evidence derived from multiple document
representations. Our aim is to investigate if a composite
representation can improve the online detection of novel events in
a stream of broadcast news stories. This classification process
otherwise known as first story detection FSD (or in the Topic
Detection and Tracking pilot study as online new event detection
[1]), is one of three main classification tasks defined by the TDT
initiative. Our composite document representation consists of a
semantic representation (based on the lexical chains derived from
a text) and a syntactic representation (using proper nouns). Using
the TDT1 evaluation methodology, we evaluate a number of
document representation combinations using these document
classifiers.
1. A DOCUMENT REPRESENTATION
STRATEGY USING LEXICAL CHAINS
The purpose of any document representation strategy in IR is two
fold. Firstly the efficiency of the IR process (i.e. filtering,
clustering etc.) improves greatly when a smaller dimensionality
than full dimensionality is imposed on the word space of a
document. Secondly, the effectiveness of the IR task may also
improve as more pertinent features are retained to describe
document content and noisy features are removed. Using some
sort of feature selection method is often an essential part of any IR
process, where the most common approach used is based on the
gathering of corpus statistics i.e. analyzing document content in
terms of word distributions within the corpus. However in this
paper we explore an alternative non-statistical approach to feature
selection based on the identification of lexical chains using an
online lexical taxonomy called WordNet [3].
When reading any text it is obvious that it is not merely
made up of a set of unrelated sentences, but that these sentences are
in fact connected to each other in one of two ways: cohesion and
coherence. Morris and Hirst [2] describe textual cohesion as the
way in which text tends to hang together. They found that the
cohesive structure of a text can be explored and represented by
creating sequences of semantically related words called lexical
chains. For example in a document concerning airplanes a typical
chain might consist of the following words {plane, airplane, pilot,
cockpit, airhostess, wing, engine}, where each word in the chain is
directly or indirectly related to another word by a semantic
relationship such as holonymy, hyponymy, meronymy and
hypernymy. Our feature selection criterion is based on the
identification of these topics or chains in the text. We assume that
words, which fail to be chained, do not take part in the overall
cohesive structure of the text and hence are not essential topic
descriptors. Another advantages of using chain words as document
classifiers is that they address two linguistic problems associated
with traditional syntactic representations i.e. synonymy and
polysemy. Firstly, WordNet allows us to represent synonymous
words like {car, automobile, motorcar} in terms of a single unique
identifier called a synset number. Secondly, when a polysemous
word i.e. bank is added to a lexical chain its correct sense within the
context of the document is discovered. In other words during the
chain formation process words are implicitly disambiguated as
follows. Firstly each term contained in a particular document is
dealt with in order of occurrence. Then each word is added to an
existing lexical chain if a semantic path (of predefined maximum
length) between the two words exists in WordNet, otherwise this
word becomes the seed of a new chain. A stronger criterion than
simple semantic association is imposed on the addition of a term to
a chain, where terms must be added to the most recently updated
(semantically related) chain. This favors the creation of lexical
chains containing words that are in close proximity within the text,
prompting the correct disambiguation of a word based on the
context in which it was used.
So the first element of our combined document
representation is a chain word representation made up of WordNet
synset numbers which map all syntactic forms of a concept to a
single number. The necessity of the second element of our
combined document representation becomes apparent when we
consider the descriptive power of proper nouns when representing
events in a news story domain, and the absence of proper nouns in
our chain word classifier. This deficiency is due to the failure of
WordNet to make semantic associations between proper nouns
and other word types in its taxonomy. Hence to address this
deficiency in our chain word representation we identify proper
nouns using a simple heuristic based on capitalization and use
these words as our syntactic representation of our combined
document representation.
2. DETECTION USING TWO CLASSIFIERS
Online Detection or First Story Detection (FSD) is in essence a
classification problem where documents arriving in chronological
order on the input stream are tagged with a YES flag if they
discuss a previously unseen news event, or a NO flag when they
discuss an old news topic.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR01, September 9-12, 2001, New Orleans, Louisiana, USA.
Copyright 2001 ACM 1-58113-331-6/01/0009$5.00.

432 424
However unlike detection in a retrospective environment a story
must be identified as novel before subsequent stories can be
considered. The single-pass clustering algorithm bases its
clustering methodology on the same assumption, the general
structure of which is summarised as follows. A more detailed
analysis of the comparison and thresholding strategies defined in
this algorithm is given in [3].
1. Convert the current document into a weighted chain word
vector and a weighted proper noun vector.
2. The first document on the input stream will become the first
cluster.
3. All subsequent incoming documents are compared with all
previously created clusters up to the current point in time. A
comparison strategy is used here to determine the extent of
the similarity between a document and a cluster. In our IR
model we use sub-vectors to describe our two distinct
document representations. This involves calculating the
closeness or similarity between the chain word vectors and
proper noun vectors for each document/cluster comparison
using the standard cosine similarity measure (used in this
variation of the vector space model to compute the cosine of
the angle between two weighted vectors). The data fusion
element of this experiment involves the combination of
similarity measures from two distinct representations of
document content in a single cluster run i.e. k equals 2 in
equation (1). So the overall similarity between a document D
and a cluster C is a linear combination of the similarities for
each sub-vector formally defined as:

where Sim(X, Y) is the cosine similarity measure for two
vectors X and Y, and w is a coefficient that biases the weight
of evidence each document representation j, contributes to
the similarity measure.
4. When the most similar cluster is found the thresholding
strategy is used to discover if this similarity measure is high
enough to warrant the addition of that document to the
cluster and the classification of the current document as an
old event. If this document does not satisfy the similarity
condition set out by the thresholding methodology then the
document is declared to discuss a new event, and this
document will form the seed of a new cluster.
5. This clustering process will continue until all documents in
the input stream have been classified.
3. THE DATA FUSION EXPERIMENT
Using the TDT1 corpus as input, the objective of this experiment
was to determine if the use of a combined representation (a lexical
chain and proper noun representation) would lead to improved
FSD performance compared with a singular document
representation using either proper nouns or chain words. Figure 1
is a Detection Error Tradeoff (DET) graph showing the impact of
our combined representation on detection. A DET graph
illustrates the tradeoff between misses and false alarms, where
points closer to the origin indicate better overall performance. As
can be seen the graph with the closest point to the origin is the
LexDetect system, leading to the conclusion that a composite
document representation using chain words and proper nouns
marginally outperforms a system (CHAIN and P_NOUN)
containing only either one of these representations. Optimal
results for the LexDetect system in this experiment were achieved
when both chain and proper noun representations were considered
as equal evidence of similarity between two documents i.e. w
j
= 1
for both representations in equation (1).
4. CONCLUSIONS
A variety of techniques for data fusion have been proposed in IR
literature [5]. Results from data fusion research have suggested
that significant improvements in system effectiveness can be
obtained by combining multiple index representations, query
formulations and search strategies. In this paper we investigated if
improved FSD performance could be achieved when a composite
document representation was used in this TDT task. Our results
showed that a marginal increase in system effectiveness is
achieved when lexical chain (semantic) representations were used
in conjunction with proper noun (syntactic) representations. In
particular, we saw that the miss rate of our FSD system LexDetect
decreased with little or no impact to the false alarm rate of the
system.
5. ACKNOWLEDGMENTS
This project is funded by an Enterprise Ireland Grant, project
number [SC/1999/083].
6. REFERENCES
[1] Ron Papka, James Allan, Topic Detection and Tracking:
Event Clustering as a basis for first story detection, Kluwer
Academic Publishers, 4:97-126, 2000.
[2] Jane Morris, Graeme Hirst, Lexical Cohesion by Thesaural
Relations as an Indicator of the Structure of Text,
Computational Linguistics 17(1), March 1991.
[3] Christiane Fellbaum, WordNet: An Electronic Lexical
Database, MIT Press, 1998.
[4] Nicola Stokes, Paula Hatch, Joe Carthy, Topic Detection, a
new application for lexical chaining?, In the Proceedings of
BCS IRSG Colloquium 2000, pp. 94-103, 2000.
[5] W. B. Croft, Combining Approaches to information retrieval,
Advances in Information Retrieval, 1:1-36 Kluwer Academic
Publishers, 2000.

Figure 1. The effect on performance when a
combined document representation is used.

=
=
k
j
j j j ) C D Sim w C D Sim
1
) 1 ( , ( ) , (
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
% Misses
%

F
a
l
s
e

A
l
a
r
m
s
P_NOUN
LexDetect
CHAIN
433 425

Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
No ratings yet
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
5 pages
Paper 2
No ratings yet
Paper 2
9 pages
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
8 pages
Online New Event Detection Based On IPLSA
No ratings yet
Online New Event Detection Based On IPLSA
12 pages
Unit 3-1
No ratings yet
Unit 3-1
66 pages
Twitter Event Detection Algorithm
No ratings yet
Twitter Event Detection Algorithm
9 pages
Ir 301
No ratings yet
Ir 301
6 pages
Meggie Malcolm
No ratings yet
Meggie Malcolm
25 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Using Wordnet For Text Categorization
No ratings yet
Using Wordnet For Text Categorization
9 pages
Ontology Based Word Sense Disambiguation
No ratings yet
Ontology Based Word Sense Disambiguation
8 pages
Semantic Density Analysis: Comparing Word Meaning Across Time and Phonetic Space
No ratings yet
Semantic Density Analysis: Comparing Word Meaning Across Time and Phonetic Space
8 pages
Unit-3 02 - Semantic Parsing
No ratings yet
Unit-3 02 - Semantic Parsing
22 pages
Applications of Lexical Cohesion PDF
No ratings yet
Applications of Lexical Cohesion PDF
276 pages
1 s2.0 S0169023X1730561X Main
No ratings yet
1 s2.0 S0169023X1730561X Main
17 pages
31 35
No ratings yet
31 35
5 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Automatic Building of An Ontology From A Corpus of Documents
No ratings yet
Automatic Building of An Ontology From A Corpus of Documents
5 pages
A Comparative Study On Text Representation Schemes in Text Categorization
No ratings yet
A Comparative Study On Text Representation Schemes in Text Categorization
11 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme
No ratings yet
Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme
8 pages
Context Annotated Graph and Fuzzy Simila
No ratings yet
Context Annotated Graph and Fuzzy Simila
13 pages
Lexical Chains As Representations of Context For The Detection and Correction of Malapropisms
No ratings yet
Lexical Chains As Representations of Context For The Detection and Correction of Malapropisms
26 pages
Trigram 11
No ratings yet
Trigram 11
16 pages
Tagging Based Efficient Web Video Event Categorization
No ratings yet
Tagging Based Efficient Web Video Event Categorization
5 pages
Audio-Speech Segmentation & Topic Detection
No ratings yet
Audio-Speech Segmentation & Topic Detection
7 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Keyword 2
No ratings yet
Keyword 2
5 pages
Ubicc-Id365 365
No ratings yet
Ubicc-Id365 365
9 pages
NLP Unit Test 2
No ratings yet
NLP Unit Test 2
10 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Complex Linguistic Features For Text Classification: A Comprehensive Study
No ratings yet
Complex Linguistic Features For Text Classification: A Comprehensive Study
15 pages
W11 0128 PDF
No ratings yet
W11 0128 PDF
10 pages
Word Sense Disambiguation Methods Applied To English and Romanian
No ratings yet
Word Sense Disambiguation Methods Applied To English and Romanian
8 pages
Semantic Relations in NLP
No ratings yet
Semantic Relations in NLP
6 pages
Elias Iosif, Athanasios Tegos, Apostolos Pangos, Eric Fosler-Lussier, Alexandros Potamianos
No ratings yet
Elias Iosif, Athanasios Tegos, Apostolos Pangos, Eric Fosler-Lussier, Alexandros Potamianos
4 pages
Ambiguous Synonyms Implementing An Unsup
No ratings yet
Ambiguous Synonyms Implementing An Unsup
40 pages
New Features For Framenet - Wordnet Mapping: Sara Tonelli and Daniele Pighin Fbk-Irst, Human Language Technologies
No ratings yet
New Features For Framenet - Wordnet Mapping: Sara Tonelli and Daniele Pighin Fbk-Irst, Human Language Technologies
9 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Lexical Resources & NLP Tools
No ratings yet
Lexical Resources & NLP Tools
6 pages
NLP Assign Mod-4,5,6 IramShaikh
No ratings yet
NLP Assign Mod-4,5,6 IramShaikh
10 pages
Performance Enhancement of WSD Using Association Rules in WEKA
No ratings yet
Performance Enhancement of WSD Using Association Rules in WEKA
8 pages
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
No ratings yet
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
9 pages
Lexi Cal Chains Report
No ratings yet
Lexi Cal Chains Report
6 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
Ambiguity-Aware Document Similarity: Fabrizio Caruso
No ratings yet
Ambiguity-Aware Document Similarity: Fabrizio Caruso
10 pages
Ambiguity-Aware Document Similarity
No ratings yet
Ambiguity-Aware Document Similarity
10 pages
Research On Domain Ontology Construction in Digita
No ratings yet
Research On Domain Ontology Construction in Digita
7 pages
Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
No ratings yet
Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
8 pages
Chapter 4 NLP
No ratings yet
Chapter 4 NLP
17 pages
Gabbar 2025 Update
No ratings yet
Gabbar 2025 Update
15 pages
NLP Pipeline and Morphology
No ratings yet
NLP Pipeline and Morphology
21 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Document Representation in Natural Language Text Retrieval
No ratings yet
Document Representation in Natural Language Text Retrieval
6 pages
21 25
No ratings yet
21 25
5 pages
Make It Short.: Coffee
No ratings yet
Make It Short.: Coffee
2 pages
Computational Linguistics With The Zen Attitude: Department of Sanskrit Studies
No ratings yet
Computational Linguistics With The Zen Attitude: Department of Sanskrit Studies
27 pages
What Is Word Sense Disambiguation Good For?: Adam Kilgarriff Itri University of Brighton
No ratings yet
What Is Word Sense Disambiguation Good For?: Adam Kilgarriff Itri University of Brighton
6 pages
Mata Hari Walkthrough 2
No ratings yet
Mata Hari Walkthrough 2
27 pages
Guardians of Language
No ratings yet
Guardians of Language
462 pages
Jewish Memories
No ratings yet
Jewish Memories
211 pages
South Indian Shadow Puppetry
No ratings yet
South Indian Shadow Puppetry
238 pages
'I Don't Believe in Word Senses' - Adam Kilgariff
No ratings yet
'I Don't Believe in Word Senses' - Adam Kilgariff
33 pages
Human Nature
No ratings yet
Human Nature
287 pages
Guardians of Language
No ratings yet
Guardians of Language
462 pages
Epic Traditions in The Contemporary World
100% (1)
Epic Traditions in The Contemporary World
291 pages
Fieldwork Memoirs of Banaras
No ratings yet
Fieldwork Memoirs of Banaras
112 pages
Getting To Be Mark Twain
No ratings yet
Getting To Be Mark Twain
146 pages
Descartes Imagination
100% (1)
Descartes Imagination
263 pages
Fatimid Writingsigns
No ratings yet
Fatimid Writingsigns
126 pages
Epic Traditions in The Contemporary World
100% (1)
Epic Traditions in The Contemporary World
291 pages
Jaina Debates on Women's Liberation
No ratings yet
Jaina Debates on Women's Liberation
289 pages
Dangerous Intimacy
100% (1)
Dangerous Intimacy
248 pages
At The Heart of The Empire Indians and The Colonial Encounter in Late-Victorian Britain
No ratings yet
At The Heart of The Empire Indians and The Colonial Encounter in Late-Victorian Britain
230 pages
California Native Stories
100% (1)
California Native Stories
455 pages
Authors of Their Own Lives
No ratings yet
Authors of Their Own Lives
418 pages
Computational Theory Mind
No ratings yet
Computational Theory Mind
314 pages
Dangerous Intimacy
100% (1)
Dangerous Intimacy
248 pages
Aristotle On The Goals and Exactness of Ethics
No ratings yet
Aristotle On The Goals and Exactness of Ethics
435 pages
Smart Care
No ratings yet
Smart Care
47 pages
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
No ratings yet
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
16 pages
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
No ratings yet
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
16 pages
LCR Measurements
No ratings yet
LCR Measurements
16 pages
Coast Guard Exam Admit Card
No ratings yet
Coast Guard Exam Admit Card
7 pages
KKS Power Plant Identification System
No ratings yet
KKS Power Plant Identification System
3 pages
NIC Scientist Job Application
No ratings yet
NIC Scientist Job Application
5 pages
KHDA - Staff Approval Application - V1.22-June 2022
No ratings yet
KHDA - Staff Approval Application - V1.22-June 2022
4 pages
DIPS v7 Rosette Plot Manual
No ratings yet
DIPS v7 Rosette Plot Manual
20 pages
DLL Arts Q2 W2 D3 Nov 16
No ratings yet
DLL Arts Q2 W2 D3 Nov 16
6 pages
MCQ Ec-405
No ratings yet
MCQ Ec-405
2 pages
Bac1105 Bisf1105 Bsd1106 Installation and Customization
No ratings yet
Bac1105 Bisf1105 Bsd1106 Installation and Customization
3 pages
ColorGATE RIP-Software Release Notes 8.00 Build 5055
No ratings yet
ColorGATE RIP-Software Release Notes 8.00 Build 5055
34 pages
PMP Certification: PMBOK® 6.0
No ratings yet
PMP Certification: PMBOK® 6.0
11 pages
Business Process Simulation Guide
No ratings yet
Business Process Simulation Guide
24 pages
Operations On Array
No ratings yet
Operations On Array
9 pages
JioFiber Tariff For Business
No ratings yet
JioFiber Tariff For Business
1 page
Database Design Assignment Guide
No ratings yet
Database Design Assignment Guide
4 pages
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
No ratings yet
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
326 pages
Resume Limpia Banerjee
No ratings yet
Resume Limpia Banerjee
3 pages
Rectangular Microstrip Antenna Design
No ratings yet
Rectangular Microstrip Antenna Design
3 pages
Cantina Centrifuge CFG February March2025
No ratings yet
Cantina Centrifuge CFG February March2025
10 pages
CH 3-5 MRI Contrast Spatial Localization
No ratings yet
CH 3-5 MRI Contrast Spatial Localization
109 pages
ISO (International Organization Standardization)
100% (1)
ISO (International Organization Standardization)
18 pages
Cranes&Hoists For Mining Industry
No ratings yet
Cranes&Hoists For Mining Industry
2 pages
19 B9 IELTS T2 Essays 240 T2 Questions
100% (1)
19 B9 IELTS T2 Essays 240 T2 Questions
116 pages
AP-Lab Manual - Updated
No ratings yet
AP-Lab Manual - Updated
110 pages
Contact ID Codes
No ratings yet
Contact ID Codes
6 pages
IEEE Paper Batch02
No ratings yet
IEEE Paper Batch02
4 pages
The Impact of Cloud Computing On Organisational Ag
No ratings yet
The Impact of Cloud Computing On Organisational Ag
18 pages

Semantic Syntactic Doc Classifiers

Uploaded by

Semantic Syntactic Doc Classifiers

Uploaded by

Combining Semantic and Syntactic Document CIassifiers

to Improve First Story Detection

You might also like