Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views7 pages

Text Classification

This document provides an extensive overview of text classification, detailing its significance, applications, and methods for building classifiers, particularly focusing on the Naïve Bayes and Maximum Entropy approaches. It covers feature engineering, performance evaluation, and the challenges associated with different classification methods. The document serves as a valuable resource for understanding the core concepts and techniques in text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Text Classification

This document provides an extensive overview of text classification, detailing its significance, applications, and methods for building classifiers, particularly focusing on the Naïve Bayes and Maximum Entropy approaches. It covers feature engineering, performance evaluation, and the challenges associated with different classification methods. The document serves as a valuable resource for understanding the core concepts and techniques in text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Text Classification: Concepts and Methods

This document, "9- Text Classification_v2.pdf", provides a comprehensive introduction to text


classification, outlining its purpose, various applications, different approaches to building classifiers, and
methods for evaluating their performance [1-3].

The document begins by explaining why text categorization is important, highlighting two main
reasons: to categorize the content of the text and to categorize the author of the text [1].


Categorizing the content includes tasks like spam detection (binary classification), sentiment analysis
(binary or multiway), and topic classification (multiway) [1]. Examples of sentiment analysis include
classifying movie, restaurant, and product reviews as positive or negative, or on a scale of 1-5 stars, and
categorizing political arguments as pro, con, or neutral [1, 4]. Topic classification involves assigning
texts to predefined categories like sport, finance, or travel [1].


Categorizing the author involves tasks such as native language identification, diagnosis of disease
(psychiatric or cognitive impairments), and identification of characteristics like gender, dialect,
educational background, and political orientation, with applications in forensics, advertising, and
disinformation [4].

The document then presents a "How to Categorize?" section, illustrating the concept with an example of
a spam email [5-7]. It then discusses two main approaches to text classification [7]:


First Approach: Rules: This method involves extracting features from the text and building rules by
combining these features [7]. An example rule provided is "black-list-address OR (‘dollars’ AND ‘have
been selected’)" for spam detection [7]. While this approach can achieve high accuracy with a well-
designed rule base, it is noted that building and maintaining these rules is very time-consuming [7].


Second Approach: Supervised Learning: This approach uses a training set of hand-labeled documents
(d, c) to learn a classifier that can map a document (d) to its corresponding class (c) from a fixed set of
classes C [2]. Various supervised machine learning algorithms can be used, such as SVM, Decision Tree,
Random Forest, Naïve Bayes, and MaxEnt [2].

The document then focuses on the Naïve Bayes Classifier [2]. It explains that this classifier is based on
Bayes' Rule and aims to find the class (c) that maximizes the probability P(c|d), where d is the document
[2].


The formula for the Naïve Bayes Classifier is given as argmax P(x1, x2, …, xn|c)P(c), where x1…xn
represent the features of the document d [2].


P(c) is the prior probability of class c [8].


P(x1, x2, …, xn|c) is the probability of seeing the features in class c [8].

The document highlights two key independence assumptions made by the Naïve Bayes Classifier: the
Bag of Words assumption (position of words doesn't matter) and conditional independence (feature
probabilities P(xi|c) are independent given the class c) [8].


The formula for conditional independence is given as P(x1, x2, …, xn|c) = P(x1|c) ⋅ P(x2|c) ⋅ P(x3|c) ⋯
P(xn|c) [8]. Essentially, Naïve Bayes focuses on the count of each feature in each document [8].


An example of how word counts are used in spam detection is provided [9].


The document addresses the problem with Maximum Likelihood Estimation in Naïve Bayes, where
unseen words in a particular class would have a probability of zero, which can be problematic [9].
Laplace (add-1) smoothing is introduced as a solution to this, providing the formula for calculating
P(wk | cj) and P(cj) using smoothing [10].


A detailed example of applying Multinomial Naïve Bayes to classify documents (Chinese vs. Japanese)
is provided, illustrating the calculation of conditional probabilities and priors with smoothing [10, 11].


The advantages of Naïve Bayes are listed as ease of implementation, fast training and classification
(suitable for large datasets), good performance even with small datasets, and its utility as a baseline
method [11, 12].


The document also points out the problems with Naïve Bayes, primarily the "naive" assumption of
feature independence, which is often not true in reality. This can lead to overconfident predictions [12,
13].

The document then introduces Maximum Entropy (MaxEnt) classifiers as a "less naive approach" that
doesn't assume conditional independence and can potentially achieve better performance with enough
training data [13, 14].


MaxEnt models P(c|d) directly, unlike Naïve Bayes which uses Bayes' Rule [15].


Features in MaxEnt are often binary and class-specific. For example, if there are three classes and the
word 'ski' is a feature, there would be three binary features: contains('ski') & c=1, contains('ski') & c=2,
contains('ski') & c=3 [15, 16].


Each feature has a learned real-valued weight [16].


The training of MaxEnt models involves finding the optimal weights, which is more computationally
intensive than Naïve Bayes as it requires iterative optimization using gradient ascent [16, 17].

The document mentions alternative feature values (binary) and feature sets (subset of vocabulary,
ignoring stopwords, using sentiment lexicons, incorporating n-grams, syntactic, and morphological
features) for both Naïve Bayes and MaxEnt [17, 18].

The section on Features Besides Unigrams discusses the heuristic feature selection problem and
mentions strategies like using a large set of features defined by a template, restricting to features that
seem useful in isolation, and adding features greedily while smoothing [18, 19]. It also provides a list of
features used by SpamAssassin as an example of real-world feature engineering for spam detection [19,
20].

The document then briefly touches upon Combining Cues via Naive Bayes, mentioning its use in
authorship attribution [21]. It also provides examples of Features in Sentiment Analysis, including
statistical features, syntax features (POS tags), and words from sentiment lexicons, giving examples of
positive and negative words [3, 21].

Finally, the document discusses Measuring Performance in text classification using Precision (good
messages kept / all messages kept) and Recall (good messages kept / all good messages) [3]. It explains
the trade-off between precision and recall based on the classification threshold and illustrates this with
curves [3, 22]. The concept of a point where precision equals recall is also mentioned [22].

In summary, this document offers a well-structured and informative overview of text classification,
starting from its motivations and applications, delving into the details of two fundamental approaches
(rule-based and supervised learning), with a strong emphasis on the Naïve Bayes classifier and an
introduction to MaxEnt. It also covers crucial aspects like feature engineering and performance
evaluation, providing concrete examples and explanations that enhance understanding of the key
concepts in this field.

--------------------------------------------------------------------------------

Understanding Lexical Semantics: Relations, Senses, and Representation

This document, "8_lexicalsemantics_v2.pdf", provides a detailed introduction to the field of lexical


semantics [1]. It begins by defining semantics as the study of meaning in linguistic utterances and then
focuses on lexical semantics as the study of systematic meaning-related connections among lexemes
(lexical relations) and the internal meaning-related structure of individual lexemes (selectional
restrictions) [1]. The document outlines its content to cover lexical relations, selectional restrictions,
word sense disambiguation, computing related words, word representation, and applications in
information retrieval [1, 2].

The document thoroughly explains various lexical relations:


Homonymy, where words share a form but have unrelated meanings, is discussed with examples like
"bank" and "bat" [2]. It further distinguishes between homographs (same spelling, like "dove") and
homophones (same sound, different spelling, like "write/right") [2]. The document points out that
homonymy causes problems for NLP applications like information retrieval, machine translation, and
text-to-speech [2, 3].

Polysemy is presented as a phenomenon where a word has related meanings, using "bank" as an example
to illustrate the difference between the financial institution and the building housing it [3]. The document
notes that many non-rare words are polysemous and that there can be systematic polysemy, such as the
institution/building sense for words like "school," "university," and "hospital," as well as author/works
and tree/fruit relationships [3, 4]. Metonymy is also mentioned as a form of systematic polysemy [4].


Synonymy is defined as words having the same meaning in some or all contexts, with examples like
"filbert/hazelnut" and "big/large" [4, 5]. The document emphasizes that synonymy is a relation between
senses rather than just words and uses the example of "big" and "large" to show that they are not always
interchangeable [5, 6].


Antonymy is described as senses being opposites with respect to one feature of meaning, with examples
like "dark/light" and "short/long" [6]. It further elaborates that antonyms can define a binary opposition
or be at opposite ends of a scale and can also be reversives like "rise/fall" [7].


Hyponymy and hypernymy are explained as hierarchical relationships where a hyponym is a more
specific subclass of a hypernym (e.g., "car" is a hyponym of "vehicle," and "mango" is a hyponym of
"fruit") [7]. The document also distinguishes between hyponyms (classes) and instances (individuals)
within lexical databases like WordNet [8].

The document then introduces WordNet as a lexical database inspired by psycholinguistic theories,
establishing a network of lexical items and relationships for English and other languages [8, 9]. It
explains that a sense in WordNet is represented by a synset (a set of near-synonyms) along with a gloss
[10]. Examples of WordNet hierarchies and noun relations are provided [11]. The document also
mentions Python (NLTK) and Java libraries for accessing WordNet [12].

Selectional restrictions are discussed as constraints that predicates impose on their arguments, using
examples like "read (human subject, textual object)" and "eat (animate subject)" [12]. The document
illustrates how selectional restrictions can be used to disambiguate word senses, such as the different
senses of "dish" based on the surrounding words [12-14]. It notes that phrase structure grammars can
implement selectional restrictions by creating ontologies and constraining rules [13, 15]. However, it also
points out problems with selectional restrictions, such as creating brittle grammars and sometimes being
too restrictive or not restrictive enough [16].

The document explores methods for disambiguation, including using selectional restrictions and lexical
relations, providing an example with "Vitamin_Pill" and "Publication Dietary" [14, 17]. It also discusses
measuring semantic relatedness, stating that it's important to quantify the strength of the relationship
between words [17]. Edge/node counting in WordNet is presented as one method, where shorter paths
indicate stronger relationships, but its shortcomings due to assumptions of equal edge length and
taxonomy density are noted [17-20]. Dictionary-based approaches like Lesk's method, which compare
definitions for overlap, are also mentioned [20].

The document delves into Word Sense Disambiguation (WSD) using machine learning approaches
[21]. It explains that these methods learn classifiers based on feature vectors extracted from labeled or
unlabeled corpora [21]. Various input features for WSD are listed, including POS tags, surrounding
words, punctuation, partial parsing, and collocational information [21, 22]. Different types of classifiers,
such as Naïve Bayes, are briefly explained [23, 24].

Finally, the document discusses Machine Learning approaches to find related words, covering Latent
Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) [24, 25]. It then provides a detailed
explanation of word embeddings, contrasting them with traditional methods like one-hot encoding and
co-occurrence matrices, highlighting their ability to capture semantic relationships in dense, low-
dimensional vectors [25-28]. The Word2Vec (CBoW and Skip-gram) and FastText models are
mentioned, noting FastText's advantage in handling word structure [28, 29]. The document concludes by
introducing Bidirectional Encoder Representations from Transformers (BERT) as a context-
sensitive word representation model that uses bidirectional Transformers and pre-training tasks like
Masked Language Modeling and Next Sentence Prediction [30-33].

Overall, this document provides a comprehensive and well-structured overview of key concepts in lexical
semantics, ranging from traditional lexical relations and resources like WordNet to modern
computational approaches involving machine learning and deep learning for word representation and
sense disambiguation [1-34]. The inclusion of examples and clear explanations makes it a valuable
resource for understanding this area of natural language processing [2-4, 7, 10, 12-14, 18, 26, 29, 32, 34].

--------------------------------------------------------------------------------

Dependency Parsing: Approaches and Results

This document, "7_Dependency Parsing_v2.pdf", provides a comprehensive overview of dependency


parsing, covering its introduction, applications, properties, different approaches, and some results [1-3].

Overview of Dependency Parsing: The document starts by noting the increasing interest in
dependency-based approaches for syntactic parsing in recent years [1]. It highlights that dependency-
based methods are still less accessible than constituency-based methods for many researchers and
developers [1]. Dependency grammars are defined by lexical items linked by binary, asymmetrical
relations called dependencies [4]. Examples of dependency labels include nsubj (nominal subject), dobj
(direct object), nmod (nominal modifier), amod (adjectival modifier), nummod (numeric modifier), and
case [4], as well as ccomp (clausal component), xcomp (open clausal component), and aux (auxiliary)
[5].

Applications: The document mentions applications of dependency parsing such as machine translation
and building knowledge bases using relation extraction [5]. The general form of a dependency parse is
described as a graph G = (V, A) where V (vertices) usually represents one word per sentence, and A
(arcs) are ordered pairs of vertices representing head-dependent relations [5, 6].

Properties: Key properties of dependency graphs are outlined, including being weakly connected,
acyclic (if i → j then not j →∗ i), and having a single head (if i → j, then not k → j, for any k != i) [6].
The document also distinguishes between projective (no crossing dependencies) and non-projective
dependencies, illustrating them visually [6].

Approaches to Dependency Parsing: The document categorizes the main approaches into transition-
based, graph-based, and current approaches [1-3].

Transition-based parsing is explained as a method that makes decisions based on a sequence of
transitions (SHIFT, REDUCE, LEFT-ARC, RIGHT-ARC) when reading a sentence from left to right to
determine dependency relationships [2]. The Nivre algorithm is presented as a specific transition-based
parsing algorithm [2, 7-10]. The algorithm uses a configuration c = (Σ|s, b|B, A) consisting of a stack Σ
for partially processed tokens, a buffer B for unread tokens, and a set A for found dependent relations
[7]. The four transitions of the Nivre algorithm are detailed: SHIFT (moves the top word from the buffer
to the stack), RIGHTlb (inserts the top word of the buffer to the stack and adds a dependency relation
from the top of the stack to the buffer), LEFTlb (pops the stack and adds a dependency relation from the
buffer to the popped word), and REDUCE (pops the stack) [8]. The document provides a step-by-step
example of how the Nivre algorithm parses the Vietnamese sentence "Tôi ăn cơm ở Bách_Khoa ." [9, 11-
18]. Various parsing algorithms like Nivre and Covington, and classifying methods like SVM and Neural
Networks, can be used within the transition-based framework [7].


Graph-based parsing aims to find the highest scoring dependency tree T for a given sentence S [9]. If
the sentence is unambiguous, T is the correct parse; if ambiguous, it's the highest scoring parse. Scores
are derived from weights on dependency edges learned from large dependency treebanks using machine
learning, making it a data-driven approach [9, 19]. Graph-based parsing can be mapped to finding a
maximum spanning tree (MST) in a fully connected graph where nodes are words and edges represent
potential dependencies [19]. The weight of a tree is the sum of the weights of its arcs, often using an arc-
factored model where weights depend on the end nodes and the link [19, 20]. The document describes a
variant of the Chu-Liu-Edmonds (CLE) algorithm for finding the MST, which involves greedily
selecting the maximum incoming arc for each node, handling cycles by contracting them, and recursively
applying the algorithm [20-22]. Features used for learning the weights in graph-based models include the
identity and POS tags of the head and dependent words, the dependency label, the direction of the label,
the sequence of POS tags between the words, the distance between them, and POS tags of neighboring
words [23].


Current approaches include end-to-end learning and joint learning [2, 9, 10, 24-26]. End-to-end
learning aims to train feature extractors and classifiers in parallel, avoiding the need for manual feature
engineering [24, 25]. Joint learning involves training models for multiple related tasks simultaneously,
such as POS tagging and dependency parsing, to leverage shared information and reduce overfitting [25,
26]. Recent research in joint learning utilizes BiLSTMs as input neural layers to generate word
embeddings and input representations for subsequent tasks [26].

Training Data and Evaluation: The document mentions that training data often follows the CoNLL
format, including information like word ID, word form, POS tag, head ID, and dependency label [10,
24]. Evaluation measures for dependency parsing include Unlabeled Attachment Score (UAS) and
Labeled Attachment Score (LAS) [27].

Some Results: The document presents experimental results on the BK Treebank, a Vietnamese
dependency treebank [27-29]. It shows the UAS and LAS achieved by different dependency parsing
methods, including transition-based (Malt Parser, Yara Parser, BiLSTM Transition) and graph-based
(BiLSTM Graph, jPTDP) approaches, both when the input text has and has not been pre-assigned with
POS tags [27, 29]. The jPTDP tool is noted for its ability to jointly learn POS tagging and dependency
parsing using neural networks [3, 28, 29].

In summary, the document provides a comprehensive introduction to dependency parsing, detailing its
principles, methodologies, and recent advancements, with a particular focus on transition-based and
graph-based approaches and their neural network-based extensions for end-to-end and joint learning. It
also presents results on a Vietnamese dependency treebank to illustrate the performance of different
parsing techniques.

You might also like