Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
71 views17 pages

Text Classification Techniques

NLP notes

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views17 pages

Text Classification Techniques

NLP notes

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

3.

Text Classification
Text classification is the task of assigning one or more categories to a given piece of
text from a larger set of possible categories. In the email spam–identifier example, we have
two categories—spam and non-spam— and each incoming email is assigned to one of these
categories.
Text classification is a special instance of the classification problem, where the input data
point(s) is text and the goal is to categorize the piece of text into one or more buckets (called a
class) from a set of pre-defined buckets (classes). The “text” can be of arbitrary length: a
character, a word, a sentence, a paragraph, or a full document. Text classification is sometimes
also referred to as topic classification, text categorization, or document categorization.
Text classification, can be further distinguished into three types based on the number of
categories involved: binary, multiclass, and multilabel classification.
➢ Binary classification: If the number of classes is two.
➢ Multiclass classification: If the number of classes is more than two.
➢ Multilabel classification: A document can have one or more labels/classes attached to
it.

Applications
Content classification and organization
This refers to the task of classifying/tagging large amounts of textual data. This, in turn,
is used to power use cases like content organization, search engines, and recommendation
systems, to name a few.

Customer support
Customers often use social media to express their opinions about and experiences of products
or services. Text classification is often used to identify the tweets that brands must respond to
(i.e., those that are actionable) and those that don’t require a response

E-commerce
Customers leave reviews for a range of products on e-commerce websites like Amazon, eBay,
etc. An example use of text classification in this kind of scenario is to understand and analyze
customers’ perception of a product or service based on their comments. This is commonly
known as “sentiment analysis.”

Other applications
➢ Text classification is used in language identification, like identifying the language of
new tweets or posts. For example, Google Translate has an automatic language
identification feature.
➢ Authorship attribution, or identifying the unknown authors of texts from a pool of
authors, is another popular use case of text classification, and it’s used in a range of
fields from forensic analysis to literary studies.
➢ Text classification has been used in the recent past for triaging posts in an online support
forum for mental health services [4]. In the NLP community, annual competitions are
conducted (e.g., clpsych.org) for solving such text classification problems originating
from clinical research.
➢ In the recent past, text classification has also been used to segregate fake news from
real news.

A Pipeline for Building Text Classication Systems


One typically follows these steps when building a text classification system:
1. Collect or create a labeled dataset suitable for the task.
2. Split the dataset into two (training and test) or three parts: training, validation (i.e.,
development), and test sets, then decide on evaluation metric(s).
3. Transform raw text into feature vectors.
4. Train a classifier using the feature vectors and the corresponding labels from the
training set.
5. Using the evaluation metric(s) from Step 2, benchmark the model performance on the
test set.
6. Deploy the model to serve the real-world use case and monitor its performance.

Steps 3 through 5 are iterated on to explore different variants of features and classifiation
algorithms and their parameters and to tune the hyperparameters before proceeding to Step 6,
deploying the optimal model in production.
For evaluating classifiers specifically, the following are used more commonly:
➢ Classification accuracy
➢ Precision, recall
➢ F1 score
➢ Area under ROC curve.
Apart from these, when classification systems are deployed in real-world applications, key
performance indicators (KPIs) specific to a given business use case are also used to evaluate
their impact and return on investment (ROI).

One Pipeline, Many Classifiers


Let’s look at building text classifiers by altering Steps 3 through 5 in the pipeline and
keeping the remaining steps constant. A good dataset is a prerequisite to start using the pipeline.
When we say “good” dataset, we mean a dataset that is a true representation of the data we’re
likely to see in production. We’ll use some of the publicly available datasets for text
classification.
No single approach is known to work universally well on all kinds of data and all classification
problems. In the real world, we experiment with multiple approaches, evaluate them, and
choose one final approach to deploy in practice.
For the rest of this section, we’ll use the “Economic News Article Tone and Relevance” dataset
from Figure Eight to demonstrate text classification. It consists of 8,000 news articles annotated
with whether or not they’re relevant to the US economy (i.e., a yes/no binary classification).
The dataset is also imbalanced, with ~1,500 relevant and ~6,500 non-relevant articles, which
poses the challenge of guarding against learning a bias toward the majority category (in this
case, non-relevant articles). Clearly, learning what a relevant news article is is more
challenging with this dataset than learning what is irrelevant. After all, just guessing that
everything is irrelevant already gives us 80% accuracy!
We’ll build classifiers using three well-known algorithms: Naive Bayes, logistic regression,
and support vector machines.

Naive Bayes Classier


Naive Bayes is a probabilistic classifier that uses Bayes’ theorem to classify texts based on the
evidence seen in training data. It estimates the conditional probability of each feature of a given
text for each class based on the occurrence of that feature in that class and multiplies the
probabilities of all the features of a given text to compute the final probability of classification
for each class. Finally, it chooses the class with maximum probability. Although simple, Naive
Bayes is commonly used as a baseline algorithm in classification experiments.
Let’s walk through the key steps of an implementation of the pipeline described earlier for our
dataset. For this, we use a Naive Bayes implementation in scikit-learn. Once the dataset is
loaded, we split the data into train and test data, as shown in the code snippet below:
The code snippet below shows this pre-processing and converting the train and test data into
feature vectors using CountVectorizer in scikit-learn, which is the implementation of the BoW
approach

The code snippet below shows how to do the training and evaluation of a Naive Bayes classifier
with the features we extracted above:
To improve our classification performance by addressing some of the possible reasons for it.
One way to approach Reason 1 is to reduce noise in the feature vectors. Let’s change the
CountVectorizer instantiation in the process, as shown in the code snippet below, and repeat
all the steps:

Now, clearly, while the average performance seems lower than before, the correct identification
of relevant articles increased by over 20%.

Logistic Regression
Logistic regression is an example of a discriminative classifier and is commonly used in text
classification, as abaseline in research, and as an MVP in real-world industry scenarios. Unlike
Naive Bayes, which estimates probabilities based on feature occurrence in classes, logistic
regression “learns” the weights for individual features based on how important they are to make
a classification decision. The goal of logistic regression is to learn a linear separator between
classes in the training data with the aim of maximizing the probability of the data. This
“learning” of feature weights and probability distribution over all classes is done through a
function called “logistic” function, and (hence the name) logistic regression
Let’s take the 5,000-dimensional feature vector from the last step of the Naive Bayes example
and train a logistic regression classifier instead of Naive Bayes. The code snippet below shows
how to use logistic regression for this task:

This results in a classifier with an accuracy of 73.7%. Our logistic regression classifier
instantiation has an argument class_weight, which is given a value “balanced”. This tells the
classifier to boost the weights for classes in inverse proportion to the number of samples for
that class. So, we expect to see better performance for the less-represented classes. We can
experiment with this code by removing that argument and retraining the classifier, to witness a
fall (by approximately 5%) in the bottom-right cell of the confusion matrix. However, logistic
regression clearly seems to perform worse than Naive Bayes for this dataset.
“A general rule of thumb when working with ML approaches is that there is no one algorithm
that learns well on all datasets. A common approach is to experiment with various algorithms
and compare them.”
Support Vector Machine
A support vector machine (SVM), first invented in the early 1960s, is a discriminative classifier
like logistic regression. However, unlike logistic regression, it aims to look for an optimal
hyperplane in a higher dimensional space, which can separate the classes in the data by a
maximum possible margin. Further, SVMs are capable of learning even non-linear separations
between classes, unlike logistic regression. However, they may also take longer to train.
SVMs come in various flavors in sklearn. Let’s see how one of them is used by keeping
everything else the same and altering maximum features to 1,000 instead of the previous
example’s 5,000. We restrict to 1,000 features, keeping in mind the time an SVM algorithm
takes to train. The code snippet below shows how to do this:
When compared to logistic regression, SVMs seem to have done better with the relevant
articles category, although, among this small set of experiments we did, Naïve Bayes, with the
smaller set of features, seems to be the best classifier for this dataset.

Using Neural Embeddings in Text Classication


The advantage of using embedding-based features is that they create a dense, low-
dimensional feature representation instead of the sparse, high-dimensional structure of
BoW/TF-IDF and other such features. There are different ways of designing and using features
based on neural embeddings.
Words and n-grams have been used primarily as features in text classification for a long time.
Different ways of vectorizing words have been proposed, and we used one such representation
in the last section, CountVectorizer. In the past few years, neural network–based architectures
have become popular for “learning” word representations, which are known as “word
embeddings.” We’ll use the sentiment-labeled sentences dataset from the UCI repository,
consisting of 1,500 positive-sentiment and 1,500 negative-sentiment sentences from Amazon,
Yelp, and IMDB. All the steps are detailed in the notebook Ch4/Word2Vec_Example.ipynb.
Loading and pre-processing the text data remains a common step. However, instead of
vectorizing the texts using BoW-based features, we’ll now rely on neural embedding models.
The following code snippet shows how to load this model into Python using gensim:
This is a large model that can be seen as a dictionary where the keys are words in the vocabulary
and the values are their learned embedding representations. Given a query word, if the word’s
embedding is present in the dictionary, it will return the same. How do we use this pre-learned
embedding to represent features? There are multiple ways of doing this. A simple approach is
just to average the embeddings for individual words in text. The code snippet below shows a
simple function to do this:

Note that it uses embeddings only for the words that are present in the dictionary. It ignores the
words for which embeddings are absent. Also, note that the above code will give a single vector
with DIMENSION(=300) components. We treat the resulting embedding vector as the feature
vector that represents the entire text. Once this feature engineering is done, the final step is
similar to what we did in the previous section: use these features and train a classifier.
If the overlap between the vocabulary of our custom domain and that of pre-trained word
embeddings is greater than 80%, pre-trained word embeddings tend to give good results in text
classification.
An important factor to consider when deploying models with embedding-based feature
extraction approaches is that the learned or pre-trained embedding models have to be stored
and loaded into memory while using these approaches. If the model itself is bulky (e.g., the
pre-trained model we used takes 3.6 GB), we need to factor this into our deployment needs.

Subword Embeddings and fastText


Word embeddings, as the name indicates, are about word representations. Even off-the-shelf
embeddings seem to work well on classification tasks, as we saw earlier. However, if a word
in our dataset was not present in the pre-trained model’s vocabulary, how will we get a
representation for this word? This problem is popularly known as out of vocabulary (OOV). In
our previous example, we just ignored such words from feature extraction.
The embedding representation for each word is represented as a sum of the representations of
individual character n-grams. While this may seem like a longer process compared to just
estimating word-level embeddings, it has two advantages:
➢ This approach can handle words that did not appear in training data (OOV).
➢ The implementation facilitates extremely fast learning on even very large corpora.
While fastText is a general-purpose library to learn the embeddings, it also supports off-the-
shelf text classification by providing end-to-end classifier training and testing; i.e., we don’t
have to handle feature extraction separately. The remaining part of this subsection shows how
to use the fastText classifier for text classification.
The training and test sets are provided as CSV files in this dataset. So, the first step involves
reading these files into your Python environment and cleaning the text to remove extraneous
characters, similar to what we did in the pre-processing steps for the other classifier examples
we’ve seen so far. Once this is done, the process to use fastText is quite simple. The code
snippet below shows a simple fastText model. The step-by-step process is detailed in the
associated Jupyter notebook (Ch4/Fast‐Text_Example.ipynb):

If we run this code in the notebook, we’ll notice that, despite the fact that this is a huge dataset
and we gave the classifier raw text and not the feature vector, the training takes only a few
seconds, and we get close to 98% precision and recall! As an exercise, try to build a classifier
using the same dataset but with either BoW or word embedding features and algorithms like
logistic regression. Notice how long it takes for the individual steps of feature extraction and
classification learning!
When we have a large dataset, and when learning seems infeasible with the approaches
described so far, fastText is a good option to use to set up a strong working baseline. However,
there’s one concern to keep in mind when using fastText, as was the case with Word2vec
embeddings: it uses pre-trained character n-gram embeddings. Thus, when we save the trained
model, it carries the entire character n-gram embeddings dictionary with it. This results in a
bulky model and can result in engineering issues. For example, the model stored with the name
“temp” in the above code snippet has a size close to 450 MB. However, fastText
implementation also comes with options to reduce the memory footprint of its classification
models with minimal reduction in classification performance. It does this by doing vocabulary
pruning and using compression algorithms. Exploring these possibilities could be a good option
in cases where large model sizes are a constraint.

Document Embeddings
In the Doc2vec embedding scheme, we learn a direct representation for the entire document
(sentence/paragraph) rather than each word. Just as we used word and character embeddings
as features for performing text classification, we can also use Doc2vec as a feature
representation mechanism. Since there are no existing pre-trained models that work with the
latest version of Doc2vec
We’ll use a dataset called “Sentiment Analysis: Emotion in Text” from figure-eight.com, which
contains 40,000 tweets labeled with 13 labels signifying different emotions. Let’s take the three
most frequent labels in this dataset—neutral, worry, happiness—and build a text classifier for
classifying new tweets into one of these three classes. The notebook for this subsection
(Ch4/Doc2Vec_Example.ipynb) walks you through the steps involved in using Doc2vec for
text classification and provides the dataset.
After loading the dataset and taking a subset of the three most frequent labels, an important
step to consider here is pre-processing the data. There are a few things that are different about
tweets compared to news articles or other such text, as we briefly discussed in Chapter 2 when
we talked about text pre-processing. First, they are very short. Second, our traditional
tokenizers may not work well with tweets, splitting smileys, hashtags, Twitter handles, etc.,
into multiple tokens. Such specialized needs prompted a lot of research into NLP for Twitter
in the recent past, which resulted in several pre-processing options for tweets. One such
solution is a TweetTokenizer, implemented in the NLTK [21] library in Python. We’ll discuss
more on this topic in Chapter 8. For now, let’s see how we can use a TweetTokenizer in the
following code snippet:

The next step in this process is to train a Doc2vec model to learn tweet representations. Ideally,
any large dataset of tweets will work for this step. However, since we don’t have such a ready-
made corpus, we’ll split our dataset into train-test and use the training data for learning the
Doc2vec representations. The first part of this process involves converting the data into a
format readable by the Doc2vec implementation, which can be done using the
TaggedDocument class. It’s used to represent a document as a list of tokens, followed by a
“tag,” which in its simplest form can be just the filename or ID of the document. However,
Doc2vec by itself can also be used as a nearest neighbor classifier for both multiclass and
multilabel classification problems using. Let’s now see how to train a Doc2vec classifier for
tweets through the code snippet below:
Training for Doc2vec involves making several choices regarding parameters, as seen in the
model definition in the code snippet above.
➢ vector_size refers to the dimensionality of the learned embeddings
➢ alpha is the learning rate.
➢ min_count is the minimum frequency of words that remain in vocabulary
➢ dm which stands for distributed memory, is one of the representation learners
implemented in Doc2vec (the other is dbow, or distributed bag of words)
➢ epochs are the number of training iterations.
There are a few other parameters that can be customized. While there are some guidelines on
choosing optimal parameters for training Doc2vec models, these are not exhaustively validated,
and we don’t know if the guidelines work for tweets.
The best way to address this issue is to explore a range of values for the ones that matter to us
(e.g., dm versus dbow, vector sizes, learning rate) and compare multiple models. One way to
do it is to start using these learned representations in a downstream task—in this case, text
classification. Doc2vec’s infer_vector function can be used to infer the vector representation
for a given text using a pre-trained model. Since there is some amount of randomness due to
the choice of hyperparameters, the inferred vectors differ each time we extract them. For this
reason, to get a stable representation, we run it multiple times (called steps) and aggregate the
vectors. Let’s use the learned model to infer features for our data and train a logistic regression
classifier:

An important point to keep in mind when using Doc2vec is the same as for fastText: if we have
to use Doc2vec for feature representation, we have to store the model that learned the
representation. While it’s not typically as bulky as fastText, it’s also not as fast to train. Such
trade-offs need to be considered and compared before we make a deployment decision.
Deep Learning for Text Classication
Deep learning is a family of machine learning algorithms where the learning happens
through different kinds of multilayered neural network architectures. Over the past few years,
it has shown remarkable improvements on standard machine learning tasks, such as image
classification, speech recognition, and machine translation. This has resulted in widespread
interest in using deep learning for various tasks, including text classification.
Two of the most commonly used neural network architectures for text classification are
convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Long short-
term memory (LSTM) networks are a popular form of RNNs. Recent approaches also involve
starting with large, pre-trained language models and fine-tuning them for the task at hand. The
first step toward training any ML or DL model is to define a feature representation. This step
has been relatively straightforward in the approaches we’ve seen so far, with BoW or
embedding vectors. The steps involved in converting training and test data into a format
suitable for the neural network input layers:
1. Tokenize the texts and convert them into word index vectors.
2. Pad the text sequences so that all text vectors are of the same length.
3. Map every word index to an embedding vector. We do that by multiplying word index
vectors with the embedding matrix. The embedding matrix can either be populated
using pre-trained embeddings or it can be trained for embeddings on this corpus.
4. Use the output from Step 3 as the input to a neural network architecture.
Once these are done, we can proceed with the specification of neural network architectures and
training classifiers with them. The Jupyter notebook associated with this section
(Ch4/DeepNN_Example.ipynb) will walk you through the entire process from text pre-
processing to neural network training and evaluation. We’ll use Keras, a Python-based DL
library. The code snippet below illustrates Steps 1 and 2:

Step 3: If we want to use pre-trained embeddings to convert the train and test data into an
embedding matrix like we did in the earlier examples with Word2vec and fastText, we have to
download them and use them to convert our data into the input format for the neural networks.
The following code snippet shows an example of how to do this using GloVe embeddings.
GloVe embeddings come with multiple dimensionalities, and we chose 100 as our dimension
here. The value of dimensionality is a hyperparameter, and we can experiment with other
dimensions as well:

Step 4: Now, we’re ready to train DL models for text classification! DL architectures consist
of an input layer, an output layer, and several hidden layers in between the two. Depending on
the architecture, different hidden layers are used. The input layer for textual input is typically
an embedding layer. The output layer, especially in the context of text classification, is a
softmax layer with categorical output. If we want to train the input layer instead of using pre-
trained embeddings, the easiest way is to call the Embedding layer class in Keras, specifying
the input and output dimensions. However, since we want to use pre-trained embeddings, we
should create a custom embedding layer that uses the embedding matrix we just built. The
following code snippet shows how to do that:

This will serve as the input layer for any neural network we want to use (CNN or LSTM).

CNNs for Text Classification


CNNs typically consist of a series of convolution and pooling layers as the hidden layers. In
the context of text classification, CNNs can be thought of as learning the most useful bag-of-
words/n-grams features instead of taking the entire collection of words/n-grams as features.
Since our dataset has only two classes—positive and negative—the output layer has two
outputs, with the softmax activation function. We’ll define a CNN with three convolution-
pooling layers using the Sequential model class in Keras, which allows us to specify DL models
as a sequential stack of layers—one after another. Once the layers and their activation functions
are specified, the next task is to define other important parameters, such as the optimizer, loss
function, and the evaluation metric to tune the hyperparameters of the model. Once all this is
done, the next step is to train and evaluate the model. The following code snippet shows one
way of specifying a CNN architecture for this task using the Python library Keras and prints
the results with the IMDB dataset for this model:

While there are some commonly recommended options for these, there’s no consensus on one
combination that works best for all datasets and problems. A good approach while building
your models is to experiment with different settings (i.e., hyperparameters). Keep in mind that
all these decisions come with some associated cost. For example, in practice, we have the
number of epochs as 10 or above. But that also increases the amount of time it takes to train
the model. Another thing to note is that, if you want to train an embedding layer instead of
using pre-trained embeddings in this model, the only thing that changes is the line
cnnmodel.add(embedding_layer). Instead, we can specify a new embedding layer as, for
example, cnnmodel.add(Embedding(Param1, Param2)). The code snippet below shows the
code and model performance for the same:

If we run this code in the notebook, we’ll notice that, in this case, training the embedding layer
on our own dataset seems to result in better classification on test data. However, if the training
data were substantially small, sticking to the pre-trained embeddings, or using the domain
adaptation techniques
LSTMs for Text Classication
LSTMs and other variants of RNNs in general have become the go-to way of doing neural
language modeling in the past few years. This is primarily because language is sequential in
nature and RNNs are specialized in working with sequential data. The current word in the
sentence depends on its context—the words before and after. However, when we model text
using CNNs, this crucial fact is not taken into account. RNNs work on the principle of using
this context while learning the language representation or a model of language. Hence, they’re
known to work well for NLP tasks. There are also CNN variants that can take such context into
account, and CNNs versus RNNs is still an open area of debate. Now that we’ve already seen
one neural network in action, it’s relatively easy to train another! Just replace the convolutional
and pooling parts with an LSTM in the prior two code examples. The following code snippet
shows how to train an LSTM model using the same IMDB dataset for text classification:

Notice that this code took much longer to run than the CNN example. While LSTMs are more
powerful in utilizing the sequential nature of text, they’re much more data hungry as compared
to CNNs. Thus, the relative lower performance of the LSTM on a dataset need not necessarily
be interpreted as a shortcoming of the model itself. It’s possible that the amount of data we
have is not sufficient to utilize the full potential of an LSTM. As in the case of CNNs, several
parameters and hyperparameters play important roles in model performance, and it’s always a
good practice to explore multiple options and compare different models before finalizing on
one.

Interpreting Text Classication Models


In fact, most real-world use cases of text classification may be similar—we just
consume the classifier’s output and don’t question its decisions. Take spam classification: we
generally don’t look for explanations of why a certain email is classified as spam or regular
email. However, there may be scenarios where such explanations are necessary.
Consider a scenario where we developed a classifier that identifies abusive comments on a
discussion forum website. The classifier identifies comments that are objectionable/abusive
and performs the job of a human moderator by either deleting them or making them invisible
to users. We know that classifiers aren’t perfect and can make errors. What if the commenter
questions this moderation decision and asks for an explanation? Some method to “explain” the
classification decision by pointing to which feature’s presence prompted such a decision can
be useful in such cases. Such a method is also useful to provide some insights into the model
and how it may perform on real-world data (instead of train/test sets), which may result in
better, more reliable models in the future.
As ML models started getting deployed in real-world applications, interest in the direction of
model interpretability grew. Recent research resulted in usable tools for interpreting model
predictions (especially for classification). Lime is one such tool that attempts to interpret a
black-box classification model by approximating it with a linear model locally around a given
training instance. The advantage of this is that such a linear model is expressed as a weighted
sum of its features and is easy to interpret for humans.
For example, if there are two features, f1 and f2, for a given test instance of a binary
classifier with classes A and B, a Lime linear model around this instance could be
something like -0.3 × f1 + 0.4 × f2 with a prediction B. This indicates that the presence
of feature f1 will negatively affect this prediction (by 0.3) and skew it toward A.

Explaining Classier Predictions with Lime


The following code snippet uses the logistic regression model we built earlier using the
“Economy News Article Tone and Relevance” dataset, which classifies a given news article as
being relevant or non-relevant and shows how we can use Lime (the full code can be accessed
in the notebook Ch4/LimeDemo.ipynb):

This code shows six features that played an important role in making this prediction. They’re
as follows:

Thus, the output of the above code can be seen as a linear sum of these six features. This would
mean that, if we remove the features “NEW” and “showing,” the prediction should move
toward the opposite class, i.e., “relevant/Yes,” by 0.35 (the sum of the weights of these two
features). Lime also has functions to visualize these predictions.
As shown in the figure, the presence of three words—York, trend, and dropped—skews the
prediction toward Yes, whereas the other three words skew the prediction toward No. Apart
from some uses we mentioned earlier, such visualizations of classifiers can also help us if we
want to do some informed feature selection.

You might also like