0% found this document useful (0 votes)

95 views34 pages

Python Text Classification Guide

This document provides a comprehensive guide to understanding and implementing text classification in Python. It discusses the key steps in a text classification pipeline, including dataset preparation, feature engineering, and model training. For feature engineering, it covers techniques like count vectors, TF-IDF vectors at the word, n-gram, and character level, word embeddings, and text-based features. The goal is to transform raw text into feature vectors that can be used to train machine learning models to automatically classify text documents.

Uploaded by

rahacse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views34 pages

Python Text Classification Guide

Uploaded by

rahacse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

https://www.analyticsvidhya.

com/blog/2018/04/a-comprehensive-guide-to-understand-and-
implement-text-classification-in-python/

A Comprehensive Guide to
Understand and Implement Text
Classification in Python
SHIVAM BANSAL, APRIL 23, 2018

Introduction
One of the widely used natural language processing task in different business problems
is “Text Classification”. The goal of text classification is to automatically classify the text
documents into one or more defined categories. Some examples of text classification
are:

 Understanding audience sentiment from social media,

 Detection of spam and non-spam emails,
 Auto tagging of customer queries, and
 Categorization of news articles into defined topics.

Table of Contents
In this article, I will explain about the text classification and the step by step process to
implement it in python.

Text Classification is an example of supervised machine learning task since a labelled

dataset containing text documents and their labels is used for train a classifier. An end-
to-end text classification pipeline is composed of three main components:
1. Dataset Preparation: The first step is the Dataset Preparation step which includes
the process of loading a dataset and performing basic pre-processing. The dataset is
then splitted into train and validation sets.
2. Feature Engineering: The next step is the Feature Engineering in which the raw
dataset is transformed into flat features which can be used in a machine learning model.
This step also includes the process of creating new features from the existing data.
3. Model Training: The final step is the Model Building step in which a machine learning
model is trained on a labelled dataset.

4. Improve Performance of Text Classifier: In this article, we will also look at the
different ways to improve the performance of text classifiers.

Note : This article does not narrate NLP tasks in depth. If you want to revise the basics
and come back here, you can always go through this article.

Getting your machine ready

Lets implement basic components in a step by step manner in order to create a text
classification framework in python. To start with, import all the required libraries.

You would need requisite libraries to run this code – you can install them at their
individual official links

 Pandas
 Scikit-learn
 XGBoost
 TextBlob
 Keras

# libraries for dataset preparation, feature engineering, model

training

from sklearn import model_selection, preprocessing, linear_model, naive_bayes,

metrics, svm

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string

from keras.preprocessing import text, sequence

from keras import layers, models, optimizers

1. Dataset preparation
For the purpose of this article, I am the using dataset of amazon reviews which can
be downloaded at this link. The dataset consists of 3.6M text reviews and their labels,
we will use only a small fraction of data. To prepare the dataset, load the downloaded
data into a pandas dataframe containing two columns – text and label. (Source)

# load the dataset

data = open('data/corpus').read()

labels, texts = [], []

for i, line in enumerate(data.split("\n")):

content = line.split()

labels.append(content[0])

texts.append(" ".join(content[1:]))

# create a dataframe using texts and lables

trainDF = pandas.DataFrame()
trainDF['text'] = texts

trainDF['label'] = labels

Next, we will split the dataset into training and validation sets so that we can train and
test classifier. Also, we will encode our target column so that it can be used in machine
learning models.

# split the dataset into training and validation datasets

train_x, valid_x, train_y, valid_y =

model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable

encoder = preprocessing.LabelEncoder()

train_y = encoder.fit_transform(train_y)

valid_y = encoder.fit_transform(valid_y)

2. Feature Engineering
The next step is the feature engineering step. In this step, raw text data will be
transformed into feature vectors and new features will be created using the existing
dataset. We will implement the following different ideas in order to obtain relevant
features from our dataset.

2.1 Count Vectors as features

2.2 TF-IDF Vectors as features

 Word level
 N-Gram level
 Character level
2.3 Word Embeddings as features
2.4 Text / NLP based features
2.5 Topic Models as features

Lets look at the implementation of these ideas in detail.

2.1 Count Vectors as features

Count Vector is a matrix notation of the dataset in which every row represents a
document from the corpus, every column represents a term from the corpus, and every
cell represents the frequency count of a particular term in a particular document.

# create a count vectorizer object

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

count_vect.fit(trainDF['text'])

# transform the training and validation data using count vectorizer object

xtrain_count = count_vect.transform(train_x)

xvalid_count = count_vect.transform(valid_x)

2.2 TF-IDF Vectors as features

TF-IDF score represents the relative importance of a term in the document and the
entire corpus. TF-IDF score is composed by two terms: the first computes the
normalized Term Frequency (TF), the second term is the Inverse Document Frequency
(IDF), computed as the logarithm of the number of the documents in the corpus divided
by the number of documents where the specific term appears.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

TF-IDF Vectors can be generated at different levels of input tokens (words, characters,
n-grams)
a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different
documents
b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix
representing tf-idf scores of N-grams
c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams
in the corpus

# word level tf-idf

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}',

max_features=5000)

tfidf_vect.fit(trainDF['text'])

xtrain_tfidf = tfidf_vect.transform(train_x)

xvalid_tfidf = tfidf_vect.transform(valid_x)

# ngram level tf-idf

tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}',

ngram_range=(2,3), max_features=5000)

tfidf_vect_ngram.fit(trainDF['text'])

xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)

xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)

# characters level tf-idf

tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}',

ngram_range=(2,3), max_features=5000)

tfidf_vect_ngram_chars.fit(trainDF['text'])

xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)

xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)

2.3 Word Embeddings

A word embedding is a form of representing words and documents using a dense vector
representation. The position of a word within the vector space is learned from text and is
based on the words that surround the word when it is used. Word embeddings can be
trained using the input corpus itself or can be generated using pre-trained word
embeddings such as Glove, FastText, and Word2Vec. Any one of them can be
downloaded and used as transfer learning. One can read more about word
embeddings here.

Following snnipet shows how to use pre-trained word embeddings in the model. There
are four essential steps:

1. Loading the pretrained word embeddings

2. Creating a tokenizer object
3. Transforming text documents to sequence of tokens and pad them
4. Create a mapping of token and their respective embeddings

You can download the pre-trained word embeddings from here

# load the pre-trained word-embedding vectors

embeddings_index = {}

for i, line in enumerate(open('data/wiki-news-300d-1M.vec')):

values = line.split()

embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')

# create a tokenizer

token = text.Tokenizer()

token.fit_on_texts(trainDF['text'])

word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors

train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)

valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)

# create token-embedding mapping

embedding_matrix = numpy.zeros((len(word_index) + 1, 300))

for word, i in word_index.items():

embedding_vector = embeddings_index.get(word)

if embedding_vector is not None:

embedding_matrix[i] = embedding_vector
2.4 Text / NLP based features
A number of extra text based features can also be created which sometimes are helpful
for improving text classification models. Some examples are:

1. Word Count of the documents – total number of words in the documents

2. Character Count of the documents – total number of characters in the documents
3. Average Word Density of the documents – average length of the words used in
the documents
4. Puncutation Count in the Complete Essay – total number of punctuation marks in
the documents
5. Upper Case Count in the Complete Essay – total number of upper count words in
the documents
6. Title Word Count in the Complete Essay – total number of proper case (title)
words in the documents
7. Frequency distribution of Part of Speech Tags:
o Noun Count
o Verb Count
o Adjective Count
o Adverb Count
o Pronoun Count

These features are highly experimental ones and should be used according to the
problem statement only.

trainDF['char_count'] = trainDF['text'].apply(len)

trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))

trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)

trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len("".join(_ for _

in x if _ in string.punctuation)))

trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in

x.split() if wrd.istitle()]))

trainDF['upper_case_word_count'] = trainDF['text'].apply(lambda x: len([wrd for

wrd in x.split() if wrd.isupper()]))

pos_family = {

'noun' : ['NN','NNS','NNP','NNPS'],

'pron' : ['PRP','PRP$','WP','WP$'],

'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],

'adj' : ['JJ','JJR','JJS'],

'adv' : ['RB','RBR','RBS','WRB']

# function to check and get the part of speech tag count of a words in a given

sentence

def check_pos_tag(x, flag):

cnt = 0

try:

wiki = textblob.TextBlob(x)

for tup in wiki.tags:

ppo = list(tup)[1]

if ppo in pos_family[flag]:
cnt += 1

except:

pass

return cnt

trainDF['noun_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'noun'))

trainDF['verb_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'verb'))

trainDF['adj_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adj'))

trainDF['adv_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adv'))

trainDF['pron_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'pron'))

2.5 Topic Models as features

Topic Modelling is a technique to identify the groups of words (called a topic) from a
collection of documents that contains best information in the collection. I have
used Latent Dirichlet Allocation for generating Topic Modelling Features. LDA is an
iterative model which starts from a fixed number of topics. Each topic is represented as a
distribution over words, and each document is then represented as a distribution over
topics. Although the tokens themselves are meaningless, the probability distributions
over words provided by the topics provide a sense of the different ideas contained in the
documents. One can read more about topic modelling here

Lets see its implementation:

# train a LDA Model

lda_model = decomposition.LatentDirichletAllocation(n_components=20,

learning_method='online', max_iter=20)

X_topics = lda_model.fit_transform(xtrain_count)

topic_word = lda_model.components_

vocab = count_vect.get_feature_names()

# view the topic models

n_top_words = 10

topic_summaries = []

for i, topic_dist in enumerate(topic_word):

topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-

(n_top_words+1):-1]

topic_summaries.append(' '.join(topic_words))

3. Model Building
The final step in the text classification framework is to train a classifier using the features
created in the previous step. There are many different choices of machine learning
models which can be used to train a final model. We will implement following different
classifiers for this purpose:

1. Naive Bayes Classifier

2. Linear Classifier
3. Support Vector Machine
4. Bagging Models
5. Boosting Models
6. Shallow Neural Networks
7. Deep Neural Networks
o Convolutional Neural Network (CNN)
o Long Short Term Modelr (LSTM)
o Gated Recurrent Unit (GRU)
o Bidirectional RNN
o Recurrent Convolutional Neural Network (RCNN)
o Other Variants of Deep Neural Networks

Lets implement these models and understand their details. The following function is a
utility function which can be used to train a model. It accepts the classifier,
feature_vector of training data, labels of training data and feature vectors of valid data as
inputs. Using these inputs, the model is trained and accuracy score is computed.

def train_model(classifier, feature_vector_train, label, feature_vector_valid,

is_neural_net=False):

# fit the training dataset on the classifier

classifier.fit(feature_vector_train, label)

# predict the labels on validation dataset

predictions = classifier.predict(feature_vector_valid)

if is_neural_net:

predictions = predictions.argmax(axis=-1)

return metrics.accuracy_score(predictions, valid_y)

3.1 Naive Bayes
Implementing a naive bayes model using sklearn implementation with different features

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption

of independence among predictors. A Naive Bayes classifier assumes that the presence
of a particular feature in a class is unrelated to the presence of any other feature here .

# Naive Bayes on Count Vectors

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y,

xvalid_count)

print "NB, Count Vectors: ", accuracy

# Naive Bayes on Word Level TF IDF Vectors

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y,

xvalid_tfidf)

print "NB, WordLevel TF-IDF: ", accuracy

# Naive Bayes on Ngram Level TF IDF Vectors

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y,

xvalid_tfidf_ngram)

print "NB, N-Gram Vectors: ", accuracy

# Naive Bayes on Character Level TF IDF Vectors

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars,

train_y, xvalid_tfidf_ngram_chars)

print "NB, CharLevel Vectors: ", accuracy

NB, Count Vectors: 0.7004

NB, WordLevel TF-IDF: 0.7024

NB, N-Gram Vectors: 0.5344

NB, CharLevel Vectors: 0.6872

3.2 Linear Classifier

Implementing a Linear Classifier (Logistic Regression)

Logistic regression measures the relationship between the categorical dependent

variable and one or more independent variables by estimating probabilities using a
logistic/sigmoid function. One can read more about logistic regression here

# Linear Classifier on Count Vectors

accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y,

xvalid_count)

print "LR, Count Vectors: ", accuracy

# Linear Classifier on Word Level TF IDF Vectors

accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y,

xvalid_tfidf)

print "LR, WordLevel TF-IDF: ", accuracy

# Linear Classifier on Ngram Level TF IDF Vectors

accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram,

train_y, xvalid_tfidf_ngram)

print "LR, N-Gram Vectors: ", accuracy

# Linear Classifier on Character Level TF IDF Vectors

accuracy = train_model(linear_model.LogisticRegression(),

xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)

print "LR, CharLevel Vectors: ", accuracy

LR, Count Vectors: 0.7048

LR, WordLevel TF-IDF: 0.7056

LR, N-Gram Vectors: 0.4896

LR, CharLevel Vectors: 0.7012

3.3 Implementing a SVM Model

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be
used for both classification or regression challenges. The model extracts a best possible
hyper-plane / line that segregates the two classes. One can read more about it here

# SVM on Ngram Level TF IDF Vectors

accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)

print "SVM, N-Gram Vectors: ", accuracy

SVM, N-Gram Vectors: 0.5296

3.4 Bagging Model

Implementing a Random Forest Model

Random Forest models are a type of ensemble models, particularly bagging models.
They are part of the tree based model family. One can read more about Bagging and
random forests here

# RF on Count Vectors

accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y,

xvalid_count)

print "RF, Count Vectors: ", accuracy

# RF on Word Level TF IDF Vectors

accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y,

xvalid_tfidf)
print "RF, WordLevel TF-IDF: ", accuracy

RF, Count Vectors: 0.6972

RF, WordLevel TF-IDF: 0.6988

3.5 Boosting Model

Implementing Xtereme Gradient Boosting Model

Boosting models are another type of ensemble models part of tree based models.
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and
also variance in supervised learning, and a family of machine learning algorithms that
convert weak learners to strong ones. A weak learner is defined to be a classifier that is
only slightly correlated with the true classification (it can label examples better than
random guessing). Read more about these models here

# Extereme Gradient Boosting on Count Vectors

accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y,

xvalid_count.tocsc())

print "Xgb, Count Vectors: ", accuracy

# Extereme Gradient Boosting on Word Level TF IDF Vectors

accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y,

xvalid_tfidf.tocsc())

print "Xgb, WordLevel TF-IDF: ", accuracy

# Extereme Gradient Boosting on Character Level TF IDF Vectors

accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(),

train_y, xvalid_tfidf_ngram_chars.tocsc())

print "Xgb, CharLevel Vectors: ", accuracy

/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:

DeprecationWarning: The truth value of an empty array is ambiguous. Returning

False, but in future this will result in an error. Use `array.size > 0` to check

that an array is not empty.

if diff:

/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:

DeprecationWarning: The truth value of an empty array is ambiguous. Returning

False, but in future this will result in an error. Use `array.size > 0` to check

that an array is not empty.

if diff:

Xgb, Count Vectors: 0.6324

Xgb, WordLevel TF-IDF: 0.6364

Xgb, CharLevel Vectors: 0.6548

/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:

DeprecationWarning: The truth value of an empty array is ambiguous. Returning

False, but in future this will result in an error. Use `array.size > 0` to check

that an array is not empty.

if diff:

3.6 Shallow Neural Networks

A neural network is a mathematical model that is designed to behave similar to biological
neurons and nervous system. These models are used to recognize complex patterns
and relationships that exists within a labelled data. A shallow neural network contains
mainly three types of layers – input layer, hidden layer, and output layer. Read more
about neural networks here

def create_model_architecture(input_size):

# create input layer

input_layer = layers.Input((input_size, ), sparse=True)

# create hidden layer

hidden_layer = layers.Dense(100, activation="relu")(input_layer)

# create output layer

output_layer = layers.Dense(1, activation="sigmoid")(hidden_layer)

classifier = models.Model(inputs = input_layer, outputs = output_layer)

classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return classifier

classifier = create_model_architecture(xtrain_tfidf_ngram.shape[1])

accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y,

xvalid_tfidf_ngram, is_neural_net=True)

print "NN, Ngram Level TF IDF Vectors", accuracy

Epoch 1/1

7500/7500 [==============================] - 1s 67us/step - loss: 0.6909

NN, Ngram Level TF IDF Vectors 0.5296

3.7 Deep Neural Networks

Deep Neural Networks are more complex neural networks in which the hidden layers
performs much more complex operations than simple sigmoid or relu activations.
Different types of deep learning models can be applied in text classification problems.

3.7.1 Convolutional Neural Network

In Convolutional neural networks, convolutions over the input layer are used to compute
the output. This results in local connections, where each region of the input is connected
to a neuron in the output. Each layer applies different filters and combines their results.

# Add the word embedding Layer

embedding_layer = layers.Embedding(len(word_index) + 1, 300,

weights=[embedding_matrix], trainable=False)(input_layer)

embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

# Add the convolutional Layer

conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

# Add the pooling Layer

pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

# Add the output Layers

output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)

output_layer1 = layers.Dropout(0.25)(output_layer1)
output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

# Compile the model

model = models.Model(inputs=input_layer, outputs=output_layer2)

model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return model

classifier = create_cnn()

accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,

is_neural_net=True)

print "CNN, Word Embeddings", accuracy

Epoch 1/1

7500/7500 [==============================] - 12s 2ms/step - loss: 0.5847

CNN, Word Embeddings 0.5296

3.7.2 Recurrent Neural Network – LSTM

Unlike Feed-forward neural networks in which activation outputs are propagated only in
one direction, the activation outputs from neurons propagate in both directions (from
inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates
loops in the neural network architecture which acts as a ‘memory state’ of the neurons.
This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a
problem called Vanishing Gradient is associated with them. In this problem, while
learning with a large number of layers, it becomes really hard for the network to learn
and tune the parameters of the earlier layers. To address this problem, A new type of
RNNs called LSTMs (Long Short Term Memory) Models have been developed.

# Add an Input Layer

input_layer = layers.Input((70, ))

# Add the word embedding Layer

embedding_layer = layers.Embedding(len(word_index) + 1, 300,

weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

# Add the LSTM Layer

lstm_layer = layers.LSTM(100)(embedding_layer)

# Add the output Layers

output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)

output_layer1 = layers.Dropout(0.25)(output_layer1)

output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

# Compile the model

model = models.Model(inputs=input_layer, outputs=output_layer2)

model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return model
classifier = create_rnn_lstm()

accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,

is_neural_net=True)

print "RNN-LSTM, Word Embeddings", accuracy

Epoch 1/1

7500/7500 [==============================] - 22s 3ms/step - loss: 0.6899

RNN-LSTM, Word Embeddings 0.5124

3.7.3 Recurrent Neural Network – GRU

Gated Recurrent Units are another form of recurrent neural networks. Lets add a layer of
GRU instead of LSTM in our network.

def create_rnn_gru():

# Add an Input Layer

input_layer = layers.Input((70, ))

# Add the word embedding Layer

embedding_layer = layers.Embedding(len(word_index) + 1, 300,

weights=[embedding_matrix], trainable=False)(input_layer)

embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the GRU Layer

lstm_layer = layers.GRU(100)(embedding_layer)

# Add the output Layers

output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)

output_layer1 = layers.Dropout(0.25)(output_layer1)

output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

# Compile the model

model = models.Model(inputs=input_layer, outputs=output_layer2)

model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return model

classifier = create_rnn_gru()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,

is_neural_net=True)

print "RNN-GRU, Word Embeddings", accuracy

Epoch 1/1

7500/7500 [==============================] - 19s 3ms/step - loss: 0.6898

RNN-GRU, Word Embeddings 0.5124

3.7.4 Bidirectional RNN

RNN layers can be wrapped in Bidirectional layers as well. Lets wrap our GRU layer in
bidirectional layer.

def create_bidirectional_rnn():

# Add an Input Layer

input_layer = layers.Input((70, ))

# Add the word embedding Layer

embedding_layer = layers.Embedding(len(word_index) + 1, 300,

weights=[embedding_matrix], trainable=False)(input_layer)

embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the LSTM Layer

lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)

# Add the output Layers

output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)

output_layer1 = layers.Dropout(0.25)(output_layer1)

output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

# Compile the model

model = models.Model(inputs=input_layer, outputs=output_layer2)

model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return model

classifier = create_bidirectional_rnn()

accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,

is_neural_net=True)
print "RNN-Bidirectional, Word Embeddings", accuracy

Epoch 1/1

7500/7500 [==============================] - 32s 4ms/step - loss: 0.6889

RNN-Bidirectional, Word Embeddings 0.5124

3.7.5 Recurrent Convolutional Neural Network

Once the essential architectures have been tried out, one can try different variants of
these layers such as recurrent convolutional neural network. Another variants can be:

1. Hierarichial Attention Networks

2. Sequence to Sequence Models with Attention
3. Bidirectional Recurrent Convolutional Neural Networks
4. CNNs and RNNs with more number of layers

def create_rcnn():

# Add an Input Layer

input_layer = layers.Input((70, ))

# Add the word embedding Layer

embedding_layer = layers.Embedding(len(word_index) + 1, 300,

weights=[embedding_matrix], trainable=False)(input_layer)

embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the recurrent layer

rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))

(embedding_layer)

# Add the convolutional Layer

conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

# Add the pooling Layer

pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

# Add the output Layers

output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)

output_layer1 = layers.Dropout(0.25)(output_layer1)

output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

# Compile the model

model = models.Model(inputs=input_layer, outputs=output_layer2)

model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')

return model

classifier = create_rcnn()

accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,

is_neural_net=True)

print "CNN, Word Embeddings", accuracy

Epoch 1/1

7500/7500 [==============================] - 11s 1ms/step - loss: 0.6902

CNN, Word Embeddings 0.5124

Improving Text Classification Models

While the above framework can be applied to a number of text classification problems,
but to achieve a good accuracy some improvements can be done in the overall
framework. For example, following are some tips to improve the performance of text
classification models and this framework.

1. Text Cleaning : text cleaning can help to reducue the noise present in text data in the
form of stopwords, punctuations marks, suffix variations etc. This article can help to
understand how to implement text classification in detail.

2. Hstacking Text / NLP features with text feature vectors : In the feature
engineering section, we generated a number of different feature vectros, combining
them together can help to improve the accuracy of the classifier.
3. Hyperparamter Tuning in modelling : Tuning the paramters is an important step, a
number of parameters such as tree length, leafs, network paramters etc can be fine
tuned to get a best fit model.

4. Ensemble Models : Stacking different models and blending their outputs can help to
further improve the results. Read more about ensemble models here

End Notes
In this article, we discussed about how to prepare a text dataset like cleaning/creating
training and validation dataset, perform different types of feature engineering like Count
Vector/TF-IDF/ Word Embedding/ Topic Modelling and basic text features, and finally
trained a variety of classifiers like Naive Bayes/ Logistic regression/ SVM/ MLP/ LSTM
and GRU. At the end, discussed about different approach to improve the performance of
text classifiers.

Note: There is a video course, Natural Language Processing using Python, with 3 real
life projects, two of them involve text classification.

Did you find this article useful ? Share your views and opinions in the comments section
below.

Learn, compete, hack and get hired!

哈佛大学开放课程幸福课 (积极心理学) 视频英文字幕下载 (1-12集) (网易公开课提供) PDF
No ratings yet
哈佛大学开放课程幸福课 (积极心理学) 视频英文字幕下载 (1-12集) (网易公开课提供) PDF
352 pages
Expository Writing Checklist
No ratings yet
Expository Writing Checklist
3 pages
Interpersonal Reactivity Index (IRI) PDF
No ratings yet
Interpersonal Reactivity Index (IRI) PDF
6 pages
Glove
100% (1)
Glove
10 pages
Mental State of English Learners and Its Influential Factors
No ratings yet
Mental State of English Learners and Its Influential Factors
6 pages
Introduction To CBT
No ratings yet
Introduction To CBT
12 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Text Classification With Transformer - 1716327784332
No ratings yet
Text Classification With Transformer - 1716327784332
3 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Foundations of Python For AI
No ratings yet
Foundations of Python For AI
67 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Video Presentation Information
No ratings yet
Video Presentation Information
5 pages
Methodology
No ratings yet
Methodology
9 pages
Part B
No ratings yet
Part B
6 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Module III
No ratings yet
Module III
42 pages
10253.exp 5
No ratings yet
10253.exp 5
12 pages
Lab Report 8
No ratings yet
Lab Report 8
11 pages
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
No ratings yet
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
4 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Report
No ratings yet
Report
2 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Spam Detection Using Tensorflow
No ratings yet
Spam Detection Using Tensorflow
13 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Report
No ratings yet
Report
89 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
17 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Module 2 Feature Engineering and Text Representation
No ratings yet
Module 2 Feature Engineering and Text Representation
19 pages
Using Pre-Trained Word Embeddings - 1716328022707
No ratings yet
Using Pre-Trained Word Embeddings - 1716328022707
8 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
FND Imp Points
No ratings yet
FND Imp Points
6 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
NLP Lab Manual for B.E. Students
No ratings yet
NLP Lab Manual for B.E. Students
21 pages
Classification CNN
No ratings yet
Classification CNN
7 pages
Intro to Text Classification for Students
No ratings yet
Intro to Text Classification for Students
3 pages
Official: Osition Escription
No ratings yet
Official: Osition Escription
4 pages
Decision Trees for CS Students
No ratings yet
Decision Trees for CS Students
6 pages
DecisionTreeClassifiers Avi PurdueUni 2017
No ratings yet
DecisionTreeClassifiers Avi PurdueUni 2017
127 pages
Kazi Zafar's 1st Janaza Held in Tongi: New Age Online
No ratings yet
Kazi Zafar's 1st Janaza Held in Tongi: New Age Online
6 pages
Effect Size in Research Analysis
No ratings yet
Effect Size in Research Analysis
10 pages
Spreading The Message of Islam Dawah Tabligh
No ratings yet
Spreading The Message of Islam Dawah Tabligh
2 pages
Analisis Efisiensi Pengelolaan Tempat Tidur Rumah Sakit Berdasarkan Grafik Barber Johnson Di Rs Pku Muhammadiyah Yogyakarta Tahun 2015
No ratings yet
Analisis Efisiensi Pengelolaan Tempat Tidur Rumah Sakit Berdasarkan Grafik Barber Johnson Di Rs Pku Muhammadiyah Yogyakarta Tahun 2015
8 pages
Conflict Management
100% (1)
Conflict Management
28 pages
Ashley Foster ELA Lesson Plan 2
No ratings yet
Ashley Foster ELA Lesson Plan 2
9 pages
Group A: English For Speaking Exercise 5
No ratings yet
Group A: English For Speaking Exercise 5
2 pages
The Verbal Ability Handbook
No ratings yet
The Verbal Ability Handbook
60 pages
Personal Construct Theory Overview
No ratings yet
Personal Construct Theory Overview
15 pages
4 Key Management Functions Explained
No ratings yet
4 Key Management Functions Explained
7 pages
Dokumen Dari Martha Patricia Purba
No ratings yet
Dokumen Dari Martha Patricia Purba
9 pages
Unit 1: Introduction To Short Vowels
No ratings yet
Unit 1: Introduction To Short Vowels
8 pages
Unit 3 Newsletter - The Move Toward Freedom
No ratings yet
Unit 3 Newsletter - The Move Toward Freedom
5 pages
Lecture#4 - Resilience1
No ratings yet
Lecture#4 - Resilience1
37 pages
English Grammar Practice
No ratings yet
English Grammar Practice
12 pages
Taslk 1-3
No ratings yet
Taslk 1-3
7 pages
English Workshop for 11th Grade
0% (1)
English Workshop for 11th Grade
3 pages
Gita Assessment 1
No ratings yet
Gita Assessment 1
1 page
Ojt LP 02
No ratings yet
Ojt LP 02
5 pages
FP Grade 3 English HL LP 28 - 30 April
No ratings yet
FP Grade 3 English HL LP 28 - 30 April
3 pages
Radcliffe Brown PDF
No ratings yet
Radcliffe Brown PDF
10 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
Free Will Could All Be An Illusion, Scientists Suggest After Study Shows Choice May Just Be Brain Tricking Itself - Science - News - The Independent
No ratings yet
Free Will Could All Be An Illusion, Scientists Suggest After Study Shows Choice May Just Be Brain Tricking Itself - Science - News - The Independent
9 pages
Music Therapy Intake Form
No ratings yet
Music Therapy Intake Form
5 pages
Instruction Planning Models For Mother Tongue Instruction
No ratings yet
Instruction Planning Models For Mother Tongue Instruction
12 pages
TSSC PUBLISHER EDITED Final 1
No ratings yet
TSSC PUBLISHER EDITED Final 1
26 pages
Cohesive Devices Error Analysis
No ratings yet
Cohesive Devices Error Analysis
88 pages
Teacch y Tea
No ratings yet
Teacch y Tea
14 pages