A Study on a Joint Deep Learning Model for Myanmar Text Classification
Myat Sapal Phyu Khin Thandar Nwet
Faculty of Computer Science Faculty of Computer Science
University of Information Technology University of Information Technology
Yangon, Myanmar Yangon, Myanmar
[email protected] [email protected] Abstract models typically require a lot of data to accomplish a
specific task. Although most pre-trained vectors are
Text classification is one of the most critical trained on general datasets, the use of pre-trained
areas of research in the field of natural language vectors can handle data requirements. It can be used for
processing (NLP). Recently, most of the NLP tasks the specific task by transferring learning with the
achieve remarkable performance by using deep appropriate amount of data. This paper focuses in
learning models. Generally, deep learning models particular on the classification of Myanmar articles.
require a huge amount of data to be utilized. This Words are considered as basic units in this paper. Since
paper uses pre-trained word vectors to handle the Myanmar language has no standard rule to determine
resource-demanding problem and studies the word boundaries, it is important in the pre-processing
effectiveness of a joint Convolutional Neural Network phase to determine the word boundaries. As Myanmar
and Long Short Term Memory (CNN-LSTM) for language is a morphologically rich language, good
Myanmar text classification. The comparative word representation is difficult to learn because many
analysis is performed on the baseline Convolutional word forms seldom occur in the training corpus. The
Neural Networks (CNN), Recurrent Neural Networks BPE tokenizer1 is used to determine the boundary of
(RNN) and their combined model CNN-RNN. the word. The segmented words are converted into an
embedding matrix by using the pre-trained vectors
Keywords—text classification, CNN, RNN, CNN- trained on Wikipedia by the Skip-gram model [2]. The
RNN, CNN-LSTM, deep learning model. effective use of pre-trained vectors is very supportive
of resource-scarce languages. In this paper, CNN and
I. INTRODUCTION
LSTM models are jointly performed for Myanmar text
Text classification is one of the interesting classification. The convolution process is performed on
application areas in NLP such as sentiment analysis, the embedding matrix for selecting the features and the
machine translation, automatic summarization, etc. LSTM model is used to capture long term information.
Nowadays, information overloading is one of the In this paper, the max-pooling layer is dropped in the
important problems for today's people. People waste CNN model to prevent the loss of context information.
too much time to select their interesting information The next sections are as follows, the works
because they receive a vast amount of information from related to the text classification and other areas of
various sources of internet. A powerful text classifier application are addressed for both Myanmar and
can extract useful content from massive amounts of English in section 2. Section 3 describes Myanmar's
information. It can classify text into associated background knowledge. Section 4 discusses the
categories including sentiment analysis, spam components of the CNN-LSTM joint model. Section 5
detection, articles, books, and hate speech detection. It explains the findings of the experiment and the paper is
is among the most active research areas in the field of concluded in section 6.
NLP. Historically, many researchers performed text
classification by using various machine learning II. RELATED WORK
algorithm and their variant models. The use of deep
Wang et al. [8] proposed a regional CNN-
learning models has achieved great interest in text
LSTM model that captures regional information by
classification due to their ability to capture semantic
CNN and predicts valence arousal (VA) by LSTM
word relationships in recent years. All deep learning
1
https://github.com/bheinzerling/bpemb
model and outperforms than regression-based, syllables. There are thirteen three basic consonants,
lexicon-based and NN-based methods on two datasets eight vowels and eleven medial. A Myanmar syllable
for sentiment analysis. Kwaik et al. [5] proposed a is constructed by one initial consonant (C), zero or
deep learning model that combines LSTM and CNN more medial (M), zero or more vowels (V) and optional
models for dialectal Arabic sentiment analysis and this dependent various signs. Words are considered a basic
model performs better than two baseline models on unit in this paper.
three datasets. Zhang et al. [9] constructed the CNN
model on the top of LSTM. The output from the LSTM A. Pre-processing
model is further extracted by the CNN model and it
Pre-processing basically contains two steps, 1)
performed better in terms of accuracy than the baseline
removing unnecessary characters, and 2) determining
models. Kim et al. [4] conducted the experiments with
the word boundaries. As this study focus on the
variations of CNN model, CNN-rand, CNN-static and
Myanmar text classification, Myanmar Unicode range
CNN-multichannel on the top of pre-trained vectors
between [U1000-U104F]2 is removed to ignore non-
for sentence classification. These CNN models
Myanmar characters. The numbers “၀-၉”, [U1040-
performed better in 4 out of 7 tasks than the state-of-
the-art. U1049] and punctuation marks “၊, ။”, [U104A- U104B]
The previous research works in deep learning are also removed. As Myanmar language has no rule to
and machine learning models for Myanmar language determine word boundary, it is needed to determine the
in speech recognition, named entity recognition, and boundary of words. In this work, word boundaries are
text classification are also investigated. Aye et al. [1] determined by the BPE tokenizer.
aimed to enhance the sentiment analysis for Myanmar
language in the food and restaurant domain by B. Pre-trained Model
considering intensifier and objective words and
The reuse of the pre-trained model by transfer
improved the prediction accuracy. Hlaing et al. [3]
learning on a new task is very efficient because it can
applied LSTM-RNN in Myanmar speech synthesis.
train the deep learning models with not much data.
Mo et al. [6] annotated Named Entity tagged corpus
Typically, most of the NLP tasks don’t have sufficient
for Myanmar language and evaluation was performed
label data to be trained on such complex models. In
comparatively between neural sequence model and
this paper, we use the pre-trained vector file that was
baseline CRF. In our previous works [7], we
publicly released by the Facebook AI Research
performed the comparative analysis of CNN and RNN
(FAIR) lab 3 . It was trained on Wikipedia using the
both on syllable and word level by using three pre-
fastText skip-gram model with 300 dimension. In this
trained vectors and also collected and annotate six
paper, the pre-trained model is used as the starting
Myanmar articles datasets. In this paper, a
point instead of learning from scratch.
comparative analysis is performed on a joint CNN-
LSTM model against baseline CNN, RNN, and their C. Construction of Embedding Matrix
combine model by using Myanmar articles datasets.
The segmented words are transformed into word
III. MYANMAR LANGUAGE BACKGROUND vectors by matching the vocabulary in the pre-trained
vectors file. Figure 1 shows the construction of the
Myanmar language is the official language of
embedding matrix. Firstly, Myanmar text data are
the Republic of the Union of Myanmar. It is a
extracted from online news websites. Secondly,
morphologically rich language and Myanmar
unnecessary characters are removed from the extracted
sentences are basically constructed as the subject,
text. Then, word boundaries are determined by the BPE
object, and verb pattern. The Myanmar script is written
tokenizer although sometimes the results are
from left to right and the characters are rounded in
meaningless as it is a frequency-based tokenizer. The
appearance. There is no regular inter-word spacing in
tokenized words are matched with pre-trained word
Myanmar language like in the English language.
vectors file in order to construct the word embedding
Though, spaces are used to mark phrases. Sentences are
matrix for the embedding layer.
clearly delimited by a sentence boundary marker.
Table I shows the sample of Myanmar text pre-
Myanmar words are constructed by one or more
processing with the sample sentence “Corona ဗို င််းရပ်စ်
2 3
https://mcf.org.mm/myanmar-unicode/ https://fasttext.cc/docs/en/pretrained-vectors.html
ကာကွ ယ်ရရ်း မြန်ြာတို ်းမြြှ င်မပင်ဆင်။” and non-Myanmar B. Convolution Layer
characters “Corona” is removed. Then, the text string
In the convolution layer, the convolution
is segmented into words as
process is performed by the ReLU activation function
“ဗို င််းရပ်စ်_ကာကွ ယ်ရရ်း_မြန်ြာ_တို ်းမြြှ င်_မပင်ဆင်”, “_” shows
with stride size 1. Convolution layer selects the features
the boundary of words. and the result of the convolution process is in the form
of the feature map. Max-pooling layer is discarded
TABLE I. SAMPLE OF TEXT PRE-PROCESSING because it captures only very important features and
Corona ဗို င််းရပ်စ် ကာကွ ယ်ရရ်း ignores the un-important features and it can lead to
Input Sentence
မြန်ြာတို ်းမြြှ င်မပင်ဆင်။ losing the context information.
English Meaning Myanmar ramps up the defense against
Coronavirus
C. LSTM Layer
Word
ဗို င််းရပ်စ်_ကာကွ ယ်ရရ်း_မြန်ြာ_တို ်းမြြှ င်_ မပင်ဆင်
Segmentation
LSTM layer is used instead of the max-pooling
layer to capture context information.
Extract Sentences D. Fully Connected Layer and Output Layer
Text
Documents
In the final output layer, the probability of the
Remove Un-necessary class is predicted by using the sigmoid activation
characters function. In the hyper-parameter setting, Adam
optimization function and binary_crossentropy loss
Word Segmentation
function with 0.5 dropouts and 16 batch size on 10
epochs. Moreover, l2=0.01 is set in kernel and bias
regularizer to reduce overfitting.
Match with Pre-trained
vectors
Embedding
Matrix
Convolution
Layer
Embedding
Matrix
Feature Map
Figure 1. Construction of embedding matrix.
Dropout
IV. A JOINT CNN-LSTM MODEL
The proposed joint CNN-LSTM model
LSTM Layer
basically consists of four components, 1) Embedding
Layer, 2) Convolution Layer, 3) LSTM layer, and 4)
Fully connected and output layer. Figure 2 illustrates
P(c1)
the joint CNN-LSTM model.
P(c2)
Fully
A. Embedding Layer Connected
Layer
In the embedding layer, segmented words are (Use Sigmoid) P(cn)
transformed into vector representation by matching the
pre-trained vectors file. The pre-trained vectors file is
Output
like vocabulary and each word in the vocabulary (Probability of
attached with their corresponding vectors that can catch each class)
the context information. Figure 2. A joint CNN-LSTM model
V. EXPERIMENT 𝑦𝑡 = 𝑓(𝑊𝑦 ℎ𝑡 ) (1)
ℎ𝑡 = 𝜎(𝑊ℎ ℎ𝑡−1 + 𝑊𝑥 𝑥𝑡 ) (2)
A. Dataset
3) CNN-RNN: In this combined model, the
Myanmar text data are collected from various CNN model is used for feature extraction and RNN is
sources of Myanmar news websites including 7Day used to make a prediction by past words. The
Daily 4 , DVB 5 , The Voice 6 , Thit Htoo Lwin 7 , and drawbacks CNN model is its locality and it requires
Myanmar Wikipedia 8 . The text data contains five many convolution layers to capture long term
categories art, sport, crime, education, and health. Each information. RNN is a bias model and it predicts the
category is collected in tab-separated values (.tsv) files semantic of words by past words and reduces
and each row contains a sentence that is annotated with performance when predicting long text. LSTM model
the corresponding label. The text data are converted is an extension of the RNN model to be better
from Zawgyi to Unicode by Rabbit converter 9 and preservation of long term dependency problem.
shuffle and split 75% and 25% for train and test dataset
for each topic and the total number of sentences for the C. Experimental Setup
training is 36,113 sentences and the testing is 12,037
The experiment is implemented on Google
sentences. The details of the dataset are listed in Table
Colaboratory (Colab) 11 that provides the Jupyter
II.
notebook environment 12 , executes code in Python
TABLE II. MYANMAR TEXT DATASET FOR FIVE 3.6.9. No setup is required and runs in the cloud and
TOPICS have a maximum lifetime for each user. The key
Class Train Test Total
advantage of Colab is the support of the Tesla K80
Sport 9,919 3,242 13,161 GPU accelerator. Gensim13 3.6.0 is used to load the
Art 9,483 3,173 12,656 pre-trained word vector file. Keras14 2.2.5 is run on the
Crime 6,718 2,184 8,902 TensorFlow backend. It enables fast experimentation
Health 4,743 1,632 6,375 of deep learning models and can be run on both CPU
Education 5,250 1,806 7,056
and GPU. Large size data such as pre-trained word
vector files can be stored in Google drive to reduce file
Total 36,113 12,037 48,150
uploading time. The sklearn.metrics15 0.22.1 is used to
evaluate the classification performance of the models.
B. Comparison Models
In this work, the comparative analysis is D. Experimental Result
performed on a joint CNN-LSTM model with baseline
The performance of the proposed model, CNN-
CNN, RNN, and their combined model CNN-RNN.
LSTM model is compared with comparison models
1) CNN: CNN is a feed-forward neural network
described in section 5.2 as listed in Table III. In this
and a basic CNN contains three layers, convolution
paper, the experiment is performed on Myanmar news
layer, pooling layer, and fully connected layer. The
articles text data that contains five topics and the detail
ReLU activation is mostly used in the convolution
of the dataset is listed in Table II. The experiment
layer, max-pooling is mostly used in the pooling layer
contains three main parts pre-processing, word
and Softmax function is mostly used in the fully
embedding and text classification. In the pre-
connected layer. In this paper, the simple CNN with
processing phase, we remove unnecessary characters
one convolution layer and sigmoid function is used in
and segment words by BPE tokenizer. Word
the experiment.
embedding matrix is constructed by a pre-trained
2) RNN: RNN is an artificial neural network with
model via the Gensim library and the CNN, RNN,
an internal memory that keeps information to persist.
CNN-RNN and CNN-LSTM models are trained by
It learns from the previous data and performs the same
using Keras, model-level library. All models are
function for all input data. RNN produces the output
trained using 10 epochs, 16 batch size, 0.5 dropout
yt as in equation10 (1).
rate, 0.01 l2 bias and kernel regularizer, Adam
4 10
https://7daydaily.com/ https://en.wikipedia.org/wiki/Recurrent_neural_network
5 11
http://burmese.dvb.no/ https://colab.research.google.com/
6 12
http://thevoicemyanmar.com/ https://jupyter.org/install.html
7 13
http://www.thithtoolwin.com/ https://pypi.org/project/gensim/3.6.0/
8 14
https://my.wikipedia.org/wiki/ https://keras.io/
9 15
https://www.rabbit-converter.org/ https://scikit-learn.org/stable/
optimizer and binary_crossentropy loss function. The usedin the experiment. Although the CNN-LSTM
performance of each model is measured with scikit- model performs better than CNN in three topics in
learn’s classification metrics that report precision, term of F1-score, the CNN model gets at least 4x faster
recall, f1-score measured on each class and the highest in training time than CNN-LSTM (24 min 5sec) model
scores on each evaluation metrics precision, recall and and at least 3x faster than the remaining models, CNN-
f1-score are highlighted in bold. According to the RNN (12 min 56 sec) and RNN (13 min 15 sec) model
results of the experiment, a joint CNN-LSTM model in most datasets. To sum up, the joint CNN-LSTM
performs better in F1-score than the comparison model outperforms than CNN, RNN and CNN-RNN
models in all classes. CNN model equally performs models in most domains but it requires much training
better on crime and education domains. The training time than other models. The use of simple CNN with
time for each model is also measured in this paper. one layer convolution faster in training time although
According to the measurement results, the the accuracy of the model is slightly degraded in some
CNN model requires the minimum training time (6 topics than the CNN-LSTM model.
min 1 sec) because only one convolution layer is
TABLE III. COMPARISON OF TEXT CLASSIFICATION PERFORMANCE ON FIVE TOPICS
Precision Recall F1-score
Class CNN- CNN- CNN- CNN- CNN- CNN-
RNN CNN RNN CNN RNN CNN
RNN LSTM RNN LSTM RNN LSTM
Sport 0.91 0.91 0.88 0.93 0.93 0.93 0.95 0.94 0.92 0.92 0.92 0.94
Art 0.85 0.86 0.87 0.88 0.91 0.87 0.90 0.90 0.88 0.87 0.88 0.89
Crime 0.87 0.86 0.89 0.88 0.91 0.92 0.90 0.93 0.89 0.89 0.90 0.90
Health 0.94 0.92 0.91 0.91 0.87 0.88 0.89 0.90 0.90 0.90 0.90 0.91
Education 0.91 0.90 0.93 0.91 0.85 0.84 0.84 0.86 0.88 0.87 0.88 0.88
helpful to accomplish our works and very useful for
VI. CONCLUSION resource-scarce languages. We would like to thank a
This paper performs the comparative friend who assists to collect and annotate Myanmar text
experiments on the joint CNN-LSTM model with datasets.
CNN, RNN and CNN-RNN models on five categories
including sport, art, crime, health, and education. We
REFERENCES
initially experimented with many convolution layers in [1] Aye YM, Aung SS. Enhanced Sentiment
the CNN model to get higher performance and to catch Classification for Informal Myanmar Text of
long term information, but the performance was not Restaurant Reviews. In 16th International
increased as expected. Moreover, the max-pooling Conference on Software Engineering Research,
layer of the CNN model led to the loss of the local Management, and Applications (SERA), IEEE,
context information. So, we use only one convolution 2018: 31-36.
layer to extract features and the LSTM layer instead of [2] Grave E, Bojanowski P, Gupta P, Joulin A,
the max-pooling layer in order to catch long term Mikolov T.“Learning Word Vectors for 157
information and to reduce the loss of context Languages”. In Proceedings of the Eleventh
information. According to the experiment, the joint International Conference on Language Resources
CNN-LSTM model performs better than CNN, RNN, and Evaluation, LREC-2018, 2018 May.
and CNN-RNN models in most domains, but it takes
[3] Hlaing AM, Pa WP, Thu YK. Enhancing
much training time than the remaining models.
Myanmar Speech Synthesis with Linguistic
Information and LSTM-RNN. In Proc. 10th ISCA
ACKNOWLEDGMENT
Speech Synthesis Workshop, 2019: 189-193.
We deeply thank the anonymous reviewers who [4] Kim Y., Convolutional Neural Networks for
give their precious time for reviewing our manuscript. Sentence Classification. In Proceedings of the
We greatly thanks all of the researchers who shared 2014 Conference on Empirical Methods in
pre-trained words vectors publicly and their works very
Natural Language Processing (EMNLP), 2014:
1746-1751.
[5] Kwaik KA, Saad M, Chatzikyriakidis S, Dobnik
S. LSTM-CNN Deep Learning Model for
Sentiment Analysis of Dialectal Arabic. In
International Conference on Arabic Language
Processing 2019 Oct 16 (pp. 108-121). Springer,
Cham.
[6] Mo HM, Soe KM, Myanmar named entity corpus
and its use in syllable-based neural named entity
recognition, International Journal of Electrical
and Computer Engineering (IJECE), 2020.
[7] Phyu SP, Nwet KT. Article Classification in
Myanmar Language. In the Proceeding of 2019
International Conference on Advanced
Information Technologies (ICAIT), IEEE, 2019:
188-193.
[8] Wang J, Yu LC, Lai KR, Zhang X. Dimensional
sentiment analysis using a regional CNN-LSTM
model. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers) 2016 Aug
(pp. 225-230).
[9] Zhang J, Li Y, Tian J, Li T. LSTM-CNN Hybrid
Model for Text Classification. In2018 IEEE 3rd
Advanced Information Technology, Electronic
and Automation Control Conference (IAEAC)
2018 Oct 12 (pp. 1675-1680). IEEE