Natural Language Processing
Abstract- Natural language processing is widely discussed and researched topic nowadays. As it is one of the oldest area of
research in machine learning it is used in major fields such as machine translation speech recognition and text processing.
Natural language processing has brought major breakthrough in the field of computation and AI. Various algorithms used for
Natural language processing are mainly dependent on the recurrent neural network. Different text and speech processing
algorithm are discussed in this review paper and their working is explained with examples. Results of various algorithms show
the development done in this field over past decade or so. We have tried to differentiate between various algorithms and also its
future scope of research. The Gap analysis between different algorithms is mentioned in the paper as well as the application of
these various algorithms is also explained. Natural language processing has not attained perfection till date but continuous
improvement done is the field can surely touch the perfection line. Different AI now use natural language processing
algorithms to recognize and process the voice command given by user.
I. Introduction well as Written Text.
Different algorithms developed to increase the efficiency of
Andrew Ng has long predicted that as speech recognition processing the language in text form which we are going to
goes from 95% accurate to 99% accurate, it will become a discuss here are:
primary way that we interact with computers. The idea is ● Long short term memory
that this 4% accuracy gap is the difference between ● Sequence 2 Sequence model
annoyingly unreliable and incredibly useful. Thanks to Deep ● Named Entity Recognition model
Learning, we‟re finally cresting that peak. ● User preference graph model
● Word Embedding model
Nowadays artificial intelligence is widely discussed ● Feature based sentence extraction using fuzzy
buzzword and is in under rapid development. Basically inference rules.
artificial intelligence is a computer program that can do ● Template based algorithm using automatic text
something smart like a human, it is actually machine summarization
mimicking human to perform task in his absence and Similarly language can be processed even if the input is in
sometimes in better as well as efficient way, broadly speech form. For that various algorithms are developed and
speaking. the best of them all are:
● Word Recognition
Machine learning is subset of AI. The intelligence of
machine is improved using machine learning as through ● Acoustic Modeling
learning algorithms and analysis of different types of data. ● Connectionist temporal classification
Deep learning and neural networks are subset of machine ● Phase based machine translation
learning. Deep learning algorithms do analysis of different ● Neural machine translation
data sets through algorithm again and again and improves ● Google neural machine translation
the machine knowledge according to the output obtained.
In this review paper different algorithms and models are discussed
Natural language processing is an integral area of computer and various improvements done in field of natural language
science in which machine learning and computational processing. We provide you a basic idea about all the algorithms
linguistics are broadly used. This field is mainly concerned mentioned above, like on what basis they work on, their efficiency
with making the human and computer interaction easy but and different applications where these can be implemented for the
efficient. Machine learns the syntax and meaning of human betterment of the society. We worked under the Department of
NLP involves making computer systems to perform computer science, SVKM‟s NMIMS shirpur to complete the
meaningful tasks with the natural and human review paper.
understandable language.
The reason why natural language processing is so
important in future is it helps us to build models and
processes which take chunks of information as input
and in form of voice or text or both and manipulate
them as per the algorithm inside the computer.
Thus the input can be speech, text or image where
output of an NLP system can be processed Speech as
Each box represents a RNN most commonly LSTM
II. Algorithms, models and Approach to problem: implemented RNN cell. In this model every input is encoded
II.I. Text processing algorithms into fixed size vector which is later processing decoded
using decoder.
A. LSTM:
LSTM stands for long short term memory [1]. Recurrent
neural network is the primary element of LSTM model.
Recurrent Neural network is chunk of neural network which
can remember values.
LSTM are special kind of recurrent neural network which
can remember previous input over arbitrary time interval and
predict the output. It is used for training machine through
input sets. It is one of the learning model in machine
learning which is broadly used in Natural Language
Processing. Stored values are not modified as learning Fig 2: Encoder Decoder structure and working
proceeds. LSTM model is unable to edit the input sets but it
can learn from it by computing its frequency according to Firstly vocabulary list is built and compiled using
the event by processing it several times. First step in LSTM embedding so that the model can identify the correct
is what data input is flushed out of the network. It is decided grammar syntax. The vocabulary set is processed to check
by forget gate layer „0‟ represents „completely forget‟ and 1 for the occurrences of the words and classify frequently
represent „completely keep‟. Next step decides what to store used, rarely used and unique words in the vocabulary. The
in cell state which is decided by input gate layer. The next words are then replaced by with ids. Based on id‟s the reply
layer called a tanh layer creates a vector of new candidate suggestion is decoded and given as output. Following are
values which is combined to input values to update the state. some tags used in model while compiling the input.
Some LSTM model have an extra state called peephole
connection which lets the gate to check status of cell state EOS: End of sentence.
and before dropping the data from network. PAD: Filler.
GO: Start decoding.
B. Seq2Seq model: UNK: Unknown; word not in vocabulary.
The tradition seq2seq model contains two recurrent neural Following in the examples of Seq2seq model working:
network i.e. encoder network and decoder network [2].
Question: How are you?
Answer: I am fine.
This pair will be converted to:
Question: [PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”,
“are”, “How”]
Answer: [GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD,
PAD].
Fig 1: Recurrent neural network structure
C. Named entity recognition Model:
As the name suggests named entity recognition is used to
identify relevant names, classify name by the entity they
belong to. NER model finds names places, people and other
relevant entities in input data set which might be in text or
speech form [3]. NER model works in two phases. First
phase of NER model is to divide the text into segments or
chunks to classify them. These chunks are classified in
predefined categories such as name of person, organization,
location etc. in form of tokens. The formatting is ignored
like bolding and capitalization. Ex. $ mike^ (ENAMEX,
name) who is a student of $ New York University^
(ENAMEX, org), $ New York university ^ (ENAMEX,
location), scored $ 90% ^ (NUMEX, percent) in his seminar
on the $ 19th of March ^ (TIMEX, date). In second phase of
model. This model can be widely used in language and a) Breaking of Original Sentence into Chunks
speech processing while user preference graph is to be b) Find all possible translations for each chunk: in this step
created for smart reply of search suggestions. with the help of the corpora we check how the humans
translated the text in real world sentences.
D. User preference graph: c) Generate all possible sentences and find the most likely
User preference graph is used to create a set of user choices. one: with the help of different combinations of
When user repetitively choose specific tenses, adjectives, translation in step b) we can generate more than 5000
conjunctions and prepositions etc. a preference graph is combination of sentences.
created so that when user is using similar type of sentences d) Give the probable score by comparing it to training set. :
then the model suggests the next words by calculating Here the training set contains a large database of text
probability[4]. This words are mapped to each other hence a from different books, articles, newspaper etc. By
preference graph is created for particular user. On big scale comparing each combination of the above step to the
implementation of this model, people with similar user training dataset we give it a probability score (likelihood
preference graph are grouped together so that the score). After trying the different combinations and
suggestions can have wide scope and variety. This can be passing it through our training data set we will pick the
implemented in smart reply, smart suggestions in typing, one that has most likely chunk translation while also
having a high likelihood score.
auto reply systems etc.
The disadvantage of PBMT [6] is that it is difficult to build
and maintain. If there is a need to add a new language than
bilingual corpora of that language should be present. For less
famous language pair translation tradeoffs are made. That is
if translation from Gujarati to Georgian it does not use a
complex pipeline. Instead it may internally translate it to
English and then translate it to Georgian. [7]
G. Neural Machine Translation (NMT):
Fig 3: User preference graph example
Neural Machine Translation is newest method of Machine
Translation. It creates much more accurate translations as
compared to the Statistical Machine Translation. [5] Neural
E. Word Embedding:
Machine Translation works by sending the input to different
Word embedding is derived from feature learning and
“layers” to be processed before output. NMT is able to use
language modeling in natural language processing where
algorithm to learn linguistic rules on its own from Statistical
words and phrases are mapped onto vectors of real number
Models. The NMT system [6] is based on attentional
of preference graph.
encoder-decoder and operates on sub word units. To
improve the efficiency further, back-translations of the
F. Phrase Based Machine Translation: monolingual News corpus is used as additional training data.
PBMT is one mode of Statistical Machine Translation. It It is optimal for both direction translations.
uses predictive modeling to translate text [5]. These models
are created with the help of or learned from bilingual large The strengths of NMT are that it can better handle verb order
unstructured set of texts. With the help of these the most forms and avoid verb omission. It can handle English noun
probable output is created. collection. Phrase Structure and articles are also well
handled by NMT.
[7] The limitations of NMT are ambiguous words into
German. It also issues with forming verb continuous tenses.
Dominant problem for NMT are prepositions.
Fig 4: Figure based translation model
The algorithm for PBMT is as follows
Increasing the time complexity of the algorithm. So Pre
Processing is done.
b. Preprocessing our Sampled Data: We reduce the time
II.II Voice Processing Algorithms complexity of the algorithm by doing some
H. Acoustic Modeling: preprocessing. This preprocessing includes grouping
of our sampled audio into 20-millisecond long
This contains the references of the individual sounds that
chunks. The below image shows first 20 milliseconds
make up a word. This individual sounds are then assigned a
of audio of first 320 samples
label. This label are known as phoneme. [8] A speech corpus
and by using special training algorithm to create a statistical
This short recording is also difficult to process since
AQV presentations which represents each phoneme in a
human speech comprises of complex sounds. There
language.[9] These statistical representation are called as
are low pitch sounds, mid-range speech sounds and
Hidden Markov Model (HMM). Each phoneme has its own
even some high range speech sounds. To reduce the
HMM. Advantage of using Acoustic Modeling is that the
time complexity further we use Fourier Transform.
users are motivated to articulate clearly, smartphones do
Thus with the help of Fourier Transform the complex
high quality speech capture speech transferred to server
sound wave gets broken into simple sounds waves.
error-free over IP. These models take lots and lots of training
We then add up how much energy is present in each
data to create and for many users they work just fine.
one. A Spectrogram is created because for the neural
However for many others they do not. This is because the
network finding patterns in the spectrogram is far
data used to generate the model contains samples from tens
easier than finding the pattern in the raw sound files.
of thousands of different speakers, so they are generic.
Below is the representation of the above sampled data
Making specific models for individuals is not economical,
in spectrogram
neither is making models for accents with small populations.
Acoustic models are a limitation of the technology.
A. Connectionist Temporal Classification (CTC):
Connectionist Temporal Classification is majorly used for
training recurrent neural network (RNNs). One such
example of recurrent neural network is LSTM Model. In
speech recognition timing is major variable. The input is
majorly similar to phenome but the timing may be varied. To
overcome this problem where LSTM has issues with
recognizing phonemes in speech audio. CTC work by output
and scoring, thus being independent of the underlying
network structure. CTC was first introduced in [9]. c. Fig 5: 20ms/320 samples per window
A CTC network has continuous output which is
later fitted through training to model the probability of the
label. The output are sequence of labels. If the sequence of
labels differ only in length than they are considered equal.
There are many combinations of equal labels. Thus making
the scoring a non-trivial tasks. [10] Paper outlines a dynamic
programming algorithm used to compute the sum of
probabilities over all paths corresponding to a given
labelling.
So the Algorithm is as follows.
a. Turning Sounds into Bits:
This is called sampling. By Nyquist Theorem if we Fig 6: Working of Network for speech processing
sample at least twice as fast as the highest frequency
we can recover the original signal back. For speech d. Recognizing Characters from Short Sounds:
recognition a sampling rate of 16,000 samples per Since the input is straightforward to process. We can feed it
second is optimal. After this there would be array of to deep neural network. The input being the 20 millisecond
numbers with each number representing the sound audio chunks. For each input the neural network will try to
wave‟s amplitude at 1/16000 a second intervals. We figure out the phoneme. Since it is recurrent neural network
could directly give the sampled data to neural network the present output will also influences future predictions. For
but finding the pattern in such a large dump of data easy understanding we will consider that the input data is of
would be difficult and require lot of computations. person saying the word “HELLO”. [11] So if the present
“XYZ”. Since CTC also deals with varying audio length. summarization, text summarization and video
The output would be as HHHEE_LL_LLLOOO”. summarization.
HHHUU_LL_LLLOOO” AAAUU_LL_LLLOOO”. We
will then clean up the output by removing repeated IV. Feature based sentence extraction using fuzzy
characters. HHHEE_LL_LLLOOO becomes HE_L_LO inference rules.
HHHUU_LL_LLLOOO becomes HU_L_LO The stated algorithm is based on evaluating a sentence in the
AAAUU_LL_LLLOOO becomes AU_L_LO Then we‟ll input data on basis of some rules which categorize the
remove any blanks: HE_L_LO becomes HELLO HU_L_LO statement and assign those values as low, medium and high.
becomes HULLO AU_L_LO becomes AULLO. Since all of This assignment of the values are done on the basis of rules
the sounds similar to HELLO. So this pronunciation based which are total 8 in count and are also known as fuzzy rules
predictions will be combined with likelihood scored based or fuzzy logic. These rules are in an IF-THEN form. Like for
on large database of written text. Hence by this likelihood an example the algorithm takes a statement F as an input and
score of Hello will be greater than the other two. So the apply all the rules on it and assign values to it. Like IF(F1 is
output would be shown correctly. H) here it means the importance of the statement on the
basis of first logic rule is „high‟ similarly all the rules are
[12] One disadvantage of this algorithm is that if the input applied as (F2 is H), (F3 is M), (F4 is H), (F5 is M), (F6 if
audio file is of HULLO then the algorithm would not be able H), (F7 is H), (F8 is H)[13] and if the statement after being
to recognize it correctly since the database of written text evaluated by the rules satisfies the criteria and is considered
does not contain more number of HULLO. So the algorithm as important then it is added in the summary as per the same
would malfunction when the reader says words which aren‟t sequence as that was in the input data.
present in the database of written text.
The algorithm consists of 4 stages which process the data
III. Applications of NLP: and gives the final output as the processed summary. These
stages are: first is Preprocessing then Feature extraction
One of the application of Natural language processing which
followed by Fuzzy logic scoring And Sentence selection and
we are going to discuss here is summarization of text
assembly
automatically with the help of software. We will also discuss
Each of these stages consists of sub processes and output of
two of the best algorithms which were designed to
each stage is given as input to the next stage to process.
summarize the text and will also compare both of them so as
This algorithm used certain features to determine the
to get to a conclusion.
importance of the sentence on which it is included into the
But before we discuss about the algorithm it is better to
summary. These features are [14]
know more about what automatic text summarization
1. Title feature
actually is.
2. Term weight
Automatic text summarization is basically the task for a
3. Sentence length
software to reduce a large amount of text into a meaningful
4. Sentence position
short summary which allows the reader to understand what
5. Thematic word
information the document contains in a short descriptive
6. And fuzzy logic
form so as it saves the efforts and time of the user.
On the basis of these features the sentence is evaluated by
the algorithm and the output is generated.
There are mainly two general or fundamental ways to
automatic summarization of text. Those are extraction and
abstraction. In text summarization, Extractive methods work V. Template based algorithm for automatic text
on choosing between a subset of words, phrases, or summarization.
sentences present in the document in its original text to As we saw that in feature based algorithm the sentences are
produce an extracted summary.[15]While in , abstractive evaluated on basis of some basic criteria or basic features
methods the algorithm builds an internal semantic and the sentences as arranged in the input data are added as
representation and then they use natural language generation same in the output summary while in case of template based
techniques that is in the technique the machine acts as if it extraction algorithm some modification is being done after
has a human brain and has the ability to generate a extracting the text to arrange the information in a proper and
meaningful summary by understanding the text present in grammatical way to arrange in the summary and to make it
the document. This process creates a summary that is far more look like a human work.
closer to what a person might actually extract and present as
a summary of text. This generated summary includes verbal The template based algorithm for automatic text
innovations. Research by this date have focused primarily on summarization is implemented in two phases [16]
extractive methods, which are pertinent for image collection A. Text pre-processing
B. And information extraction
A. Text pre-processing avoid that, stemming is done and these words having same
This part of the implementation includes the module shown meaning but different tenses are converted into basic simple
below: tense.
B. Information extraction
This part of the algorithm includes the following modules:
1. Training the dialogue control
2. Knowledge based discovery
3. Dialogue management
4. Template based summarization
1. Training the dialogue control
During training, the system gain knowledge of important or
index terms, named body like name of the persons, places,
and temporal formats as per the norms of the rules.
Intelligence and efficiency of the algorithm increases with
every training set. That is every time we feed the data to the
algorithm it stores the results of the process into a data
storage and then uses them as reference or as experience
while analysis the next input. The concepts learnt during
training are stored in the knowledge base of the system.
Fig 7: Text preprocessing model
2. Knowledge based discovery
1. Syntactic analysis Knowledge based discovery here means the process of
The job of syntactic analysis module is to decide the starting extracting intelligent information and storing them in an
and ending of each sentence in the input document. unstructured text form. Thus decreasing the need to create
Presently the algorithm take the full stop symbol as the multiple storage structures to store different terms of
ending of the sentence. And any string of characters up to different category, thus reducing the search time and hence
the full stop symbol is taken as a one full sentence. improving the overall performance of the algorithm.
2. Tokenizer 3. Dialogue management
The job of tokenizer is to break the sentence given as the The dialogue management module is a process which offers
output from the syntactic analysis module into tokens. The human and computer interaction. With the help of this
broken pieces of a sentence can be words, numbers or module the user can request the information he needs using
punctuation marks. normal natural language text. The dialogue control module
which is built upon the training model as the experience data
set, which accepts the user request, understands it, then
3. Semantic refers to its knowledge base and then finally produces the
Semantic analysis sub phase understands the role of every answers which probably contains the sought information.
word which is in a sentence. Then assignment of a tag is
done on every word named as noun, verb, adjective, and 4. Template based summarization
adverb and so on. This process of assigning and dividing the It is the process of combining together all the meaningful
word into different classes is called Part-Of-Speech tagging text present in the input data or the document in a compact
or POS tagging format.
4. Stop Word Removal
Some words are used more often in the natural language text
but their value of importance in extracting meaning is very
little in regards when we consider overall meaning of the
sentence. Such words are stated as stop words and are
removed
5. Stemming
Stemming is the task of evaluating basic form of a certain
word in the input text document. Same words are written in
different tenses but in all having the same meaning thus to Fig 8: Information Extraction Module
The algorithm also allows user to prepare the template that [7] Maja Popović “Comparing Language Related Issues for NMT
has provisions to specify events, locations, named Entities and PBMT between German and English”.
etc. User also has the provision for specifying any number of [8] Ciprian Chelba “Speech and Natural Language: Where Are
and any type of POS patterns. We Now and Where Are We Headed?”.
[9] Alex Graves, Santiago Fernández et al. “Connectionist
Temporal Classification: Labelling Unsegmented Sequence
VI. Gap analysis: Data with Recurrent Neural Networks” Pittsburgh,
Pennsylvania, USA — June 25 - 29, 2006.
[10] Brian Milch, Alexander Franz “Searching the Web by Voice”
In template based algorithm all the templates created on Taipei, Taiwan — August 24 - September 01, 2002.
different documents by either same user or different user are [11] Grégoire Mesnil, Yann Dauphin et al. “Using
being stored in the database for future references as it acts as Recurrent Neural Networks for Slot Filling in Spoken
training set for the software and hence increasing the Language Understanding” IEEE Press Piscataway, NJ, USA
accuracy of the system while in feature based algorithm Volume 23 Issue 3, March 2015.
there is no concept of database and thus no training set hence [12] Ian McGraw, Rohit Prabhavalkar et al. “Personalized Speech
the gap is being fulfilled.[17] Recognition on Mobile Devices” Shanghai, China 19 May
2016.
[13] MR.S.A.Babar, MS.P.D.Patil “Fuzzy approach for document
summarization”.
VII. Conclusion
[14] Mr.S.A.Babar Prof.S.A.Thorat “Improving Text
As above context indicate text processing algorithms are Summarization using Fuzzy Logic & Latent Semantic
based on entity based classification and preference graphs. Analysis”.
The text processing algorithms are used in smart reply and [15] Prashant G Desai, Saroja “A Study of Natural Language
Processing Based Algorithms for Text Summarization” Devi
smart suggestions in various applications to reduce the user's
Niranjan N Chiplunkar, Mahesh Kini M.
workload and time giving appropriate and efficient output. [16] Prashant G.Desai, Sarojadevi,Niranjan N. Chiplunkar “A
Whereas in speech processing problem is nowhere near template based algorithm for automatic text summarization
solved but it has improved a lot past decade. Neural and and dialogue management for text documents”.
deep learning is used in text processing and speech [17] Dipanjan Das Andr´e F.T. Martins “A Survey on Automatic
processing to give a more efficient output. These recently Text Summarization”.
improved algorithms have resulted in major breakthrough in
this area. The accuracy level of output is close to perfect due
to new improved algorithms which is now close to what
humans would interpret. Various AI are developed based on
text processing and speech processing algorithms to assess
the user‟s requirement based on input classification. This
improves results and user has more personalized result
according to his needs. Combination of various text
processing algorithms and speech processing algorithms are
used for more refined output.
VIII. References
[1] Matthew Henderson, Ramial-Rfou, Brian Strope etal
“Efficient Natural Language Response Suggestion for Smart
Reply”.
[2] Jan Chorowski, Navdeep Jaitly “Towards better decoding and
language model integration in sequence to sequence models”.
[3] Adams Wei Yu, Hongrae Lee, Quoc V. Le “Learning to Skim
Text”.
[4] FangtaoLi§, YangGao† et al. “Deceptive Answer Prediction
with User Preference Graph”.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen et al. “Google's
Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation”.
[6] Luisa Bentivogli, Arianna Bisazza, and Mauro Cettolo
“Neural versus Phrase-Based Machine Translation Quality: a
Case Study”.