Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views10 pages

Image Captioning Using Deep Learning

The document is a systematic literature review on auto image captioning using deep learning, focusing on methods and models from 2017 to 2019. It identifies key tasks such as feature extraction, object detection, and relationship identification, with common techniques including CNN for image understanding and RNN/LSTM for language generation. The study evaluates various datasets and methods, concluding that while progress has been made, there is still a need for improvement to achieve human-like accuracy in image captioning.

Uploaded by

Chris Bekor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Image Captioning Using Deep Learning

The document is a systematic literature review on auto image captioning using deep learning, focusing on methods and models from 2017 to 2019. It identifies key tasks such as feature extraction, object detection, and relationship identification, with common techniques including CNN for image understanding and RNN/LSTM for language generation. The study evaluates various datasets and methods, concluding that while progress has been made, there is still a need for improvement to achieve human-like accuracy in image captioning.

Uploaded by

Chris Bekor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341776536

Image Captioning using Deep Learning: A Systematic Literature Review

Article in International Journal of Advanced Computer Science and Applications · May 2020
DOI: 10.14569/IJACSA.2020.0110537

CITATIONS READS
29 5,520

6 authors, including:

Murk Chohan Adil Khan


Begum Nusrat Bhutto Women University Sukkur Sukkur IBA University
7 PUBLICATIONS 219 CITATIONS 11 PUBLICATIONS 236 CITATIONS

SEE PROFILE SEE PROFILE

Muhammad Saleem Saif Hassan Katper


Dawood University of Engineering and Technology Sukkur IBA University
8 PUBLICATIONS 219 CITATIONS 15 PUBLICATIONS 339 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Saif Hassan Katper on 31 May 2020.

The user has requested enhancement of the downloaded file.


(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

Image Captioning using Deep Learning: A Systematic


Literature Review
Murk Chohan1, Adil Khan2, Muhammad Saleem Mahar3
Saif Hassan4, Abdul Ghafoor5, Mehmood Khan6
Department of Computer Science
Sukkur IBA University
Pakistan

Abstract—Auto Image captioning is defined as the process of image. Since the boost of automation and Artificial
generating captions or textual descriptions for images based on Intelligence lots of research is going on to give machine
the contents of the image. It is a machine learning task that human like capabilities and reduce manual work. For
involves both natural language processing (for text generation) machines acquiring results and accuracy as good as human in
and computer vision (for understanding image contents). Auto image captioning problem has always been a very challenging
image captioning is a very recent and growing research problem task.
nowadays. Day by day various new methods are being introduced
to achieve satisfactory results in this field. However, there are Auto image captioning is performed by following key
still lots of attention required to achieve results as good as a tasks in order. At first features are extracted after proper
human. This study aims to find out in a systematic way that what extraction of features different objects from an image are
different and recent methods and models are used for image detected, after that the relationship between objects are to be
captioning using deep learning? What methods are implemented identified (i.e. if objects are cat and grass it is to be identified
to use those models? And what methods are more likely to give that if cat in on grass). Once objects are detected and
good results. For doing so we have performed a systematic relationships are identified now it is required to generate the
literature review on recent studies from 2017 to 2019 from well- text description, i.e. Sequence of words in orderly form that
known databases (Scopus, Web of Sciences, IEEEXplore). We they make a good sentence according to the relationship
found a total of 61 prime studies relevant to the objective of this between the image objects.
research. We found that CNN is used to understand image
contents and find out objects in an image while RNN or LSTM is To perform above key tasks using deep learning different
used for language generation. The most commonly used datasets deep learning networks are used. For Example to get visual
are MS COCO used in all studies and flicker 8k and flicker 30k. features and objects CNN with different region proposing
The most commonly used evaluation matrix is BLEU (1 to 4) models like RCNN, Faster RCNN can be used and to generate
used in all studies. It is also found that LSTM with CNN has text description in sequence RNN or LSTM can be used.
outperformed RNN with CNN. We found that the two most Using these networks various different methods are developed
promising methods for implementing this model are Encoder to perform auto image captioning in various different
Decoder, and attention mechanism and a combination of them
domains. However, still, there is room for the machine to
can help in improving results to a good scale. This research
make capable enough to generate descriptions like a human
provides a guideline and recommendation to researchers who
want to contribute to auto image captioning. [61]. . After training the Deep Learning network for image
captioning to evaluate its performance various evaluation
Keywords—Image Captioning; Deep Learning; Neural matrices like BLEU, CIDEr, and ROUGE-L exists.
Network; Recurrent Neural Network (RNN); Convolution Neural The purpose of this Systematic Literature Review is to
Network (CNN); Long Short Term Memory (LSTM) study all newest Articles from 2017 to 2019 to find different
I. INTRODUCTION methods to achieve auto image captioning in different
domains, what different datasets are used to achieve the task,
Auto image captioning is the process to automatically In which different practical domains this task is used, which
generate human like descriptions of the images. It is very technique Outperforms others and finally attains to describe
dominant task with good practical and industrial significance the technicalities behind different networks, methods and
[62]. Auto Image captioning has a good practical use in evaluation matrices. Our study will help new researchers who
industry, security, surveillance, medical, agriculture and many want to work in this domain to attain better accuracy. We
more prime domains. It is not just very crucial but also very specially focused and the collection of quality articles which
challenging task in computer vision [1]. Traditional object have been published till now. We attempt to find our different
detection and image classification task just needed to identify techniques presented in [1- 60] articles, find their methods
objects within the image where the task of Auto image strengths and weakness. Finally we attempt to summarize
captioning is not just identifying the objects but also them to explain which technique has better performance in its
identifying the relationship between them and total scene particular domain. Our work mostly focuses on identifying the
understanding of the image. After understanding the scene it is most popular techniques. The areas in which yet there is
also required to generate a human like description of that

278 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

attention require and in result section we also attempt to the research. In alignment with our research problem we have
explain the technical concepts behind the used approaches. come up with research questions for which we try to find
answers during this research.
II. METHODOLOGY
A. Research Questions
The planning conducting and reporting of this Systematic
literature review is done step by step. First in planning section Before conducting this study we kept the following
we identified the need of conducting this research its research questions to measure the quality of our work. This
importance. Identifying the research questions and design study basically provide a detailed knowledge related to these
search strategy, designing quality assessment criteria and research questions. Table I provides the list of research
finally designing data extraction strategy is also planned questions.
during this stage. After proper planning we have conducted

TABLE I. LIST OF RESEARCH QUESTIONS

RQ# Research Question Motivation


How image captioning recognizes the important objects, attributes of Identifying DL techniques for object detection and relation finding
RQ 1
objects and their relationships in an image? mechanism
How Deep learning-based techniques are capable of handling the
RQ 2 Identifying DL methods to handle challenges of image captioning
complexities and challenges of image captioning?
Identifying DL techniques for Language generation as well as object
RQ 3 What deep learning techniques are used for image captioning?
detection
RQ 4 Which techniques outperform other techniques? Comparison between several techniques
RQ 5 What datasets are used for Image Captioning? Identifying different datasets used for image captioning
RQ 6 What evaluation mechanisms are used in literature for image captioning? Identifying different methods to evaluate image captioning models
Since an article can be indexed in many databases we
B. Search Results removed the duplicate articles from either one of the database.
According to our research questions we came up with our After duplicate removal total number of studies from all three
search keywords and we categorized them in two different databases are shown in Table IV.
groups, shown in the Table II.
Abstract screening is also important to filter the searched
Using scientific approach for searching the results from studies to keep valuable studies that are more related to
different academic databases. We composed the query string someone’s work. We performed abstract screening on the 577
from the keywords cited in Table II. articles which were remained after duplicate removal to check
Query String: ("Image Captioning") AND ("Deep out the relevance of studies with our work. We found many
Learning" OR "Neural Network" OR "RNN" OR "LSTM" OR studies not relevant to our topic like some were about audio
"CNN") captioning or video captioning. After the abstract screening,
we had a total of 308 studies out of 577 studies, Table V
We applied the cited search query string on three well illustrates the total number of studies from each database after
known academic databases namely IEEE Xplore, Web of the abstract screening.
Sciences and Scopus to search the articles. We adopted the
most recent articles published during 2017-2019 from the TABLE IV. NUMBER OF STUDIES OF DUPLICATE REMOVAL
journals, and our initial search results are illustrated in the
Database Duplicate removal results
Table III.
IEEE Xplore 162
TABLE II. KEYWORDS IN TWO DIFFERENT GROUPS (GROUP 1 AND Web of Science 167
GROUP 2)
Scopus 248
Keywords: Group 1 Keywords: Group 2
Total 577
Deep Learning, Neural Network, RNN,
Image Captioning
CNN, LSTM
TABLE V. NUMBER OF STUDIES AFTER ABSTRACT SCREENING RESULTS

TABLE III. INITIAL STAGE RESULTS FROM IEEE XPLORE, WEB OF Database Abstract screening results
SCIENCES AND SCOPUS
IEEE Xplore 63
Database Original search results
Web of Science 92
IEEE Xplore 247
Scopus 143
Web of Science 167
Total 308
Scopus 313
Total 727

279 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

C. Quality Assessment Criteria The result which we found above illustrated in PRISMA
The quality of 308 articles was assessed for quality diagram (see Fig. 1). All this process we this dissipated in
assessment criteria. We assessed the quality of selected 308 following diagram.
studies to ensure the quality assessment of our study. We went D. Data Extraction and Synthesis
through the full text screening of those studies which were
ambiguous and was not clear from abstract screening. The After selection of final 61 primary studies we extracted
process of quality assessment criteria (QAC) was done with data from those studies for performing final synthesis. We
full text screening. All four authors agreed to make some defined our data extraction strategy based on our research
quality assessment questions (QAQ) to ensure the quality of questions. We have extracted following parameters from our
our work. primary studies for further synthesis, year or article published,
title, models use for language generation and object detection,
QA Q1 The article must be published in journal methods use to implement models, datasets used, evaluation
QA Q2 Article has proposed a proper method to implement matrices used for evaluation purpose and finally accuracy of
image captioning using deep learning. proposed model.
QA Q3 The article must have clear and unambiguous results. The purpose of synthesis is to summarize the facts
QA Q4 Article must discuss the applications and challenges extracted in data extraction and give a clear picture of work
of image captioning. done in past and directions to new researchers.
QA Q5 Article must discuss the evaluation strategy of the
built model. TABLE VI. NUMBER OF STUDIES AFTER QUALITY ASSESSMENT CRITERIA

We assessed the quality of 308 studies on the basis of Database Quality assessment results
quality assessment criteria (QAC) questions and through full
IEEE Xplore 12
text screening, we found total 61 studies from all three
databases. Number of each studies from all three databases Web of Science 16
shown in Table VI. Scopus 33
Total 61

Fig. 1. PRISMA Diagram.

280 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

thing (any image). It takes images as an input and understands


III. RESULTS AND DISCUSSION it to perform assigned tasks. Two main functions of CNN are
A. Datasets convolution and pooling. Convolution is used in CNN to
There are many datasets available for performing image detect the edges of an image and pooling is used to reduce the
captioning. In literature most common used data sets are MS size of an image. It is a method in which we take a small
COCO and flicker 8k and 30k. Moreover for a text description number matrix called kernel or filter then move it over our
of specific task like in medical or traffic movement picture and convert it depending on the filter values.
description their own dedicated datasets are created. Fig. 2 Following formula is used to calculate the feature map, where
below show the datasets along with their frequency in our f is used to denote input image and h is used to denote filter.
selected studies. The outcome matrix rows and column indexes are labeled
1) MSCOCO: MS COCO stands for common object in with m and n, respectively.
context. It is very large dataset which contains 330k images, 𝐺[𝑀, 𝑁] = (𝐹 ∗ 𝐻)[𝑀, 𝑁]
1.5 object instances and 5 captions per image. MS COCO is
found to be very widely used dataset in literature. It is very = � 𝐽 � 𝐾 (𝐻[𝐽, 𝐾]𝐹[𝑀 − 𝐽, 𝑁 − 𝐾])
best suited for image captioning because unlike other datasets Calculation of convolution layer done in two steps. First
it contains non iconic images. Iconic images are those images step is used to calculate the intermediate value Z, and its
which contains only one object with a background where as addition with bias. Second step is to apply non-linear
non-iconic images contains various objects overlapping. activation g with intermediate value.
Object layout plays an important role in understanding context
of scene and that is very carefully taken care of while labeling 𝑍 [𝑖] = 𝑊 [𝑖] ∗ 𝐴[𝑖−1] + 𝑏 [𝑖]
images. Fig. 3 shows some images taken from MS COCO 𝐴[𝑖] = 𝑔[𝑖] (𝑍 [𝑖] )
dataset.
2) Deep learning networks: Deep learning network used 4) Recurrent Neural Network (RNN): CNNs commonly
for images is Convolution neural network. CNN has been do not do well in a sequential fashion when the input data is
proved best to map image data into output variable. There are interrelated. CNNs have no connection of any kind between
various prebuilt model that take advantage from this feature of previous input and next data. So all of the outputs depend on
CNN i.e. RCNN faster RCNN etc. these models are used for themselves. Depending on the trained model, CNN takes input
object detection and localization in images which is very and gives output. For doing above task RNN is used. RNN
necessary task in image captioning since it’s not just have its memory, so that it is able to remind what happened
classification task and understanding image contents is earlier in data. Earlier means previous inputs. RNN performs
necessary. Once image data is understood there is need of best on textual data because text is interrelated (sequential
predicting the sequence of words to generate the text for that data).Basic formula for RNN is written below.
particular image. For sequence prediction two most famous ℎ(𝑡) = 𝑓�ℎ[𝑡−1] , 𝑥(𝑡); 𝜃�
networks are Recurrent Neural Network (RNN) and long short
term memory (LSTM). For image captioning generation task f is a function of current hidden state h. h(t-1) is a previous
CNN is either used with RNN or LSTM where CNN is used hidden state, x(t)is current input, and Ꝋ is a parameter of
for understanding image contents and RNN or LSTM for text function.
description generation. Fig. 4 and Table VII represents the 5) Long Short Term Memory (LSTM): LSTM is a variant
number of studies that have used RNN or LSTM with CNN. of RNN. It is better than simple RNN because it solves the
Fig. 5 shows use of CNN and RNN networks for image issues faced by simple RNN. Two major issues faced by
captioning in year 2017 to 2019. simple RNN is (i) exploding gradient and vanishing gradient
In terms of performance we Compared BLEU-1 and (ii) long term dependency. LSTM uses gates to remember
performance of both text prediction networks and found out the past and gates are the heart of LSTM. Gates which are
that LSTM outperforms RNN in terms of accuracy. Fig. 6 available in LSTM are (i) input gate (ii) forget gate and (iii)
shows the result of top 5 highest accuracy achieving papers for output gate. They all are sigmoid activation function. Sigmoid
both networks. means output between 0 and 1, mostly 0 or 1. When output is
3) Convolution Neural Network (CNN): Convolution 0, it means gate is blocking. If output is 1 then pass
Neural Network is an algorithm of Deep Learning which is everything. Below is the equation for above defined gates.
𝑖𝑡 = 𝜕(𝑤𝑖 [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 ).
normally used to process images. CNN is an evolution of
simple ANN that gives better result on images. Simple dense B. Evaluation Mechanism
network is best for classification tasks where some features Evaluating the trained model is quiet difficult task in
are used to classify the image. CNN performs best with more image captioning for this purpose various evaluation matrices
features in an image. It is used to process the local features as are created. Most common evaluation mechanisms found in
well. Because images contain repeating patterns of particular literature are BLEU, ROUGE-L, CIDEr, METEOR, and

281 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

SPICE. It is found that BLEU score is most popular method of Formula for weighted function is:
evaluation used by almost all of the studies. You can verify
𝑃𝑅
this from given Fig. 7 and Table VIII. 𝐹=
𝛼𝑃 + (1 − 𝛼)𝑅
6) BELU: BLEU stands for bilingual evaluation
understudy. It is an evaluation mechanism widely use in text Where P and R stands for precision and recall calculated as
m/c and m/r, where c and r are candidate and reference length
generation. It is a mechanism for comparing the machine
and m is number of mapped unigrams among two texts.
generated text with one or more manually written text. So
basically it summarizes that how close a generated text is to an Formula for Penalty function is:
expected text. BLEU score is majorly prevalent in automated 𝑐 𝛽
𝑃𝑒𝑛𝑎𝑙𝑖𝑡𝑦 = 𝛾 � � , 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝛾 ≤ 1
machine translation but it can be also used in image 𝑚
captioning, text summarization, speech recognition etc. Where c is number of matched chunks and m is total
Particularly in image captioning the BLEU score is accuracy number of matches.
that how close a generated caption is to a manual human
Over all meteor score is found by:
generated caption of that particular image. The score scale lies
between 0.0 to 1.0. Where 1.0 is perfect score and 0.0 is worst 𝑀 = (1 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦) ∗ 𝐹
score.
8) ROUGE-L: ROUGE stands for recall oriented
We found that almost all studies used bleu as their understudy for gisting evaluation. As clear from its name
evaluation matrix and they calculated BLEU-1 to 4 where ROUGE is only based on recall but ROUGE-L is based on its
BLEU-1 is calculating accuracy only on 1 gram, BLEU-2 for F score which is harmonic mean of its precision and recall
2 grams, BLEU-3 for 3 grams and BLEU-4 for 4 grams. values. Following are the formulas for calculating precision,
The BLEU score can be calculated from following recall and F values
formula. 𝐿𝐶𝑆(𝐴, 𝐵) 𝐿𝐶𝑆(𝐴, 𝐵)
𝑃= 𝑎𝑛𝑑 𝑅 =
𝑁 𝑚 𝑛
𝐵𝐿𝐸𝑈 = 𝐵𝑃. 𝑒𝑥𝑝 �� 𝑤𝑛 log𝑝𝑛 � Here A and B are candidate and reference generated text
𝑛=1 and m and n are their lengths and LCS stands for longest
common sequence since ROUGE-L depends on longest
7) METEOR: METEOR stands for metric for evaluation common sequence.
and translation with explicit ordering. While BLEU takes
account of entire text generated overshadowing the score of Now for calculating F their harmonic means are
each and individual sentence generated the METEOR takes calculated.
care of that. For doing so METEOR enhances the precision (1 + 𝑏 2 )𝑅𝑃
and recall functions. Instead of precision and recall the meteor 𝐹=
𝑅 + 𝑏2𝑃
utilizes weighted F-score for mapping unigram and for
incorrect word order it uses penalty function.

Datasets
60

50

40

30

20
Datasets

10

Fig. 2. Datasets used for Image Captioning in Selected Studies.

282 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

Fig. 3. MS COCO Dataset Images.

Models

45
40
35
30
25
Models
20
15
10
5
0
CNN+RNN CNN+LSTM CNN+RNN+LSTM

Fig. 4. Deep Learning Model used for Image Captioning in Literature.

TABLE VII. DEEP LEARNING MODEL USED FOR IMAGE CAPTIONING IN LITERATURE

NETWORKS
SR# STUDIES RN
LSTM
N
[1],[2],[4],[6],[8],[10],[11],[13],[14],[15],[16],[18],[19],[20],[22],[23],[26],[27],[28],[29],[31],[32],[33],[36],[38],[37],[40],[41],[42],[
1 ✔
43],[44],[45],[46],[47],[49],[50],[52],[53],[54],[57],[58]
2 [5],[7],[9],[12],[17],[30],[34],[35],[51],[56],[21] ✔
3 [18],[24],[25],[39],[48],[59],[60] ✔ ✔

35

30

25

20
RNN
15 LSTM

10

0
2017 2018 2019

Fig. 5. Use of CNN and RNN Networks through the Years.

283 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

90

80

70

60

50

40

30

20

10

0
LSTM+CNN P1 P2 P3 P4 P5 RNN+CNN

Fig. 6. Comparison of Best Score Achieved by RNN and LSTM (B1 Result Comparison).

Evaluation matrices
70

60

50

40

30
Evaluation matrices

20

10

0
BLEU METEOR CIDEr ROUGE-L SPICE

Fig. 7. Evaluation Metrics used in Literature.

TABLE VIII. EVALUATION METRICS USED IN LITERATURE

EVALUATION METRICS
Sr.No. STUDIES
B M C R S
[1], [2], [3], [4] ,[6], [8], [10], [11], [13], [15], [16], [18], [20], [22], [26], [29], [30], [31], [32],
1 ✔ ✔ ✔ ✔
[33], [38], [39], [40], [42], [43], [44], [49], [50], [52], [53], [54], [55], [56], [57],
2 [5], [12], [19], [23], [34], [27], [28], [36], [37], [45], [51], [58], [59] ✔ ✔ ✔
3 [7], [35] ✔ ✔
4 [9], [41], [47], [48] ✔
5 [17] ✔ ✔ ✔ ✔
6 [25] ✔ ✔ ✔
7 [34] ✔
8 [46] ✔ ✔ ✔ ✔
9 [60] ✔ ✔ ✔ ✔ ✔

284 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

[7] Huang, G., & Hu, H. (2019). c-RNN: A Fine-Grained Language Model
IV. CONCLUSION for Image Captioning. Neural Processing Letters, 49(2), 683-691.
[8] Xiao, F., Gong, X., Zhang, Y., Shen, Y., Li, J., & Gao, X. (2019). DAA:
This systematic literature review (SLR) presents a detailed Dual LSTMs with adaptive attention for image captioning.
analysis of different deep learning models used for image Neurocomputing, 364, 322-329.
captioning. To perform the study we searched articles from [9] Dhir, R., Mishra, S. K., Saha, S., & Bhattacharyya, P. (2019). A Deep
three academic databases, after applying inclusion and Attention based Framework for Image Caption Generation in Hindi
exclusion criteria on all article and we selected 61 primary Language. Computación y Sistemas, 23(3).
studies to perform a literature review. Using data extraction [10] Xiao, X., Wang, L., Ding, K., Xiang, S., & Pan, C. (2019). Deep
mechanism we extracted the data and analyzed it deeply. We Hierarchical Encoder–Decoder Network for Image Captioning. IEEE
Transactions on Multimedia, 21(11), 2942-2956.
found various different models and techniques used for image
captioning. For image content extraction CNN is the best- [11] Zeng, X., Wen, L., Liu, B., & Qi, X. (2019). Deep learning for
ultrasound image caption generation based on object detection.
suited model and for language generation two frequently used Neurocomputing.
models are RNN and LSTM. It is found that LSTM has [12] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for
outperformed RNN. We also found different studies have used generating image descriptions. In Proceedings of the IEEE conference
several different mechanisms for scene understanding like on computer vision and pattern recognition (pp. 3128-3137).
encoder-decoder mechanism and attention mechanism. The [13] Xiao, X., Wang, L., Ding, K., Xiang, S., & Pan, C. (2019). Dense
most suitable dataset for image captioning is MSCOCO semantic embedding network for image captioning. Pattern Recognition,
because it contains non-iconic images, unlike other datasets. 90, 285-296.
[14] Han, M., Chen, W., & Moges, A. D. (2019). Fast image captioning using
Throughout our review, we have observed that image LSTM. Cluster Computing, 22(3), 6143-6155.
captioning is mostly used generally. There are various [15] Dash, S. K., Saha, S., Pakray, P., & Gelbukh, A. (2019). Generating
domains that can take advantage of image captioning to image captions through multimodal embedding. Journal of Intelligent &
automate their tasks. Fuzzy Systems, 36(5), 4787-4796.
[16] Li, L., Tang, S., Zhang, Y., Deng, L., & Tian, Q. (2017). Gla: Global–
1) A model can be trained in medical ultrasound or MRI local attention for image description. IEEE Transactions on Multimedia,
images or angiographic videos to generate a complete report 20(3), 726-737.
of a person without any consent from a doctor. Image [17] Kinghorn, P., Zhang, L., & Shao, L. (2019). A hierarchical and regional
deep learning architecture for image description generation. Pattern
captioning can be used to generate an automatic report by Recognition Letters, 119, 77-85.
looking at those medical images of a person. [18] Su, Y., Li, Y., Xu, N., & Liu, A. A. (2019). Hierarchical deep neural
2) Image captioning can also be used in industries to network for image captioning. Neural Processing Letters, 1-11.
automate various tasks. A model can be trained on images of a [19] Zhang, Z., Wu, Q., Wang, Y., & Chen, F. (2018). High-quality image
captioning with fine-grained and semantic-guided visual attention. IEEE
company product manufacturing environment to find out an Transactions on Multimedia, 21(7), 1681-1693.
anomaly in the environment or product automatically. It can [20] Shetty, R., Tavakoli, H. R., & Laaksonen, J. (2018). Image and video
also be used also used to detect any mishap in a company like captioning with augmented neural architectures. IEEE MultiMedia,
fire or security issues. 25(2), 34-46.
3) Image captioning can also be used in agriculture to [21] Dong, J., Li, X., & Snoek, C. G. (2018). Predicting visual features from
text for image and video caption retrieval. IEEE Transactions on
generate the report of crops for owners by looking at images Multimedia, 20(12), 3377-3388.
of crops. [22] Ding, S., Qu, S., Xi, Y., Sangaiah, A. K., & Wan, S. (2019). Image
4) Image captioning can also be used in traffic analysis caption generation with high-level image features. Pattern Recognition
report generation by using CCTV cameras installed on streets Letters, 123, 89-95.
and thus guide drivers which is the best suitable path to take [23] Wu, Q., Shen, C., Wang, P., Dick, A., & van den Hengel, A. (2017).
Image captioning and visual question answering based on attributes and
and where parking is available. external knowledge. IEEE transactions on pattern analysis and machine
REFERENCES intelligence, 40(6), 1367-1381.
[1] Yang, L., & Hu, H. (2019). Adaptive syncretic attention for constrained [24] Yang, J., Sun, Y., Liang, J., Ren, B., & Lai, S. H. (2019). Image
image captioning. Neural Processing Letters, 50(1), 549-564. captioning by incorporating affective concepts learned from both visual
and textual components. Neurocomputing, 328, 56-68.
[2] Fu, K., Jin, J., Cui, R., Sha, F., & Zhang, C. (2016). Aligning where to
see and what to tell: Image captioning with region-based attention and [25] Guo, R., Ma, S., & Han, Y. (2019). Image captioning: from structural
scene-specific contexts. IEEE transactions on pattern analysis and tetrad to translated sentences. Multimedia Tools and Applications,
machine intelligence, 39(12), 2321-2334. 78(17), 24321-24346.
[3] Li, J., Yao, P., Guo, L., & Zhang, W. (2019). Boosted Transformer for [26] Zhang, X., He, S., Song, X., Lau, R. W., Jiao, J., & Ye, Q. (2019).
Image Captioning. Applied Sciences, 9(16), 3260. Image captioning via semantic element embedding. Neurocomputing.
[27] Cao, P., Yang, Z., Sun, L., Liang, Y., Yang, M. Q., & Guan, R. (2019).
[4] Oluwasanmi, A., Aftab, M. U., Alabdulkreem, E., Kumeda, B.,
Image Captioning with Bidirectional Semantic Attention-Based Guiding
Baagyere, E. Y., & Qin, Z. (2019). CaptionNet: Automatic end-to-end
of Long Short-Term Memory. Neural Processing Letters, 50(1), 103-
siamese difference captioning model with attention. IEEE Access, 7,
119.
106773-106783.
[5] Wu, J., & Hu, H. (2017). Cascade recurrent neural network for image [28] Wang, C., Yang, H., & Meinel, C. (2018). Image captioning with deep
caption generation. Electronics Letters, 53(25), 1642-1643. bidirectional LSTMs and multi-task learning. ACM Transactions on
Multimedia Computing, Communications, and Applications (TOMM),
[6] Tan, J. H., Chan, C. S., & Chuah, J. H. (2019). COMIC: Toward A 14(2s), 1-20.
Compact Image Captioning Model With Attention. IEEE Transactions
on Multimedia, 21(10), 2686-2696. [29] Chen, H., Ding, G., Lin, Z., Guo, Y., Shan, C., & Han, J. (2019). Image
captioning with memorized knowledge. Cognitive Computation, 1-14.

285 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020

[30] He, C., & Hu, H. (2019). Image captioning with text-based visual [46] Tan, Y. H., & Chan, C. S. (2019). Phrase-based image caption generator
attention. Neural Processing Letters, 49(1), 177-185. with hierarchical LSTM network. Neurocomputing, 333, 86-100.
[31] Zhu, X., Li, L., Liu, J., Li, Z., Peng, H., & Niu, X. (2018). Image [47] Wu, L., Xu, M., Wang, J., & Perry, S. (2019). Recall What You See
captioning with triple-attention and stack parallel LSTM. Continually Using GridLSTM in Image Captioning. IEEE Transactions
Neurocomputing, 319, 55-65. on Multimedia.
[32] Guo, R., Ma, S., & Han, Y. (2019). Image captioning: from structural [48] Guan, J., & Wang, E. (2018). Repeated review based image captioning
tetrad to translated sentences. Multimedia Tools and Applications, for image evidence review. Signal Processing: Image Communication,
78(17), 24321-24346. 63, 141-148.
[33] Fu, K., Li, J., Jin, J., & Zhang, C. (2018). Image-text surgery: Efficient [49] Xu, N., Liu, A. A., Liu, J., Nie, W., & Su, Y. (2019). Scene graph
concept learning in image captioning by generating pseudopairs. IEEE captioner: Image captioning based on structural visual representation.
transactions on neural networks and learning systems, 29(12), 5910- Journal of Visual Communication and Image Representation, 58, 477-
5921. 485.
[34] W.-Y. Lan, X.,-X. Wang, G, Yang, X.-R Li (2019) Improving Chinese [50] Xian, Y., & Tian, Y. (2019). Self-Guiding Multimodal LSTM—When
Image Captioning by Tag Prediction We Do Not Have a Perfect Training Dataset for Image Captioning. IEEE
[35] Li, X., & Jiang, S. (2019). Know more say less: Image captioning based Transactions on Image Processing, 28(11), 5241-5252.
on scene graphs. IEEE Transactions on Multimedia, 21(8), 2117-2130. [51] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell:
[36] Chen, X., Zhang, M., Wang, Z., Zuo, L., Li, B., & Yang, Y. (2018). Lessons learned from the 2015 mscoco image captioning challenge.
Leveraging unpaired out-of-domain data for image captioning. Pattern IEEE transactions on pattern analysis and machine intelligence, 39(4),
Recognition Letters. 652-663.
[37] Fang, F., Wang, H., Chen, Y., & Tang, P. (2018). Looking deeper and [52] Ding, S., Qu, S., Xi, Y., & Wan, S. (2019). Stimulus-driven and
transferring attention for image captioning. Multimedia Tools and concept-driven analysis for image caption generation. Neurocomputing.
Applications, 77(23), 31159-31175. [53] Lu, S., Hu, R., Liu, J., Guo, L., & Zheng, F. (2019). Structure Preserving
[38] Wang, E. K., Zhang, X., Wang, F., Wu, T. Y., & Chen, C. M. (2019). Convolutional Attention for Image Captioning. Applied Sciences, 9(14),
Multilayer dense attention model for image caption. IEEE Access, 7, 2888.
66358-66368. [54] Yu, N., Hu, X., Song, B., Yang, J., & Zhang, J. (2018). Topic-oriented
[39] Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. image captioning based on order-embedding. IEEE Transactions on
(2019). Multi-Level Policy and Reward-Based Deep Reinforcement Image Processing, 28(6), 2743-2754.
Learning Framework for Image Captioning. IEEE Transactions on [55] Park, C. C., Kim, B., & Kim, G. (2018). Towards personalized image
Multimedia. captioning via multimodal memory networks. IEEE transactions on
[40] Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with pattern analysis and machine intelligence, 41(4), 999-1012.
multi-view visual representation for image captioning. IEEE [56] Yang, L., & Hu, H. (2017). TVPRNN for image caption generation.
Transactions on Circuits and Systems for Video Technology. Electronics Letters, 53(22), 1471-1473.
[41] Cui, W., Zhang, D., He, X., Yao, M., Wang, Z., Hao, Y., ... & Huang, J. [57] Zhang, Z., Zhang, W., Diao, W., Yan, M., Gao, X., & Sun, X. (2019).
(2019). Multi-Scale Remote Sensing Semantic Analysis Based on a VAA: Visual Aligning Attention Model for Remote Sensing Image
Global Perspective. ISPRS International Journal of Geo-Information, Captioning. IEEE Access, 7, 137355-137364.
8(9), 417. [58] He, X., Yang, Y., Shi, B., & Bai, X. (2019). VD-SAN: Visual-densely
[42] Yang, M., Zhao, W., Xu, W., Feng, Y., Zhao, Z., Chen, X., & Lei, K. semantic attention network for image caption generation.
(2018). Multitask learning for cross-domain image captioning. IEEE Neurocomputing, 328, 48-55.
Transactions on Multimedia, 21(4), 1047-1061. [59] Li, X., Yuan, A., & Lu, X. (2019). Vision-to-language tasks based on
[43] Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., & Liu, Q. (2019). attributes and attention mechanism. IEEE transactions on cybernetics.
Neural image caption generation with weighted training and reference. [60] Yang, L., & Hu, H. (2019). Visual Skeleton and Reparative Attention
Cognitive Computation, 11(6), 763-777. for Part-of-Speech image captioning system. Computer Vision and
[44] Su, J., Tang, J., Lu, Z., Han, X., & Zhang, H. (2019). A neural image Image Understanding, 189, 102819.
captioning model with caption-to-images semantic constructor. [61] Staniūtė, R., & Šešok, D. (2019). A Systematic Literature Review on
Neurocomputing, 367, 144-151. Image Captioning. Applied Sciences, 9(10), 2024.
[45] Poleak, C., & Kwon, J. (2019). Parallel Image Captioning Using 2D
Masked Convolution. Applied Sciences, 9(9), 1871.

286 | P a g e
www.ijacsa.thesai.org
View publication stats

You might also like