Image Captioning Using Deep Learning
Image Captioning Using Deep Learning
net/publication/341776536
Article in International Journal of Advanced Computer Science and Applications · May 2020
DOI: 10.14569/IJACSA.2020.0110537
CITATIONS READS
29 5,520
6 authors, including:
All content following this page was uploaded by Saif Hassan Katper on 31 May 2020.
Abstract—Auto Image captioning is defined as the process of image. Since the boost of automation and Artificial
generating captions or textual descriptions for images based on Intelligence lots of research is going on to give machine
the contents of the image. It is a machine learning task that human like capabilities and reduce manual work. For
involves both natural language processing (for text generation) machines acquiring results and accuracy as good as human in
and computer vision (for understanding image contents). Auto image captioning problem has always been a very challenging
image captioning is a very recent and growing research problem task.
nowadays. Day by day various new methods are being introduced
to achieve satisfactory results in this field. However, there are Auto image captioning is performed by following key
still lots of attention required to achieve results as good as a tasks in order. At first features are extracted after proper
human. This study aims to find out in a systematic way that what extraction of features different objects from an image are
different and recent methods and models are used for image detected, after that the relationship between objects are to be
captioning using deep learning? What methods are implemented identified (i.e. if objects are cat and grass it is to be identified
to use those models? And what methods are more likely to give that if cat in on grass). Once objects are detected and
good results. For doing so we have performed a systematic relationships are identified now it is required to generate the
literature review on recent studies from 2017 to 2019 from well- text description, i.e. Sequence of words in orderly form that
known databases (Scopus, Web of Sciences, IEEEXplore). We they make a good sentence according to the relationship
found a total of 61 prime studies relevant to the objective of this between the image objects.
research. We found that CNN is used to understand image
contents and find out objects in an image while RNN or LSTM is To perform above key tasks using deep learning different
used for language generation. The most commonly used datasets deep learning networks are used. For Example to get visual
are MS COCO used in all studies and flicker 8k and flicker 30k. features and objects CNN with different region proposing
The most commonly used evaluation matrix is BLEU (1 to 4) models like RCNN, Faster RCNN can be used and to generate
used in all studies. It is also found that LSTM with CNN has text description in sequence RNN or LSTM can be used.
outperformed RNN with CNN. We found that the two most Using these networks various different methods are developed
promising methods for implementing this model are Encoder to perform auto image captioning in various different
Decoder, and attention mechanism and a combination of them
domains. However, still, there is room for the machine to
can help in improving results to a good scale. This research
make capable enough to generate descriptions like a human
provides a guideline and recommendation to researchers who
want to contribute to auto image captioning. [61]. . After training the Deep Learning network for image
captioning to evaluate its performance various evaluation
Keywords—Image Captioning; Deep Learning; Neural matrices like BLEU, CIDEr, and ROUGE-L exists.
Network; Recurrent Neural Network (RNN); Convolution Neural The purpose of this Systematic Literature Review is to
Network (CNN); Long Short Term Memory (LSTM) study all newest Articles from 2017 to 2019 to find different
I. INTRODUCTION methods to achieve auto image captioning in different
domains, what different datasets are used to achieve the task,
Auto image captioning is the process to automatically In which different practical domains this task is used, which
generate human like descriptions of the images. It is very technique Outperforms others and finally attains to describe
dominant task with good practical and industrial significance the technicalities behind different networks, methods and
[62]. Auto Image captioning has a good practical use in evaluation matrices. Our study will help new researchers who
industry, security, surveillance, medical, agriculture and many want to work in this domain to attain better accuracy. We
more prime domains. It is not just very crucial but also very specially focused and the collection of quality articles which
challenging task in computer vision [1]. Traditional object have been published till now. We attempt to find our different
detection and image classification task just needed to identify techniques presented in [1- 60] articles, find their methods
objects within the image where the task of Auto image strengths and weakness. Finally we attempt to summarize
captioning is not just identifying the objects but also them to explain which technique has better performance in its
identifying the relationship between them and total scene particular domain. Our work mostly focuses on identifying the
understanding of the image. After understanding the scene it is most popular techniques. The areas in which yet there is
also required to generate a human like description of that
278 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
attention require and in result section we also attempt to the research. In alignment with our research problem we have
explain the technical concepts behind the used approaches. come up with research questions for which we try to find
answers during this research.
II. METHODOLOGY
A. Research Questions
The planning conducting and reporting of this Systematic
literature review is done step by step. First in planning section Before conducting this study we kept the following
we identified the need of conducting this research its research questions to measure the quality of our work. This
importance. Identifying the research questions and design study basically provide a detailed knowledge related to these
search strategy, designing quality assessment criteria and research questions. Table I provides the list of research
finally designing data extraction strategy is also planned questions.
during this stage. After proper planning we have conducted
TABLE III. INITIAL STAGE RESULTS FROM IEEE XPLORE, WEB OF Database Abstract screening results
SCIENCES AND SCOPUS
IEEE Xplore 63
Database Original search results
Web of Science 92
IEEE Xplore 247
Scopus 143
Web of Science 167
Total 308
Scopus 313
Total 727
279 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
C. Quality Assessment Criteria The result which we found above illustrated in PRISMA
The quality of 308 articles was assessed for quality diagram (see Fig. 1). All this process we this dissipated in
assessment criteria. We assessed the quality of selected 308 following diagram.
studies to ensure the quality assessment of our study. We went D. Data Extraction and Synthesis
through the full text screening of those studies which were
ambiguous and was not clear from abstract screening. The After selection of final 61 primary studies we extracted
process of quality assessment criteria (QAC) was done with data from those studies for performing final synthesis. We
full text screening. All four authors agreed to make some defined our data extraction strategy based on our research
quality assessment questions (QAQ) to ensure the quality of questions. We have extracted following parameters from our
our work. primary studies for further synthesis, year or article published,
title, models use for language generation and object detection,
QA Q1 The article must be published in journal methods use to implement models, datasets used, evaluation
QA Q2 Article has proposed a proper method to implement matrices used for evaluation purpose and finally accuracy of
image captioning using deep learning. proposed model.
QA Q3 The article must have clear and unambiguous results. The purpose of synthesis is to summarize the facts
QA Q4 Article must discuss the applications and challenges extracted in data extraction and give a clear picture of work
of image captioning. done in past and directions to new researchers.
QA Q5 Article must discuss the evaluation strategy of the
built model. TABLE VI. NUMBER OF STUDIES AFTER QUALITY ASSESSMENT CRITERIA
We assessed the quality of 308 studies on the basis of Database Quality assessment results
quality assessment criteria (QAC) questions and through full
IEEE Xplore 12
text screening, we found total 61 studies from all three
databases. Number of each studies from all three databases Web of Science 16
shown in Table VI. Scopus 33
Total 61
280 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
281 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
SPICE. It is found that BLEU score is most popular method of Formula for weighted function is:
evaluation used by almost all of the studies. You can verify
𝑃𝑅
this from given Fig. 7 and Table VIII. 𝐹=
𝛼𝑃 + (1 − 𝛼)𝑅
6) BELU: BLEU stands for bilingual evaluation
understudy. It is an evaluation mechanism widely use in text Where P and R stands for precision and recall calculated as
m/c and m/r, where c and r are candidate and reference length
generation. It is a mechanism for comparing the machine
and m is number of mapped unigrams among two texts.
generated text with one or more manually written text. So
basically it summarizes that how close a generated text is to an Formula for Penalty function is:
expected text. BLEU score is majorly prevalent in automated 𝑐 𝛽
𝑃𝑒𝑛𝑎𝑙𝑖𝑡𝑦 = 𝛾 � � , 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝛾 ≤ 1
machine translation but it can be also used in image 𝑚
captioning, text summarization, speech recognition etc. Where c is number of matched chunks and m is total
Particularly in image captioning the BLEU score is accuracy number of matches.
that how close a generated caption is to a manual human
Over all meteor score is found by:
generated caption of that particular image. The score scale lies
between 0.0 to 1.0. Where 1.0 is perfect score and 0.0 is worst 𝑀 = (1 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦) ∗ 𝐹
score.
8) ROUGE-L: ROUGE stands for recall oriented
We found that almost all studies used bleu as their understudy for gisting evaluation. As clear from its name
evaluation matrix and they calculated BLEU-1 to 4 where ROUGE is only based on recall but ROUGE-L is based on its
BLEU-1 is calculating accuracy only on 1 gram, BLEU-2 for F score which is harmonic mean of its precision and recall
2 grams, BLEU-3 for 3 grams and BLEU-4 for 4 grams. values. Following are the formulas for calculating precision,
The BLEU score can be calculated from following recall and F values
formula. 𝐿𝐶𝑆(𝐴, 𝐵) 𝐿𝐶𝑆(𝐴, 𝐵)
𝑃= 𝑎𝑛𝑑 𝑅 =
𝑁 𝑚 𝑛
𝐵𝐿𝐸𝑈 = 𝐵𝑃. 𝑒𝑥𝑝 �� 𝑤𝑛 log𝑝𝑛 � Here A and B are candidate and reference generated text
𝑛=1 and m and n are their lengths and LCS stands for longest
common sequence since ROUGE-L depends on longest
7) METEOR: METEOR stands for metric for evaluation common sequence.
and translation with explicit ordering. While BLEU takes
account of entire text generated overshadowing the score of Now for calculating F their harmonic means are
each and individual sentence generated the METEOR takes calculated.
care of that. For doing so METEOR enhances the precision (1 + 𝑏 2 )𝑅𝑃
and recall functions. Instead of precision and recall the meteor 𝐹=
𝑅 + 𝑏2𝑃
utilizes weighted F-score for mapping unigram and for
incorrect word order it uses penalty function.
Datasets
60
50
40
30
20
Datasets
10
282 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
Models
45
40
35
30
25
Models
20
15
10
5
0
CNN+RNN CNN+LSTM CNN+RNN+LSTM
TABLE VII. DEEP LEARNING MODEL USED FOR IMAGE CAPTIONING IN LITERATURE
NETWORKS
SR# STUDIES RN
LSTM
N
[1],[2],[4],[6],[8],[10],[11],[13],[14],[15],[16],[18],[19],[20],[22],[23],[26],[27],[28],[29],[31],[32],[33],[36],[38],[37],[40],[41],[42],[
1 ✔
43],[44],[45],[46],[47],[49],[50],[52],[53],[54],[57],[58]
2 [5],[7],[9],[12],[17],[30],[34],[35],[51],[56],[21] ✔
3 [18],[24],[25],[39],[48],[59],[60] ✔ ✔
35
30
25
20
RNN
15 LSTM
10
0
2017 2018 2019
283 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
90
80
70
60
50
40
30
20
10
0
LSTM+CNN P1 P2 P3 P4 P5 RNN+CNN
Fig. 6. Comparison of Best Score Achieved by RNN and LSTM (B1 Result Comparison).
Evaluation matrices
70
60
50
40
30
Evaluation matrices
20
10
0
BLEU METEOR CIDEr ROUGE-L SPICE
EVALUATION METRICS
Sr.No. STUDIES
B M C R S
[1], [2], [3], [4] ,[6], [8], [10], [11], [13], [15], [16], [18], [20], [22], [26], [29], [30], [31], [32],
1 ✔ ✔ ✔ ✔
[33], [38], [39], [40], [42], [43], [44], [49], [50], [52], [53], [54], [55], [56], [57],
2 [5], [12], [19], [23], [34], [27], [28], [36], [37], [45], [51], [58], [59] ✔ ✔ ✔
3 [7], [35] ✔ ✔
4 [9], [41], [47], [48] ✔
5 [17] ✔ ✔ ✔ ✔
6 [25] ✔ ✔ ✔
7 [34] ✔
8 [46] ✔ ✔ ✔ ✔
9 [60] ✔ ✔ ✔ ✔ ✔
284 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
[7] Huang, G., & Hu, H. (2019). c-RNN: A Fine-Grained Language Model
IV. CONCLUSION for Image Captioning. Neural Processing Letters, 49(2), 683-691.
[8] Xiao, F., Gong, X., Zhang, Y., Shen, Y., Li, J., & Gao, X. (2019). DAA:
This systematic literature review (SLR) presents a detailed Dual LSTMs with adaptive attention for image captioning.
analysis of different deep learning models used for image Neurocomputing, 364, 322-329.
captioning. To perform the study we searched articles from [9] Dhir, R., Mishra, S. K., Saha, S., & Bhattacharyya, P. (2019). A Deep
three academic databases, after applying inclusion and Attention based Framework for Image Caption Generation in Hindi
exclusion criteria on all article and we selected 61 primary Language. Computación y Sistemas, 23(3).
studies to perform a literature review. Using data extraction [10] Xiao, X., Wang, L., Ding, K., Xiang, S., & Pan, C. (2019). Deep
mechanism we extracted the data and analyzed it deeply. We Hierarchical Encoder–Decoder Network for Image Captioning. IEEE
Transactions on Multimedia, 21(11), 2942-2956.
found various different models and techniques used for image
captioning. For image content extraction CNN is the best- [11] Zeng, X., Wen, L., Liu, B., & Qi, X. (2019). Deep learning for
ultrasound image caption generation based on object detection.
suited model and for language generation two frequently used Neurocomputing.
models are RNN and LSTM. It is found that LSTM has [12] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for
outperformed RNN. We also found different studies have used generating image descriptions. In Proceedings of the IEEE conference
several different mechanisms for scene understanding like on computer vision and pattern recognition (pp. 3128-3137).
encoder-decoder mechanism and attention mechanism. The [13] Xiao, X., Wang, L., Ding, K., Xiang, S., & Pan, C. (2019). Dense
most suitable dataset for image captioning is MSCOCO semantic embedding network for image captioning. Pattern Recognition,
because it contains non-iconic images, unlike other datasets. 90, 285-296.
[14] Han, M., Chen, W., & Moges, A. D. (2019). Fast image captioning using
Throughout our review, we have observed that image LSTM. Cluster Computing, 22(3), 6143-6155.
captioning is mostly used generally. There are various [15] Dash, S. K., Saha, S., Pakray, P., & Gelbukh, A. (2019). Generating
domains that can take advantage of image captioning to image captions through multimodal embedding. Journal of Intelligent &
automate their tasks. Fuzzy Systems, 36(5), 4787-4796.
[16] Li, L., Tang, S., Zhang, Y., Deng, L., & Tian, Q. (2017). Gla: Global–
1) A model can be trained in medical ultrasound or MRI local attention for image description. IEEE Transactions on Multimedia,
images or angiographic videos to generate a complete report 20(3), 726-737.
of a person without any consent from a doctor. Image [17] Kinghorn, P., Zhang, L., & Shao, L. (2019). A hierarchical and regional
deep learning architecture for image description generation. Pattern
captioning can be used to generate an automatic report by Recognition Letters, 119, 77-85.
looking at those medical images of a person. [18] Su, Y., Li, Y., Xu, N., & Liu, A. A. (2019). Hierarchical deep neural
2) Image captioning can also be used in industries to network for image captioning. Neural Processing Letters, 1-11.
automate various tasks. A model can be trained on images of a [19] Zhang, Z., Wu, Q., Wang, Y., & Chen, F. (2018). High-quality image
captioning with fine-grained and semantic-guided visual attention. IEEE
company product manufacturing environment to find out an Transactions on Multimedia, 21(7), 1681-1693.
anomaly in the environment or product automatically. It can [20] Shetty, R., Tavakoli, H. R., & Laaksonen, J. (2018). Image and video
also be used also used to detect any mishap in a company like captioning with augmented neural architectures. IEEE MultiMedia,
fire or security issues. 25(2), 34-46.
3) Image captioning can also be used in agriculture to [21] Dong, J., Li, X., & Snoek, C. G. (2018). Predicting visual features from
text for image and video caption retrieval. IEEE Transactions on
generate the report of crops for owners by looking at images Multimedia, 20(12), 3377-3388.
of crops. [22] Ding, S., Qu, S., Xi, Y., Sangaiah, A. K., & Wan, S. (2019). Image
4) Image captioning can also be used in traffic analysis caption generation with high-level image features. Pattern Recognition
report generation by using CCTV cameras installed on streets Letters, 123, 89-95.
and thus guide drivers which is the best suitable path to take [23] Wu, Q., Shen, C., Wang, P., Dick, A., & van den Hengel, A. (2017).
Image captioning and visual question answering based on attributes and
and where parking is available. external knowledge. IEEE transactions on pattern analysis and machine
REFERENCES intelligence, 40(6), 1367-1381.
[1] Yang, L., & Hu, H. (2019). Adaptive syncretic attention for constrained [24] Yang, J., Sun, Y., Liang, J., Ren, B., & Lai, S. H. (2019). Image
image captioning. Neural Processing Letters, 50(1), 549-564. captioning by incorporating affective concepts learned from both visual
and textual components. Neurocomputing, 328, 56-68.
[2] Fu, K., Jin, J., Cui, R., Sha, F., & Zhang, C. (2016). Aligning where to
see and what to tell: Image captioning with region-based attention and [25] Guo, R., Ma, S., & Han, Y. (2019). Image captioning: from structural
scene-specific contexts. IEEE transactions on pattern analysis and tetrad to translated sentences. Multimedia Tools and Applications,
machine intelligence, 39(12), 2321-2334. 78(17), 24321-24346.
[3] Li, J., Yao, P., Guo, L., & Zhang, W. (2019). Boosted Transformer for [26] Zhang, X., He, S., Song, X., Lau, R. W., Jiao, J., & Ye, Q. (2019).
Image Captioning. Applied Sciences, 9(16), 3260. Image captioning via semantic element embedding. Neurocomputing.
[27] Cao, P., Yang, Z., Sun, L., Liang, Y., Yang, M. Q., & Guan, R. (2019).
[4] Oluwasanmi, A., Aftab, M. U., Alabdulkreem, E., Kumeda, B.,
Image Captioning with Bidirectional Semantic Attention-Based Guiding
Baagyere, E. Y., & Qin, Z. (2019). CaptionNet: Automatic end-to-end
of Long Short-Term Memory. Neural Processing Letters, 50(1), 103-
siamese difference captioning model with attention. IEEE Access, 7,
119.
106773-106783.
[5] Wu, J., & Hu, H. (2017). Cascade recurrent neural network for image [28] Wang, C., Yang, H., & Meinel, C. (2018). Image captioning with deep
caption generation. Electronics Letters, 53(25), 1642-1643. bidirectional LSTMs and multi-task learning. ACM Transactions on
Multimedia Computing, Communications, and Applications (TOMM),
[6] Tan, J. H., Chan, C. S., & Chuah, J. H. (2019). COMIC: Toward A 14(2s), 1-20.
Compact Image Captioning Model With Attention. IEEE Transactions
on Multimedia, 21(10), 2686-2696. [29] Chen, H., Ding, G., Lin, Z., Guo, Y., Shan, C., & Han, J. (2019). Image
captioning with memorized knowledge. Cognitive Computation, 1-14.
285 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 5, 2020
[30] He, C., & Hu, H. (2019). Image captioning with text-based visual [46] Tan, Y. H., & Chan, C. S. (2019). Phrase-based image caption generator
attention. Neural Processing Letters, 49(1), 177-185. with hierarchical LSTM network. Neurocomputing, 333, 86-100.
[31] Zhu, X., Li, L., Liu, J., Li, Z., Peng, H., & Niu, X. (2018). Image [47] Wu, L., Xu, M., Wang, J., & Perry, S. (2019). Recall What You See
captioning with triple-attention and stack parallel LSTM. Continually Using GridLSTM in Image Captioning. IEEE Transactions
Neurocomputing, 319, 55-65. on Multimedia.
[32] Guo, R., Ma, S., & Han, Y. (2019). Image captioning: from structural [48] Guan, J., & Wang, E. (2018). Repeated review based image captioning
tetrad to translated sentences. Multimedia Tools and Applications, for image evidence review. Signal Processing: Image Communication,
78(17), 24321-24346. 63, 141-148.
[33] Fu, K., Li, J., Jin, J., & Zhang, C. (2018). Image-text surgery: Efficient [49] Xu, N., Liu, A. A., Liu, J., Nie, W., & Su, Y. (2019). Scene graph
concept learning in image captioning by generating pseudopairs. IEEE captioner: Image captioning based on structural visual representation.
transactions on neural networks and learning systems, 29(12), 5910- Journal of Visual Communication and Image Representation, 58, 477-
5921. 485.
[34] W.-Y. Lan, X.,-X. Wang, G, Yang, X.-R Li (2019) Improving Chinese [50] Xian, Y., & Tian, Y. (2019). Self-Guiding Multimodal LSTM—When
Image Captioning by Tag Prediction We Do Not Have a Perfect Training Dataset for Image Captioning. IEEE
[35] Li, X., & Jiang, S. (2019). Know more say less: Image captioning based Transactions on Image Processing, 28(11), 5241-5252.
on scene graphs. IEEE Transactions on Multimedia, 21(8), 2117-2130. [51] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell:
[36] Chen, X., Zhang, M., Wang, Z., Zuo, L., Li, B., & Yang, Y. (2018). Lessons learned from the 2015 mscoco image captioning challenge.
Leveraging unpaired out-of-domain data for image captioning. Pattern IEEE transactions on pattern analysis and machine intelligence, 39(4),
Recognition Letters. 652-663.
[37] Fang, F., Wang, H., Chen, Y., & Tang, P. (2018). Looking deeper and [52] Ding, S., Qu, S., Xi, Y., & Wan, S. (2019). Stimulus-driven and
transferring attention for image captioning. Multimedia Tools and concept-driven analysis for image caption generation. Neurocomputing.
Applications, 77(23), 31159-31175. [53] Lu, S., Hu, R., Liu, J., Guo, L., & Zheng, F. (2019). Structure Preserving
[38] Wang, E. K., Zhang, X., Wang, F., Wu, T. Y., & Chen, C. M. (2019). Convolutional Attention for Image Captioning. Applied Sciences, 9(14),
Multilayer dense attention model for image caption. IEEE Access, 7, 2888.
66358-66368. [54] Yu, N., Hu, X., Song, B., Yang, J., & Zhang, J. (2018). Topic-oriented
[39] Xu, N., Zhang, H., Liu, A. A., Nie, W., Su, Y., Nie, J., & Zhang, Y. image captioning based on order-embedding. IEEE Transactions on
(2019). Multi-Level Policy and Reward-Based Deep Reinforcement Image Processing, 28(6), 2743-2754.
Learning Framework for Image Captioning. IEEE Transactions on [55] Park, C. C., Kim, B., & Kim, G. (2018). Towards personalized image
Multimedia. captioning via multimodal memory networks. IEEE transactions on
[40] Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with pattern analysis and machine intelligence, 41(4), 999-1012.
multi-view visual representation for image captioning. IEEE [56] Yang, L., & Hu, H. (2017). TVPRNN for image caption generation.
Transactions on Circuits and Systems for Video Technology. Electronics Letters, 53(22), 1471-1473.
[41] Cui, W., Zhang, D., He, X., Yao, M., Wang, Z., Hao, Y., ... & Huang, J. [57] Zhang, Z., Zhang, W., Diao, W., Yan, M., Gao, X., & Sun, X. (2019).
(2019). Multi-Scale Remote Sensing Semantic Analysis Based on a VAA: Visual Aligning Attention Model for Remote Sensing Image
Global Perspective. ISPRS International Journal of Geo-Information, Captioning. IEEE Access, 7, 137355-137364.
8(9), 417. [58] He, X., Yang, Y., Shi, B., & Bai, X. (2019). VD-SAN: Visual-densely
[42] Yang, M., Zhao, W., Xu, W., Feng, Y., Zhao, Z., Chen, X., & Lei, K. semantic attention network for image caption generation.
(2018). Multitask learning for cross-domain image captioning. IEEE Neurocomputing, 328, 48-55.
Transactions on Multimedia, 21(4), 1047-1061. [59] Li, X., Yuan, A., & Lu, X. (2019). Vision-to-language tasks based on
[43] Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., & Liu, Q. (2019). attributes and attention mechanism. IEEE transactions on cybernetics.
Neural image caption generation with weighted training and reference. [60] Yang, L., & Hu, H. (2019). Visual Skeleton and Reparative Attention
Cognitive Computation, 11(6), 763-777. for Part-of-Speech image captioning system. Computer Vision and
[44] Su, J., Tang, J., Lu, Z., Han, X., & Zhang, H. (2019). A neural image Image Understanding, 189, 102819.
captioning model with caption-to-images semantic constructor. [61] Staniūtė, R., & Šešok, D. (2019). A Systematic Literature Review on
Neurocomputing, 367, 144-151. Image Captioning. Applied Sciences, 9(10), 2024.
[45] Poleak, C., & Kwon, J. (2019). Parallel Image Captioning Using 2D
Masked Convolution. Applied Sciences, 9(9), 1871.
286 | P a g e
www.ijacsa.thesai.org
View publication stats