AN END-TO-END TRAINING BASED APPROACH FOR CAPTIONING
COMPLEX IMAGES USING NEURAL NETWORKS
By
Shreyas V Kashyap - 1BG13CS097
Swaroop Gupta B.A -1BG13CS113
Spandana Rao B.A - 1BG13CS103
Under the Guidance of
Smt.Jebah Jaykumar
AREA OF RESEARCH
We have hereby chose the area of research to be a combination of the following
research extensive areas.
Machine Learning
Image Processing
CORE AREA OF PROBLEM
Our core area of problem is
Computer Vision
Program a computer to "understand" a scene or features in an image.
Concerned with the automatic extraction, analysis and understanding of useful
information from a single image or a sequence of images
RELEVANCE OF THE PROBLEM
Object detection
Currently the existing state of the art methods can detect a single object or
multiple non overlapping objects
This makes the detection useless for any analysis of an entire scene
Object labelling
The labelling of single prominent object in an image
Unable to describe a scene in a complex dense detailed image
APPLICATION OF THE PROBLEM
Currently the problem persists in the following applications of Computer Vision
Image search
The existing image search works only on the file name of an image, and not on the
details of the scene
This should be overcome and the search should happen only on the basis of what is
there in the image, instead of the filename
Image scene detection
The existing Image detection, detects a single prominent part of the image and
cannot detect if there are variations of viewpoint.
This should be overcome and multiple objects need to be detected and the whole
scene has to be described.
PROBLEM STATEMENT
Our ability to effortlessly describe all aspects of an image relies on a strong semantic
understanding of a visual scene and all of its elements. However, despite numerous
potential applications, this ability remains a challenge for our state of the art visual
recognition systems
Our goal is to design an architecture that jointly localizes regions of interest and the
describes each with natural language
Architecture is composed of a Convolutional Network, an efficient dense localization
layer, and Recurrent Neural Network language model that generates the label
sequences for the complex images.
INPUT / OUTPUT EXAMPLE
Sample Input :
Output of existing System :
Output Expected :
THANK YOU