Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views11 pages

DC GCN

This paper proposes a dual channel graph convolutional network (DC-GCN) model to answer visual questions. The DC-GCN consists of three parts: an image graph convolutional network to capture object relations, a question graph convolutional network to capture word dependencies, and an attention alignment module. Experimental results show the DC-GCN achieves comparable performance to state-of-the-art approaches.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

DC GCN

This paper proposes a dual channel graph convolutional network (DC-GCN) model to answer visual questions. The DC-GCN consists of three parts: an image graph convolutional network to capture object relations, a question graph convolutional network to capture word dependencies, and an attention alignment module. Experimental results show the DC-GCN achieves comparable performance to state-of-the-art approaches.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Aligned Dual Channel Graph Convolutional Network

for Visual Question Answering


Qingbao Huang1,2 , Jielong Wei2 , Yi Cai1∗
Changmeng Zheng1 , Junying Chen1 , Ho-fung Leung3 , Qing Li4
1
School of Software Engineering, South China University of Technology, Guangzhou, China
2
School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
3
The Chinese University of Hong Kong, Hong Kong SAR, China
4
The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
[email protected], [email protected], [email protected]

Abstract
Visual question answering aims to answer the
natural language question about a given im-
age. Existing graph-based methods only focus
on the relations between objects in an image
and neglect the importance of the syntactic de-
pendency relations between words in a ques-
tion. To simultaneously capture the relations
between objects in an image and the syntactic
dependency relations between words in a ques-
tion, we propose a novel dual channel graph
convolutional network (DC-GCN) for better
combining visual and textual advantages. The
Figure 1: (a) The question and the ground true answer.
DC-GCN model consists of three parts: an
(b) The wrong answer is predicted by a state-of-the-art
I-GCN module to capture the relations be-
model, which focuses on the highlighted region in the
tween objects in an image, a Q-GCN module
image. The depth of the color indicates the weights
to capture the syntactic dependency relation-
of the words in the question, where deeper color repre-
s between words in a question, and an atten-
sents higher weight. The question is performed by syn-
tion alignment module to align image represen-
tactic dependency parsing. (c) The dependency parsing
tations and question representations. Experi-
of the question is obtained by the universal Standford
mental results show that our model achieves
Dependencies tool (De Marneffe et al., 2014).
comparable performance with the state-of-the-
art approaches.
2018a), and relation reasoning (Cao et al., 2018; Li
1 Introduction et al., 2019; Cadene et al., 2019a). Representing
As a form of visual Turing test, visual question images as graphs allows one to explicitly model
answering (VQA) has drawn much attention. The interactions between two objects in an image, so as
goal of VQA (Antol et al., 2015; Goyal et al., 2017) to seamlessly transfer information between graph
is to answer a natural language question related to nodes (e.g., objects in an image).
the contents of a given image. Attention mecha- Very recent research methods (Li et al., 2019; Ca-
nisms are served as the backbone of the previous dene et al., 2019a; Yu et al., 2019) have achieved
mainstream approaches (Lu et al., 2016; Yang et al., remarkable performances, but there is still a big
2016; Yu et al., 2017), however, they tend to catch gap between them and human. As shown in Figure
only the most discriminative information, ignoring 1(a), given an image of a group of persons and the
other rich complementary clues (Liu et al., 2019). corresponding question, a VQA system needs to
Recent VQA studies have been exploring higher not only recognize the objects in an image (e.g.,
level semantic representation of images, notably batter, umpire and catcher), but also grasp the tex-
using graph-based structures for better image under- tual information in the question “what color is the
standing, such as scene graph generation (Xu et al., umpire’s shirt”. However, even many competitive
2017; Yang et al., 2018), visual relationship detec- VQA models struggle to process them accurately,
tion (Yao et al., 2018), object counting (Zhang et al., and as a result predict the incorrect answer (black)

Corresponding author: Yi Cai ([email protected]). rather than the correct answer (blue), including the

7166
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7166–7176
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
state-of-the-art methods. 2 Related Works
Although the relations between two objects in
Visual Question Answering Attention mechanis-
an image have been considered, the attention-based
m has been proven effective on many tasks, such as
VQA models lack building blocks to explicitly cap-
machine translation (Bahdanau et al., 2014) and im-
ture the syntactic dependency relations between
age captioning (Pedersoli et al., 2017). A number
words in a question. As shown in Figure 1(c), these
of methods have been developed so far, in which
dependency relations can reflect which object is
question-guided attention on image regions is com-
being asked (e.g., the word umpire’s modifies the
monly used. These can be categorized into two
word shirt) and which aspect of the object is be-
classes according to the types of employed image
ing asked (e.g., the word color is the direct object
features. One class uses visual features from some
of the word is). If a VQA model only knows the
region proposals, which are generated by Region
word shirt rather than the relation between words
Proposal Network (Ren et al., 2015). The other
umpire’s and shirt in a question, it is difficult to
class uses convolutional features (i.e., activations
distinguish which object is being asked. In fact, we
of convolutional layers).
do need the modified relations to discriminate the
To learn a better representation of the question,
correct object from multiple similar objects. There-
the Stacked Attention Network (Yang et al., 2016)
fore, we consider that it is necessary to explore the
which can search question-related image regions
relations between words at linguistic level in addi-
is designed by performing multi-step visual atten-
tion to constructing the relations between objects
tion operations. A co-attention mechanism that
at visual level.
jointly performs question-guided visual attention
Motivated by this, we propose a dual channel and image-guided question attention is proposed to
graph convolutional network (DC-GCN) to simul- solve the problems of which regions to look at and
taneously capture the relations between objects in what words to listen to (Shih et al., 2016). To ob-
an image and the syntactic dependency relations tain more fine-grained interaction between image
between words in a question. Our proposed DC- and question, some researchers introduce rather
GCN model consists of an Image-GCN (I-GCN) sophisticated fusion strategies. Bilinear pooling
module, a Question GCN (Q-GCN) module, and method (Kim et al., 2018; Yu et al., 2017, 2018)
an attention alignment module. The I-GCN module is one of the pioneering works to efficiently and
captures the relations between objects in an image, expressively combine multimodal features by using
the Q-GCN module captures the syntactic depen- an outer product of two vectors.
dency relations between words in a question, and Recently, some researchers devoted to overcome
the attention alignment module is used to align two the priors on VQA dataset and proposed the meth-
representations of image and question. The contri- ods like GVQA (Agrawal et al., 2018), UpDn +
butions of this work are summarized as follows: Q-Adv + DoE (Ramakrishnan et al., 2018), and
1) We propose a dual channel graph convolution- RUBi (Cadene et al., 2019b) to solve the language
al network (DC-GCN) to simultaneously capture biases on the VQA-CP-v2 dataset.
the visual and textual relations, and design the at- Graph Networks Graph networks are power-
tention alignment module to align the multimodal ful models that can perform relational inferences
representations, thus reducing the semantic gaps through message passing. The core idea is to enable
between vision and language. communication between image regions to build
contextualized representations of these regions. Be-
2) We explore how to construct the syntactic low we review some of the recent works that rely
dependency relations between words at linguistic on graph networks and other contextualized repre-
level via graph convolutional networks as well as sentations for VQA.
the relations between objects at visual level. Recent research works (Cadene et al., 2019a; Li
3) We conduct extensive experiments and abla- et al., 2019) focus on how to deal with complex
tion studies on VQA-v2 and VQA-CP-v2 datasets scene and relation reasoning to obtain better image
to examine the effectiveness of our DC-GCN mod- representations. Based on multimodal attention-
el. Experimental results show that the DC-GCN al networks, (Cadene et al., 2019a) introduces an
model achieves competitive performance with the atomic reasoning primitive to represent interactions
state-of-the-art approaches. between question and image region by a rich vecto-

7167
Figure 2: Illustration of our proposed Dual Channel Graph Convolutional Network (DC-GCN) for VQA task. The
Dependency Parsing constructs the semantic relations between words in a question, and Q-GCN Module updates
every word’s features by aggregating the adjacent word features. In addition, the I-GCN Module builds the relations
between image objects, and the Attention Alignment Module use question-guided image attention mechanism to
learn a new object representation thus align the images and questions. All punctuations and upper cases have been
preprocessed. The numbers in red are the weight scores of image objects and words.

rial representation and model region relations with (Pennington et al., 2014). The word embeddings
pairwise combinations. GCNs, which can better are input into a LSTM (Hochreiter and Schmidhu-
explore the visual relations between objects and ber, 1997) to encode, which produces the initial
aggregate its own features and neighbors’ features, question representation hq = {hqj }λj=0 ∈ Rλ×dq .
have been applied to various tasks, such as text
classification (Yao et al., 2019), relation extraction 3.2 Relation Extraction and Encoding
(Guo et al., 2019; Zhang et al., 2018b), scene graph 3.2.1 I-GCN Module
generation (Yang et al., 2018; Yao et al., 2018).
To answer complicated questions about an im- Image Fully-connected Relations Graph By
age, a relation-aware graph attention network (Re- treating each object region in an image as a ver-
GAT) (Li et al., 2019) is proposed to encode each tex, we can construct a fully-connected undirected
image into a graph and model multi-type inter- graph, as shown in Figure 3(b). Each edge repre-
object relations via a graph attention mechanism, sents a relation between two object regions.
such as spatial relations, semantic relations and im- Pruned Image Graph with Spatial Relations S-
plicit relations. One limitation of ReGAT (Li et al., patial relations represent an object position in an
2019) lies in the fact that it solely consider the re- image, which correspond to a 4-dimensional spatial
lations between objects in an image while neglect coordinate [x1 , y1 , x2 , y2 ]. Note that (x1 , y1 ) is the
the importance of text information. In contrast, our coordinate of the top-left point of the bounding box
DC-GCN simultaneously capture visual relations and (x2 , y2 ) is the coordinate of the bottom-right
in an image and textual relations in a question. point of the bounding box.
Identifying the correlation between objects is
3 Model a key step. We calculate the correlation between
objects by using spatial relations. The steps are
3.1 Feature Extraction as follows: (1) The features of two nodes are in-
Similar to (Anderson et al., 2018), we extract the put into multi-layer perceptron respectively, and
image features by using a pretrained Faster RCNN then the corresponding elements are multiplied to
(Ren et al., 2015). We select µ object proposal- get a relatedness score. (2) The intersection over
s for each image, where each object proposal is union of two object regions is calculated. Accord-
represented by a 2048 dimensional feature vector. ing to the overlapping part of two object regions,
The obtained visual region features are denoted as different spatial relations are classified into 11 dif-
hv = {hvi }µi=0 ∈ Rµ×2048 . ferent categories, such as inside, cover, and overlap
To extract the question features, each word is (Yao et al., 2018). Following the work (Yao et al.,
embedded into a 300-dimensional Glove vector 2018), we utilize the overlapping region between

7168
two object regions to judge whether there is an are then gathered with weight αij , followed by a
edge between two regions. If two object regions non-linear function σ. This layer-wise propagation
have large overlapping part, it means that there is can be denoted as:
a strong correlation between these two objects. If  
two object regions haven’t any overlapping part, (l+1) (l) (l)
X
hvi = σ hvi + Aij αij Wb hvj  . (3)
we consider two objects have a weak correlation, j∈N (i)
which means there are no edges to connect these
two nodes. According to the spatial relations, we Following the stacked L layer GCN, the output
prune some irrelevant relations between objects of I-GCN module Hv can be denoted as:
and obtain a sparse graph, as shown in Figure 3(c). (l+1)
Hv = hvi (l < L). (4)

3.2.2 Q-GCN Module


In practice, we observe that two words in a sentence
usually hold certain relations. Such relations can be
identified by the universal Standford Dependencies
(De Marneffe et al., 2014). As shown in Table
Figure 3: (a) Generate region proposals by pretrained 1, we list a part of commonly-used dependency
model (Anderson et al., 2018). For display purposes, relations. For example, the sentence what color is
we only highlight some object regions. (b) Construct
the relations between objects. (c) Prune the irrelevant
object edges and calculate the weight between objects.
The numbers in red are the weights of edges.
Image Graph Convolutions Following the previ-
ous studies (Li et al., 2019; Zhang et al., 2018b;
Yang et al., 2018), we use GCN to update the repre-
sentations of objects. Given a graph with µ nodes, Figure 4: The question is performed by syntactic de-
each object region in an image is a node. We rep- pendency parsing. The word is is the root node of de-
resent the graph structure with a µ × µ adjacency pendency relations while the words in blue (e.g., det,
matrix A, where Aij = 1 if there is overlapping dobj) are dependency relations. The direction of arrow
indicates that two words exist a relation.
region between node i and node j; else Aij = 0.
Given a target node i and a neighboring node the umpire’s shirt is parsed to obtain the relations
j ∈ N (i) in an image, where N (i) is the set of between words (e.g., cop, det and nmod), as shown
nodes neighboring with node i, and the representa- in Figure 4. The words in blue are the dependency
tions of node i and node j are hvi and hvj , respec- relations. The ending of arrow indicates that this
tively. To obtain the correlation score sij between word is a modifier. The word root in purple is
node i and j, we learn a fully connected layer over used to indicate which word is the root node of
concatenated node features hvi and hvj : dependency relations.
Question Fully-connected Relations Graph By
(l) (l)
sij = waT σ(Wa [hvi , hvj ]), (1) treating each word in a question as a node, we con-
struct a fully-connected undirected graph, as shown
where wa and Wa are learned parameters, σ is
(l) (l) in Figure 5(a). Each edge represents a relation be-
the non-linear activation function, and [hvi , hvj ] tween two words.
denotes the concatenation operation. We apply a Pruned Question Graph with Dependency Rela-
softmax function over the correlation score sij to tions Irrelevant relations between two words may
obtain weight αij , as shown in Figure 3(c) where bring noises. Therefore, we need to prune some
the numbers in red represent the weight scores: unrelated relations to reduce the noises. By parsing
exp (sij ) the dependency relations of a question, we obtain
αij = P . (2) the relations between words (cf. Figure 4). Accord-
j∈N (i) exp (sij )
ing to dependency relations, we prune some edges
The l-th layer representations of neighboring nodes between two nodes which do not have dependency
(l)
hvj are first transformed via a learned linear trans- relations. A sparse graph is obtained, as shown in
formation Wb . Those transformed representations Figure 5(b).

7169
Relations Relation Description notes the concatenation operation. We apply a soft-
det determiner max function over the correlation score tij to obtain
nsubj nominal subject weight βij :
case prepositions, postpositions
nmod nominal modifier exp (tij )
cop copula βij = P . (6)
dobj direct object j∈Ω(i) exp (tij )
amod adjective modifier
aux auxiliary As shown in Figure 5(c), the numbers in red are
advmod adverbial modifier the weight scores. The l-th layer representations
compound compound (l)
of neighboring nodes hqj are first transformed via
dep dependent
acl claussal modifier of noun a learned linear transformation Wd . Those trans-
nsubjpass possive nominal subject formed representations are gathered with weight
auxpass passive auxiliary
root root node
βij , followed by a non-linear function σ. This
layer-wise propagation can be denoted as:
Table 1: The main categories of relations classified by  
the dependency parsing tool (De Marneffe et al., 2014). (l+1) (l)
X (l)
hqi = σ hqi + Bij βij Wd hqj  . (7)
j∈Ω(i)

Following the stacked L layer GCN, the output


of Q-GCN module Hq is denoted as:
(l+1)
Hq = hqi (l < L). (8)
Figure 5: (a) A fully-connected graph network is built 3.3 Attention Alignment Module
where each word is a node and each word may have Based on the previous works (Gao et al., 2019;
relations with other words. (b) the Stanford Syntactic
Yu et al., 2019), we use self-attention mechanism
Parsing tool (De Marneffe et al., 2014) is used to ob-
tain the dependency relations between words. Accord- (Vaswani et al., 2017) to enhance the correlation
ing to these relations, we can prune the unrelated edges between words in a question and the correlation
and obtain a sparse graph. (c) The numbers in red are between objects in an image, respectively.
the weight scores. For the node umpire’s, the weight To enhance the correlation between words and
of word the is 0.1 while the weight of word shirt is highlight the important words, we utilize the self-
0.9. The weight scores reflect the importance of words. attention mechanism to update question representa-
The phrase umpires’s shirt describes an object, thus the
tion Hq . The updated question representation H̃q
word shirt is more important than word the.
is obtained as follows:
!
Question Graph Convolutions Following the Hq HqT
previous works (Li et al., 2019; Zhang et al., 2018b; H̃q = softmax p Hq , (9)
dq
Yang et al., 2018), we use GCN to update the node
representations of words. Given a graph with λ where HqT is the transpose of Hq and dq is the
nodes, each word in a question is a node. We rep- dimension of Hq . The level of this self-attention is
resent the graph structure with a λ × λ adjacency set to 4.
matrix B where Bij = 1 if there is a dependency To obtain the image representation related to
relation between node i and node j; else Bij = 0. question representation, we align the image repre-
Given a target node i and a neighboring node sentation Hv by utilizing the question representa-
j ∈ Ω(i) in a question, Ω(i) is the set of nodes tion H̃q as the guided vector. The similarity score
neighboring with node i. The representations of r between Hv and H̃q is calculated as follows:
node i and j are hqi and hqj , respectively. To obtain
H̃q H T
the correlation score tij between node i and j, we r= √ v , (10)
learn a fully connected layer over concatenated dv
node features hqi and hqj : where HvT is the transpose of Hv and dv is the
(l) (l) dimension of Hv . A softmax function is used to
tij = wcT σ(Wc [hqi , hqj ]), (5) normalize the score r to obtain the weight score r̃:
where wc and Wc are learned parameters, σ is the exp (ri )
(l) (l) r̃ = [r̃1 , · · · , r̃i ] = P (11)
non-linear activation function, and [hqi , hqj ] de- j∈µ exp (rj )

7170
where µ is the number of image regions. questions are padded and truncated to the same
By multiplying the weight r̃ and the image rep- length 14, i.e., λ = 14. The levels of stacked layer
resentation Hv , the updated image representation L and attention alignment module are both 4.
H̃v is obtained:
4.3 Experimental Results
H̃v = r̃ · Hv . (12)
Table 2 shows the performance of our DC-GCN
The level of this question guided image attention is model and baseline models trained with the widely-
set to 4. The final outputs of the attention alignment used VQA-v2 dataset. All results in our paper are
module are H̃q and H̃v . based on single-model performance. For a fair com-
parison, we also train our model with extra visual
3.4 Answer Prediction genome dataset (Krishna et al., 2017). Bottom-Up
We apply the linear multimodal fusion method to
fuse two representations H̃q and H̃v as follows: Test-dev Test-std
Model
Y/N Num Other All All
Hr = WvT H̃v + WqT H̃q , (13) Bottom-Up 81.82 44.21 56.05 65.32 65.67
(Anderson et al.,
pred = softmax (We Hr + be ) , (14) 2018)
DCN (Nguyen 83.51 46.61 57.26 66.87 66.97
where Wv , Wq , We , and be are learned parameters, and Okatani,
and pred means the probability of the classified 2018)
Counter (Zhang 83.14 51.62 58.97 68.09 68.41
answers from the set of answer vocabulary which et al., 2018a)
contains M candidate answers. Following (Yu et al., BAN (Kim et al., 85.31 50.93 60.26 69.52 -
2019), we use binary cross-entropy loss function 2018)
to train an answer classifier. DFAF (Gao et al., 86.09 53.32 60.49 70.22 70.34
2019)
Erase-Att (Liu 85.87 50.28 61.10 70.07 70.36
4 Experiments et al., 2019)
ReGAT (Li et al., 86.08 54.42 60.33 70.27 70.58
4.1 Datasets 2019)
VQA-v2 (Goyal et al., 2017) is the most common- MCAN (Yu et al., 86.82 53.26 60.72 70.63 70.90
2019)
ly used VQA benchmark dataset which is split in-
DC-GCN (ours) 87.32 53.75 61.45 71.21 71.54
to train, val, and test-standard sets. Among test-
standard set, 25% are served as test-dev set. Each Table 2: Comparison with previous state-of-the-art
question has 10 answers from different annotators. methods on VQA-v2 test dataset. ”-” means data ab-
Answers with the highest frequency are treated as sence. Answer types consist of Yes/No, Num and Other
categories. All means the total accurary rate. All results
the ground truth. All answer types can be divided
in our paper are based on single-model performance.
into Yes/No, Number, and Other. VQA-CP-v2 (A-
grawal et al., 2018) is a derivation of the VQA-v2 (Anderson et al., 2018) is proposed to use features
dataset, which is introduced to evaluate and reduce based on Faster RCNN (Ren et al., 2015) instead
the question-oriented bias in VQA models. Due to of ResNet (He et al., 2016). Dense Co-Attention
significant difference of distribution between train Network (DCN) (Nguyen and Okatani, 2018) uti-
set and test set, the VQA-CP-v2 dataset is harder lizes dense stack of multiple layers of co-attention
than VQA-v2 dataset. mechanism. Counting method (Zhang et al., 2018a)
is good at counting questions by utilizing the in-
4.2 Experimental Setup formation of bounding boxes. DFAF (Gao et al.,
We use the Adam optimizer (Kingma and Ba, 2014) 2019) dynamically fuses Intra- and Inter-modality
with parameters α = 0.0001, β1 = 0.9, and β2 = information. ReGAT (Li et al., 2019) models se-
0.99. The size of the answer vocabulary is set mantic, spatial, and implicit relations via a graph
to M =3,129 as used in (Anderson et al., 2018). attention network. MCAN (Yu et al., 2019) utilizes
The base learning rate is set to 0.0001. After 15 deep modular networks to learn the multimodal
epochs, the learning rate is decayed by 1/5 every 2 feature representations, which is a state-of-the-art
epochs. All the models are trained up to 20 epochs approach on VQA-v2 dataset. As shown in Table 2,
with the same batch size 64 and hidden size 512. our model increases the overall accuracy of DFAF
Each image has µ ∈ [10, 100] object regions, all and MCAN by 1.2% and 0.6% on the test-std set,

7171
Figure 6: Visualizations of the learned attention maps of the Q-GCN module, I-GCN module and Attention Align-
ment module from some typical layers. We regard the correlation score between nodes as the attention score.
Q-GCN(l) and I-GCN(l) denote the question GCN attention maps and image GCN attention maps from the l-th
layer, respectively, as shown in (a), (b), (c) and (d). And (e) and (f) mean the question-guided image attention
weight of Attention Alignment module in l-th layer. For the sake of presentation, we only consider 20 object re-
gions in an image. The index within [1, 20] shown on the axes of the attention maps corresponds to each object in
the image. For better visualization effect, we highlight in the image three objects which correspond to 4-th, 6-th,
9-th, and 12-th objects, respectively.

respectively. Although still cannot achieve com- Model Acc. (%)


parable performance in the category of Num with RAMEN (Robik Shrestha, 2019) 39.21
respect to ReGAT (which is the best one in count- BAN (Kim et al., 2018) * 39.31
Murel (Cadene et al., 2019a) 39.54
ing sub-task), our DC-GCN outperforms it in other ReGAT-Sem (Li et al., 2019) 39.54
categories (e.g., Y/N with 1.2%, Other with 1.1% ReGAT-Imp (Li et al., 2019) 39.58
and Overall with 0.9%). It shows that DC-GCN ReGAT-Spa (Li et al., 2019) 40.30
ReGAT (Li et al., 2019) 40.42
has relation capturing ability in answering all kinds
of questions by sufficiently exploring the semantics GVQA (Agrawal et al., 2018) # 31.30
UpDn (Anderson et al., 2018) ** 39.74
in both object appearances and object relations. In UpDn + Q-Adv + DoE
41.17
summary, our DC-GCN achieves outstanding per- (Ramakrishnan et al., 2018) #
formance on the VQA-v2 dataset. DC-GCN (ours) 41.47
To demonstrate the generalizability of our DC- Table 3: Model accuracy on the VQA-CP-v2 bench-
GCN model, we also conduct experiments on the mark (open-ended setting on the test split). The re-
VQA-CP-v2 dataset. To overcome the language sults of models with * and ** are obtained from the
biases of the VQA-v2 dataset, the research work work (Robik Shrestha, 2019) and (Ramakrishnan et al.,
(Agrawal et al., 2018) designed the VQA-CP-v2 2018), respectively. Models with # are designed for
dataset and specifically proposed the GVQA model solving the language biases. The ReGAT model con-
for reducing the influence of language biases. Ta- sists of Semantic (Sem), Implicit (Imp), and Spatial (S-
pa) relation encoder.
ble 3 shows the results on VQA-CP-v2 test split.
The Murel (Cadene et al., 2019a) and ReGAT (Li
et al., 2019) build the relations between objects to vantage over UpDn + Q-Adv + DoE model. The
realize the reasoning task and question answering results on VQA-CP-v2 dataset show that depen-
task, which are the state-of-the-art models. Our dency parsing and DC-GCN can effectively reduce
DC-GCN model surpasses both Murel and ReGAT question-based overfitting.
on VQA-CP-v2 (41.47 vs. 39.54 and 41.47 vs.
4.4 Qualitative Analysis
40.42). The performance gain is lifted to +1.05%.
Although our proposed method is not designed for In Figure 6, we visualize the learned attentions
VQA-CP-v2 dataset, our model has a slight ad- from the I-GCN module, Q-GCN module and At-

7172
tention Alignment module. Due to the space lim-
itation, we only show one example and visualize
six attention maps from different attention units
and different layers. From the results, we have the
following observations.
Question GCN Module: The attention maps of
Q-GCN(2) focus on the words color and shirt as
shown in Figure 6(a) while the attention maps of
Q-GCN(4) correctly focus on the words color, um-
pire’s, and shirt, as shown in Figure 6(b). Those
words have the larger weight than others. That is
to say, the keywords color, umpire’s and shirt are
identified correctly.
Image GCN Module For the sake of presentation,
we only consider 20 object regions in an image.
The index within [1, 20] shown on the axes of
the attention maps corresponds to each object in
the image. Among these indexes, indexes 4, 6, 9,
and 12 are the most relevant ones for the question. Figure 7: We summarize three types of incorrect exam-
Compared with I-GCN(2) which focuses on the ples: limitation of object detection, text semantic un-
4-th, 6-th, 9-th, 12-th, and 14-th objects (cf. Figure derstanding and subjective judgment which correspond
6(c)), the I-GCN(4) focuses more on the 4-th, 6-th, to (a), (b), and (c), respectively.
and 12-th objects where the 4-th object has larger
weight than the 6-th and 12-th objects, as shown in of object detection. The image feature extractor is
Figure 6(d). The 4-th object region is the region of based on Faster R-CNN model (Ren et al., 2015).
ground true while the 6-th, 9-th, and 12-th object The accuracy of object detection can indirectly af-
regions are the most relevant ones. fect the accuracy of feature extraction. Counting
Attention Alignment Module Given a specific subtask in VQA task has a large room to improve.
question, a model needs to align image objects In Figure 7(b), the question what time should y-
guided by the question to update the representation- ou pay can be answered by recognizing the text
s of objects. As shown in Figure 6(e), the focus semantic understanding in the image. Text seman-
regions are more scattered, where the key regions tic understanding belongs to another task, namely
are mainly the 4-th, 9-th and 12-th object regions. text visual question answering (Biten et al., 2019),
Through the guidance of the identified words color, which requires to recognize the numbers, symbols
umpire’s and shirt, the DC-GCN model gradual- and proper nouns in a scene. In Figure 7(c), sub-
ly pays more attention to the 4-th, 9-th, and 12-th jective judgment is needed to answer the question
object regions rather than other irrelevant object is this man happy. Making this judgment requires
regions, as shown in Figure 6(f). This alignment some common sense knowledge and real life ex-
process demonstrates that our model can capture perience. Specifically, someone holding a banana
the relations of multiple similar objects. against him and just like holding a gun towards
him, so he is unhappy. Our model can not make
We also visualize some negative examples pre- such analysis like a human being done to make a
dicted by our DC-GCN model. As shown in Figure subjective judgment and predict the correct answer
7, which can be classified into three categories: (1) yes.
limitation of object detection; (2) text semantic un- Finally, to understand the distribution of three
derstanding in scenarios; (3) subjective judgment. error types, we randomly pick up 100 samples on
In Figure 7(a), although the question how many dev set of VQA-v2. The number of three error
sheep are pictured is not so difficult, the image types (i.e., overlapping objects, text semantic un-
content is really confusing. If not observe carefully, derstanding, and subjective judgment) is 3, 3, and
it’s rather easy to obtain the wrong answer 2 in- 29, respectively. The predicted answers of the first
stead of 3. The reasons for this error include object two questions types are all incorrect. The last one
occlusion, near and far degrees, and the limitation has 12 incorrect answers, which means the error

7173
rate of this question type is 41.4%. These observa- not remove them from the sentence. As shown in
tions are helpful to make further improvement in Table 4, removing the relations like det, case, aux
the future. and advmod individually, has trivial influence to
the semantic representations of the question. But
4.5 Ablation Study the result accuracy decreases significantly when we
We perform extensive ablation studies on the VQA- simultaneously remove the relations det, case and
v2 validation dataset (cf. Table 4). The experimen- cop. The reason may be that the sentence loses too
tal results are based on one black of our DC-GCN much information and becomes difficult to fully
model. All modules inside DC-GCN have the same express the meaning of the original sentence. For
dimension of 512. The learning rate is 0.0001 and example, consider the two phrases on the table and
the batch size is 32. under the table. If we remove the relation case,
which means that the words on and under are re-
Component Setting Acc. (%)
moved, then it will be hard to distinguish whether
Bottom-Up it is on the table or under the table.
Bottom-Up 63.15
(Anderson et al., 2018)
Default DC-GCN 66.57 5 Conclusion
DC-GCN 66.57
GCN Types w/o I-GCN 65.52 In this paper, we propose a dual channel graph con-
w/o Q-GCN 66.15 volutional network to explore the relations between
- det 66.50 objects in an image and the syntactic dependency
- case 66.42 relations between words in a question. Further-
- cop 66.01
Dependency - aux 66.48 more, we explicitly construct the relations between
relations - advmod 66.53 words by dependency tree and align the image and
- compound 66.35
- det case 65.23
question representations by an attention alignment
- det case cop 64.11 module to reduce the gaps between vision and lan-
guage. Extensive experiments on the VQA-v2 and
Table 4: Ablation studies of our proposed model on
VQA-v2 validation dataset. The experimental results VQA-CP-v2 datasets demonstrate that our model
are based on one black of our DC-GCN model. w/o achieves comparable performance with the state-
means removing a certain module from DC-GCN mod- of-the-art approaches. We will explore more com-
el. The detailed descriptions about dependency rela- plicated object relation modeling in future work.
tions are shown on Table 1.
Acknowledgements
Firstly, we investigate the influence of GCN
types. There are two GCN types: I-GCN and We thank the anonymous reviewers for valuable
Q-GCN, as shown in Table 4. When removing comments and thoughtful suggestions. We would
the I-GCN, the performance of our model decreas- also like to thank Professor Yuzhang Lin from Uni-
es from 66.57% to 65.52% (p-value = 3.22E-08 versity of Massachusetts Lowell for helpful discus-
< 0.05). When removing the Q-GCN, the perfor- sions.
mance of our model slightly decreases from 66.57% This work was supported by the Fundamental
to 66.15% (p-value = 2.04E-07 < 0.05). We consid- Research Funds for the Central Universities,
er that there are two reasons. One is that the image SCUT (No. 2017ZD048, D2182480), the Sci-
content is more complex than the question’s con- ence and Technology Planning Project of
tent, hence which has richer semantic information. Guangdong Province (No.2017B050506004),
By building the relations between objects can help the Science and Technology Programs of
clarify what the image represents and help align Guangzhou (No.201704030076, 201802010027,
with the question representations. The other is that 201902010046) and the collaborative research
the length of question is short, and less information grants from the Guangxi Natural Science Foun-
is contained (e.g., what animal is this? and what dation (2017GXNSFAA198225) and the Hong
color is the man’s shirt?). Kong Research Grants Council (project no. PolyU
Then, we perform ablation study on the influ- 1121417 and project no. C1031-18G), and an
ence of dependency relations (cf. Table 1). The internal research grant from the Hong Kong
relations, like nsubj, nmod, dobj and amod, are cru- Polytechnic University (project 1.9B0V).
cial to semantic representations, therefore, we do

7174
References in visual question answering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Recognition, pages 6904–6913.
Aniruddha Kembhavi. 2018. Don’t just assume;
look and answer: Overcoming priors for visual ques- Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Atten-
tion answering. In Proceedings of the IEEE Confer- tion guided graph convolutional networks for rela-
ence on Computer Vision and Pattern Recognition, tion extraction. In Proceedings of the 57th Confer-
pages 4971–4980. ence of the Association for Computational Linguis-
tics (ACL), pages 241–251.
Peter Anderson, Xiaodong He, Chris Buehler, Damien
Teney, Mark Johnson, Stephen Gould, and Lei Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Zhang. 2018. Bottom-up and top-down attention for Sun. 2016. Deep residual learning for image recog-
image captioning and visual question answering. In nition. In Proceedings of the IEEE conference on
Proceedings of the IEEE Conference on Computer computer vision and pattern recognition, pages 770–
Vision and Pattern Recognition, pages 6077–6086. 778.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, Long short-term memory. Neural computation,
and Devi Parikh. 2015. Vqa: Visual question an- 9(8):1735–1780.
swering. In Proceedings of the IEEE international
conference on computer vision, pages 2425–2433. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.
2018. Bilinear attention networks. In Advances
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- in Neural Information Processing Systems, pages
gio. 2014. Neural machine translation by jointly 1564–1574.
learning to align and translate. arXiv preprint arX-
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
iv:1409.0473.
method for stochastic optimization. arXiv preprint
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis arXiv:1412.6980.
Gomez, Maral Rusiol, Ernest Valveny, C. V. Jawa- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
har, and Dimosthenis Karatzas. 2019. Scene text vi- son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
sual question answering. CoRR abs/1905.13648. Yannis Kalantidis, Li-Jia Li, David A Shamma,
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and et al. 2017. Visual genome: Connecting language
Nicolas Thome. 2019a. Murel: Multimodal rela- and vision using crowdsourced dense image anno-
tional reasoning for visual question answering. In tations. International Journal of Computer Vision,
Proceedings of the IEEE Conference on Computer 123(1):32–73.
Vision and Pattern Recognition, pages 1989–1998. Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019.
Relation-aware graph attention network for visual
Remi Cadene, Corentin Dancette, Hedi Ben younes, question answering. Proceedings of the IEEE In-
Matthieu Cord, and Devi Parikh. 2019b. Rubi: Re- ternational Conference on Computer Vision, pages
ducing unimodal biases for visual question answer- 10313–10322.
ing. In Advances in Neural Information Processing
Systems, pages 841–852. Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang,
and Hongsheng Li. 2019. Improving referring
Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin expression grounding with cross-modal attention-
Li, and Liang Lin. 2018. Visual question reasoning guided erasing. In Proceedings of the IEEE Confer-
on general dependency tree. In Proceedings of the ence on Computer Vision and Pattern Recognition,
IEEE Conference on Computer Vision and Pattern pages 1950–1959.
Recognition, pages 7249–7257.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Marie-Catherine De Marneffe, Timothy Dozat, Natali- 2016. Hierarchical question-image co-attention for
a Silveira, Katri Haverinen, Filip Ginter, Joakim visual question answering. In Advances In Neural
Nivre, and Christopher D Manning. 2014. Universal Information Processing Systems, pages 289–297.
stanford dependencies: A cross-linguistic typology.
In LREC, volume 14, pages 4585–4592. Duy-Kien Nguyen and Takayuki Okatani. 2018. Im-
proved fusion of visual and language representations
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, by dense symmetric co-attention for visual question
Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. answering. In Proceedings of the IEEE Conference
2019. Dynamic fusion with intra-and inter-modality on Computer Vision and Pattern Recognition, pages
attention flow for visual question answering. In Pro- 6087–6096.
ceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 6639–6648. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and
Jakob Verbeek. 2017. Areas of attention for image
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhru- captioning. In Proceedings of the IEEE Interna-
v Batra, and Devi Parikh. 2017. Making the v in vqa tional Conference on Computer Vision, pages 1242–
matter: Elevating the role of image understanding 1250.

7175
Jeffrey Pennington, Richard Socher, and Christopher IEEE Conference on Computer Vision and Pattern
Manning. 2014. Glove: Global vectors for word rep- Recognition, pages 6281–6290.
resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process- Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao.
ing (EMNLP), pages 1532–1543. 2017. Multi-modal factorized bilinear pooling with
co-attention learning for visual question answering.
Sainandan Ramakrishnan, Aishwarya Agrawal, and In Proceedings of the IEEE international conference
Stefan Lee. 2018. Overcoming language priors in on computer vision, pages 1821–1830.
visual question answering with adversarial regular-
ization. In Advances in Neural Information Process- Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and
ing Systems, pages 1541–1551. Dacheng Tao. 2018. Beyond bilinear: Generalized
multimodal factorized high-order pooling for visual
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian question answering. IEEE transactions on neural
Sun. 2015. Faster r-cnn: Towards real-time ob- networks and learning systems, 29(12):5947–5959.
ject detection with region proposal networks. In
Advances in neural information processing systems, Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett.
pages 91–99. 2018a. Learning to count objects in natural images
for visual question answering. International Confer-
Christopher Kanan Robik Shrestha, Kushal Kafle. ence on Learning Representation (ICLR).
2019. Answer them all! toward universal visual
question answering models. In Proceedings of the Yuhao Zhang, Peng Qi, and Christopher D Manning.
IEEE Conference on Computer Vision and Pattern 2018b. Graph convolution over pruned dependency
Recognition, pages 10472–10481. trees improves relation extraction. In Proceedings of
the 2018 conference on empirical methods in natural
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. language processing (EMNLP), pages 2205–2215.
Where to look: Focus regions for visual question
answering. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages
4613–4621.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is al-
l you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-
Fei. 2017. Scene graph generation by iterative mes-
sage passing. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition,
pages 5410–5419.
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and
Devi Parikh. 2018. Graph r-cnn for scene graph gen-
eration. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 670–685.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,
and Alex Smola. 2016. Stacked attention network-
s for image question answering. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 21–29.
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
Graph convolutional networks for text classification.
In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 7370–7377.
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018.
Exploring visual relationship for image captioning.
In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 684–699.
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and
Qi Tian. 2019. Deep modular co-attention networks
for visual question answering. In Proceedings of the

7176

You might also like