DC GCN

This paper proposes a dual channel graph convolutional network (DC-GCN) model to answer visual questions. The DC-GCN consists of three parts: an image graph convolutional network to capture object relations, a question graph convolutional network to capture word dependencies, and an attention alignment module. Experimental results show the DC-GCN achieves comparable performance to state-of-the-art approaches.

Uploaded by

1692124482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views11 pages

DC GCN

Uploaded by

1692124482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Aligned Dual Channel Graph Convolutional Network

for Visual Question Answering

Qingbao Huang1,2 , Jielong Wei2 , Yi Cai1∗
Changmeng Zheng1 , Junying Chen1 , Ho-fung Leung3 , Qing Li4
1
School of Software Engineering, South China University of Technology, Guangzhou, China
2
School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
3
The Chinese University of Hong Kong, Hong Kong SAR, China
4
The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
[email protected], [email protected], [email protected]

Abstract
Visual question answering aims to answer the
natural language question about a given im-
age. Existing graph-based methods only focus
on the relations between objects in an image
and neglect the importance of the syntactic de-
pendency relations between words in a ques-
tion. To simultaneously capture the relations
between objects in an image and the syntactic
dependency relations between words in a ques-
tion, we propose a novel dual channel graph
convolutional network (DC-GCN) for better
combining visual and textual advantages. The
Figure 1: (a) The question and the ground true answer.
DC-GCN model consists of three parts: an
(b) The wrong answer is predicted by a state-of-the-art
I-GCN module to capture the relations be-
model, which focuses on the highlighted region in the
tween objects in an image, a Q-GCN module
image. The depth of the color indicates the weights
to capture the syntactic dependency relation-
of the words in the question, where deeper color repre-
s between words in a question, and an atten-
sents higher weight. The question is performed by syn-
tion alignment module to align image represen-
tactic dependency parsing. (c) The dependency parsing
tations and question representations. Experi-
of the question is obtained by the universal Standford
mental results show that our model achieves
Dependencies tool (De Marneffe et al., 2014).
comparable performance with the state-of-the-
art approaches.
2018a), and relation reasoning (Cao et al., 2018; Li
1 Introduction et al., 2019; Cadene et al., 2019a). Representing
As a form of visual Turing test, visual question images as graphs allows one to explicitly model
answering (VQA) has drawn much attention. The interactions between two objects in an image, so as
goal of VQA (Antol et al., 2015; Goyal et al., 2017) to seamlessly transfer information between graph
is to answer a natural language question related to nodes (e.g., objects in an image).
the contents of a given image. Attention mecha- Very recent research methods (Li et al., 2019; Ca-
nisms are served as the backbone of the previous dene et al., 2019a; Yu et al., 2019) have achieved
mainstream approaches (Lu et al., 2016; Yang et al., remarkable performances, but there is still a big
2016; Yu et al., 2017), however, they tend to catch gap between them and human. As shown in Figure
only the most discriminative information, ignoring 1(a), given an image of a group of persons and the
other rich complementary clues (Liu et al., 2019). corresponding question, a VQA system needs to
Recent VQA studies have been exploring higher not only recognize the objects in an image (e.g.,
level semantic representation of images, notably batter, umpire and catcher), but also grasp the tex-
using graph-based structures for better image under- tual information in the question “what color is the
standing, such as scene graph generation (Xu et al., umpire’s shirt”. However, even many competitive
2017; Yang et al., 2018), visual relationship detec- VQA models struggle to process them accurately,
tion (Yao et al., 2018), object counting (Zhang et al., and as a result predict the incorrect answer (black)
∗
Corresponding author: Yi Cai ([email protected]). rather than the correct answer (blue), including the

7166
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7166–7176
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
state-of-the-art methods. 2 Related Works
Although the relations between two objects in
Visual Question Answering Attention mechanis-
an image have been considered, the attention-based
m has been proven effective on many tasks, such as
VQA models lack building blocks to explicitly cap-
machine translation (Bahdanau et al., 2014) and im-
ture the syntactic dependency relations between
age captioning (Pedersoli et al., 2017). A number
words in a question. As shown in Figure 1(c), these
of methods have been developed so far, in which
dependency relations can reflect which object is
question-guided attention on image regions is com-
being asked (e.g., the word umpire’s modifies the
monly used. These can be categorized into two
word shirt) and which aspect of the object is be-
classes according to the types of employed image
ing asked (e.g., the word color is the direct object
features. One class uses visual features from some
of the word is). If a VQA model only knows the
region proposals, which are generated by Region
word shirt rather than the relation between words
Proposal Network (Ren et al., 2015). The other
umpire’s and shirt in a question, it is difficult to
class uses convolutional features (i.e., activations
distinguish which object is being asked. In fact, we
of convolutional layers).
do need the modified relations to discriminate the
To learn a better representation of the question,
correct object from multiple similar objects. There-
the Stacked Attention Network (Yang et al., 2016)
fore, we consider that it is necessary to explore the
which can search question-related image regions
relations between words at linguistic level in addi-
is designed by performing multi-step visual atten-
tion to constructing the relations between objects
tion operations. A co-attention mechanism that
at visual level.
jointly performs question-guided visual attention
Motivated by this, we propose a dual channel and image-guided question attention is proposed to
graph convolutional network (DC-GCN) to simul- solve the problems of which regions to look at and
taneously capture the relations between objects in what words to listen to (Shih et al., 2016). To ob-
an image and the syntactic dependency relations tain more fine-grained interaction between image
between words in a question. Our proposed DC- and question, some researchers introduce rather
GCN model consists of an Image-GCN (I-GCN) sophisticated fusion strategies. Bilinear pooling
module, a Question GCN (Q-GCN) module, and method (Kim et al., 2018; Yu et al., 2017, 2018)
an attention alignment module. The I-GCN module is one of the pioneering works to efficiently and
captures the relations between objects in an image, expressively combine multimodal features by using
the Q-GCN module captures the syntactic depen- an outer product of two vectors.
dency relations between words in a question, and Recently, some researchers devoted to overcome
the attention alignment module is used to align two the priors on VQA dataset and proposed the meth-
representations of image and question. The contri- ods like GVQA (Agrawal et al., 2018), UpDn +
butions of this work are summarized as follows: Q-Adv + DoE (Ramakrishnan et al., 2018), and
1) We propose a dual channel graph convolution- RUBi (Cadene et al., 2019b) to solve the language
al network (DC-GCN) to simultaneously capture biases on the VQA-CP-v2 dataset.
the visual and textual relations, and design the at- Graph Networks Graph networks are power-
tention alignment module to align the multimodal ful models that can perform relational inferences
representations, thus reducing the semantic gaps through message passing. The core idea is to enable
between vision and language. communication between image regions to build
contextualized representations of these regions. Be-
2) We explore how to construct the syntactic low we review some of the recent works that rely
dependency relations between words at linguistic on graph networks and other contextualized repre-
level via graph convolutional networks as well as sentations for VQA.
the relations between objects at visual level. Recent research works (Cadene et al., 2019a; Li
3) We conduct extensive experiments and abla- et al., 2019) focus on how to deal with complex
tion studies on VQA-v2 and VQA-CP-v2 datasets scene and relation reasoning to obtain better image
to examine the effectiveness of our DC-GCN mod- representations. Based on multimodal attention-
el. Experimental results show that the DC-GCN al networks, (Cadene et al., 2019a) introduces an
model achieves competitive performance with the atomic reasoning primitive to represent interactions
state-of-the-art approaches. between question and image region by a rich vecto-

7167
Figure 2: Illustration of our proposed Dual Channel Graph Convolutional Network (DC-GCN) for VQA task. The
Dependency Parsing constructs the semantic relations between words in a question, and Q-GCN Module updates
every word’s features by aggregating the adjacent word features. In addition, the I-GCN Module builds the relations
between image objects, and the Attention Alignment Module use question-guided image attention mechanism to
learn a new object representation thus align the images and questions. All punctuations and upper cases have been
preprocessed. The numbers in red are the weight scores of image objects and words.

rial representation and model region relations with (Pennington et al., 2014). The word embeddings
pairwise combinations. GCNs, which can better are input into a LSTM (Hochreiter and Schmidhu-
explore the visual relations between objects and ber, 1997) to encode, which produces the initial
aggregate its own features and neighbors’ features, question representation hq = {hqj }λj=0 ∈ Rλ×dq .
have been applied to various tasks, such as text
classification (Yao et al., 2019), relation extraction 3.2 Relation Extraction and Encoding
(Guo et al., 2019; Zhang et al., 2018b), scene graph 3.2.1 I-GCN Module
generation (Yang et al., 2018; Yao et al., 2018).
To answer complicated questions about an im- Image Fully-connected Relations Graph By
age, a relation-aware graph attention network (Re- treating each object region in an image as a ver-
GAT) (Li et al., 2019) is proposed to encode each tex, we can construct a fully-connected undirected
image into a graph and model multi-type inter- graph, as shown in Figure 3(b). Each edge repre-
object relations via a graph attention mechanism, sents a relation between two object regions.
such as spatial relations, semantic relations and im- Pruned Image Graph with Spatial Relations S-
plicit relations. One limitation of ReGAT (Li et al., patial relations represent an object position in an
2019) lies in the fact that it solely consider the re- image, which correspond to a 4-dimensional spatial
lations between objects in an image while neglect coordinate [x1 , y1 , x2 , y2 ]. Note that (x1 , y1 ) is the
the importance of text information. In contrast, our coordinate of the top-left point of the bounding box
DC-GCN simultaneously capture visual relations and (x2 , y2 ) is the coordinate of the bottom-right
in an image and textual relations in a question. point of the bounding box.
Identifying the correlation between objects is
3 Model a key step. We calculate the correlation between
objects by using spatial relations. The steps are
3.1 Feature Extraction as follows: (1) The features of two nodes are in-
Similar to (Anderson et al., 2018), we extract the put into multi-layer perceptron respectively, and
image features by using a pretrained Faster RCNN then the corresponding elements are multiplied to
(Ren et al., 2015). We select µ object proposal- get a relatedness score. (2) The intersection over
s for each image, where each object proposal is union of two object regions is calculated. Accord-
represented by a 2048 dimensional feature vector. ing to the overlapping part of two object regions,
The obtained visual region features are denoted as different spatial relations are classified into 11 dif-
hv = {hvi }µi=0 ∈ Rµ×2048 . ferent categories, such as inside, cover, and overlap
To extract the question features, each word is (Yao et al., 2018). Following the work (Yao et al.,
embedded into a 300-dimensional Glove vector 2018), we utilize the overlapping region between

7168
two object regions to judge whether there is an are then gathered with weight αij , followed by a
edge between two regions. If two object regions non-linear function σ. This layer-wise propagation
have large overlapping part, it means that there is can be denoted as:
a strong correlation between these two objects. If  
two object regions haven’t any overlapping part, (l+1) (l) (l)
X
hvi = σ hvi + Aij αij Wb hvj  . (3)
we consider two objects have a weak correlation, j∈N (i)
which means there are no edges to connect these
two nodes. According to the spatial relations, we Following the stacked L layer GCN, the output
prune some irrelevant relations between objects of I-GCN module Hv can be denoted as:
and obtain a sparse graph, as shown in Figure 3(c). (l+1)
Hv = hvi (l < L). (4)

3.2.2 Q-GCN Module

In practice, we observe that two words in a sentence
usually hold certain relations. Such relations can be
identified by the universal Standford Dependencies
(De Marneffe et al., 2014). As shown in Table
Figure 3: (a) Generate region proposals by pretrained 1, we list a part of commonly-used dependency
model (Anderson et al., 2018). For display purposes, relations. For example, the sentence what color is
we only highlight some object regions. (b) Construct
the relations between objects. (c) Prune the irrelevant
object edges and calculate the weight between objects.
The numbers in red are the weights of edges.
Image Graph Convolutions Following the previ-
ous studies (Li et al., 2019; Zhang et al., 2018b;
Yang et al., 2018), we use GCN to update the repre-
sentations of objects. Given a graph with µ nodes, Figure 4: The question is performed by syntactic de-
each object region in an image is a node. We rep- pendency parsing. The word is is the root node of de-
resent the graph structure with a µ × µ adjacency pendency relations while the words in blue (e.g., det,
matrix A, where Aij = 1 if there is overlapping dobj) are dependency relations. The direction of arrow
indicates that two words exist a relation.
region between node i and node j; else Aij = 0.
Given a target node i and a neighboring node the umpire’s shirt is parsed to obtain the relations
j ∈ N (i) in an image, where N (i) is the set of between words (e.g., cop, det and nmod), as shown
nodes neighboring with node i, and the representa- in Figure 4. The words in blue are the dependency
tions of node i and node j are hvi and hvj , respec- relations. The ending of arrow indicates that this
tively. To obtain the correlation score sij between word is a modifier. The word root in purple is
node i and j, we learn a fully connected layer over used to indicate which word is the root node of
concatenated node features hvi and hvj : dependency relations.
Question Fully-connected Relations Graph By
(l) (l)
sij = waT σ(Wa [hvi , hvj ]), (1) treating each word in a question as a node, we con-
struct a fully-connected undirected graph, as shown
where wa and Wa are learned parameters, σ is
(l) (l) in Figure 5(a). Each edge represents a relation be-
the non-linear activation function, and [hvi , hvj ] tween two words.
denotes the concatenation operation. We apply a Pruned Question Graph with Dependency Rela-
softmax function over the correlation score sij to tions Irrelevant relations between two words may
obtain weight αij , as shown in Figure 3(c) where bring noises. Therefore, we need to prune some
the numbers in red represent the weight scores: unrelated relations to reduce the noises. By parsing
exp (sij ) the dependency relations of a question, we obtain
αij = P . (2) the relations between words (cf. Figure 4). Accord-
j∈N (i) exp (sij )
ing to dependency relations, we prune some edges
The l-th layer representations of neighboring nodes between two nodes which do not have dependency
(l)
hvj are first transformed via a learned linear trans- relations. A sparse graph is obtained, as shown in
formation Wb . Those transformed representations Figure 5(b).

7169
Relations Relation Description notes the concatenation operation. We apply a soft-
det determiner max function over the correlation score tij to obtain
nsubj nominal subject weight βij :
case prepositions, postpositions
nmod nominal modifier exp (tij )
cop copula βij = P . (6)
dobj direct object j∈Ω(i) exp (tij )
amod adjective modifier
aux auxiliary As shown in Figure 5(c), the numbers in red are
advmod adverbial modifier the weight scores. The l-th layer representations
compound compound (l)
of neighboring nodes hqj are first transformed via
dep dependent
acl claussal modifier of noun a learned linear transformation Wd . Those trans-
nsubjpass possive nominal subject formed representations are gathered with weight
auxpass passive auxiliary
root root node
βij , followed by a non-linear function σ. This
layer-wise propagation can be denoted as:
Table 1: The main categories of relations classified by  
the dependency parsing tool (De Marneffe et al., 2014). (l+1) (l)
X (l)
hqi = σ hqi + Bij βij Wd hqj  . (7)
j∈Ω(i)

Following the stacked L layer GCN, the output

of Q-GCN module Hq is denoted as:
(l+1)
Hq = hqi (l < L). (8)
Figure 5: (a) A fully-connected graph network is built 3.3 Attention Alignment Module
where each word is a node and each word may have Based on the previous works (Gao et al., 2019;
relations with other words. (b) the Stanford Syntactic
Yu et al., 2019), we use self-attention mechanism
Parsing tool (De Marneffe et al., 2014) is used to ob-
tain the dependency relations between words. Accord- (Vaswani et al., 2017) to enhance the correlation
ing to these relations, we can prune the unrelated edges between words in a question and the correlation
and obtain a sparse graph. (c) The numbers in red are between objects in an image, respectively.
the weight scores. For the node umpire’s, the weight To enhance the correlation between words and
of word the is 0.1 while the weight of word shirt is highlight the important words, we utilize the self-
0.9. The weight scores reflect the importance of words. attention mechanism to update question representa-
The phrase umpires’s shirt describes an object, thus the
tion Hq . The updated question representation H̃q
word shirt is more important than word the.
is obtained as follows:
!
Question Graph Convolutions Following the Hq HqT
previous works (Li et al., 2019; Zhang et al., 2018b; H̃q = softmax p Hq , (9)
dq
Yang et al., 2018), we use GCN to update the node
representations of words. Given a graph with λ where HqT is the transpose of Hq and dq is the
nodes, each word in a question is a node. We rep- dimension of Hq . The level of this self-attention is
resent the graph structure with a λ × λ adjacency set to 4.
matrix B where Bij = 1 if there is a dependency To obtain the image representation related to
relation between node i and node j; else Bij = 0. question representation, we align the image repre-
Given a target node i and a neighboring node sentation Hv by utilizing the question representa-
j ∈ Ω(i) in a question, Ω(i) is the set of nodes tion H̃q as the guided vector. The similarity score
neighboring with node i. The representations of r between Hv and H̃q is calculated as follows:
node i and j are hqi and hqj , respectively. To obtain
H̃q H T
the correlation score tij between node i and j, we r= √ v , (10)
learn a fully connected layer over concatenated dv
node features hqi and hqj : where HvT is the transpose of Hv and dv is the
(l) (l) dimension of Hv . A softmax function is used to
tij = wcT σ(Wc [hqi , hqj ]), (5) normalize the score r to obtain the weight score r̃:
where wc and Wc are learned parameters, σ is the exp (ri )
(l) (l) r̃ = [r̃1 , · · · , r̃i ] = P (11)
non-linear activation function, and [hqi , hqj ] de- j∈µ exp (rj )

7170
where µ is the number of image regions. questions are padded and truncated to the same
By multiplying the weight r̃ and the image rep- length 14, i.e., λ = 14. The levels of stacked layer
resentation Hv , the updated image representation L and attention alignment module are both 4.
H̃v is obtained:
4.3 Experimental Results
H̃v = r̃ · Hv . (12)
Table 2 shows the performance of our DC-GCN
The level of this question guided image attention is model and baseline models trained with the widely-
set to 4. The final outputs of the attention alignment used VQA-v2 dataset. All results in our paper are
module are H̃q and H̃v . based on single-model performance. For a fair com-
parison, we also train our model with extra visual
3.4 Answer Prediction genome dataset (Krishna et al., 2017). Bottom-Up
We apply the linear multimodal fusion method to
fuse two representations H̃q and H̃v as follows: Test-dev Test-std
Model
Y/N Num Other All All
Hr = WvT H̃v + WqT H̃q , (13) Bottom-Up 81.82 44.21 56.05 65.32 65.67
(Anderson et al.,
pred = softmax (We Hr + be ) , (14) 2018)
DCN (Nguyen 83.51 46.61 57.26 66.87 66.97
where Wv , Wq , We , and be are learned parameters, and Okatani,
and pred means the probability of the classified 2018)
Counter (Zhang 83.14 51.62 58.97 68.09 68.41
answers from the set of answer vocabulary which et al., 2018a)
contains M candidate answers. Following (Yu et al., BAN (Kim et al., 85.31 50.93 60.26 69.52 -
2019), we use binary cross-entropy loss function 2018)
to train an answer classifier. DFAF (Gao et al., 86.09 53.32 60.49 70.22 70.34
2019)
Erase-Att (Liu 85.87 50.28 61.10 70.07 70.36
4 Experiments et al., 2019)
ReGAT (Li et al., 86.08 54.42 60.33 70.27 70.58
4.1 Datasets 2019)
VQA-v2 (Goyal et al., 2017) is the most common- MCAN (Yu et al., 86.82 53.26 60.72 70.63 70.90
2019)
ly used VQA benchmark dataset which is split in-
DC-GCN (ours) 87.32 53.75 61.45 71.21 71.54
to train, val, and test-standard sets. Among test-
standard set, 25% are served as test-dev set. Each Table 2: Comparison with previous state-of-the-art
question has 10 answers from different annotators. methods on VQA-v2 test dataset. ”-” means data ab-
Answers with the highest frequency are treated as sence. Answer types consist of Yes/No, Num and Other
categories. All means the total accurary rate. All results
the ground truth. All answer types can be divided
in our paper are based on single-model performance.
into Yes/No, Number, and Other. VQA-CP-v2 (A-
grawal et al., 2018) is a derivation of the VQA-v2 (Anderson et al., 2018) is proposed to use features
dataset, which is introduced to evaluate and reduce based on Faster RCNN (Ren et al., 2015) instead
the question-oriented bias in VQA models. Due to of ResNet (He et al., 2016). Dense Co-Attention
significant difference of distribution between train Network (DCN) (Nguyen and Okatani, 2018) uti-
set and test set, the VQA-CP-v2 dataset is harder lizes dense stack of multiple layers of co-attention
than VQA-v2 dataset. mechanism. Counting method (Zhang et al., 2018a)
is good at counting questions by utilizing the in-
4.2 Experimental Setup formation of bounding boxes. DFAF (Gao et al.,
We use the Adam optimizer (Kingma and Ba, 2014) 2019) dynamically fuses Intra- and Inter-modality
with parameters α = 0.0001, β1 = 0.9, and β2 = information. ReGAT (Li et al., 2019) models se-
0.99. The size of the answer vocabulary is set mantic, spatial, and implicit relations via a graph
to M =3,129 as used in (Anderson et al., 2018). attention network. MCAN (Yu et al., 2019) utilizes
The base learning rate is set to 0.0001. After 15 deep modular networks to learn the multimodal
epochs, the learning rate is decayed by 1/5 every 2 feature representations, which is a state-of-the-art
epochs. All the models are trained up to 20 epochs approach on VQA-v2 dataset. As shown in Table 2,
with the same batch size 64 and hidden size 512. our model increases the overall accuracy of DFAF
Each image has µ ∈ [10, 100] object regions, all and MCAN by 1.2% and 0.6% on the test-std set,

7171
Figure 6: Visualizations of the learned attention maps of the Q-GCN module, I-GCN module and Attention Align-
ment module from some typical layers. We regard the correlation score between nodes as the attention score.
Q-GCN(l) and I-GCN(l) denote the question GCN attention maps and image GCN attention maps from the l-th
layer, respectively, as shown in (a), (b), (c) and (d). And (e) and (f) mean the question-guided image attention
weight of Attention Alignment module in l-th layer. For the sake of presentation, we only consider 20 object re-
gions in an image. The index within [1, 20] shown on the axes of the attention maps corresponds to each object in
the image. For better visualization effect, we highlight in the image three objects which correspond to 4-th, 6-th,
9-th, and 12-th objects, respectively.

respectively. Although still cannot achieve com- Model Acc. (%)

parable performance in the category of Num with RAMEN (Robik Shrestha, 2019) 39.21
respect to ReGAT (which is the best one in count- BAN (Kim et al., 2018) * 39.31
Murel (Cadene et al., 2019a) 39.54
ing sub-task), our DC-GCN outperforms it in other ReGAT-Sem (Li et al., 2019) 39.54
categories (e.g., Y/N with 1.2%, Other with 1.1% ReGAT-Imp (Li et al., 2019) 39.58
and Overall with 0.9%). It shows that DC-GCN ReGAT-Spa (Li et al., 2019) 40.30
ReGAT (Li et al., 2019) 40.42
has relation capturing ability in answering all kinds
of questions by sufficiently exploring the semantics GVQA (Agrawal et al., 2018) # 31.30
UpDn (Anderson et al., 2018) ** 39.74
in both object appearances and object relations. In UpDn + Q-Adv + DoE
41.17
summary, our DC-GCN achieves outstanding per- (Ramakrishnan et al., 2018) #
formance on the VQA-v2 dataset. DC-GCN (ours) 41.47
To demonstrate the generalizability of our DC- Table 3: Model accuracy on the VQA-CP-v2 bench-
GCN model, we also conduct experiments on the mark (open-ended setting on the test split). The re-
VQA-CP-v2 dataset. To overcome the language sults of models with * and ** are obtained from the
biases of the VQA-v2 dataset, the research work work (Robik Shrestha, 2019) and (Ramakrishnan et al.,
(Agrawal et al., 2018) designed the VQA-CP-v2 2018), respectively. Models with # are designed for
dataset and specifically proposed the GVQA model solving the language biases. The ReGAT model con-
for reducing the influence of language biases. Ta- sists of Semantic (Sem), Implicit (Imp), and Spatial (S-
pa) relation encoder.
ble 3 shows the results on VQA-CP-v2 test split.
The Murel (Cadene et al., 2019a) and ReGAT (Li
et al., 2019) build the relations between objects to vantage over UpDn + Q-Adv + DoE model. The
realize the reasoning task and question answering results on VQA-CP-v2 dataset show that depen-
task, which are the state-of-the-art models. Our dency parsing and DC-GCN can effectively reduce
DC-GCN model surpasses both Murel and ReGAT question-based overfitting.
on VQA-CP-v2 (41.47 vs. 39.54 and 41.47 vs.
4.4 Qualitative Analysis
40.42). The performance gain is lifted to +1.05%.
Although our proposed method is not designed for In Figure 6, we visualize the learned attentions
VQA-CP-v2 dataset, our model has a slight ad- from the I-GCN module, Q-GCN module and At-

7172
tention Alignment module. Due to the space lim-
itation, we only show one example and visualize
six attention maps from different attention units
and different layers. From the results, we have the
following observations.
Question GCN Module: The attention maps of
Q-GCN(2) focus on the words color and shirt as
shown in Figure 6(a) while the attention maps of
Q-GCN(4) correctly focus on the words color, um-
pire’s, and shirt, as shown in Figure 6(b). Those
words have the larger weight than others. That is
to say, the keywords color, umpire’s and shirt are
identified correctly.
Image GCN Module For the sake of presentation,
we only consider 20 object regions in an image.
The index within [1, 20] shown on the axes of
the attention maps corresponds to each object in
the image. Among these indexes, indexes 4, 6, 9,
and 12 are the most relevant ones for the question. Figure 7: We summarize three types of incorrect exam-
Compared with I-GCN(2) which focuses on the ples: limitation of object detection, text semantic un-
4-th, 6-th, 9-th, 12-th, and 14-th objects (cf. Figure derstanding and subjective judgment which correspond
6(c)), the I-GCN(4) focuses more on the 4-th, 6-th, to (a), (b), and (c), respectively.
and 12-th objects where the 4-th object has larger
weight than the 6-th and 12-th objects, as shown in of object detection. The image feature extractor is
Figure 6(d). The 4-th object region is the region of based on Faster R-CNN model (Ren et al., 2015).
ground true while the 6-th, 9-th, and 12-th object The accuracy of object detection can indirectly af-
regions are the most relevant ones. fect the accuracy of feature extraction. Counting
Attention Alignment Module Given a specific subtask in VQA task has a large room to improve.
question, a model needs to align image objects In Figure 7(b), the question what time should y-
guided by the question to update the representation- ou pay can be answered by recognizing the text
s of objects. As shown in Figure 6(e), the focus semantic understanding in the image. Text seman-
regions are more scattered, where the key regions tic understanding belongs to another task, namely
are mainly the 4-th, 9-th and 12-th object regions. text visual question answering (Biten et al., 2019),
Through the guidance of the identified words color, which requires to recognize the numbers, symbols
umpire’s and shirt, the DC-GCN model gradual- and proper nouns in a scene. In Figure 7(c), sub-
ly pays more attention to the 4-th, 9-th, and 12-th jective judgment is needed to answer the question
object regions rather than other irrelevant object is this man happy. Making this judgment requires
regions, as shown in Figure 6(f). This alignment some common sense knowledge and real life ex-
process demonstrates that our model can capture perience. Specifically, someone holding a banana
the relations of multiple similar objects. against him and just like holding a gun towards
him, so he is unhappy. Our model can not make
We also visualize some negative examples pre- such analysis like a human being done to make a
dicted by our DC-GCN model. As shown in Figure subjective judgment and predict the correct answer
7, which can be classified into three categories: (1) yes.
limitation of object detection; (2) text semantic un- Finally, to understand the distribution of three
derstanding in scenarios; (3) subjective judgment. error types, we randomly pick up 100 samples on
In Figure 7(a), although the question how many dev set of VQA-v2. The number of three error
sheep are pictured is not so difficult, the image types (i.e., overlapping objects, text semantic un-
content is really confusing. If not observe carefully, derstanding, and subjective judgment) is 3, 3, and
it’s rather easy to obtain the wrong answer 2 in- 29, respectively. The predicted answers of the first
stead of 3. The reasons for this error include object two questions types are all incorrect. The last one
occlusion, near and far degrees, and the limitation has 12 incorrect answers, which means the error

7173
rate of this question type is 41.4%. These observa- not remove them from the sentence. As shown in
tions are helpful to make further improvement in Table 4, removing the relations like det, case, aux
the future. and advmod individually, has trivial influence to
the semantic representations of the question. But
4.5 Ablation Study the result accuracy decreases significantly when we
We perform extensive ablation studies on the VQA- simultaneously remove the relations det, case and
v2 validation dataset (cf. Table 4). The experimen- cop. The reason may be that the sentence loses too
tal results are based on one black of our DC-GCN much information and becomes difficult to fully
model. All modules inside DC-GCN have the same express the meaning of the original sentence. For
dimension of 512. The learning rate is 0.0001 and example, consider the two phrases on the table and
the batch size is 32. under the table. If we remove the relation case,
which means that the words on and under are re-
Component Setting Acc. (%)
moved, then it will be hard to distinguish whether
Bottom-Up it is on the table or under the table.
Bottom-Up 63.15
(Anderson et al., 2018)
Default DC-GCN 66.57 5 Conclusion
DC-GCN 66.57
GCN Types w/o I-GCN 65.52 In this paper, we propose a dual channel graph con-
w/o Q-GCN 66.15 volutional network to explore the relations between
- det 66.50 objects in an image and the syntactic dependency
- case 66.42 relations between words in a question. Further-
- cop 66.01
Dependency - aux 66.48 more, we explicitly construct the relations between
relations - advmod 66.53 words by dependency tree and align the image and
- compound 66.35
- det case 65.23
question representations by an attention alignment
- det case cop 64.11 module to reduce the gaps between vision and lan-
guage. Extensive experiments on the VQA-v2 and
Table 4: Ablation studies of our proposed model on
VQA-v2 validation dataset. The experimental results VQA-CP-v2 datasets demonstrate that our model
are based on one black of our DC-GCN model. w/o achieves comparable performance with the state-
means removing a certain module from DC-GCN mod- of-the-art approaches. We will explore more com-
el. The detailed descriptions about dependency rela- plicated object relation modeling in future work.
tions are shown on Table 1.
Acknowledgements
Firstly, we investigate the influence of GCN
types. There are two GCN types: I-GCN and We thank the anonymous reviewers for valuable
Q-GCN, as shown in Table 4. When removing comments and thoughtful suggestions. We would
the I-GCN, the performance of our model decreas- also like to thank Professor Yuzhang Lin from Uni-
es from 66.57% to 65.52% (p-value = 3.22E-08 versity of Massachusetts Lowell for helpful discus-
< 0.05). When removing the Q-GCN, the perfor- sions.
mance of our model slightly decreases from 66.57% This work was supported by the Fundamental
to 66.15% (p-value = 2.04E-07 < 0.05). We consid- Research Funds for the Central Universities,
er that there are two reasons. One is that the image SCUT (No. 2017ZD048, D2182480), the Sci-
content is more complex than the question’s con- ence and Technology Planning Project of
tent, hence which has richer semantic information. Guangdong Province (No.2017B050506004),
By building the relations between objects can help the Science and Technology Programs of
clarify what the image represents and help align Guangzhou (No.201704030076, 201802010027,
with the question representations. The other is that 201902010046) and the collaborative research
the length of question is short, and less information grants from the Guangxi Natural Science Foun-
is contained (e.g., what animal is this? and what dation (2017GXNSFAA198225) and the Hong
color is the man’s shirt?). Kong Research Grants Council (project no. PolyU
Then, we perform ablation study on the influ- 1121417 and project no. C1031-18G), and an
ence of dependency relations (cf. Table 1). The internal research grant from the Hong Kong
relations, like nsubj, nmod, dobj and amod, are cru- Polytechnic University (project 1.9B0V).
cial to semantic representations, therefore, we do

7174
References in visual question answering. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Recognition, pages 6904–6913.
Aniruddha Kembhavi. 2018. Don’t just assume;
look and answer: Overcoming priors for visual ques- Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Atten-
tion answering. In Proceedings of the IEEE Confer- tion guided graph convolutional networks for rela-
ence on Computer Vision and Pattern Recognition, tion extraction. In Proceedings of the 57th Confer-
pages 4971–4980. ence of the Association for Computational Linguis-
tics (ACL), pages 241–251.
Peter Anderson, Xiaodong He, Chris Buehler, Damien
Teney, Mark Johnson, Stephen Gould, and Lei Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Zhang. 2018. Bottom-up and top-down attention for Sun. 2016. Deep residual learning for image recog-
image captioning and visual question answering. In nition. In Proceedings of the IEEE conference on
Proceedings of the IEEE Conference on Computer computer vision and pattern recognition, pages 770–
Vision and Pattern Recognition, pages 6077–6086. 778.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, Long short-term memory. Neural computation,
and Devi Parikh. 2015. Vqa: Visual question an- 9(8):1735–1780.
swering. In Proceedings of the IEEE international
conference on computer vision, pages 2425–2433. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.
2018. Bilinear attention networks. In Advances
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- in Neural Information Processing Systems, pages
gio. 2014. Neural machine translation by jointly 1564–1574.
learning to align and translate. arXiv preprint arX-
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
iv:1409.0473.
method for stochastic optimization. arXiv preprint
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis arXiv:1412.6980.
Gomez, Maral Rusiol, Ernest Valveny, C. V. Jawa- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
har, and Dimosthenis Karatzas. 2019. Scene text vi- son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
sual question answering. CoRR abs/1905.13648. Yannis Kalantidis, Li-Jia Li, David A Shamma,
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and et al. 2017. Visual genome: Connecting language
Nicolas Thome. 2019a. Murel: Multimodal rela- and vision using crowdsourced dense image anno-
tional reasoning for visual question answering. In tations. International Journal of Computer Vision,
Proceedings of the IEEE Conference on Computer 123(1):32–73.
Vision and Pattern Recognition, pages 1989–1998. Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019.
Relation-aware graph attention network for visual
Remi Cadene, Corentin Dancette, Hedi Ben younes, question answering. Proceedings of the IEEE In-
Matthieu Cord, and Devi Parikh. 2019b. Rubi: Re- ternational Conference on Computer Vision, pages
ducing unimodal biases for visual question answer- 10313–10322.
ing. In Advances in Neural Information Processing
Systems, pages 841–852. Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang,
and Hongsheng Li. 2019. Improving referring
Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin expression grounding with cross-modal attention-
Li, and Liang Lin. 2018. Visual question reasoning guided erasing. In Proceedings of the IEEE Confer-
on general dependency tree. In Proceedings of the ence on Computer Vision and Pattern Recognition,
IEEE Conference on Computer Vision and Pattern pages 1950–1959.
Recognition, pages 7249–7257.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Marie-Catherine De Marneffe, Timothy Dozat, Natali- 2016. Hierarchical question-image co-attention for
a Silveira, Katri Haverinen, Filip Ginter, Joakim visual question answering. In Advances In Neural
Nivre, and Christopher D Manning. 2014. Universal Information Processing Systems, pages 289–297.
stanford dependencies: A cross-linguistic typology.
In LREC, volume 14, pages 4585–4592. Duy-Kien Nguyen and Takayuki Okatani. 2018. Im-
proved fusion of visual and language representations
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, by dense symmetric co-attention for visual question
Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. answering. In Proceedings of the IEEE Conference
2019. Dynamic fusion with intra-and inter-modality on Computer Vision and Pattern Recognition, pages
attention flow for visual question answering. In Pro- 6087–6096.
ceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 6639–6648. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and
Jakob Verbeek. 2017. Areas of attention for image
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhru- captioning. In Proceedings of the IEEE Interna-
v Batra, and Devi Parikh. 2017. Making the v in vqa tional Conference on Computer Vision, pages 1242–
matter: Elevating the role of image understanding 1250.

7175
Jeffrey Pennington, Richard Socher, and Christopher IEEE Conference on Computer Vision and Pattern
Manning. 2014. Glove: Global vectors for word rep- Recognition, pages 6281–6290.
resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process- Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao.
ing (EMNLP), pages 1532–1543. 2017. Multi-modal factorized bilinear pooling with
co-attention learning for visual question answering.
Sainandan Ramakrishnan, Aishwarya Agrawal, and In Proceedings of the IEEE international conference
Stefan Lee. 2018. Overcoming language priors in on computer vision, pages 1821–1830.
visual question answering with adversarial regular-
ization. In Advances in Neural Information Process- Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and
ing Systems, pages 1541–1551. Dacheng Tao. 2018. Beyond bilinear: Generalized
multimodal factorized high-order pooling for visual
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian question answering. IEEE transactions on neural
Sun. 2015. Faster r-cnn: Towards real-time ob- networks and learning systems, 29(12):5947–5959.
ject detection with region proposal networks. In
Advances in neural information processing systems, Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett.
pages 91–99. 2018a. Learning to count objects in natural images
for visual question answering. International Confer-
Christopher Kanan Robik Shrestha, Kushal Kafle. ence on Learning Representation (ICLR).
2019. Answer them all! toward universal visual
question answering models. In Proceedings of the Yuhao Zhang, Peng Qi, and Christopher D Manning.
IEEE Conference on Computer Vision and Pattern 2018b. Graph convolution over pruned dependency
Recognition, pages 10472–10481. trees improves relation extraction. In Proceedings of
the 2018 conference on empirical methods in natural
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. language processing (EMNLP), pages 2205–2215.
Where to look: Focus regions for visual question
answering. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages
4613–4621.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is al-
l you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-
Fei. 2017. Scene graph generation by iterative mes-
sage passing. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition,
pages 5410–5419.
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and
Devi Parikh. 2018. Graph r-cnn for scene graph gen-
eration. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 670–685.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,
and Alex Smola. 2016. Stacked attention network-
s for image question answering. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 21–29.
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.
Graph convolutional networks for text classification.
In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 7370–7377.
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018.
Exploring visual relationship for image captioning.
In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 684–699.
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and
Qi Tian. 2019. Deep modular co-attention networks
for visual question answering. In Proceedings of the

7176

VQA-GNN Reasoning With Multimodal Semantic Graph F
No ratings yet
VQA-GNN Reasoning With Multimodal Semantic Graph F
10 pages
GNNs in Visual Question Answering
No ratings yet
GNNs in Visual Question Answering
38 pages
Evaluating Text-to-Visual Generation With Image-to-Text Generation
No ratings yet
Evaluating Text-to-Visual Generation With Image-to-Text Generation
29 pages
Image Captioning
No ratings yet
Image Captioning
9 pages
Learning To Reason: End-to-End Module Networks For Visual Question Answering
No ratings yet
Learning To Reason: End-to-End Module Networks For Visual Question Answering
10 pages
VR Project
No ratings yet
VR Project
15 pages
Reasonvqa: A Multi-Hop Reasoning Benchmark With Structural Knowledge For Visual Question Answering
No ratings yet
Reasonvqa: A Multi-Hop Reasoning Benchmark With Structural Knowledge For Visual Question Answering
18 pages
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
No ratings yet
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
24 pages
1 s2.0 S0262885621000706 Main
No ratings yet
1 s2.0 S0262885621000706 Main
11 pages
Combining Knowledge Graph and LLMs For
No ratings yet
Combining Knowledge Graph and LLMs For
11 pages
Tell-And-Answer Towards Explainable Visual Question
No ratings yet
Tell-And-Answer Towards Explainable Visual Question
9 pages
Paper 1
No ratings yet
Paper 1
17 pages
What Value Do Explicit High Level Concepts Have in Vision To Language Problems?
No ratings yet
What Value Do Explicit High Level Concepts Have in Vision To Language Problems?
10 pages
Survey On VQA
No ratings yet
Survey On VQA
30 pages
Re GAT
No ratings yet
Re GAT
10 pages
Ask Me Anything: Free-Form Visual Question Answering Based On Knowledge From External Sources
No ratings yet
Ask Me Anything: Free-Form Visual Question Answering Based On Knowledge From External Sources
9 pages
Lee 2019
No ratings yet
Lee 2019
6 pages
Deep Compositional Question Answering With Neural Module Networks
No ratings yet
Deep Compositional Question Answering With Neural Module Networks
10 pages
Visual7W Grounded Question Answering in Images
No ratings yet
Visual7W Grounded Question Answering in Images
10 pages
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
No ratings yet
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
40 pages
Interpretable Visual Question Answering Via Reasoning Supervision
No ratings yet
Interpretable Visual Question Answering Via Reasoning Supervision
5 pages
Explainable High-Order Visual Question Reasoning
No ratings yet
Explainable High-Order Visual Question Reasoning
10 pages
Wu 2017
No ratings yet
Wu 2017
28 pages
Simpleaug
No ratings yet
Simpleaug
16 pages
Jayesh Songara, Shivam Pande, Shabnam Choudhury, Biplab Banerjee, Rajbabu Velmurugan
No ratings yet
Jayesh Songara, Shivam Pande, Shabnam Choudhury, Biplab Banerjee, Rajbabu Velmurugan
4 pages
Rsadapter: Adapting Multimodal Models For Remote Sensing Visual Question Answering
No ratings yet
Rsadapter: Adapting Multimodal Models For Remote Sensing Visual Question Answering
13 pages
A Comprehensive Survey On Visual Question Answering Datasets and Algorithms
No ratings yet
A Comprehensive Survey On Visual Question Answering Datasets and Algorithms
27 pages
Stacked Attention Networks For Image Question Answering
No ratings yet
Stacked Attention Networks For Image Question Answering
9 pages
Across Images and Graphs For Question Answering: Example 1
No ratings yet
Across Images and Graphs For Question Answering: Example 1
14 pages
Check It Again: Progressive Visual Question Answering Via Visual Entailment
No ratings yet
Check It Again: Progressive Visual Question Answering Via Visual Entailment
10 pages
AI Methods in Visual QA
No ratings yet
AI Methods in Visual QA
5 pages
Multi-Modal Structure-Embedding Graph Transformer For Visual Commonsense Reasoning
No ratings yet
Multi-Modal Structure-Embedding Graph Transformer For Visual Commonsense Reasoning
11 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
Survey of Recent Advances in Visual Question Answering
No ratings yet
Survey of Recent Advances in Visual Question Answering
7 pages
A Convolutional Neural Network Based Approach For Visual Question
No ratings yet
A Convolutional Neural Network Based Approach For Visual Question
61 pages
Neural Module Networks for VQA
No ratings yet
Neural Module Networks for VQA
10 pages
Zhao 2017
No ratings yet
Zhao 2017
9 pages
Textually Enriched Neural Module Networks For Visual Question Answering
No ratings yet
Textually Enriched Neural Module Networks For Visual Question Answering
9 pages
1 s2.0 S266729522300082X Main
No ratings yet
1 s2.0 S266729522300082X Main
7 pages
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
No ratings yet
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
41 pages
Joint Answering and Explanation For Visual Commonsense Reasoning
No ratings yet
Joint Answering and Explanation For Visual Commonsense Reasoning
11 pages
Interactive Visual Questioning
No ratings yet
Interactive Visual Questioning
10 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
No ratings yet
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
17 pages
Neuro Symbolic 2
No ratings yet
Neuro Symbolic 2
28 pages
Scene Graph Generation
No ratings yet
Scene Graph Generation
27 pages
Heterogeneous Memory Enhanced Multimodal Attention Model For Video Question Answering
No ratings yet
Heterogeneous Memory Enhanced Multimodal Attention Model For Video Question Answering
9 pages
(03-Oct-2020) VQA ThaoLe VietAI
No ratings yet
(03-Oct-2020) VQA ThaoLe VietAI
41 pages
Scene Text VQA Dataset Overview
No ratings yet
Scene Text VQA Dataset Overview
11 pages
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
No ratings yet
Cross-Modal Alignment With Graph Reasoning For Image-Text Retrieval
18 pages
2022 Paclic-1 92
No ratings yet
2022 Paclic-1 92
8 pages
Visual Question Answering A State of The Art Review
No ratings yet
Visual Question Answering A State of The Art Review
41 pages
Updated PPT presentation-ISA-1 Phase-2
No ratings yet
Updated PPT presentation-ISA-1 Phase-2
27 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
Virtual Humans Today and Tomorrow
No ratings yet
Virtual Humans Today and Tomorrow
78 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
28 pages
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
No ratings yet
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
18 pages
Major Project Phase 1
No ratings yet
Major Project Phase 1
18 pages
CLASS 7th Ch-10
100% (1)
CLASS 7th Ch-10
28 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Real Time Object Detection Seeing The World Instantly
No ratings yet
Real Time Object Detection Seeing The World Instantly
26 pages
Senior AI Engineer (Full Stack) - WeRoad
No ratings yet
Senior AI Engineer (Full Stack) - WeRoad
9 pages
DT Unit 2 Lecture Notes
No ratings yet
DT Unit 2 Lecture Notes
18 pages
Information Technology
No ratings yet
Information Technology
22 pages
So, You Are Working On A Machine Learning Problem...
No ratings yet
So, You Are Working On A Machine Learning Problem...
36 pages
The Ultimate Guide To Searchandising With Constructor
No ratings yet
The Ultimate Guide To Searchandising With Constructor
26 pages
Hema Bhaskar Sai
No ratings yet
Hema Bhaskar Sai
2 pages
Saudi Arabia's Strategic Leap Towards A Diversified Economy and Technological Innovation
No ratings yet
Saudi Arabia's Strategic Leap Towards A Diversified Economy and Technological Innovation
15 pages
Smart Sorting System
No ratings yet
Smart Sorting System
10 pages
Mb0047 - Management Information System
No ratings yet
Mb0047 - Management Information System
11 pages
The Potential of Generative Artificial Intelligence Across Disciplines Perspectives and Future Directions
100% (2)
The Potential of Generative Artificial Intelligence Across Disciplines Perspectives and Future Directions
33 pages
AI & ML in Finance: Bibliometric Review
No ratings yet
AI & ML in Finance: Bibliometric Review
14 pages
Report On Sentiment Analysis For Customer Reviews
No ratings yet
Report On Sentiment Analysis For Customer Reviews
4 pages
AI Robotics Evolution for Students
No ratings yet
AI Robotics Evolution for Students
16 pages
Thpt Buổi 3 - đề Luyện Thi Thpt Qg 2025 Đề Số 3
No ratings yet
Thpt Buổi 3 - đề Luyện Thi Thpt Qg 2025 Đề Số 3
8 pages
Quiz AI1704 Page 2 of 2
No ratings yet
Quiz AI1704 Page 2 of 2
8 pages
Continual Learning and Catastrophic Forgetting
No ratings yet
Continual Learning and Catastrophic Forgetting
21 pages
Traffic Sign
No ratings yet
Traffic Sign
15 pages
MSR (Initialization Better Than Xavier)
No ratings yet
MSR (Initialization Better Than Xavier)
9 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
6 pages
Deep Learning for Beginners
No ratings yet
Deep Learning for Beginners
19 pages
Decentralized AI Output Verification
No ratings yet
Decentralized AI Output Verification
6 pages
Awiros - Job Description - 2022-23
No ratings yet
Awiros - Job Description - 2022-23
3 pages
Fractal Analytics
No ratings yet
Fractal Analytics
3 pages
Searle's Critique of Strong AI
No ratings yet
Searle's Critique of Strong AI
40 pages
CNN Architecture and Applications
No ratings yet
CNN Architecture and Applications
55 pages
Historical Interview Handout
No ratings yet
Historical Interview Handout
2 pages

DC GCN

Uploaded by

DC GCN

Uploaded by

Aligned Dual Channel Graph Convolutional Network

for Visual Question Answering

3.2.2 Q-GCN Module

Following the stacked L layer GCN, the output

respectively. Although still cannot achieve com- Model Acc. (%)

You might also like