DeepPrimitive: Layered Image Decomposition
DeepPrimitive: Layered Image Decomposition
Research Article
Jiahui Huang1 ( ), Jun Gao2 , Vignesh Ganapathi-Subramanian3 , Hao Su4 , Yin Liu5 ,
Chengcheng Tang3 , and Leonidas J. Guibas3
c The Author(s) 2018. This article is published with open access at Springerlink.com
385
386 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.
representations of complex images. Labeling image with a variable number of unknowns. Then, we
pixels with high-level primitive information also aids propose a layered architecture in Section 4, which can
in vectorizing rasterized images. learn to separate different information layers of the
Complex images have multiple layers of information image and regress parameters in each layer separately.
embedded in them. It is shown in Ref. [6], that In Section 6, we give experiments used to evaluate
human analysis of an image is always performed in the performance of our network against existing
a top–down manner. For example, when given an traditional state-of-the-art techniques, and in Section
image of a room, the biggest objects such as desks, 7, we show how this framework could be applied
beds, chairs, etc., are observed. Then the focus shifts to image editing and recognition by components.
to specific objects, e.g., objects on the desk such We also discuss the limitations of our framework.
as books and monitor; this analysis is performed Finally, in Section 9, we attempt to envisage how
recursively. When analyzing an image of a window, the framework provided in this work would help
humans tend to focus on the border of the window to solve the important problem of primitive-based
first; the inner structure within the window and representations, which has applications that lie at the
decorations are considered later. However, original intersection of vision, AI, and robotics.
object detection networks neglect this layered search To sum up, our contributions in this paper include:
and treat objects from different information layers • A framework based on the YOLOv2 network that
the same. Layered detection has added value when enables class-wise parameter regression for different
there are internal occlusions in the image, which make primitives.
traditional object detection more difficult to perform. • An RNN model to estimate a sequence of a variable
In this work, we attempt to generate a deep network number of control points representing a closed
that separates multiple information layers as in Fig. 1, spline curve in a single 2D image.
and is able to detect the positions of the primitives in • A layered primitive detection model to extract
each layer as well as estimating their parameters (e.g., relationship information from an image.
the width, height, and orientation of a rectangle or
the number and positions of control points of a spline). 2 Related work
The proposed method is shown to be more accurate
than traditional methods and other learning-based Our task of decomposing an input image into layers
approaches. of correlated and possibly overlapping geometric
This paper is organized as follows. We consider primitives is inherently linked to three categories
related work in Section 2, and provide an analysis of problems, which have been treated and studied
of the novelty of our work. Then, in Section 3, independently in the traditional setting. Object
we propose a framework based on the traditional detection and high-level vision, regression and
YOLOv2 network [2], to provide parameters that are reconstruction of geometric components such as
fully interpretable and high-level. We also tackle splines and primitives, and finally, understanding
the problem of regressing parameters for primitives relationships and layout of objects and entities are
problems that provide information at different scales,
all of great importance to the computer vision
and graphics communities. After considering these
three categories of applications, we conclude the
discussion of related work with relevant machine
learning methodologies, with a focus on recurrent
neural networks.
2.1 Object detection and high-level vision
Among the traditional model-driven approaches
Fig. 1 Motivation: given an image composed of abstract shapes, our to object detection, the generalized Hough
framework can decompose overlapping primitives into multiple layers transform [7] is a classical technique applicable to
and estimate their parameters.
detecting particular classes of shapes up to rigid
DeepPrimitive: Image decomposition by layered primitive detection 387
transformations. Variability of shapes as well as Liu et al. [9] attempt to use feature hierarchies
input nuances are tackled by deep-learning based and detect objects based on different feature maps.
techniques; faster-RCNN [8] utilizes region proposal Lin et al. [19] further improve this elegant idea
networks (RPN) to locate objects and fast-RCNN by adding top–down convolutional layers and skip
to determine the semantic class of each object. connections. However, these works only focus on how
Recent works like YOLO [1, 2] and SSD [9] formulate to combine features at different scales regardless of
the task of detection as a regression problem and the relationships between objects and the associated
propose end-to-end trainable solutions. We use the layers composing the original image. The work
detection framework of the efficient YOLOv2 [2] as by Bellver et al. [6] formulates detection as a
the backbone of our framework. However, unlike reinforcement learning problem and represents an
YOLO or YOLOv2, as well as providing bounding image as a predefined hierarchical tree, leaving the
boxes and class labels, our framework also regresses agent to iteratively select subsequent parts to look
geometric parameters and handles the problem of at. The work most relevant to ours is CSGNet [20],
occlusion, in layered fashion. a recursive neural network model which generates a
To construct high-level objects using simple structured program defining the relationships between
primitives, Biederman [5] introduced the idea of visual a sparse set of primitives. However, the possible
composition. Recently, SCAN [10] tries to compose positions and sizes of the primitives are limited
visual primitives in a hierarchical way and learn an to the size of a finite action space. In contrast,
implicit hierarchy of concepts as well as their logical our work allows more detailed transformations of
relations using a β-VAE network. While they build primitives, and our layered representation is less
their hierarchy over concepts, our work is based on prone to redundancy.
visual containment relationships for different shapes.
2.4 Recurrent neural networks
Lake et al. [11] proposed a probabilistic program
induction scheme to parse hand-writing images into The recurrent neural network (RNN) (and its variants
several strokes and sub-strokes using a few images LSTM [21], GRU [22]) is a common model widely used
as training data, but their method is limited to the in natural language processing which has recently
specific domain of hand-written characters. been applied to computer vision tasks. One key
inspiration for our work is polygon-RNN [23], in which
2.2 Spline fitting and vectorization a sequence of vertices forming a polygon is predicted
Primitives and splines are widely used for representing in a recurrent manner. One of the key differences
geometry or images due to their succinctness and in our work is that we aim to abstract the simplest
precision. Thus, recovering them by fitting input types of representation on different layers, based on
data is a long-standing problem in graphics. The general splines instead of polylines, or interpolating
idea of iteratively minimizing a distance metric [12– cubic Bézier curves as in the polygon-RNN.
14], serving as a foundation of many studies, has been The discussion above only samples the studies
improved by either more effective distance metrics most relevant to our work. There are many other
[15] or more efficient optimization techniques [16]. relevant areas such as image parsing, dense captioning,
However, most previous works fail due to lack of structure-aware geometry processing, and more.
decent initialization, which is overcome by a learning- Despite richness of relevant works across a wide range
based algorithm in our case. It is worth noting that which manifest the importance of the topic, we believe
vectorizing rasterized images [17, 18] also aims to that the problem of understanding images as abstract
solve a related problem. However, since previous compositions is underexplored.
works do not decompose an image into assemblies
of clean primitives, there is a loss of high-level 3 Basic model
information about shape and layering.
In this section, we propose a framework based on
2.3 Layered object detection a standard modification of the YOLOv2 model [2],
Multiple works have of late attempted to introduce inspired by Ref. [24], to perform parameter regression.
composable layers into the process of object detection. The parameters regressed by the model, as opposed
388 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.
to those in Ref. [24], are fully interpretable and high- and predicted parameters respectively.
level. 3.2 Definition of primitive parameters
3.1 Adapting YOLO for parameter regression
Primitives with fixed number of parameters.
The primary idea of this model is to extend the Simple primitives like rectangles or circles have fixed
architecture of the state-of-the-art object detector numbers of parameters, and so the values of these
YOLOv2 to detect primitives in an image, and parameters can be used directly as ground truth
in addition, to estimate the parameters of each for training. For parameters lying within [0, 1], we
primitive. The deep neural network architecture can further increase the network training stability
is capable of extracting more detailed descriptors by applying a sigmoid function to the network
of detected objects, as well as the bounding box output to constrain the estimated parameters.
location. Providing additional structural information Readers are referred to Section S1 in the Electronic
about the object to the YOLOv2 architecture aids in Supplementary Material (ESM) for detailed definitions
augmenting the learned features. of primitive parameters.
The YOLOv2 network in the original paper Primitives with variable number of para-
consumes an entire image and segments it into a meters. Some of the primitives discussed in this
grid of size S × S. Each square in the grid can paper, including closed B-spline curves, have a
contain multiple primitives. The networks model variable number of control points. This permits
this multiplicity by containing up to B possible primitives to represent different kinds of shapes,
anchors (primitives in this case). Thus, traditional but it is not compatible with the previously defined
YOLOv2 networks learn S × S × B × (K + 5) different model. This incompatibility is solved by learning a
parameters; the K + 5 term arises since, in addition fixed-length embedding of the control point positions.
to the class labels for the K different primitive In addition, a recurrent neural network (RNN) is
classes, the network also predicts 1 object probability appended to the model, to serve as a decoder to
value and 4 bounding-box related values [2]. While output the control points in a sequential manner. At
regressing parameters for the bounding boxes, the time step i, the model predicts the position of the
regressor needs to predict M extra variables for each ith control point ci , and a stop probability pi ∈ [0, 1],
bounding box being predicted. The M variables that indicates the end of the curve. We apply cross-
are the total number of possible parameters from entropy life loss to the stop probability while training
all different primitive categories. This increases the the RNN.
number of parameters predicted by the network to The loss functions for the RNN-based model must
S × S × B × (5 + K + M ). be designed with care. Naively, one can use a
To achieve this end, a new loss term is added to the simple mean-squared error (MSE) loss for control
loss function previously proposed in Ref. [24]. The point position prediction and a cross entropy loss for
new term, Lp , feeds information about the primitive probability prediction. However, this only handles
parameters into the network. This term is defined as the situation where the sequence of control points is
S
S B K
fixed and well-defined. Note that every point in the
(k) (l) (m) (m)
Lp = 1i,j 1(i,j),k L(t(i,j),k , t̂(i,j),k ) control point sequence C = (c1 , . . . , cN ) of a closed
i=0 j=0 k=0 l=0 m∈X(l) spline curve can be viewed as the starting point of
(1) the sequence. Thus, in order to predict a control
(k)
where 1i,j is an indicator function that determines if point sequence invariant to the position of starting
grid square (i, j) is assigned a positive object label for point, a circular loss similar to that used in Ref. [23]
(l)
bounding box k. The indicator 1(i,j),k is a function is defined as follows:
that determines if bounding box k of grid square (i, j)
Lcirc = min (min(L(C, Gk ), L(C, Gk ))) (2)
belongs to the primitive defined by l. The purpose k∈[1,N ]
of introducing this term is to include a weighing for where L is the MSE loss, Gk is the ground truth
a primitive in the loss only when the primitive is control point sequence rotated by k places, i.e., if gi
plausible for the image. X(l) is the set of parameters denotes the ith control point in the ground truth, then
for primitive l. The terms t and t̂ denote the target Gk is the sequence (gk , · · · , gN , g1 , · · · , gk−1 ) and Gk
DeepPrimitive: Image decomposition by layered primitive detection 389
is the inverse sequence of Gk . In this way, the ground both faster and cognizant of previous learning. We
truth sequence that leads to minimum MSE loss is perform region of interest (RoI) pooling [25] on the
considered to be the target sequence, making the intermediate output of our network. This enables
loss function rotation-invariant. Also note that the us to extract regions in the image to focus on, to
introduction of Gk guarantees the loss to be invariant perform detection at the next level.
to clockwise and anti-clockwise sequencing.
4.2 Architecture
Fig. 2 The detection process in our layered model. Cuboids denote input images or feature maps. Dark blue arrows, dark green arrows, and
dark purple arrows represent conv layers, RoI pooling layers, and detection blocks, respectively; notation is consistent with that in the text.
The final output of our network is a layered primitive tree containing both shape information and layer information.
390 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.
this way, the layering is represented explicitly by We observed that the predicted bounding box position
cropping within the interior of an image. This model is usually more accurate than the regressed parameters.
can be expressed as Hence, a local parameter with respect to the bounding
B(1) = d1 (J1 ) (3) box is defined for each primitive so as to be able to
perform better reconstruction. Readers are referred to
B(i) = di (R[Ji ; B(i − 1)]), i2 (4)
Section S1 in the ESM for detailed descriptions of the
where R[J; B(i)] represents feature map J cropped parameters used.
using bounding box information from B(i) which is
5.2 Network architecture
fed to an RoI pooling layer to obtain a uniform size
output for future processing. Our code is adapted from an open source PyTorch
Lower level feature maps are employed for deeper implementation x . The backbone network uses the
layer detection since deeper layer primitives are Darknet-19 architecture configured as in Redmon and
usually smaller in size and thus clearer feature maps Farhadi [2]. We set the depth of our layered detection
are required to perform accurate detection. For model to 3, using three detection blocks. Detailed
consistency within different regions in image, we configuration of detection block di (i = 1, 2, 3) is
perform training using local coordinates within the provided in Section S2 of the ESM.
parent bounding box as the ground truth for B(i). 5.3 Training
For example, consider an image with a rectangle
The entire hierarchical model can be trained fully end-
inside a circle. Then, the ground truth coordinates
to-end. Additionally, we adopt a method similar to
for the rectangle should lie within the local coordinate
scheduled sampling [26] to enhance training stability
system with respect to the circle. Therefore, predicted
and testing performance. The predicted information
coordinates are transformed before calculating the
B(i − 1) from level i − 1, which is fed into level i, is
loss functions. These local coordinates are used for
substituted by the ground truth value for level i − 1
ground truth since RoI pooling is known to capture
with probability p. The value of p is set to 0.9 in the
partial information in the image, as testified by faster-
first 10 epochs and is subsequently decreased by 0.05
RCNN [8]. Meanwhile, since there are multiple
every 2 epochs.
layers of convolutional operations, the feature map
An RNN decoder model is pre-trained separately
can encode some information outside the bounding
to regress a fixed length embedding for control point
box, thus providing the model with the capability to
positions. While training this RNN model, the grid
correct mistakes made in outer layers, by considering
number S is set to 1 in the YOLOv2 detection
both local and global information while making
framework and the features of closed spline curve
detections in inner layers.
images are extracted with our backbone Darknet-
It is worth noting that the information passed from
19 network. The pre-trained RNN decoder learns
higher to lower layers is not simply restricted to the
to decode the fixed length embedding and output
explicit bounding box position. The feature map in
positions of control points sequentially. When the
shallower convolutional layers is used to predict both
layered model is being trained, the value of the
higher and lower level primitives (e.g., in Fig. 2, J2
embedding is used as direct supervision. In the
affects both B(1) and B(2)). Although we only pass
first 5 epochs, the embedding is supervised and in
the bounding box information explicitly, knowledge
subsequent epochs, the network is trained with the
from higher layers can be passed implicitly via these
positions of control points instead. Note that the
related feature maps.
RNNs share the same weights across different levels
of the hierarchy.
5 Implementation
5.4 Data synthesis
In this section, we present our implementation details. Following previous works [10, 27], we use synthetic
5.1 Primitive and parameter selection datasets due to the lack of annotated datasets. The
hierarchical model was trained with 150,000 synthetic
Four types of primitives are used in our experiments:
rectangles, triangles, ellipses, and closed spline curves. x https://github.com/longcw/yolo2-pytorch
DeepPrimitive: Image decomposition by layered primitive detection 391
pictures of size 416 × 416. When we generated the 6.2 Comparisons to other methods
training data, we kept the containment relationships Although our model detects primitives in a layered
across layers; there may be multiple primitives in each manner, simple object detection measurements
layer. The number of primitives in a single image including precision and recall rate (or mAP for
is restricted to 8, the maximum number of layers methods with confidence score output) can be applied
to 3, and the number of control points of closed to test model accuracy. Meanwhile, we define our
spline curves varies from 5 to 7. In order to test the reconstruction loss as the pixel-wise RMSE between
robustness of our method, noise was added to the the input picture and the re-rendered picture using
shapes of the primitives, as well as hatching patterns the predicted results from the network. There are
for primitives and some skewing of the image itself. multiple approaches to shape detection; we set up 5
Selected dataset images are shown in Fig. 3. independent baselines for comparison. The first two
baselines are traditional methods while the last three
6 Experiments and results are learning-based approaches:
6.1 Ablation study for circular loss • Contour method. In this method, edge detection
is first applied to the input image; each
During the pretraining process for the RNN decoder
independent contour is separated. A post-
to predict control point positions, we compare the
processing approximation step is then employed to
training and validation losses using two different loss
replace almost collinear segments with a single line
functions, i.e., the previously defined Lcirc and a
segment with a parameter q controlling the strength
simple MSE loss. As shown in Table 1, training
of approximation. The type of shape is determined
with circular loss leads to better convergence loss
by counting the number of line segments (i.e., its
and thus better prediction results. Figure 4 shows
number of edges). This method is implemented
two examples comparing the prediction results given
using findContours and approxPolyDP functions
the same curve image as input. We found that using
of OpenCV [28].
circular loss eliminates the ambiguity of starting point
• Hough transform [29]. This is widely used to find
and clock direction in the training data, and leads to
imperfect shape instances in images by a voting
more accurate fitting results.
procedure in parameter space. For rectangles and
Table 1 Error and accuracy measures during training and testing triangles, whose edges are straight line segments,
with two different loss functions. Loss denotes the MSE distance we first use Hough line transform to detect all
between the ground truth and predicted positions of control points
(distances are normalized to lie in the unit interval). # Point Acc. possible lines and then recover the parameters of
denotes the frequency of predicting the number of control points the primitives by solving a set of linear equations.
correctly
For ellipses, we use the method described in
Training Validation Ref. [30].
Loss # Point Acc. Loss # Point Acc. • CSGNet [20]. In 2D, this takes a single image
LMSE 0.12203 74.60 0.12210 74.93 as input and generates a program defining the
Lcirc 0.04365 76.32 0.04369 75.83
shapes presented. This model allows for more
Fig. 3 Examples drawn from our synthetic training dataset. For the Pure dataset, we synthesized simple binary images for training. The
Pure+Noise dataset modified the Pure dataset by adding noise and random affine transformations to each image. The Tex. (short for
“Textured”) dataset allows testing of the robustness of shape detection methods by adding hatching patterns to the shapes. The Textured+Noise
dataset imitates real world hand drawn shape pictures. The Natural dataset imitates colored versions of real world images.
392 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.
Table 2 Precision, recall, and reconstruction loss measures using various methods as described in Fig. 3. Prec and Recall denote the precision
and recall values as percentages respectively while Recon measures the RMSE loss between the original picture and the reconstructed picture
using the layered prediction results
Pure Pure+Noise Textured Textured+Noise Natural
Method
Prec Recall Recon Prec Recall Prec Recall Prec Recall Prec Recall
Contour (q = 4 × 10−4 ) 78.8 42.9 1.44 10.1 37.7 10.8 54.6 10.0 47.5 5.9 62.2
Contour (q = 2 × 10−3 ) 94.0 72.8 1.70 32.5 60.1 16.8 88.0 15.6 73.2 6.4 70.3
Hough transform 32.6 78.6 1.61 5.1 73.7 — — — — — —
CSGNet (optimized) [20] 37.1 65.4 28.7 — — — — — — — —
Flat model 99.7 91.0 — 99.5 90.0 99.6 91.2 99.4 91.0 57.9 62.2
Recursive model 96.1 72.4 1.64 60.1 61.2 74.0 60.1 95.8 49.9 98.9 84.5
Our model 99.7 96.1 1.61 99.5 95.0 99.6 95.8 99.5 95.4 97.9 87.6
Our model (optimized∗ ) 99.7 96.1 1.39 99.6 95.0 — — — — — —
* It is impossible to measure reconstruction loss for images with texture or noise, making it unclear how to define the optimization target.
DeepPrimitive: Image decomposition by layered primitive detection 393
Table 3 Average precision (AP) measures of learning-based shape well-reconstructed. Unlike the baselines, our method
detection methods. Values are presented in percentage
can extract high-level shape information as well as
Mean Parallelogram Triangle Oval Spline
containment relationships. Our model outperforms
Flat 87.2 87.2 86.3 84.4 90.9
Recursive 54.3 43.8 53.8 76.0 43.6
the others both quantitatively and qualitatively,
Ours 90.5 88.2 90.7 90.9 92.0 except for the reconstruction loss. However, after
appending a simple local optimizer to our model,
denoted Our model (optimized) in Table 2, the
reconstruction loss is further decreased.
The trained model was applied directly to Google
Material icons [31] (lines 1–4 of Fig. 6, using Pure
model) and a small real world dataset containing
150 images selected from the PASCAL VOC2012
dataset [32] and the Internet (lines 5–8 of Fig. 6,
using Natural model). To the best of our knowledge,
no public dataset exists that provides ground truth
annotations at geometric primitive level. So we have
manually annotated the 150 images from this small
real world dataset. Testing using our trained model
reached an mAP (the metric used in all experiments)
of 54.5%. Readers are referred to Sections S3 and S4
in the ESM for further results.
While DeepPrimitive manages to decompose the
real world images into relevant primitives, it is to be
remembered that this is not the primary focus of our
work. Our current model is trained only on synthetic Another potential application is recognition-by-
images, but adapting synthetic images to real images components [5]. Usually, state-of-the-art classifiers
with domain adaptation techniques is one trend in the based on deep networks need very much data for
vision community. A few recent vision papers have training, and its lack hampers accuracy. Once
been trained and tested on purely synthetic datasets primitives in an image have been recognized, one
(e.g., Ref. [27]). can easily define classification rules using the layered
information obtained. Additional training data is not
7 Applications needed and only a single shape detection model has
to be trained. The idea is illustrated in Fig. 9. Given
Once an image has been decomposed into several an image, pre-processing steps such as denoising and
layers and high-level parameters defining the thresholding are performed to extract the borders
primitives in the image acquired, one can utilize this of shapes. The proposed model is then applied to
information for a variety of applications. In this detect the primitives and generate a shape parsing
paper, we demonstrate the use of these parameters tree (in XML format in the figure for demonstration
in two example applications. purposes), with which a handcrafted classifier could
The first application we present is image editing. easily predict the class of an object in the image by
It is usually very difficult for an artist to modify top–down traversal of the tree.
the shapes in a rasterized image directly. With a
low reconstruction loss, our model can decompose 8 Limitations
an image into several manipulable components with
high fidelity and flexibility. For example, in Fig. 7, it As an explorative study aiming to understand and
is easy for an icon designer to modify parameters of reconstruct images as primitives composed layer-wise,
the shapes, changing the angle between the hands of there are several limitations left to be resolved in
the clock, or tweaking the shape of the paint brush future work. For images with highly-overlapping
head. For real world images in Fig. 8, we can directly primitives within the same layer, our model cannot
manage the position of the parts in an image using distinguish between them: the output will either be
high-level editing tools (e.g., as in Ref. [33]). a single primitive or misclassified primitives. Our
model discovers only containment relationships: if
one higher-level primitive intersects multiple lower-
level primitives, duplicate detections of the higher-
level primitive are possible. The last two images of
line 4 in Fig. 6 demonstrate such failures. These
limitations restrict the layer decomposability of our
model. Meanwhile, only synthetic images are used
Fig. 7 Image editing on a rasterized image at a primitive level.
Primitive detection is performed on the image, followed by editing of
for training. Annotated real world data would make
the primitives. the model more generalizable.
Fig. 8 High-level image editing of real world images based on detected primitives. The first two columns of each group show the original
image and its layered decomposition while the last two columns of each group show manipulated results.
DeepPrimitive: Image decomposition by layered primitive detection 395
References
[14] Chen, Y.; Medioni, G. Object modeling by registration Information Processing Systems 28. Cortes, C.;
of multiple range images. In: Proceedings of the IEEE Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; Garnett,
International Conference on Robotics and Automation, R. Eds. Curran Associates, Inc., 1171–1179, 2015.
2724–2729, 1991. [27] Wu, J.; Tenenbaum, J. B.; Kohli, P. Neural scene de-
[15] Wang, W.; Pottmann, H.; Liu, Y. Fitting B-spline rendering. In: Proceedings of the IEEE Conference on
curves to point clouds by curvature-based squared Computer Vision and Pattern Recognition, 2017.
distance minimization. ACM Transactions on Graphics [28] Itseez. Open source computer vision library. 2015.
Vol. 25, No. 2, 214–238, 2006. Available at https://github.com/itseez/opencv.
[16] Zheng, W.; Bo, P.; Liu, Y.; Wang, W. Fast B-spline [29] Duda, R. O.; Hart, P. E. Use of the Hough
curve fitting by L-BFGS. Computer Aided Geometric transformation to detect lines and curves in pictures.
Design Vol. 29, No. 7, 448–462, 2012. Communications of the ACM Vol. 15, No. 1, 11–15, 1972.
[17] Sun, J.; Liang, L.; Wen, F.; Shum, H.-Y. Image vectorization
[30] Xie, Y.; Ji, Q. A new efficient ellipse detection method.
using optimized gradient meshes. ACM Transactions
In: Proceedings of the IEEE International Conference
on Graphics Vol. 26, No. 3, Article No. 11, 2007.
on Pattern Recognition, Vol. 2, 957–960, 2002.
[18] Lecot, G.; Levy, B. Ardeco: Automatic region detection
[31] Google. Google material icon. 2017. Available at
and conversion. In: Proceedings of the 17th Eurographics
https://material.io/icons/.
Symposium on Rendering Techniques, 349–360, 2006.
[32] Everingham, M. The PASCAL Visual Object
[19] Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan,
Classes Challenge 2012 (VOC2012). Available at
B.; Belongie, S. Feature pyramid networks for object
http://www.pascal-network.org/challenges/VOC/
detection. In: Proceedings of the IEEE Conference on
voc2012/workshop/index.html.
Computer Vision and Pattern Recognition, 2117–2125,
2017. [33] Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.
B. PatchMatch: A randomized correspondence algorithm
[20] Sharma, G.; Goyal, R.; Liu, D.; Kalogerakis, E.; Maji,
for structural image editing. ACM Transactions on
S. CSGNet: Neural shape parser for constructive solid
geometry. In: Proceedings of the IEEE Conference on Graphics Vol. 28, No. 3, Article No. 24, 2009.
Computer Vision and Pattern Recognition, 5515–5523,
2018. Jiahui Huang received his B.S. degree
[21] Gers, F. A.; Schraudolph, N. N.; Schmidhuber, J. in computer science and technology
Learning precise timing with LSTM recurrent networks. from Tsinghua University in 2018. He
Journal of Machine Learning Research Vol. 3, No. 1, is currently a Ph.D. candidate in
115–143, 2002. computer science in Tsinghua University.
His research interests include computer
[22] Cho, K.; Merriënboer, B. V.; Gulcehre, C.; Bahdanau,
vision and computer graphics.
D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning
phrase representations using RNN encoder–decoder
for statistical machine translation. arXiv preprint Jun Gao received his B.S. degree in
arXiv:1406.1078, 2014. computer science from Peking University
[23] Castrejón, L.; Kundu, K.; Urtasun, R.; Fidler, S. in 2018. He is a graduate student
Annotating object instances with a polygon-RNN. In: in the Machine Learning Group at
Proceedings of the IEEE Conference on Computer the University of Toronto and also
Vision and Pattern Recognition, 5230–5238, 2017. affiliates to the Vector Institute. His
research interests are in deep learning
[24] Jetley, S.; Sapienza, M.; Golodetz, S.; Torr, P. H. S.
and computer vision.
Straight to shapes: Real-time detection of encoded
shapes. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 4207–4216, Vignesh G. Subramanian is a
2017. Ph.D. candidate in the Department
of Electrical Engineering, Stanford
[25] Girshick, R. Fast R-CNN. In: Proceedings of the IEEE
University. He previously obtained
International Conference on Computer Vision, 1440–
his dual degrees (B.Tech. in EE and
1448, 2015.
M.Tech. in communication engineering)
[26] Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. from IIT Madras, India. His research
Scheduled sampling for sequence prediction with interests include shape correspondences,
recurrent neural networks. In: Advances in Neural 3D geometry, graphics, and vision.
DeepPrimitive: Image decomposition by layered primitive detection 397