Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
98 views13 pages

DeepPrimitive: Layered Image Decomposition

This document describes a deep learning framework called DeepPrimitive that can decompose images into layers of geometric primitives. The framework modifies the YOLO network to detect primitives and regress their parameters in each image layer separately. An RNN is also used to predict the control points of spline curves. The proposed layered detection model is shown to have higher accuracy than traditional methods and other learning approaches for image decomposition by primitive detection.

Uploaded by

Hou Bou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views13 pages

DeepPrimitive: Layered Image Decomposition

This document describes a deep learning framework called DeepPrimitive that can decompose images into layers of geometric primitives. The framework modifies the YOLO network to detect primitives and regress their parameters in each image layer separately. An RNN is also used to predict the control points of spline curves. The proposed layered detection model is shown to have higher accuracy than traditional methods and other learning approaches for image decomposition by primitive detection.

Uploaded by

Hou Bou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Computational Visual Media

https://doi.org/10.1007/s41095-018-0128-6 Vol. 4, No. 4, December 2018, 385–397

Research Article

DeepPrimitive: Image decomposition by layered primitive detection

Jiahui Huang1 ( ), Jun Gao2 , Vignesh Ganapathi-Subramanian3 , Hao Su4 , Yin Liu5 ,
Chengcheng Tang3 , and Leonidas J. Guibas3


c The Author(s) 2018. This article is published with open access at Springerlink.com

Abstract The perception of the visual world through 1 Introduction


basic building blocks, such as cubes, spheres, and cones,
The computer vision community has been interested
gives human beings a parsimonious understanding of
the visual world. Thus, efforts to find primitive-based
in performing detection tasks on images for a long
geometric interpretations of visual data date back to time. The success of object detection techniques has
1970s studies of visual media. However, due to the been a shot-in-the-arm for better image understand-
difficulty of primitive fitting in the pre-deep learning ing. The potent combination of deep learning
age, this research approach faded from the main stage, techniques with traditional techniques [1, 2] has
and the vision community turned primarily to semantic yielded state-of-the-art techniques which focus on
image understanding. In this paper, we revisit the detecting objects in an image through bounding
classical problem of building geometric interpretations box proposals. While this works well for tasks that
of images, using supervised deep learning tools. We require strong object localization, other applications
build a framework to detect primitives from images in in robotics and autonomic systems require a more
a layered manner by modifying the YOLO network; detailed understanding of the objects in the image.
an RNN with a novel loss function is then used Thus, another well-studied task in visual media
to equip this network with the capability to predict processing is that of instance segmentation, where
primitives with a variable number of parameters. We a per-pixel class label is assigned to an input image.
compare our pipeline to traditional and other baseline Such dense labeling schemes are too redundant, and
learning methods, demonstrating that our layered an intermediate representation needs to be developed.
detection model has higher accuracy and performs
Understanding images or shapes in terms of basic
better reconstruction.
primitives is a very natural human abstraction. The
Keywords layered image decomposition; primitive parsimonious nature of primitive-based descriptions,
detection; biologically inspired vision; deep especially when the task at hand does not require
learning fine-grained knowledge of the image, makes them easy
to use and a good choice. This has been explored
extensively in the realms of both computer vision
1 Tsinghua University, Beijing, 100084, China. E-mail:
[email protected] ( ).
and graphics. Various traditional approaches exist
2 Computer Science Department, University of Toronto, for modeling images and objects, such as blocks
Toronto, M5S2E4, Canada. E-mail: [email protected]. world [3], generalized cylinders [4], and geons [5].
3 Stanford University, Stanford, 94305, United States. While primitive-based modeling generally uses classical
E-mail: V. Ganapathi-Subramanian, [email protected]; techniques, using machine learning techniques to
C. Tang, [email protected]; L. J. Guibas, extract these primitives can help us to attack more
[email protected].
complex images, with multiple layers of information in
4 University of California San Diego, La Jolla, 92093,
United States. E-mail: [email protected].
them. Basic primitive elements such as rectangles,
5 University of Wisconsin-Madison, Madison, 53715, United circles, triangles, and spline curves are usually
States. E-mail: [email protected]. the building blocks of objects in images, and in
Manuscript received: 2018-11-30; accepted: 2018-12-03 combination, provide simple, yet extremely informative

385
386 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

representations of complex images. Labeling image with a variable number of unknowns. Then, we
pixels with high-level primitive information also aids propose a layered architecture in Section 4, which can
in vectorizing rasterized images. learn to separate different information layers of the
Complex images have multiple layers of information image and regress parameters in each layer separately.
embedded in them. It is shown in Ref. [6], that In Section 6, we give experiments used to evaluate
human analysis of an image is always performed in the performance of our network against existing
a top–down manner. For example, when given an traditional state-of-the-art techniques, and in Section
image of a room, the biggest objects such as desks, 7, we show how this framework could be applied
beds, chairs, etc., are observed. Then the focus shifts to image editing and recognition by components.
to specific objects, e.g., objects on the desk such We also discuss the limitations of our framework.
as books and monitor; this analysis is performed Finally, in Section 9, we attempt to envisage how
recursively. When analyzing an image of a window, the framework provided in this work would help
humans tend to focus on the border of the window to solve the important problem of primitive-based
first; the inner structure within the window and representations, which has applications that lie at the
decorations are considered later. However, original intersection of vision, AI, and robotics.
object detection networks neglect this layered search To sum up, our contributions in this paper include:
and treat objects from different information layers • A framework based on the YOLOv2 network that
the same. Layered detection has added value when enables class-wise parameter regression for different
there are internal occlusions in the image, which make primitives.
traditional object detection more difficult to perform. • An RNN model to estimate a sequence of a variable
In this work, we attempt to generate a deep network number of control points representing a closed
that separates multiple information layers as in Fig. 1, spline curve in a single 2D image.
and is able to detect the positions of the primitives in • A layered primitive detection model to extract
each layer as well as estimating their parameters (e.g., relationship information from an image.
the width, height, and orientation of a rectangle or
the number and positions of control points of a spline). 2 Related work
The proposed method is shown to be more accurate
than traditional methods and other learning-based Our task of decomposing an input image into layers
approaches. of correlated and possibly overlapping geometric
This paper is organized as follows. We consider primitives is inherently linked to three categories
related work in Section 2, and provide an analysis of problems, which have been treated and studied
of the novelty of our work. Then, in Section 3, independently in the traditional setting. Object
we propose a framework based on the traditional detection and high-level vision, regression and
YOLOv2 network [2], to provide parameters that are reconstruction of geometric components such as
fully interpretable and high-level. We also tackle splines and primitives, and finally, understanding
the problem of regressing parameters for primitives relationships and layout of objects and entities are
problems that provide information at different scales,
all of great importance to the computer vision
and graphics communities. After considering these
three categories of applications, we conclude the
discussion of related work with relevant machine
learning methodologies, with a focus on recurrent
neural networks.
2.1 Object detection and high-level vision
Among the traditional model-driven approaches
Fig. 1 Motivation: given an image composed of abstract shapes, our to object detection, the generalized Hough
framework can decompose overlapping primitives into multiple layers transform [7] is a classical technique applicable to
and estimate their parameters.
detecting particular classes of shapes up to rigid
DeepPrimitive: Image decomposition by layered primitive detection 387

transformations. Variability of shapes as well as Liu et al. [9] attempt to use feature hierarchies
input nuances are tackled by deep-learning based and detect objects based on different feature maps.
techniques; faster-RCNN [8] utilizes region proposal Lin et al. [19] further improve this elegant idea
networks (RPN) to locate objects and fast-RCNN by adding top–down convolutional layers and skip
to determine the semantic class of each object. connections. However, these works only focus on how
Recent works like YOLO [1, 2] and SSD [9] formulate to combine features at different scales regardless of
the task of detection as a regression problem and the relationships between objects and the associated
propose end-to-end trainable solutions. We use the layers composing the original image. The work
detection framework of the efficient YOLOv2 [2] as by Bellver et al. [6] formulates detection as a
the backbone of our framework. However, unlike reinforcement learning problem and represents an
YOLO or YOLOv2, as well as providing bounding image as a predefined hierarchical tree, leaving the
boxes and class labels, our framework also regresses agent to iteratively select subsequent parts to look
geometric parameters and handles the problem of at. The work most relevant to ours is CSGNet [20],
occlusion, in layered fashion. a recursive neural network model which generates a
To construct high-level objects using simple structured program defining the relationships between
primitives, Biederman [5] introduced the idea of visual a sparse set of primitives. However, the possible
composition. Recently, SCAN [10] tries to compose positions and sizes of the primitives are limited
visual primitives in a hierarchical way and learn an to the size of a finite action space. In contrast,
implicit hierarchy of concepts as well as their logical our work allows more detailed transformations of
relations using a β-VAE network. While they build primitives, and our layered representation is less
their hierarchy over concepts, our work is based on prone to redundancy.
visual containment relationships for different shapes.
2.4 Recurrent neural networks
Lake et al. [11] proposed a probabilistic program
induction scheme to parse hand-writing images into The recurrent neural network (RNN) (and its variants
several strokes and sub-strokes using a few images LSTM [21], GRU [22]) is a common model widely used
as training data, but their method is limited to the in natural language processing which has recently
specific domain of hand-written characters. been applied to computer vision tasks. One key
inspiration for our work is polygon-RNN [23], in which
2.2 Spline fitting and vectorization a sequence of vertices forming a polygon is predicted
Primitives and splines are widely used for representing in a recurrent manner. One of the key differences
geometry or images due to their succinctness and in our work is that we aim to abstract the simplest
precision. Thus, recovering them by fitting input types of representation on different layers, based on
data is a long-standing problem in graphics. The general splines instead of polylines, or interpolating
idea of iteratively minimizing a distance metric [12– cubic Bézier curves as in the polygon-RNN.
14], serving as a foundation of many studies, has been The discussion above only samples the studies
improved by either more effective distance metrics most relevant to our work. There are many other
[15] or more efficient optimization techniques [16]. relevant areas such as image parsing, dense captioning,
However, most previous works fail due to lack of structure-aware geometry processing, and more.
decent initialization, which is overcome by a learning- Despite richness of relevant works across a wide range
based algorithm in our case. It is worth noting that which manifest the importance of the topic, we believe
vectorizing rasterized images [17, 18] also aims to that the problem of understanding images as abstract
solve a related problem. However, since previous compositions is underexplored.
works do not decompose an image into assemblies
of clean primitives, there is a loss of high-level 3 Basic model
information about shape and layering.
In this section, we propose a framework based on
2.3 Layered object detection a standard modification of the YOLOv2 model [2],
Multiple works have of late attempted to introduce inspired by Ref. [24], to perform parameter regression.
composable layers into the process of object detection. The parameters regressed by the model, as opposed
388 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

to those in Ref. [24], are fully interpretable and high- and predicted parameters respectively.
level. 3.2 Definition of primitive parameters
3.1 Adapting YOLO for parameter regression
Primitives with fixed number of parameters.
The primary idea of this model is to extend the Simple primitives like rectangles or circles have fixed
architecture of the state-of-the-art object detector numbers of parameters, and so the values of these
YOLOv2 to detect primitives in an image, and parameters can be used directly as ground truth
in addition, to estimate the parameters of each for training. For parameters lying within [0, 1], we
primitive. The deep neural network architecture can further increase the network training stability
is capable of extracting more detailed descriptors by applying a sigmoid function to the network
of detected objects, as well as the bounding box output to constrain the estimated parameters.
location. Providing additional structural information Readers are referred to Section S1 in the Electronic
about the object to the YOLOv2 architecture aids in Supplementary Material (ESM) for detailed definitions
augmenting the learned features. of primitive parameters.
The YOLOv2 network in the original paper Primitives with variable number of para-
consumes an entire image and segments it into a meters. Some of the primitives discussed in this
grid of size S × S. Each square in the grid can paper, including closed B-spline curves, have a
contain multiple primitives. The networks model variable number of control points. This permits
this multiplicity by containing up to B possible primitives to represent different kinds of shapes,
anchors (primitives in this case). Thus, traditional but it is not compatible with the previously defined
YOLOv2 networks learn S × S × B × (K + 5) different model. This incompatibility is solved by learning a
parameters; the K + 5 term arises since, in addition fixed-length embedding of the control point positions.
to the class labels for the K different primitive In addition, a recurrent neural network (RNN) is
classes, the network also predicts 1 object probability appended to the model, to serve as a decoder to
value and 4 bounding-box related values [2]. While output the control points in a sequential manner. At
regressing parameters for the bounding boxes, the time step i, the model predicts the position of the
regressor needs to predict M extra variables for each ith control point ci , and a stop probability pi ∈ [0, 1],
bounding box being predicted. The M variables that indicates the end of the curve. We apply cross-
are the total number of possible parameters from entropy life loss to the stop probability while training
all different primitive categories. This increases the the RNN.
number of parameters predicted by the network to The loss functions for the RNN-based model must
S × S × B × (5 + K + M ). be designed with care. Naively, one can use a
To achieve this end, a new loss term is added to the simple mean-squared error (MSE) loss for control
loss function previously proposed in Ref. [24]. The point position prediction and a cross entropy loss for
new term, Lp , feeds information about the primitive probability prediction. However, this only handles
parameters into the network. This term is defined as the situation where the sequence of control points is
 S 
S  B K
  fixed and well-defined. Note that every point in the
(k) (l) (m) (m)
Lp = 1i,j 1(i,j),k L(t(i,j),k , t̂(i,j),k ) control point sequence C = (c1 , . . . , cN ) of a closed
i=0 j=0 k=0 l=0 m∈X(l) spline curve can be viewed as the starting point of
(1) the sequence. Thus, in order to predict a control
(k)
where 1i,j is an indicator function that determines if point sequence invariant to the position of starting
grid square (i, j) is assigned a positive object label for point, a circular loss similar to that used in Ref. [23]
(l)
bounding box k. The indicator 1(i,j),k is a function is defined as follows:
that determines if bounding box k of grid square (i, j)
Lcirc = min (min(L(C, Gk ), L(C, Gk ))) (2)
belongs to the primitive defined by l. The purpose k∈[1,N ]
of introducing this term is to include a weighing for where L is the MSE loss, Gk is the ground truth
a primitive in the loss only when the primitive is control point sequence rotated by k places, i.e., if gi
plausible for the image. X(l) is the set of parameters denotes the ith control point in the ground truth, then
for primitive l. The terms t and t̂ denote the target Gk is the sequence (gk , · · · , gN , g1 , · · · , gk−1 ) and Gk
DeepPrimitive: Image decomposition by layered primitive detection 389

is the inverse sequence of Gk . In this way, the ground both faster and cognizant of previous learning. We
truth sequence that leads to minimum MSE loss is perform region of interest (RoI) pooling [25] on the
considered to be the target sequence, making the intermediate output of our network. This enables
loss function rotation-invariant. Also note that the us to extract regions in the image to focus on, to
introduction of Gk guarantees the loss to be invariant perform detection at the next level.
to clockwise and anti-clockwise sequencing.
4.2 Architecture

4 Layered detection model After an image is forwarded through the backbone


network, simple post-processing steps including
4.1 Layered detection thresholding and non-maximal suppression are
We use a layered model to capture the nested performed to obtain the final prediction results.
structure of primitives in an image. The idea is The backbone network is the previously discussed
inspired by two observations. Our first observation YOLO network with modified loss; the difference
is from how multiple layers in design tools, such as lies in that the backbone network is intended to
Adobe Photoshop and Illustrator, can help create a only predict primitives in the top layer, i.e., the
vector graphics image. With layers, artists can plan outermost primitives in the image. Following this,
the arrangement of items in the space in a top–down the coordinates of the bounding boxes of detected
manner. This fact that all vector icon images can be primitives are fed into an RoI pooling layer. The RoI
decomposed into multiple layers, as shown in Fig. 1, pooling layers consume the intermediate output of
serves as inspiration to extend the model proposed the network and pool it into a uniform sized feature
in Section 3 to include layered detection. Secondly, map for detection following the layering. Figure 2
for the detection of each layer, it allows one to focus illustrates this model.
on a specific part of the image, instead of working on Specifically, the architecture of the backbone
the entire image. For example in Fig. 1, the white network can be treated as multiple consecutive
rectangle in the lower-right of the image is completely modules, which contain several convolution layers
inside the black disk: one can focus in the interior with ReLU activation; each module is combined with
of the disk where the only accessible primitive is the pooling layers. We denote the modules by f1 , · · · , fM
rectangle. (from shallow layers to deep layers). The deepest layer
However, training separate networks for different fM has output J1 that is processed by the detection
levels of detection is a redundant and time-consuming block d1 . Subsequent detection blocks di process the
process, since intuitively, the parameters regressed output of convolutional layer fM −i+1 . We do not
by these networks are likely to be related. Therefore, use the whole feature map Ji as the input to di , but
we propose a layered detection model to perform this instead, we crop the feature map using the prediction
regression task, thereby making the training process results from di−1 and resize it to a uniform size. In

Fig. 2 The detection process in our layered model. Cuboids denote input images or feature maps. Dark blue arrows, dark green arrows, and
dark purple arrows represent conv layers, RoI pooling layers, and detection blocks, respectively; notation is consistent with that in the text.
The final output of our network is a layered primitive tree containing both shape information and layer information.
390 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

this way, the layering is represented explicitly by We observed that the predicted bounding box position
cropping within the interior of an image. This model is usually more accurate than the regressed parameters.
can be expressed as Hence, a local parameter with respect to the bounding
B(1) = d1 (J1 ) (3) box is defined for each primitive so as to be able to
perform better reconstruction. Readers are referred to
B(i) = di (R[Ji ; B(i − 1)]), i2 (4)
Section S1 in the ESM for detailed descriptions of the
where R[J; B(i)] represents feature map J cropped parameters used.
using bounding box information from B(i) which is
5.2 Network architecture
fed to an RoI pooling layer to obtain a uniform size
output for future processing. Our code is adapted from an open source PyTorch
Lower level feature maps are employed for deeper implementation x . The backbone network uses the
layer detection since deeper layer primitives are Darknet-19 architecture configured as in Redmon and
usually smaller in size and thus clearer feature maps Farhadi [2]. We set the depth of our layered detection
are required to perform accurate detection. For model to 3, using three detection blocks. Detailed
consistency within different regions in image, we configuration of detection block di (i = 1, 2, 3) is
perform training using local coordinates within the provided in Section S2 of the ESM.
parent bounding box as the ground truth for B(i). 5.3 Training
For example, consider an image with a rectangle
The entire hierarchical model can be trained fully end-
inside a circle. Then, the ground truth coordinates
to-end. Additionally, we adopt a method similar to
for the rectangle should lie within the local coordinate
scheduled sampling [26] to enhance training stability
system with respect to the circle. Therefore, predicted
and testing performance. The predicted information
coordinates are transformed before calculating the
B(i − 1) from level i − 1, which is fed into level i, is
loss functions. These local coordinates are used for
substituted by the ground truth value for level i − 1
ground truth since RoI pooling is known to capture
with probability p. The value of p is set to 0.9 in the
partial information in the image, as testified by faster-
first 10 epochs and is subsequently decreased by 0.05
RCNN [8]. Meanwhile, since there are multiple
every 2 epochs.
layers of convolutional operations, the feature map
An RNN decoder model is pre-trained separately
can encode some information outside the bounding
to regress a fixed length embedding for control point
box, thus providing the model with the capability to
positions. While training this RNN model, the grid
correct mistakes made in outer layers, by considering
number S is set to 1 in the YOLOv2 detection
both local and global information while making
framework and the features of closed spline curve
detections in inner layers.
images are extracted with our backbone Darknet-
It is worth noting that the information passed from
19 network. The pre-trained RNN decoder learns
higher to lower layers is not simply restricted to the
to decode the fixed length embedding and output
explicit bounding box position. The feature map in
positions of control points sequentially. When the
shallower convolutional layers is used to predict both
layered model is being trained, the value of the
higher and lower level primitives (e.g., in Fig. 2, J2
embedding is used as direct supervision. In the
affects both B(1) and B(2)). Although we only pass
first 5 epochs, the embedding is supervised and in
the bounding box information explicitly, knowledge
subsequent epochs, the network is trained with the
from higher layers can be passed implicitly via these
positions of control points instead. Note that the
related feature maps.
RNNs share the same weights across different levels
of the hierarchy.
5 Implementation
5.4 Data synthesis
In this section, we present our implementation details. Following previous works [10, 27], we use synthetic
5.1 Primitive and parameter selection datasets due to the lack of annotated datasets. The
hierarchical model was trained with 150,000 synthetic
Four types of primitives are used in our experiments:
rectangles, triangles, ellipses, and closed spline curves. x https://github.com/longcw/yolo2-pytorch
DeepPrimitive: Image decomposition by layered primitive detection 391

pictures of size 416 × 416. When we generated the 6.2 Comparisons to other methods
training data, we kept the containment relationships Although our model detects primitives in a layered
across layers; there may be multiple primitives in each manner, simple object detection measurements
layer. The number of primitives in a single image including precision and recall rate (or mAP for
is restricted to 8, the maximum number of layers methods with confidence score output) can be applied
to 3, and the number of control points of closed to test model accuracy. Meanwhile, we define our
spline curves varies from 5 to 7. In order to test the reconstruction loss as the pixel-wise RMSE between
robustness of our method, noise was added to the the input picture and the re-rendered picture using
shapes of the primitives, as well as hatching patterns the predicted results from the network. There are
for primitives and some skewing of the image itself. multiple approaches to shape detection; we set up 5
Selected dataset images are shown in Fig. 3. independent baselines for comparison. The first two
baselines are traditional methods while the last three
6 Experiments and results are learning-based approaches:
6.1 Ablation study for circular loss • Contour method. In this method, edge detection
is first applied to the input image; each
During the pretraining process for the RNN decoder
independent contour is separated. A post-
to predict control point positions, we compare the
processing approximation step is then employed to
training and validation losses using two different loss
replace almost collinear segments with a single line
functions, i.e., the previously defined Lcirc and a
segment with a parameter q controlling the strength
simple MSE loss. As shown in Table 1, training
of approximation. The type of shape is determined
with circular loss leads to better convergence loss
by counting the number of line segments (i.e., its
and thus better prediction results. Figure 4 shows
number of edges). This method is implemented
two examples comparing the prediction results given
using findContours and approxPolyDP functions
the same curve image as input. We found that using
of OpenCV [28].
circular loss eliminates the ambiguity of starting point
• Hough transform [29]. This is widely used to find
and clock direction in the training data, and leads to
imperfect shape instances in images by a voting
more accurate fitting results.
procedure in parameter space. For rectangles and
Table 1 Error and accuracy measures during training and testing triangles, whose edges are straight line segments,
with two different loss functions. Loss denotes the MSE distance we first use Hough line transform to detect all
between the ground truth and predicted positions of control points
(distances are normalized to lie in the unit interval). # Point Acc. possible lines and then recover the parameters of
denotes the frequency of predicting the number of control points the primitives by solving a set of linear equations.
correctly
For ellipses, we use the method described in
Training Validation Ref. [30].
Loss # Point Acc. Loss # Point Acc. • CSGNet [20]. In 2D, this takes a single image
LMSE 0.12203 74.60 0.12210 74.93 as input and generates a program defining the
Lcirc 0.04365 76.32 0.04369 75.83
shapes presented. This model allows for more

Fig. 3 Examples drawn from our synthetic training dataset. For the Pure dataset, we synthesized simple binary images for training. The
Pure+Noise dataset modified the Pure dataset by adding noise and random affine transformations to each image. The Tex. (short for
“Textured”) dataset allows testing of the robustness of shape detection methods by adding hatching patterns to the shapes. The Textured+Noise
dataset imitates real world hand drawn shape pictures. The Natural dataset imitates colored versions of real world images.
392 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

The contour method with small q value traces


the pixels on the contour precisely but ignores the
high-level shape information of the shape boundary,
leading to a high reconstruction performance but low
precision and recall accuracy in shape classification
tasks. Using a greater q value simply approximates
continuous curves with polygons, leading to poor
reconstruction performance. It is also observed that
the contour method cannot separate overlapping
primitives since it only attempts to detect boundaries
in images. The Hough transform-based method for
Fig. 4 Two closed spline curve fitting cases using circular loss and line segment detection and circle detection requires
MSE loss. a careful choice of parameters; it generally leads
to higher recall values than the contour method.
complex Boolean operations between shapes but This method partially solves the overlap problem
the sizes and positions of the primitives are highly by extending detected line segments and finding
discretized. We use the post-processed (optimized) intersections, but cannot effectively distinguish
top-1 prediction as the output of this algorithm. extremely short line segments and segments of a
• Flat model. This method uses a learning approach circle.
trained using the YOLOv2 architecture. The The above problems can be overcome by learning-
ground truth of the detector is directly set to based models. Learning-based models generally have
all primitives in the canvas, regardless of their better performance across all different datasets and
hierarchical information. the gap in performance widens as we add more noise
• Recursive model. We train only one detector to to our dataset, which is partially due to the fact
detect the primitive in the first hierarchy (i.e., the that the learned features extracted from the image
outermost primitive at the current level). Once the using our data-driven method are more effective
detector successfully detects some primitives in the and representative in comparison to hand-crafted
current level, we crop the detected region, resize features of traditional methods. Despite the feature
the cropped region to the network input size, and improvement, the absence of effective shape and
feed the image into the same network again. relationship representations can be fatal to the final
Results from these different models are compared in detection results. Using CSGNet [20], the possible
Table 2 (precision–recall–reconstruction comparison) locations and sizes of primitives are restricted due
and Table 3 (primitive–reconstruction comparison). to the size limitation of the action space. In order
Some of the prediction results from different methods to compose the target shape, redundant shapes and
are shown in Fig. 5 using the same input in each case. expressions are generated.

Table 2 Precision, recall, and reconstruction loss measures using various methods as described in Fig. 3. Prec and Recall denote the precision
and recall values as percentages respectively while Recon measures the RMSE loss between the original picture and the reconstructed picture
using the layered prediction results
Pure Pure+Noise Textured Textured+Noise Natural
Method
Prec Recall Recon Prec Recall Prec Recall Prec Recall Prec Recall
Contour (q = 4 × 10−4 ) 78.8 42.9 1.44 10.1 37.7 10.8 54.6 10.0 47.5 5.9 62.2
Contour (q = 2 × 10−3 ) 94.0 72.8 1.70 32.5 60.1 16.8 88.0 15.6 73.2 6.4 70.3
Hough transform 32.6 78.6 1.61 5.1 73.7 — — — — — —
CSGNet (optimized) [20] 37.1 65.4 28.7 — — — — — — — —
Flat model 99.7 91.0 — 99.5 90.0 99.6 91.2 99.4 91.0 57.9 62.2
Recursive model 96.1 72.4 1.64 60.1 61.2 74.0 60.1 95.8 49.9 98.9 84.5
Our model 99.7 96.1 1.61 99.5 95.0 99.6 95.8 99.5 95.4 97.9 87.6
Our model (optimized∗ ) 99.7 96.1 1.39 99.6 95.0 — — — — — —
* It is impossible to measure reconstruction loss for images with texture or noise, making it unclear how to define the optimization target.
DeepPrimitive: Image decomposition by layered primitive detection 393

Table 3 Average precision (AP) measures of learning-based shape well-reconstructed. Unlike the baselines, our method
detection methods. Values are presented in percentage
can extract high-level shape information as well as
Mean Parallelogram Triangle Oval Spline
containment relationships. Our model outperforms
Flat 87.2 87.2 86.3 84.4 90.9
Recursive 54.3 43.8 53.8 76.0 43.6
the others both quantitatively and qualitatively,
Ours 90.5 88.2 90.7 90.9 92.0 except for the reconstruction loss. However, after
appending a simple local optimizer to our model,
denoted Our model (optimized) in Table 2, the
reconstruction loss is further decreased.
The trained model was applied directly to Google
Material icons [31] (lines 1–4 of Fig. 6, using Pure
model) and a small real world dataset containing
150 images selected from the PASCAL VOC2012
dataset [32] and the Internet (lines 5–8 of Fig. 6,
using Natural model). To the best of our knowledge,
no public dataset exists that provides ground truth
annotations at geometric primitive level. So we have
manually annotated the 150 images from this small
real world dataset. Testing using our trained model
reached an mAP (the metric used in all experiments)
of 54.5%. Readers are referred to Sections S3 and S4
in the ESM for further results.
While DeepPrimitive manages to decompose the
real world images into relevant primitives, it is to be
remembered that this is not the primary focus of our

Fig. 5 Detection results examples. Shapes detected at different


levels are marked in different colors: level 1, pink; level 2, orange; level
3, blue. For the flat model, there is no predicted layer information, so
all shapes are marked in green.

Other learning-based baselines fix this with simple


containment representations but problems still occur
due to lack of layering or incorrect layering. The
flat model detects almost all primitives regardless of
their layer. However, in cases where two primitives
of the same kind (e.g., concentric circles forming
an annulus) overlap, the post-processing step (non-
maxima suppression) eliminates one of them and
predicts the median result, which is undesirable. It is
also difficult to reconstruct the original image using
the detected primitives due to the loss of layering
Fig. 6 Selected test results for our layered detection model. In each
information. In the recursive model, the layering pair of columns, the left picture shows the original input image as
information is preserved, but if the detection in well as the detection result while the right picture reconstructs the
input image using the detection result (different instances of primitives
an outer layer is not accurate enough, the error within the same hierarchy vary slightly in color for clarity). More test
snowballs and the inner layer primitives cannot be results are available in Sections S3 and S4 in the ESM.
394 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

work. Our current model is trained only on synthetic Another potential application is recognition-by-
images, but adapting synthetic images to real images components [5]. Usually, state-of-the-art classifiers
with domain adaptation techniques is one trend in the based on deep networks need very much data for
vision community. A few recent vision papers have training, and its lack hampers accuracy. Once
been trained and tested on purely synthetic datasets primitives in an image have been recognized, one
(e.g., Ref. [27]). can easily define classification rules using the layered
information obtained. Additional training data is not
7 Applications needed and only a single shape detection model has
to be trained. The idea is illustrated in Fig. 9. Given
Once an image has been decomposed into several an image, pre-processing steps such as denoising and
layers and high-level parameters defining the thresholding are performed to extract the borders
primitives in the image acquired, one can utilize this of shapes. The proposed model is then applied to
information for a variety of applications. In this detect the primitives and generate a shape parsing
paper, we demonstrate the use of these parameters tree (in XML format in the figure for demonstration
in two example applications. purposes), with which a handcrafted classifier could
The first application we present is image editing. easily predict the class of an object in the image by
It is usually very difficult for an artist to modify top–down traversal of the tree.
the shapes in a rasterized image directly. With a
low reconstruction loss, our model can decompose 8 Limitations
an image into several manipulable components with
high fidelity and flexibility. For example, in Fig. 7, it As an explorative study aiming to understand and
is easy for an icon designer to modify parameters of reconstruct images as primitives composed layer-wise,
the shapes, changing the angle between the hands of there are several limitations left to be resolved in
the clock, or tweaking the shape of the paint brush future work. For images with highly-overlapping
head. For real world images in Fig. 8, we can directly primitives within the same layer, our model cannot
manage the position of the parts in an image using distinguish between them: the output will either be
high-level editing tools (e.g., as in Ref. [33]). a single primitive or misclassified primitives. Our
model discovers only containment relationships: if
one higher-level primitive intersects multiple lower-
level primitives, duplicate detections of the higher-
level primitive are possible. The last two images of
line 4 in Fig. 6 demonstrate such failures. These
limitations restrict the layer decomposability of our
model. Meanwhile, only synthetic images are used
Fig. 7 Image editing on a rasterized image at a primitive level.
Primitive detection is performed on the image, followed by editing of
for training. Annotated real world data would make
the primitives. the model more generalizable.

Fig. 8 High-level image editing of real world images based on detected primitives. The first two columns of each group show the original
image and its layered decomposition while the last two columns of each group show manipulated results.
DeepPrimitive: Image decomposition by layered primitive detection 395

References

[1] Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You


only look once: Unified, real-time object detection.
In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 779–788, 2016.
[2] Redmon, J.; Farhadi, A. YOLO9000: Better, faster,
stronger. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 6517–6525,
2017.
[3] Roberts, L. G. Machine perception of three-dimensional
solids. Ph.D. Thesis. Massachusetts Institute of
Technology, 1963.
Fig. 9 Recognition-by-components demonstration using our proposed
hierarchical primitive detection model. [4] Binford, T. O. Visual perception by computer. In:
Proceedings of the IEEE Conference on Systems and
Control, 1971.
9 Conclusions [5] Biederman, I. Recognition-by-components: A theory of
human image understanding. Psychological Review Vol.
This paper demonstrates a data-driven approach
94, No. 2, 115–147, 1987.
to layered detection of primitives in images, and
[6] Bellver, M.; Giro-i-Nieto, X.; Marques, F.; Torres, J.
subsequent 2D reconstruction. As noted, abstraction
Hierarchical object detection with deep reinforcement
of objects into primitives is a very natural way
learning. In: Proceedings of the Deep Reinforcement
for humans to understand objects. As artificial
Learning Workshop, NIPS, 2016.
intelligence moves towards performing tasks in
[7] Ballard, D. H. Generalizing the Hough transform to
human-like fashion, there is value in trying to perform
detect arbitrary shapes. Pattern Recognition Vol. 13,
these tasks in the way a human would.
No. 2, 111–122, 1981.
Such tasks often also fall in the intersection of
[8] Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN:
robotics and computer vision, e.g., in the cases of
Towards real-time object detection with region proposal
autonomous driving and robotics. In such tasks, networks. IEEE Transactions on Pattern Analysis and
building in environment-awareness into cars or robots Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
based on their field of vision is key, and primitive- [9] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed,
level reconstruction would be useful. Primitive- S.; Fu, C.-Y.; Berg, A. C. SSD: Single shot multibox
level understanding would also help in understanding detector. In: Computer Vision – ECCV 2016. Lecture
physical interactions with objects in manipulation Notes in Computer Science, Vol. 9905. Leibe, B.; Matas,
tasks. While there are many such avenues where J.; Sebe, N.; Welling, M. Eds. Springer Cham, 21–37,
this understanding could be applied, there is a lack 2016.
of open datasets for training on real world data. A [10] Higgins, I.; Sonnerat, N.; Matthey, L.; Pal, A.; Burgess,
good direction for future study would involve learning C.; Botvinick, M.; Hassabis, D.; Lerchner, A. SCAN:
tasks of an unsupervised or self-supervised kind. Learning abstract hierarchical compositional visual
concepts. arXiv preprint arXiv:1707.03389, 2017.
Acknowledgements [11] Lake, B. M.; Salakhutdinov, R.; Tenenbaum, J. B.
Chengcheng Tang would like to acknowledge NSF Human-level concept learning through probabilistic
program induction. Science Vol. 350, No. 6266, 1332–
grant IIS-1528025, a Google Focused Research award,
1338, 2015.
a gift from the Adobe Corporation, and a gift from
[12] Rogers, D. F.; Fog, N. Constrained B-spline curve and
the NVIDIA Corporation.
surface fitting. Computer-Aided Design Vol. 21, No. 10,
Electronic Supplementary Material Supplementary 641–648, 1989.
material with detailed experimental configuration and results [13] Besl, P. J.; McKay, N. D. A method for registration
is available in the online version of this article at https: of 3-D shapes. IEEE Transactions on Pattern Analysis
//doi.org/10.1007/s41059-018-0128-6. and Machine Intelligence Vol. 14, No. 2, 239–256, 1992.
396 J. Huang, J. Gao, V. Ganapathi-Subramanian, et al.

[14] Chen, Y.; Medioni, G. Object modeling by registration Information Processing Systems 28. Cortes, C.;
of multiple range images. In: Proceedings of the IEEE Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; Garnett,
International Conference on Robotics and Automation, R. Eds. Curran Associates, Inc., 1171–1179, 2015.
2724–2729, 1991. [27] Wu, J.; Tenenbaum, J. B.; Kohli, P. Neural scene de-
[15] Wang, W.; Pottmann, H.; Liu, Y. Fitting B-spline rendering. In: Proceedings of the IEEE Conference on
curves to point clouds by curvature-based squared Computer Vision and Pattern Recognition, 2017.
distance minimization. ACM Transactions on Graphics [28] Itseez. Open source computer vision library. 2015.
Vol. 25, No. 2, 214–238, 2006. Available at https://github.com/itseez/opencv.
[16] Zheng, W.; Bo, P.; Liu, Y.; Wang, W. Fast B-spline [29] Duda, R. O.; Hart, P. E. Use of the Hough
curve fitting by L-BFGS. Computer Aided Geometric transformation to detect lines and curves in pictures.
Design Vol. 29, No. 7, 448–462, 2012. Communications of the ACM Vol. 15, No. 1, 11–15, 1972.
[17] Sun, J.; Liang, L.; Wen, F.; Shum, H.-Y. Image vectorization
[30] Xie, Y.; Ji, Q. A new efficient ellipse detection method.
using optimized gradient meshes. ACM Transactions
In: Proceedings of the IEEE International Conference
on Graphics Vol. 26, No. 3, Article No. 11, 2007.
on Pattern Recognition, Vol. 2, 957–960, 2002.
[18] Lecot, G.; Levy, B. Ardeco: Automatic region detection
[31] Google. Google material icon. 2017. Available at
and conversion. In: Proceedings of the 17th Eurographics
https://material.io/icons/.
Symposium on Rendering Techniques, 349–360, 2006.
[32] Everingham, M. The PASCAL Visual Object
[19] Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan,
Classes Challenge 2012 (VOC2012). Available at
B.; Belongie, S. Feature pyramid networks for object
http://www.pascal-network.org/challenges/VOC/
detection. In: Proceedings of the IEEE Conference on
voc2012/workshop/index.html.
Computer Vision and Pattern Recognition, 2117–2125,
2017. [33] Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.
B. PatchMatch: A randomized correspondence algorithm
[20] Sharma, G.; Goyal, R.; Liu, D.; Kalogerakis, E.; Maji,
for structural image editing. ACM Transactions on
S. CSGNet: Neural shape parser for constructive solid
geometry. In: Proceedings of the IEEE Conference on Graphics Vol. 28, No. 3, Article No. 24, 2009.
Computer Vision and Pattern Recognition, 5515–5523,
2018. Jiahui Huang received his B.S. degree
[21] Gers, F. A.; Schraudolph, N. N.; Schmidhuber, J. in computer science and technology
Learning precise timing with LSTM recurrent networks. from Tsinghua University in 2018. He
Journal of Machine Learning Research Vol. 3, No. 1, is currently a Ph.D. candidate in
115–143, 2002. computer science in Tsinghua University.
His research interests include computer
[22] Cho, K.; Merriënboer, B. V.; Gulcehre, C.; Bahdanau,
vision and computer graphics.
D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning
phrase representations using RNN encoder–decoder
for statistical machine translation. arXiv preprint Jun Gao received his B.S. degree in
arXiv:1406.1078, 2014. computer science from Peking University
[23] Castrejón, L.; Kundu, K.; Urtasun, R.; Fidler, S. in 2018. He is a graduate student
Annotating object instances with a polygon-RNN. In: in the Machine Learning Group at
Proceedings of the IEEE Conference on Computer the University of Toronto and also
Vision and Pattern Recognition, 5230–5238, 2017. affiliates to the Vector Institute. His
research interests are in deep learning
[24] Jetley, S.; Sapienza, M.; Golodetz, S.; Torr, P. H. S.
and computer vision.
Straight to shapes: Real-time detection of encoded
shapes. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 4207–4216, Vignesh G. Subramanian is a
2017. Ph.D. candidate in the Department
of Electrical Engineering, Stanford
[25] Girshick, R. Fast R-CNN. In: Proceedings of the IEEE
University. He previously obtained
International Conference on Computer Vision, 1440–
his dual degrees (B.Tech. in EE and
1448, 2015.
M.Tech. in communication engineering)
[26] Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. from IIT Madras, India. His research
Scheduled sampling for sequence prediction with interests include shape correspondences,
recurrent neural networks. In: Advances in Neural 3D geometry, graphics, and vision.
DeepPrimitive: Image decomposition by layered primitive detection 397

Hao Su received his Ph.D. degree Leonidas J. Guibas received his


from Stanford University, under the Ph.D. degree from Stanford University
supervision from Leonidas Guibas. He in 1976, under the supervision of
joined UC San Diego in 2017 and Donald Knuth. His main subsequent
is currently an assistant professor of employers were Xerox PARC, MIT, and
computer science and engineering. His DEC/SRC. Since 1984, he has been
research interests include computer at Stanford University, where he is a
vision, computer graphics, machine professor of computer science. His
learning, robotics, and optimization. More details of his research interests include computational geometry, geometric
research can be found at http://ai.ucsd.edu/haosu. modeling, computer graphics, computer vision, sensor
networks, robotics, and discrete algorithms. He is a
senior member of the IEEE and the IEEE Computer
Yin Liu received his B.S. degree from Society. More details about his research can be found at
Department of Automation of Tsinghua http://geometry.stanford.edu/member/guibas/.
University in 2018. He is currently a
Ph.D. candidate in computer science Open Access The articles published in this journal
at the University of Wisconsin-Madison. are distributed under the terms of the Creative
His research interest is in machine Commons Attribution 4.0 International License (http://
learning. creativecommons.org/licenses/by/4.0/), which permits
unrestricted use, distribution, and reproduction in any
medium, provided you give appropriate credit to the original
Chengcheng Tang received his author(s) and the source, provide a link to the Creative
Ph.D. and M.S. degrees from King Commons license, and indicate if changes were made.
Abdullah University of Science and
Technology (KAUST) in 2015 and 2011, Other papers from this open access journal are available
respectively, and his bachelor degree free of charge from http://www.springer.com/journal/41095.
from Jilin University in 2009. He is To submit a manuscript, please go to https://www.
currently a postdoctoral scholar in editorialmanager.com/cvmj.
the Computer Science Department at
Stanford University. His research interests include computer
graphics, geometric computing, computational design, and
machine learning.

You might also like