Selective Search in Object Recognition
Selective Search in Object Recognition
DOI 10.1007/s11263-013-0620-5
Received: 5 May 2012 / Accepted: 11 March 2013 / Published online: 2 April 2013
© Springer Science+Business Media New York 2013
Abstract This paper addresses the problem of generating progress over the past years ( Arbeláez et al. 2011; Comaniciu
possible object locations for use in object recognition. We and Meer 2002; Felzenszwalb and Huttenlocher 2004; Shi
introduce selective search which combines the strength of and Malik 2000). But images are intrinsically hierarchical: In
both an exhaustive search and segmentation. Like segmen- Fig. 1a the salad and spoons are inside the salad bowl, which
tation, we use the image structure to guide our sampling in turn stands on the table. Furthermore, depending on the
process. Like exhaustive search, we aim to capture all possi- context the term table in this picture can refer to only the wood
ble object locations. Instead of a single technique to generate or include everything on the table. Therefore both the nature
possible object locations, we diversify our search and use a of images and the different uses of an object category are
variety of complementary image partitionings to deal with hierarchical. This prohibits the unique partitioning of objects
as many image conditions as possible. Our selective search for all but the most specific purposes. Hence for most tasks
results in a small set of data-driven, class-independent, high multiple scales in a segmentation are a necessity. This is most
quality locations, yielding 99 % recall and a Mean Average naturally addressed by using a hierarchical partitioning, as
Best Overlap of 0.879 at 10,097 locations. The reduced num- done for example by Arbeláez et al. (2011).
ber of locations compared to an exhaustive search enables Besides that a segmentation should be hierarchical, a
the use of stronger machine learning techniques and stronger generic solution for segmentation using a single strategy may
appearance models for object recognition. In this paper we not exist at all. There are many conflicting reasons why a
show that our selective search enables the use of the powerful region should be grouped together: In Fig. 1b the cats can
Bag-of-Words model for recognition. The selective search be separated using colour, but their texture is the same. Con-
software is made publicly available (Software: http://disi. versely, in Fig. 1c the chameleon is similar to its surrounding
unitn.it/~uijlings/SelectiveSearch.html). leaves in terms of colour, yet its texture differs. Finally, in
Fig. 1d, the wheels are wildly different from the car in terms
of both colour and texture, yet are enclosed by the car. Indi-
1 Introduction vidual visual features therefore cannot resolve the ambiguity
of segmentation.
For a long time, objects were sought to be delineated before And, finally, there is a more fundamental problem.
their identification. This gave rise to segmentation, which Regions with very different characteristics, such as a face
aims for a unique partitioning of the image through a generic over a sweater, can only be combined into one object after
algorithm, where there is one part for all object silhouettesin it has been established that the object at hand is a human.
the image. Research on this topic has yielded tremendous Hence without prior recognition it is hard to decide that a
face and a sweater are part of one object ( Tu et al. 2005).
J. R. R. Uijlings (B) This has led to the opposite of the traditional approach:
University of Trento, Trento, Italy to do localisation through the identification of an object.
e-mail: [email protected]; [email protected]
This recent approach in object recognition has made enor-
K. E. A. van de Sande · T. Gevers · A. W. M. Smeulders mous progress in less than a decade ( Dalal and Triggs 2005;
University of Amsterdam, Amsterdam, The Netherlands Felzenszwalb et al. 2010; Harzallah et al. 2009; Viola and
123
Int J Comput Vis (2013) 104:154–171 155
Fig. 1 There is a high variety of reasons that an image region forms fore, to find objects in a structured way it is necessary to use a variety
an object. In (b) the cats can be distinguished by colour, not texture. In of diverse strategies. Furthermore, an image is intrinsically hierarchical
(c) the chameleon can be distinguished from the surrounding leaves by as there is no single scale for which the complete table, salad bowl, and
texture, not colour. In (d) the wheels can be part of the car because they salad spoon can be found in (a)
are enclosed, not because they are similar in texture or colour. There-
Jones 2001). With an appearance model learned from exam- However, our selective search applies to regions as well and
ples, an exhaustive search is performed where every location is also applicable to concepts such as “grass”.
within the image is examined as to not miss any potential In this paper we propose selective search for object
object location ( Dalal and Triggs 2005; Felzenszwalb et al. recognition. Our main research questions are: (1) What are
2010; Harzallah et al. 2009; Viola and Jones 2001). good diversification strategies for adapting segmentation as
However, the exhaustive search itself has several draw- a selective search strategy? (2) How effective is selective
backs. Searching every possible location is computationally search in creating a small set of high-quality locations within
infeasible. The search space has to be reduced by using a reg- an image? (3) Can we use selective search to employ more
ular grid, fixed scales, and fixed aspect ratios. In most cases powerful classifiers and appearance models for object recog-
the number of locations to visit remains huge, so much that nition?
alternative restrictions need to be imposed. The classifier is
simplified and the appearance model needs to be fast. Fur-
thermore, a uniform sampling yields many boxes for which it 2 Related Work
is immediately clear that they are not supportive of an object.
Rather then sampling locations blindly using an exhaustive We confine the related work to the domain of object recog-
search, a key question is: Can we steer the sampling by a nition and divide it into three categories: Exhaustive search,
data-driven analysis? segmentation, and other sampling strategies that do not fall
In this paper, we aim to combine the best of the intu- in either category.
itions of segmentation and exhaustive search and propose a
data-driven selective search. Inspired by bottom-up segmen- 2.1 Exhaustive Search
tation, we aim to exploit the structure of the image to gener-
ate object locations. Inspired by exhaustive search, we aim As an object can be located at any position and scale in the
to capture all possible object locations. Therefore, instead of image, it is natural to search everywhere ( Dalal and Triggs
using a single sampling technique, we aim to diversify the 2005; Harzallah et al. 2009; Viola and Jones 2004). How-
sampling techniques to account for as many image condi- ever, the visual search space is huge, making an exhaustive
tions as possible. Specifically, we use a data-driven grouping- search computationally expensive. This imposes constraints
based strategy where we increase diversity by using a variety on the evaluation cost per location and/or the number of loca-
of complementary grouping criteria and a variety of comple- tions considered. Hence most of these sliding window tech-
mentary colour spaces with different invariance properties. niques use a coarse search grid and fixed aspect ratios, using
The set of locations is obtained by combining the locations of weak classifiers and economic image features such as HOG
these complementary partitionings. Our goal is to generate a ( Dalal and Triggs 2005; Harzallah et al. 2009; Viola and
class-independent, data-driven, selective search strategy that Jones 2004). This method is often used as a preselection step
generates a small set of high-quality object locations. in a cascade of classifiers ( Harzallah et al. 2009; Viola and
Our application domain of selective search is object recog- Jones 2004).
nition. We therefore evaluate on the most commonly used Related to the sliding window technique is the highly
dataset for this purpose, the Pascal VOC detection challenge successful part-based object localisation method of Felzen-
which consists of 20 object classes. The size of this dataset szwalb et al. (2010). Their method also performs an exhaus-
yields computational constraints for our selective search. Fur- tive search using a linear SVM and HOG features. However,
thermore, the use of this dataset means that the quality of they search for objects and object parts, whose combination
locations is mainly evaluated in terms of bounding boxes. results in an impressive object detection performance.
123
156 Int J Comput Vis (2013) 104:154–171
Lampert et al. (2009) proposed using the appearance generate a set of part hypotheses using a grouping method
model to guide the search. This both alleviates the constraints based on Arbeláez et al. (2011). Each part hypothesis is
of using a regular grid, fixed scales, and fixed aspect ratio, described by both appearance and shape features. Then, an
while at the same time reduces the number of locations vis- object is recognized and carefully delineated by using its
ited. This is done by directly searching for the optimal win- parts, achieving good results for shape recognition. In their
dow within the image using a branch and bound technique. work, the segmentation is hierarchical and yields segments
While they obtain impressive results for linear classifiers, at all scales. However, they use a single grouping strategy
Alexe et al. (2010) found that for non-linear classifiers the whose power of discovering parts or objects is left unevalu-
method in practice still visits over a 100,000 windows per ated. In this work, we use multiple complementary strategies
image. to deal with as many image conditions as possible. We include
Instead of a blind exhaustive search or a branch and bound the locations generated using Arbeláez et al. (2011) in our
search, we propose selective search. We use the underly- evaluation.
ing image structure to generate object locations. In contrast
to the discussed methods, this yields a completely class-
independent set of locations. Furthermore, because we do not 2.3 Other Sampling Strategies
use a fixed aspect ratio, our method is not limited to objects
but should be able to find stuff like “grass” and “sand” as well Alexe et al. (2012) address the problem of the large sam-
(this also holds for Lampert et al. (2009)). Finally, we hope pling space of an exhaustive search by proposing to search
to generate fewer locations, which should make the prob- for any object, independent of its class. In their method they
lem easier as the variability of samples becomes lower. And train a classifier on the object windows of those objects
more importantly, it frees up computational power which can which have a well-defined shape (as opposed to stuff like
be used for stronger machine learning techniques and more “grass” and “sand”). Then instead of a full exhaustive search
powerful appearance models. they randomly sample boxes to which they apply their
classifier. The boxes with the highest “objectness” mea-
2.2 Segmentation sure serve as a set of object hypotheses. This set is then
used to greatly reduce the number of windows evaluated by
Both Carreira and Sminchisescu (2010) and Endres and class-specific object detectors. We compare our method with
Hoiem (2010) propose to generate a set of class independent their work.
object hypotheses using segmentation. Both methods gen- Another strategy is to use visual words of the Bag-
erate multiple foreground/background segmentations, learn of-Words model to predict the object location. Vedaldi (2009)
to predict the likelihood that a foreground segment is a use jumping windows (Chum and ZissermanZ 2007), in
complete object, and use this to rank the segments. Both which the relation between individual visual words and the
algorithms show a promising ability to accurately delineate object location is learned to predict the object location in
objects within images, confirmed by Li et al. (2010) who new images. Maji and Malik (2009) combine multiple of
achieve state-of-the-art results on pixel-wise image classi- these relations to predict the object location using a Hough-
fication using Carreira and Sminchisescu (2010). As com- transform, after which they randomly sample windows close
mon in segmentation, both methods rely on a single strong to the Hough maximum. In contrast to learning, we use the
algorithm for identifying good regions. They obtain a vari- image structure to sample a set of class-independent object
ety of locations by using many randomly initialised fore- hypotheses.
ground and background seeds. In contrast, we explicitly deal To summarize, our novelty is as follows. Instead of an
with a variety of image conditions by using different group- exhaustive search ( Dalal and Triggs 2005; Felzenszwalb
ing criteria and different representations. This means a lower et al. 2010; Harzallah et al. 2009; Viola and Jones 2004)
computational investment as we do not have to invest in the we use segmentation as selective search yielding a small
single best segmentation strategy, such as using the excel- set of class independent object locations. In contrast to the
lent yet expensive contour detector of Arbeláez et al. (2011). segmentation of Carreira and Sminchisescu (2010), Endres
Furthermore, as we deal with different image conditions sep- and Hoiem (2010) instead of focusing on the best segmen-
arately, we expect our locations to have a more consistent tation algorithm ( Arbeláez et al. 2011), we use a variety of
quality. Finally, our selective search paradigm dictates that strategies to deal with as many image conditions as possible,
the most interesting question is not how our regions com- thereby severely reducing computational costs while poten-
pare to Carreira and Sminchisescu (2010), Endres and Hoiem tially capturing more objects accurately. Instead of learning
(2010), but rather how they can complement each other. an objectness measure on randomly sampled boxes ( Alexe
Gu et al. (2009) address the problem of carefully segment- et al. 2012), we use a bottom-up grouping procedure to gen-
ing and recognizing objects based on their parts. They first erate good object locations.
123
Int J Comput Vis (2013) 104:154–171 157
Fig. 2 Two examples of our selective search showing the necessity of different scales. On the left we find many objects at different scales. On the
right we necessarily find the objects at different scales as the girl is contained by the tv
We take a hierarchical grouping algorithm to form the basis The second design criterion for selective search is to diver-
of our selective search. Bottom-up grouping is a popu- sify the sampling and create a set of complementary strategies
lar approach to segmentation (Comaniciu and Meer 2002; whose locations are combined afterwards. We diversify our
Felzenszwalb and Huttenlocher 2004), hence we adapt it for selective search (1) by using a variety of colour spaces with
123
158 Int J Comput Vis (2013) 104:154–171
123
Int J Comput Vis (2013) 104:154–171 159
Table 1 The invariance properties of both the individual colour channels and the colour spaces used in this paper, sorted by degree of invariance
Colour channels R G B I V L a b S r g C H
size(B Bi j ) − size(ri ) − size(ri ) We choose to order the combined object hypotheses set
f ill(ri , r j ) = 1 − (5)
size(im) based on the order in which the hypotheses were generated
in each individual grouping strategy. However, as we com-
We divide by size(im) for consistency with Eq. 4. Note bine results from up to 80 different strategies, such order
that this measure can be efficiently calculated by keeping would too heavily emphasize large regions. To prevent this,
track of the bounding boxes around each region, as the we include some randomness as follows. Given a grouping
bounding box around two regions can be easily derived j
strategy j, let ri be the region which is created at position
from these. i in the hierarchy, where i = 1 represents the top of the
hierarchy (whose corresponding region covers the complete
j
In this paper, our final similarity measure is a combination image). We now calculate the position value vi as RND × i,
of the above four: where RND is a random number in range [0, 1]. The final
j
ranking is obtained by ordering the regions using vi .
s(ri , r j ) = a1 scolour (ri , r j ) + a2 stextur e (ri , r j )
When we use locations in terms of bounding boxes, we
+a3 ssi ze (ri , r j ) + a4 s f ill (ri , r j ), (6) first rank all the locations as detailed above. Only afterwards
we filter out lower ranked duplicates. This ensures that dupli-
where ai ∈ {0, 1} denotes if the similarity measure is
cate boxes have a better chance of obtaining a high rank. This
used or not. As we aim to diversify our strategies, we do not
is desirable because if multiple grouping strategies suggest
consider any weighted similarities.
the same box location, it is likely to come from a visually
Complementary Starting Regions. A third diversification
coherent part of the image.
strategy is varying the complementary starting regions. To
the best of our knowledge, the method of Felzenszwalb and
Huttenlocher (2004) is the fastest, publicly available algo- 4 Object Recognition Using Selective Search
rithm that yields high quality starting locations. We could
not find any other algorithm with similar computational effi- This paper uses the locations generated by our selective
ciency so we use only this oversegmentation in this paper. search for object recognition. This section details our frame-
But note that different starting regions are (already) obtained work for object recognition.
by varying the colour spaces, each which has different invari- Two types of features are dominant in object recognition:
ance properties. Additionally, we vary the threshold parame- histograms of oriented gradients (HOG) (Dalal and Triggs
ter k in Felzenszwalb and Huttenlocher (2004). 2005) and bag-of-words ( Csurka et al. 2004; Sivic and
Zisserman 2003). HOG has been shown to be successful in
3.3 Combining Locations combination with the part-based model by Felzenszwalb et
al. (2010). However, as they use an exhaustive search, HOG
In this paper, we combine the object hypotheses of several features in combination with a linear classifier is the only fea-
variations of our hierarchical grouping algorithm. Ideally, sible choice from a computational perspective. In contrast,
we want to order the object hypotheses in such a way that our selective search enables the use of more expensive and
the locations which are most likely to be an object come potentially more powerful features. Therefore we use bag-of-
first. This enables one to find a good trade-off between the words for object recognition ( Harzallah et al. 2009; Lampert
quality and quantity of the resulting object hypothesis set, et al. 2009; Vedaldi 2009). However, we use a more powerful
depending on the computational efficiency of the subsequent (and expensive) implementation than (Harzallah et al. 2009;
feature extraction and classification method. Lampert et al. 2009; Vedaldi 2009) by employing a variety
123
160 Int J Comput Vis (2013) 104:154–171
Fig. 3 The training procedure of our object recognition pipeline. As positive learning examples we use the ground truth. As negatives we use
examples that have a 20–50 % overlap with the positive examples. We iteratively add hard negatives using a retraining phase
of colour-SIFT descriptors ( van de Sande et al. 2010) and a Then we enter a retraining phase to iteratively add hard
finer spatial pyramid division ( Lazebnik et al. 2006). negative examples (e.g. Felzenszwalb et al. (2010)): We apply
Specifically we sample descriptors at each pixel on a sin- the learned models to the training set using the locations
gle scale (σ = 1.2). Using software from van de Sande et al. generated by our selective search. For each negative image
(2010), we extract SIFT ( Lowe 2004) and two colour SIFTs we add the highest scoring location. As our initial training
which were found to be the most sensitive for detecting image set already yields good models, our models converge in only
structures, Extended OpponentSIFT ( van de Sande et al. two iterations.
2012) and RGB-SIFT ( van de Sande et al. 2010). We use a For the test set, the final model is applied to all locations
visual codebook of size 4,000 and a spatial pyramid with 4 generated by our selective search. The windows are sorted by
levels using a 1×1, 2×2, 3×3 and 4×4 division. This gives classifier score while windows which have more than 30 %
a total feature vector length of 360,000. In image classifica- overlap with a higher scoring window are considered near-
tion, features of this size are already used ( Perronnin et al. duplicates and are removed.
2010; Zhou et al. 2010). Because a spatial pyramid results in
a coarser spatial subdivision than the cells which make up a
HOG descriptor, our features contain less information about
5 Evaluation
the specific spatial layout of the object. Therefore, HOG is
better suited for rigid objects and our features are better suited
In this section we evaluate the quality of our selective search.
for deformable object types.
We divide our experiments in four parts, each spanning a
As classifier we employ a Support Vector Machine with
separate subsection:
a histogram intersection kernel using the Shogun Toolbox
( Sonnenburg et al. 2010). To apply the trained classi-
fier, we use the fast, approximate classification strategy of Diversification Strategies. We experiment with a variety
Maji et al. (2008), which was shown to work well for Bag- of colour spaces, similarity measures, and thresholds of
of-Words in Uijlings et al. (2010). the initial regions, all which were detailed in Sect. 3.2.
Our training procedure is illustrated in Fig. 3. The initial We seek a trade-off between the number of generated
positive examples consist of all ground truth object windows. object hypotheses, computation time, and the quality of
As initial negative examples we select from all object loca- object locations. We do this in terms of bounding boxes.
tions generated by our selective search that have an overlap This results in a selection of complementary techniques
of 20–50 % with a positive example. To avoid near-duplicate which together serve as our final selective search method.
negative examples, a negative example is excluded if it has Quality of Locations. We test the quality of the object
more than 70 % overlap with another negative. To keep the location hypotheses resulting from the selective search.
number of initial negatives per class below 20,000, we ran- Object Recognition. We use the locations of our selective
domly drop half of the negatives for the classes car, cat, dog search in the Object Recognition framework detailed in
and person. Intuitively, this set of examples can be seen as Sect. 4. We evaluate performance on the Pascal VOC
difficult negatives which are close to the positive examples. detection challenge.
This means they are close to the decision boundary and are An upper bound of location quality. We investigate how
therefore likely to become support vectors even when the well our object recognition framework performs when
complete set of negatives would be considered. Indeed, we using an object hypothesis set of “perfect” quality. How
found that this selection of training examples gives reason- does this compare to the locations that our selective
ably good initial classification models. search generates?
123
Int J Comput Vis (2013) 104:154–171 161
To evaluate the quality of our object hypotheses we define 5.1.1 Flat Versus Hierarchy
the Average Best Overlap (ABO) and Mean Average Best
Overlap (MABO) scores, which slightly generalises the mea- In the description of our method we claim that using a full
sure used in Endres and Hoiem (2010). To calculate the hierarchy is more natural than using multiple flat partition-
Average Best Overlap for a specific class c, we calculate the ings by changing a threshold. In this section we test whether
best overlap between each ground truth annotation gic ∈ G c the use of a hierarchy also leads to better results. We therefore
and the object hypotheses L generated for the corresponding compare the use of Felzenszwalb and Huttenlocher (2004)
image, and average: with multiple thresholds against our proposed algorithm.
Specifically, we perform both strategies in RG B colour
1
ABO = max Overlap(gic , l j ). (7) space. For Felzenszwalb and Huttenlocher (2004), we vary
|G c | c c l j ∈L
gi ∈G the threshold from k = 50 to k = 1, 000 in steps of 50. This
range captures both small and large regions. Additionally, as
The Overlap score is taken from Everingham et al. (2010) and a special type of threshold, we include the whole image as an
measures the area of the intersection of two regions divided object location because quite a few images contain a single
by its union: large object only. Furthermore, we also take a coarser range
area(gic ) ∩ area(lj ) from k = 50 to k = 950 in steps of 100. For our algorithm, to
Overlap(gic , l j ) = . (8) create initial regions we use a threshold of k = 50, ensuring
area(gic ) ∪ area(lj )
that both strategies have an identical smallest scale. Addi-
Analogously to Average Precision and Mean Average Preci- tionally, as we generate fewer regions, we combine results
sion, Mean Average Best Overlap is now defined as the mean using k = 50 and k = 100. As similarity measure S we use
ABO over all classes. the addition of all four similarities as defined in Eq. 6. Results
Other work often uses the recall derived from the Pascal are in Table 2.
Overlap Criterion to measure the quality of the boxes ( Alexe As can be seen, the quality of object hypotheses is better
et al. 2010; Harzallah et al. 2009; Vedaldi 2009). This cri- for our hierarchical strategy than for multiple flat partition-
terion considers an object to be found when the Overlap of ings: At a similar number of regions, our MABO score is con-
Eq. 8 is larger than 0.5. However, in many of our experiments sistently higher. Moreover, the increase in MABO achieved
we obtain a recall between 95 and 100 % for most classes, by combining the locations of two variants of our hierarchical
making this measure too insensitive for this paper. However, grouping algorithm is much higher than the increase achieved
we do report this measure when comparing with other work. by adding extra thresholds for the flat partitionings. We con-
To avoid overfitting, we perform the diversification strate- clude that using all locations from a hierarchical grouping
gies experiments on the Pascal VOC 2007 train+val set. algorithm is not only more natural but also more effective
Other experiments are done on the Pascal VOC 2007 test than using multiple flat partitionings.
set. Additionally, our object recognition system is bench-
marked on the Pascal VOC 2010 detection challenge, using
the independent evaluation server. 5.1.2 Individual Diversification Strategies
Table 2 A comparison of multiple flat partitionings against hierarchical partitionings for generating box locations shows that for the hierarchical
strategy the Mean Average Best Overlap (MABO) score is consistently higher at a similar number of locations
Threshold k in Felzenszwalb and Huttenlocher (2004) MABO No. of Windows
Flat Felzenszwalb and Huttenlocher (2004) k = 50, 150, · · · , 950 0.659 387
Hierarchical (this paper) k = 50 0.676 395
Flat Felzenszwalb and Huttenlocher (2004) k = 50, 100, · · · , 1000 0.673 597
Hierarchical (this paper) k = 50, 100 0.719 625
123
162 Int J Comput Vis (2013) 104:154–171
Table 3 Mean Average Best Overlap for box-based object hypotheses histograms of the individual colour channels. However, we
using a variety of segmentation strategies obtained similar results (MABO of 0.577). We believe that
Similarities MABO No. of Box one reason of the weakness of texture is because of object
boundaries: When two segments are separated by an object
C 0.635 356
boundary, both sides of this boundary will yield similar edge-
T 0.581 303
responses, which inadvertently increases similarity.
S 0.640 466 While the texture similarity yields relatively few object
F 0.634 449 locations, at 300 locations the other similarity measures still
C+T 0.635 346 yield a MABO higher than 0.628. This suggests that when
C+S 0.660 383 comparing individual strategies the final MABO scores in
C+F 0.660 389 Table 3 are good indicators of trade-off between quality and
T+S 0.650 406 quantity of the object hypotheses. Another observation is that
T+F 0.638 400 combinations of similarity measures generally outperform
S+F 0.638 449 the single measures. In fact, using all four similarity measures
C+T+S 0.662 377 perform best yielding a MABO of 0.676.
C+T+F 0.659 381 Looking at variations in the colour space in the top-right
C+S+F 0.674 401 of Table 3, we observe large differences in results, ranging
T+S+F 0.655 427 from a MABO of 0.615 with 125 locations for the C colour
C+T+S+F 0.676 395 space to a MABO of 0.693 with 463 locations for the HSV
colour space. We note that Lab-space has a particularly good
Colours MABO No. of Box
MABO score of 0.690 using only 328 boxes. Furthermore,
HSV 0.693 463 the order of each hierarchy is effective: using the first 328
I 0.670 399 boxes of HSV colour space yields 0.690 MABO, while using
RGB 0.676 395 the first 100 boxes yields 0.647 MABO. This shows that
rgI 0.693 362 when comparing single strategies we can use only the MABO
Lab 0.690 328 scores to represent the trade-off between quality and quantity
H 0.644 322 of the object hypotheses set. We will use this in the next
rgb 0.647 207 section when finding good combinations.
C 0.615 125 Experiments on the thresholds of Felzenszwalb and
Huttenlocher (2004) to generate the starting regions show,
Thresholds MABO No. of Box
in the bottom-right of Table 3, that a lower initial threshold
50 0.676 395 results in a higher MABO using more object locations.
100 0.671 239
150 0.668 168 5.1.3 Combinations of Diversification Strategies
250 0.647 102
500 0.585 46 We combine object location hypotheses using a variety of
1,000 0.477 19 complementary grouping strategies in order to get a good
quality set of object locations. As a full search for the best
(C)olour, (S)ize, and (F)ill perform similar. (T)exture by itself is weak.
combination is computationally expensive, we perform a
The best combination is as many diverse sources as possible
greedy search using the MABO score only as optimization
criterion. We have earlier observed that this score is repre-
measures, and threshold k = 50. Each time we vary a single sentative for the trade-off between the number of locations
parameter. Results are given in Table 3. and their quality.
We start examining the combination of similarity mea- From the resulting ordering we create three configura-
sures on the left part of Table 3. Looking first at colour, tions: a single best strategy, a fast selective search, and a
texture, size, and fill individually, we see that the texture quality selective search using all combinations of individ-
similarity performs worst with a MABO of 0.581, while the ual components, i.e. colour space, similarities, thresholds, as
other measures range between 0.63 and 0.64. To test if the rel- detailed in Table 4. The greedy search emphasizes varia-
atively low score of texture is due to our choice of feature, we tion in the combination of similarity measures. This con-
also tried to represent texture by Local Binary Patterns Ojala firms our diversification hypothesis: In the quality version,
et al. (2002). We experimented with 4 and 8 neighbours on next to the combination of all similarities, Fill and Size are
different scales using different uniformity/consistency of the taken separately. The remainder of this paper uses the three
patterns (see Ojala et al. (2002)), where we concatenate LBP strategies in Table 4.
123
Int J Comput Vis (2013) 104:154–171 163
Table 5 Comparison of recall, Mean Average Best Overlap (MABO) and number of window locations for a variety of methods on the Pascal 2007
test set
Method Recall MABO No. of Windows
5.2 Quality of Locations As shown in Table 5, our “Fast” and “Quality” selective
search methods yield a close to optimal recall of 98 and 99 %
In this section we evaluate our selective search algorithms respectively. In terms of MABO, we achieve 0.804 and 0.879
in terms of both Average Best Overlap and the number of respectively. To appreciate what a Best Overlap of 0.879
locations on the Pascal VOC 2007 test set. We first evaluate means, Figure 5 shows for bike, cow, and person an exam-
box-based locations and afterwards briefly evaluate region- ple location which has an overlap score between 0.874 and
based locations. 0.884. This illustrates that our selective search yields high
quality object locations.
5.2.1 Box-Based Locations Furthermore, note that the standard deviation of our
MABO scores is relatively low: 0.046 for the fast selective
We compare with the sliding window search of Harzallah search, and 0.039 for the quality selective search. This shows
et al. (2009), the sliding window search of Felzenszwalb et al. that selective search is robust to difference in object proper-
(2010) using the window ratio’s of their models, the jumping ties, and also to image condition often related with specific
windows of Vedaldi (2009), the “objectness” boxes of Alexe objects (one example is indoor/outdoor lighting).
et al. (2012), the boxes around the hierarchical segmentation If we compare with other algorithms, the second high-
algorithm of Arbeláez et al. (2011), the boxes around the est recall is at 0.940 and is achieved by the jumping win-
regions of Endres and Hoiem (2010), and the boxes around dows (Vedaldi 2009) using 10,000 boxes per class. As we do
the regions of Carreira and Sminchisescu (2010). From these not have the exact boxes, we were unable to obtain the MABO
algorithms, only Arbeláez et al. (2011) is not designed for score. This is followed by the exhaustive search of Felzen-
finding object locations. Yet Arbeláez et al. (2011) is one of szwalb et al. (2010) which achieves a recall of 0.933 and a
the best contour detectors publicly available, and results in MABO of 0.829 at 100,352 boxes per class (this number is
a natural hierarchy of regions. We include it in our evalua- the average over all classes). This is significantly lower then
tion to see if this algorithm designed for segmentation also our method while using at least a factor of 10 more object
performs well on finding good object locations. Furthermore, locations.
Carreira and Sminchisescu (2010); Endres and Hoiem (2010) Note furthermore that the segmentation methods of
are designed to find good object regions rather then boxes. Carreira and Sminchisescu (2010), Endres and Hoiem (2010)
Results are shown in Table 5 and Fig. 4. have a relatively high standard deviation. This illustrates that
123
164 Int J Comput Vis (2013) 104:154–171
1 1
0.95 0.95
0.85 0.85
0.8 0.8
Recall
0.75 0.75
0.7 0.7
Fig. 4 Trade-off between quality and quantity of the object hypotheses the best trade-off. In terms of Mean Average Best Overlap the “Qual-
in terms of bounding boxes on the Pascal 2007 test set. The dashed ity” selective search is comparable with Carreira and Sminchisescu
lines are for those methods whose quantity is expressed is the num- (2010), Endres and Hoiem (2010) yet is much faster to compute and
ber of boxes per class. In terms of recall “Fast” selective search has goes on longer resulting in a higher final MABO of 0.879
a single strategy can not work equally well for all classes. generate. However, these algorithms are computationally 114
Instead, using multiple complementary strategies leads to and 59 times more expensive than our “Fast” method.
more stable and reliable results. Interestingly, the “objectness” method of Alexe et al.
If we compare the segmentation of Arbelaez Arbeláez (2012) performs quite well in terms of recall, but much worse
et al. (2011) with a the single best strategy of our method, in terms of MABO. This is most likely caused by their non-
they achieve a recall of 0.752 and a MABO of 0.649 at 418 maximum suppression, which suppresses windows which
boxes, while we achieve 0.875 recall and 0.698 MABO using have more than an 0.5 overlap score with an existing, higher
286 boxes. This suggests that a good segmentation algorithm ranked window. And while this significantly improved results
does not automatically result in good object locations in terms when a 0.5 overlap score is the definition of finding an object,
of bounding boxes. for the general problem of finding the highest quality loca-
Figure 4 explores the trade-off between the quality and tions this strategy is less effective and can even be harmful
quantity of the object hypotheses. In terms of recall, our by eliminating better locations.
“Fast” method outperforms all other methods. The method of Figure 6 shows for several methods the Average Best
Harzallah et al. (2009) seems competitive for the 200 loca- Overlap per class. It is derived that the exhaustive search of
tions they use, but in their method the number of boxes is Felzenszwalb et al. (2010) which uses 10 times more loca-
per class while for our method the same boxes are used for tions which are class specific, performs similar to our method
all classes. In terms of MABO, both the object hypotheses for the classes bike, table, chair, and sofa, for the other classes
generation method of Carreira and Sminchisescu (2010) and our method yields the best score. In general, the classes with
Endres and Hoiem (2010) have a good quantity/quality trade- the highest scores are cat, dog, horse, and sofa, which are
off for the up to 790 object-box locations per image they easy largely because the instances in the dataset tend to be
123
Int J Comput Vis (2013) 104:154–171 165
0.65
Alexe et al.
0.6
Endres and Hoiem
Carreira and Sminchisescu
Felzenszwalb et al.
0.55 Selective search Fast
Selective search Quality
0.5
plane
bike
bird
boat
bottle
bus
car
cat
chair
cow
table
dog
horse
motor
person
plant
sheep
sofa
train
tv
big. The classes with the lowest scores are bottle, person, and The object regions of both Carreira and Sminchisescu
plant, which are difficult because instances tend to be small. (2010); Endres and Hoiem (2010) are of similar quality as
Nevertheless, cow, sheep, and tv are not bigger than person our “Fast” selective search, 0.665 MABO and 0.679 MABO
and yet can be found quite well by our algorithm. respectively where our “Fast” search yields 0.666 MABO.
To summarize, selective search is very effective in find- While Carreira and Sminchisescu (2010); Endres and Hoiem
ing a high quality set of object hypotheses using a limited (2010) use fewer regions these algorithms are respectively
number of boxes, where the quality is reasonable consistent 114 and 59 times computationally more expensive. Our
over the object classes. The methods of Carreira and Smin- “Quality” selective search generates 22,491 regions and is
chisescu (2010) and Endres and Hoiem (2010) have a simi- respectively 25 and 13 times faster than Carreira and Smin-
lar quality/quantity trade-off for up to 790 object locations. chisescu (2010); Endres and Hoiem (2010), and has by far
However, they have more variation over the object classes. the highest score of 0.730 MABO.
Furthermore, they are at least 59 and 13 times more expen- Figure 7 shows the Average Best Overlap of the regions per
sive to compute for our “Fast” and “Quality” selective search class. For all classes except bike, our selective search consis-
methods respectively, which is a problem for current dataset tently has relatively high ABO scores. The performance for
sizes for object recognition. In general, we conclude that bike is disproportionally lower for region-locations instead
selective search yields the best quality locations at 0.879 of object-locations, because bike is a wire-frame object and
MABO while using a reasonable number of 10,097 class- hence very difficult to accurately delineate.
independent object locations. If we compare our method to others, the method of Endres
and Hoiem (2010) is better for train, for the other classes our
5.2.2 Region-Based Locations “Quality” method yields similar or better scores. For bird,
boat, bus, chair, person, plant, and tv scores are 0.05 ABO
In this section we examine how well the regions that our better. For car we obtain 0.12 higher ABO and for bottle even
selective search generates captures object locations. We do 0.17 higher ABO. Looking at the variation in ABO scores in
this on the segmentation part of the Pascal VOC 2007 Table 6, we see that selective search has a slightly lower
test set. We compare with the segmentation of Arbeláez variation than the other methods: 0.093 MABO for “qual-
et al. (2011) and with the object hypothesis regions of both ity” and 0.108 for Endres and Hoiem (2010). However, this
Carreira and Sminchisescu (2010); Endres and Hoiem score is biased because of the wire-framed bicycle: without
(2010). Table 6 shows the results. Note that the number of bicycle the difference becomes more apparent. The standard
regions is larger than the number of boxes as there are almost deviation for the “quality” selective search becomes 0.058,
no exact duplicates. and 0.100 for Endres and Hoiem (2010). Again, this shows
123
166 Int J Comput Vis (2013) 104:154–171
Table 6 Comparison of algorithms to find a good set of potential object locations in terms of regions on the segmentation part of Pascal 2007 test
Method Recall MABO No. of Regions Time(s)
Overlap scores
0.6
0.5
0.4
0.3
0.2
Carreira and Sminchisescu
Endres and Hoiem
0.1 Selective search Fast
Selective search Quality
0
sofa
bike
table
bus
cow
train
motor
person
sheep
plane
boat
car
bottle
tv
bird
cat
chair
dog
horse
plant
that by relying on multiple complementary strategies instead Now because of the nature of selective search, rather than
of a single strategy yields more stable results. pitting methods against each other, it is more interesting to see
Figure 8 shows several example segmentations from our how they can complement each other. As both Carreira and
method and Carreira and Sminchisescu (2010), Endres and Sminchisescu (2010); Endres and Hoiem (2010) have a very
Hoiem (2010). In the first image, the other methods have different algorithm, the combination should prove effective
problems keeping the white label of the bottle and the book according to our diversification hypothesis. Indeed, as can
apart. In our case, one of our strategies ignores colour while be seen in the lower part of Table 6, combination with our
the “fill” similarity (Eq. 5) helps grouping the bottle and label “Fast” selective search leads to 0.737 MABO at 6,438 loca-
together. The missing bottle part, which is dusty, is already tions. This is a higher MABO using less locations than our
merged with the table before this bottle segment is formed, “quality” selective search. A combination of Carreira and
hence “fill” will not help here. The second image is an exam- Sminchisescu (2010); Endres and Hoiem (2010) with our
ple of a dark image on which our algorithm has generally “quality” sampling leads to 0.758 MABO at 25,355 loca-
strong results due to using a variety of colour spaces. In this tions. This is a good increase at only a modest extra number
particular image, the partially intensity invariant Lab colour of locations.
space helps to isolate the car. As we do not use the contour To conclude, selective search is highly effective for gen-
detection method of Arbeláez et al. (2011), our method some- erating object locations in terms of regions. The use of a
times generates segments with an irregular border, which is variety of strategies makes it robust against various image
illustrated by the third image of a cat. The final image shows conditions as well as the object class. The combination
a very difficult example, for which only Carreira and Smin- of Carreira and Sminchisescu (2010), Endres and Hoiem
chisescu (2010) provides an accurate segment. (2010) and our grouping algorithms into a single selective
123
Int J Comput Vis (2013) 104:154–171 167
Fig. 8 A qualitative
Selective Search [ 4] [9]
comparison of selective search,
Carreira and Sminchisescu
(2010), and Endres and Hoiem
(2010). For our method we
observe: ignoring colour allows
finding the bottle, multiple
colour spaces help in dark
images (car), and not
using Arbeláez et al. (2011) 0.917 0.773 0.741
sometimes result in irregular
borders such as the cat
search showed promising improvements. Given these improve- image. The final round of SVM learning takes around 8
ments, and given that there are many more different par- hours per class on a GPU for approximately 30,000 train-
titioning algorithms out there to use in a selective search, ing examples ( van de Sande et al. 2011) resulting from two
it will be interesting to see how far our selective search rounds of mining negatives on Pascal VOC 2010. Mining
paradigm can still go in terms of computational efficiency, hard negatives is done in parallel and takes around 11 h on 10
number of object locations, and the quality of object machines for a single round, which is around 40 s per image.
locations. This is divided into 30 s for counting visual word frequen-
cies and 0.5 s per class for classification. Testing takes 40 s
5.3 Object Recognition for extracting features, visual word assignment, and count-
ing visual word frequencies, after which 0.5 s is needed per
In this section we will evaluate our selective search strategy class for classification. For comparison, the code of Felzen-
for object recognition using the Pascal VOC 2010 detection szwalb et al. (2010) (without cascade, just like our version)
task. needs for testing slightly less than 4 s per image per class. For
Our selective search strategy enables the use of expensive the 20 Pascal classes this makes our framework faster during
and powerful image representations and machine learning testing.
techniques. In this section we use selective search inside the We evaluate results using the official evaluation server.
Bag-of-Words based object recognition framework described This evaluation is independent as the test data has not been
in Sect. 4. The reduced number of object locations compared released. We compare with the top-4 of the competition. Note
to an exhaustive search make it feasible to use such a strong that while all methods in the top-4 are based on an exhaustive
Bag-of-Words implementation. search using variations on part-based model of Felzenszwalb
To give an indication of computational requirements: The et al. (2010) with HOG-features, our method differs substan-
pixel-wise extraction of three SIFT variants plus visual word tially by using selective search and Bag-of-Words features.
assignment takes around 10 seconds and is done once per Results are shown in Table 7.
123
168
123
Table 7 Results from the Pascal VOC 2010 detection task test set
System Plane Bike Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Motor Person Plant Sheep Sofa Train TV
NLPR .533 .553 .192 .210 .300 .544 .467 .412 .200 .315 .207 .303 .486 .553 .465 .102 .344 .265 .503 .403
MIT UCLA .542 .485 .157 .192 .292 .555 .435 417 .169 .285 .267 .309 .483 .550 .417 .097 .358 .308 .472 .408
Zhu et al. (2010)
NUS .491 .524 .178 .120 .306 .535 .328 .373 .177 .306 .277 .295 .519 .563 .442 .096 .148 .279 .495 .384
UoCTTI .524 .543 .130 .156 .351 .542 .491 .318 .155 .262 .135 .215 .454 .516 .475 .091 .351 .194 .466 .380
Felzenszwalb et al. (2010)
This paper .562 .424 .153 .126 .218 .493 .368 .461 .129 .321 .300 .365 .435 .529 .329 .153 .411 .318 .470 .448
Our method is the only object recognition system based on Bag-of-Words. It has the best scores for 9, mostly non-rigid object categories, where the difference is up to 0.056 AP. The other methods
are based on part-based HOG features, and perform better on most rigid object classes
Best results are displayed in bold
Int J Comput Vis (2013) 104:154–171
Int J Comput Vis (2013) 104:154–171 169
Table 8 Results for ImageNet Large Scale Visual Recognition Table 9 Quality of locations on Pascal VOC 2012 train+val
Challenge 2011 (ILSVRC2011)
Boxes train+val 2012 MABO No. of Locations
Participant Flat error Hierarchical error
“Fast” 0.814 2,006
University of Amsterdam (ours) 0.425 0.285 “Quality” 0.886 10,681
ISI lab., University of Tokyo 0.565 0.410 Segments train+val 2012 MABO No. of Locations
Hierarchical error penalises mistakes less if the predicted class is seman-
tically similar to the real class according to the WordNet hierarchy “Fast” 0.512 3,482
“Quality” 0.559 22,073
It is shown that our method yields the best results for the
classes plane, cat, cow, table, dog, plant, sheep, sofa, and tv. 5.4 Pascal VOC 2012
Except table, sofa, and tv, these classes are all non-rigid. This
is expected, as Bag-of-Words is theoretically better suited Because the Pacal VOC 2012 is the latest and perhaps
for these classes than the HOG-features. Indeed, for the rigid final VOC dataset, we briefly present results on this dataset
classes bike, bottle, bus, car, person, and train the HOG-based to facilitate comparison with our work in the future. We
methods perform better. The exception is the rigid class tv. present quality of boxes using the train+val set, the qual-
This is presumably because our selective search performs ity of segments on the segmentation part of train+val,
well in locating tv’s, see Fig. 6. and our localisation framework using a Spatial Pyramid of
In the Pascal 2011 challenge there are several entries wich 1×1, 2×2, 3×3, and, 4×4 on the test set using the official
achieve significantly higher scores than our entry. These evaluation server.
methods use Bag-of-Words as additional information on Results for the location quality are presented in Table 9.
the locations found by their part-based model, yielding bet- We see that for the box-locations the results are slightly
ter detection accuracy. Interestingly, however, by using Bag higher than in Pascal VOC 2007. For the segments, however,
-of-Words to detect locations our method achieves a higher results are worse. This is mainly because the 2012 segmen-
total recall for many classes ( Everingham et al. 2011). tation set is considerably more difficult.
Finally, our selective search enabled participation to the For the 2012 detection challenge, the Mean Average Pre-
detection task of the ImageNet Large Scale Visual Recog- cision is 0.350. This is similar to the 0.351 MAP obtained on
nition Challenge 2011 (ILSVRC2011) as shown in Table 8. Pascal VOC 2010.
This dataset contains 1,229,413 training images and 100,000
test images with 1,000 different object categories. Test- 5.5 An Upper Bound of Location Quality
ing can be accelerated as features extracted from the loca-
tions of selective search can be reused for all classes. For In this experiment we investigate how close our selective
example, using the fast Bag-of-Words framework of Uijlings search locations are to the optimal locations in terms of recog-
et al. (2010), the time to extract SIFT-descriptors plus two nition accuracy for Bag-of-Words features. We do this on the
colour variants takes 6.7 s and assignment to visual words Pascal VOC 2007 test set.
takes 1.7 s 1 . Using a 1 × 1, 2 × 2, and, 3 × 3 spatial pyra- The red line in Fig. 9 shows the MAP score of our object
mid division it takes 14 s to get all 172,032 dimensional recognition system when the top n boxes of our “quality”
features. Classification in a cascade on the pyramid lev- selective search method are used. The performance starts at
els then takes 0.3 s per class. For 1,000 classes, the total 0.283 MAP using the first 500 object locations with a MABO
process then takes 323 s per image for testing. In con- of 0.758. It rapidly increases to 0.356 MAP using the first
trast, using the part-based framework of Felzenszwalb et 3,000 object locations with a MABO of 0.855, and then ends
al. (2010) it takes 3.9 s per class per image, resulting in at 0.360 MAP using all 10,097 object locations with a MABO
3,900 s per image for testing. This clearly shows that the of 0.883.
reduced number of locations helps scaling towards more The magenta line shows the performance of our object
classes. recognition system if we include the ground truth object loca-
We conclude that compared to an exhaustive search, selec- tions to our hypotheses set, representing an object hypothesis
tive search enables the use of more expensive features and set of “perfect” quality with a MABO score of 1. When only
classifiers and scales better as the number of classes increase. the ground truth boxes are used a MAP of 0.592 is achieved,
which is an upper bound of our object recognition system.
1 We found no difference in recognition accuracy when using the Ran-
However, this score rapidly declines to 0.437 MAP using as
dom Forest assignment of Uijlings et al. (2010) or kmeans nearest neigh- few as 500 locations per image. Remarkably, when all 10,079
bour assignment in van de Sande et al. (2010) on the Pascal dataset. boxes are used the performance drops to 0.377 MAP, only
123
170 Int J Comput Vis (2013) 104:154–171
123
Int J Comput Vis (2013) 104:154–171 171
Arbeláez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detec- Lowe, D. G. (2004). Distinctive image features from scale-invariant
tion and hierarchical image segmentation. IEEE Transactions on Pat- keypoints. International Journal of Computer Vision, 60, 91–110.
tern Analysis and Machine Intelligence, 33(5), 898–916. Maji, S., Berg, A. C., & Malik, J. (2008). In CVPR: Classification using
Carreira, J., Sminchisescu, C. (2010). Constrained parametric min-cuts intersection kernel support vector machines is efficient.
for automatic object segmentation. In CVPR. Maji, S., & Malik, J. (2009). Object detection using a max-margin hough
Chum, O., Zisserman, A. (2007). An exemplar model for learning object transform. In CVPR.
classes. In CVPR. Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward scale and rotation invariant texture classification with local binary
feature space analysis. IEEE Transactions on Pattern Analysis and patterns. IEEE Transactions on Pattern Analysis and Machine Intel-
Machine Intelligence, 24, 603–619. ligence, 24(7), 971–987.
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). In Perronnin, F., Sánchez, J., & Thomas M. (2010). In ECCV: Improving
ECCV statistical learning in computer vision: Visual categorization the Fisher Kernel for large-scale image classification.
with bags of keypoints. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation.
Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human IEEE Transactions on Pattern Analysis and Machine Intelligence,
detection. In CVPR. 22, 888–905.
Endres, I., Hoiem, D. (2010). Category independent object proposals. Sivic, J., Zisserman, A.(2003). Video google: A text retrieval approach
In ECCV. to object matching in videos. In ICCV.
Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, Sonnenburg, S., Raetsch, G., Henschel, S., Widmer, C., Behr, J., Zien,
A. (2011). The Pascal visual object classes challenge workshop: A., et al. (2010). The shogun machine learning toolbox. Journal of
Overview and results of the detection challenge. Machine Learning Research, 11, 1799–1802.
Everingham, M., van Gool, L., Williams, C. K. I., Winn, J., & Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. (2005). Image parsing: Unify-
Zisserman, A. (2010). The Pascal visual object classes (voc) chal- ing segmentation, detection and recognition. Marr Prize Issue. Inter-
lenge. International Journal of Computer Vision, 88, 303–338. national Journal of Computer Vision.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. Uijlings, J. R. R., Smeulders, A. W. M., & Scha, R. J. H. (2010). Real-
(2010). Object detection with discriminatively trained part based time visual concept classification. IEEE Transactions on Multimedia,
models. IEEE Transactions on Pattern Analysis and Machine Intel- 12(7), 665–681.
ligence, 32, 1627–1645. van de Sande, K. E. A., & Gevers, T. (2012). Illumination-invariant
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based descriptors for discriminative visual object categorization. Technical
image segmentation. International Journal of Computer Vision, 59, report : University of Amsterdam.
167–181. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating
Geusebroek, J. M., van den Boomgaard, R., Smeulders, A. W. M., & color descriptors for object and scene recognition. IEEE Transac-
Geerts, H. (2001). Color invariance. IEEE Transactions on Pattern tions on Pattern Analysis and Machine Intelligence, 32, 1582–1596.
Analysis and Machine Intelligence, 23, 1338–1350. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011). Empow-
Gu, C., Lim, J. J., Arbeláez, P., & Malik, J. (2009). In CVPR: Recog- ering visual categorization with the GPU. IEEE Transactions on
nition using regions. Multimedia, 13(1), 60–70.
Harzallah, H., Jurie, F., & Schmid, C. (2009). In ICCV: Combining Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). In ICCV:
efficient object localization and image classification. Multiple kernels for object detection.
Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2009). Efficient sub- Viola, P., & Jones, M. (2001). Rapid object detection using a boosted
window search: A branch and bound framework for object localiza- cascade of simple features. In CVPR, Volume 1, 511–518.
tion. IEEE Transactions on Pattern Analysis and Machine Intelli- Viola, P., & Jones, M. J. (2004). Robust real-time face detection. Inter-
gence, 31, 2129–2142. national Journal of Computer Vision, 57, 137–154.
Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Zhou, X., Kai, Y., Zhang, T., & Huang, T. S. (2010). In ECCV: Image
Spatial pyramid matching for recognizing natural scene categories. classification using super-vector coding of local image descriptors.
In CVPR. Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). In CVPR: Latent
Li, F., & Carreira, J., Sminchisescu, C. (2010). In CVPR: Object recog- hierarchical structural learning for object detection.
nition as ranking holistic figure-ground hypotheses.
Liu, C., Sharan, L., Adelson, E.H., Rosenholtz, R. (2010). Exploring
features in a bayesian framework for material recognition. In Com-
puter vision and pattern recognition 2010. IEEE.
123