0% found this document useful (0 votes)

20 views18 pages

Selective Search in Object Recognition

Uploaded by

Amoksilin KN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views18 pages

Selective Search in Object Recognition

Uploaded by

Amoksilin KN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Int J Comput Vis (2013) 104:154–171

DOI 10.1007/s11263-013-0620-5

Selective Search for Object Recognition

J. R. R. Uijlings · K. E. A. van de Sande ·
T. Gevers · A. W. M. Smeulders

Received: 5 May 2012 / Accepted: 11 March 2013 / Published online: 2 April 2013
© Springer Science+Business Media New York 2013

Abstract This paper addresses the problem of generating progress over the past years ( Arbeláez et al. 2011; Comaniciu
possible object locations for use in object recognition. We and Meer 2002; Felzenszwalb and Huttenlocher 2004; Shi
introduce selective search which combines the strength of and Malik 2000). But images are intrinsically hierarchical: In
both an exhaustive search and segmentation. Like segmen- Fig. 1a the salad and spoons are inside the salad bowl, which
tation, we use the image structure to guide our sampling in turn stands on the table. Furthermore, depending on the
process. Like exhaustive search, we aim to capture all possi- context the term table in this picture can refer to only the wood
ble object locations. Instead of a single technique to generate or include everything on the table. Therefore both the nature
possible object locations, we diversify our search and use a of images and the different uses of an object category are
variety of complementary image partitionings to deal with hierarchical. This prohibits the unique partitioning of objects
as many image conditions as possible. Our selective search for all but the most specific purposes. Hence for most tasks
results in a small set of data-driven, class-independent, high multiple scales in a segmentation are a necessity. This is most
quality locations, yielding 99 % recall and a Mean Average naturally addressed by using a hierarchical partitioning, as
Best Overlap of 0.879 at 10,097 locations. The reduced num- done for example by Arbeláez et al. (2011).
ber of locations compared to an exhaustive search enables Besides that a segmentation should be hierarchical, a
the use of stronger machine learning techniques and stronger generic solution for segmentation using a single strategy may
appearance models for object recognition. In this paper we not exist at all. There are many conflicting reasons why a
show that our selective search enables the use of the powerful region should be grouped together: In Fig. 1b the cats can
Bag-of-Words model for recognition. The selective search be separated using colour, but their texture is the same. Con-
software is made publicly available (Software: http://disi. versely, in Fig. 1c the chameleon is similar to its surrounding
unitn.it/~uijlings/SelectiveSearch.html). leaves in terms of colour, yet its texture differs. Finally, in
Fig. 1d, the wheels are wildly different from the car in terms
of both colour and texture, yet are enclosed by the car. Indi-
1 Introduction vidual visual features therefore cannot resolve the ambiguity
of segmentation.
For a long time, objects were sought to be delineated before And, finally, there is a more fundamental problem.
their identification. This gave rise to segmentation, which Regions with very different characteristics, such as a face
aims for a unique partitioning of the image through a generic over a sweater, can only be combined into one object after
algorithm, where there is one part for all object silhouettesin it has been established that the object at hand is a human.
the image. Research on this topic has yielded tremendous Hence without prior recognition it is hard to decide that a
face and a sweater are part of one object ( Tu et al. 2005).
J. R. R. Uijlings (B) This has led to the opposite of the traditional approach:
University of Trento, Trento, Italy to do localisation through the identification of an object.
e-mail: [email protected]; [email protected]
This recent approach in object recognition has made enor-
K. E. A. van de Sande · T. Gevers · A. W. M. Smeulders mous progress in less than a decade ( Dalal and Triggs 2005;
University of Amsterdam, Amsterdam, The Netherlands Felzenszwalb et al. 2010; Harzallah et al. 2009; Viola and

123
Int J Comput Vis (2013) 104:154–171 155

Fig. 1 There is a high variety of reasons that an image region forms fore, to find objects in a structured way it is necessary to use a variety
an object. In (b) the cats can be distinguished by colour, not texture. In of diverse strategies. Furthermore, an image is intrinsically hierarchical
(c) the chameleon can be distinguished from the surrounding leaves by as there is no single scale for which the complete table, salad bowl, and
texture, not colour. In (d) the wheels can be part of the car because they salad spoon can be found in (a)
are enclosed, not because they are similar in texture or colour. There-

Jones 2001). With an appearance model learned from exam- However, our selective search applies to regions as well and
ples, an exhaustive search is performed where every location is also applicable to concepts such as “grass”.
within the image is examined as to not miss any potential In this paper we propose selective search for object
object location ( Dalal and Triggs 2005; Felzenszwalb et al. recognition. Our main research questions are: (1) What are
2010; Harzallah et al. 2009; Viola and Jones 2001). good diversification strategies for adapting segmentation as
However, the exhaustive search itself has several draw- a selective search strategy? (2) How effective is selective
backs. Searching every possible location is computationally search in creating a small set of high-quality locations within
infeasible. The search space has to be reduced by using a reg- an image? (3) Can we use selective search to employ more
ular grid, fixed scales, and fixed aspect ratios. In most cases powerful classifiers and appearance models for object recog-
the number of locations to visit remains huge, so much that nition?
alternative restrictions need to be imposed. The classifier is
simplified and the appearance model needs to be fast. Fur-
thermore, a uniform sampling yields many boxes for which it 2 Related Work
is immediately clear that they are not supportive of an object.
Rather then sampling locations blindly using an exhaustive We confine the related work to the domain of object recog-
search, a key question is: Can we steer the sampling by a nition and divide it into three categories: Exhaustive search,
data-driven analysis? segmentation, and other sampling strategies that do not fall
In this paper, we aim to combine the best of the intu- in either category.
itions of segmentation and exhaustive search and propose a
data-driven selective search. Inspired by bottom-up segmen- 2.1 Exhaustive Search
tation, we aim to exploit the structure of the image to gener-
ate object locations. Inspired by exhaustive search, we aim As an object can be located at any position and scale in the
to capture all possible object locations. Therefore, instead of image, it is natural to search everywhere ( Dalal and Triggs
using a single sampling technique, we aim to diversify the 2005; Harzallah et al. 2009; Viola and Jones 2004). How-
sampling techniques to account for as many image condi- ever, the visual search space is huge, making an exhaustive
tions as possible. Specifically, we use a data-driven grouping- search computationally expensive. This imposes constraints
based strategy where we increase diversity by using a variety on the evaluation cost per location and/or the number of loca-
of complementary grouping criteria and a variety of comple- tions considered. Hence most of these sliding window tech-
mentary colour spaces with different invariance properties. niques use a coarse search grid and fixed aspect ratios, using
The set of locations is obtained by combining the locations of weak classifiers and economic image features such as HOG
these complementary partitionings. Our goal is to generate a ( Dalal and Triggs 2005; Harzallah et al. 2009; Viola and
class-independent, data-driven, selective search strategy that Jones 2004). This method is often used as a preselection step
generates a small set of high-quality object locations. in a cascade of classifiers ( Harzallah et al. 2009; Viola and
Our application domain of selective search is object recog- Jones 2004).
nition. We therefore evaluate on the most commonly used Related to the sliding window technique is the highly
dataset for this purpose, the Pascal VOC detection challenge successful part-based object localisation method of Felzen-
which consists of 20 object classes. The size of this dataset szwalb et al. (2010). Their method also performs an exhaus-
yields computational constraints for our selective search. Fur- tive search using a linear SVM and HOG features. However,
thermore, the use of this dataset means that the quality of they search for objects and object parts, whose combination
locations is mainly evaluated in terms of bounding boxes. results in an impressive object detection performance.

123
156 Int J Comput Vis (2013) 104:154–171

Lampert et al. (2009) proposed using the appearance generate a set of part hypotheses using a grouping method
model to guide the search. This both alleviates the constraints based on Arbeláez et al. (2011). Each part hypothesis is
of using a regular grid, fixed scales, and fixed aspect ratio, described by both appearance and shape features. Then, an
while at the same time reduces the number of locations vis- object is recognized and carefully delineated by using its
ited. This is done by directly searching for the optimal win- parts, achieving good results for shape recognition. In their
dow within the image using a branch and bound technique. work, the segmentation is hierarchical and yields segments
While they obtain impressive results for linear classifiers, at all scales. However, they use a single grouping strategy
Alexe et al. (2010) found that for non-linear classifiers the whose power of discovering parts or objects is left unevalu-
method in practice still visits over a 100,000 windows per ated. In this work, we use multiple complementary strategies
image. to deal with as many image conditions as possible. We include
Instead of a blind exhaustive search or a branch and bound the locations generated using Arbeláez et al. (2011) in our
search, we propose selective search. We use the underly- evaluation.
ing image structure to generate object locations. In contrast
to the discussed methods, this yields a completely class-
independent set of locations. Furthermore, because we do not 2.3 Other Sampling Strategies
use a fixed aspect ratio, our method is not limited to objects
but should be able to find stuff like “grass” and “sand” as well Alexe et al. (2012) address the problem of the large sam-
(this also holds for Lampert et al. (2009)). Finally, we hope pling space of an exhaustive search by proposing to search
to generate fewer locations, which should make the prob- for any object, independent of its class. In their method they
lem easier as the variability of samples becomes lower. And train a classifier on the object windows of those objects
more importantly, it frees up computational power which can which have a well-defined shape (as opposed to stuff like
be used for stronger machine learning techniques and more “grass” and “sand”). Then instead of a full exhaustive search
powerful appearance models. they randomly sample boxes to which they apply their
classifier. The boxes with the highest “objectness” mea-
2.2 Segmentation sure serve as a set of object hypotheses. This set is then
used to greatly reduce the number of windows evaluated by
Both Carreira and Sminchisescu (2010) and Endres and class-specific object detectors. We compare our method with
Hoiem (2010) propose to generate a set of class independent their work.
object hypotheses using segmentation. Both methods gen- Another strategy is to use visual words of the Bag-
erate multiple foreground/background segmentations, learn of-Words model to predict the object location. Vedaldi (2009)
to predict the likelihood that a foreground segment is a use jumping windows (Chum and ZissermanZ 2007), in
complete object, and use this to rank the segments. Both which the relation between individual visual words and the
algorithms show a promising ability to accurately delineate object location is learned to predict the object location in
objects within images, confirmed by Li et al. (2010) who new images. Maji and Malik (2009) combine multiple of
achieve state-of-the-art results on pixel-wise image classi- these relations to predict the object location using a Hough-
fication using Carreira and Sminchisescu (2010). As com- transform, after which they randomly sample windows close
mon in segmentation, both methods rely on a single strong to the Hough maximum. In contrast to learning, we use the
algorithm for identifying good regions. They obtain a vari- image structure to sample a set of class-independent object
ety of locations by using many randomly initialised fore- hypotheses.
ground and background seeds. In contrast, we explicitly deal To summarize, our novelty is as follows. Instead of an
with a variety of image conditions by using different group- exhaustive search ( Dalal and Triggs 2005; Felzenszwalb
ing criteria and different representations. This means a lower et al. 2010; Harzallah et al. 2009; Viola and Jones 2004)
computational investment as we do not have to invest in the we use segmentation as selective search yielding a small
single best segmentation strategy, such as using the excel- set of class independent object locations. In contrast to the
lent yet expensive contour detector of Arbeláez et al. (2011). segmentation of Carreira and Sminchisescu (2010), Endres
Furthermore, as we deal with different image conditions sep- and Hoiem (2010) instead of focusing on the best segmen-
arately, we expect our locations to have a more consistent tation algorithm ( Arbeláez et al. 2011), we use a variety of
quality. Finally, our selective search paradigm dictates that strategies to deal with as many image conditions as possible,
the most interesting question is not how our regions com- thereby severely reducing computational costs while poten-
pare to Carreira and Sminchisescu (2010), Endres and Hoiem tially capturing more objects accurately. Instead of learning
(2010), but rather how they can complement each other. an objectness measure on randomly sampled boxes ( Alexe
Gu et al. (2009) address the problem of carefully segment- et al. 2012), we use a bottom-up grouping procedure to gen-
ing and recognizing objects based on their parts. They first erate good object locations.

123
Int J Comput Vis (2013) 104:154–171 157

Fig. 2 Two examples of our selective search showing the necessity of different scales. On the left we find many objects at different scales. On the
right we necessarily find the objects at different scales as the girl is contained by the tv

3 Selective Search selective search. Because the process of grouping itself is

hierarchical, we can naturally generate locations at all scales
In this section we detail our selective search algorithm for by continuing the grouping process until the whole image
object recognition and present a variety of diversification becomes a single region. This satisfies the condition of cap-
strategies to deal with as many image conditions as possible. turing all scales.
A selective search algorithm is subject to the following design As regions can yield richer information than pixels, we
considerations: want to use region-based features whenever possible. To get
a set of small starting regions which ideally do not span
Capture All Scales. Objects can occur at any scale within multiple objects, we use the fast method of Felzenszwalb
the image. Furthermore, some objects have less clear and Huttenlocher (2004), which Arbeláez et al. (2011) found
boundaries then other objects. Therefore, in selective well-suited for such purpose.
search all object scales have to be taken into account, Our grouping procedure now works as follows. We first
as illustrated in Fig. 2. This is most naturally achieved by use (Felzenszwalb and Huttenlocher 2004) to create initial
using an hierarchical algorithm. regions. Then we use a greedy algorithm to iteratively group
Diversification. There is no single optimal strategy to regions together: First the similarities between all neighbour-
group regions together. As observed earlier in Fig. 1, ing regions are calculated. The two most similar regions are
regions may form an object because of only colour, only grouped together, and new similarities are calculated between
texture, or because parts are enclosed. Furthermore, light- the resulting region and its neighbours. The process of group-
ing conditions such as shading and the colour of the light ing the most similar regions is repeated until the whole image
may influence how regions form an object. Therefore becomes a single region. The general method is detailed in
instead of a single strategy which works well in most Algorithm 1.
cases, we want to have a diverse set of strategies to deal For the similarity s(ri , r j ) between region ri and r j we
with all cases. want a variety of complementary measures under the con-
Fast to Compute. The goal of selective search is to yield straint that they are fast to compute. In effect, this means that
a set of possible object locations for use in a practical the similarities should be based on features that can be prop-
object recognition framework. The creation of this set agated through the hierarchy, i.e. when merging region ri and
should not become a computational bottleneck, hence r j into rt , the features of region rt need to be calculated from
our algorithm should be reasonably fast. the features of ri and r j without accessing the image pixels.

3.1 Selective Search by Hierarchical Grouping 3.2 Diversification Strategies

We take a hierarchical grouping algorithm to form the basis The second design criterion for selective search is to diver-
of our selective search. Bottom-up grouping is a popu- sify the sampling and create a set of complementary strategies
lar approach to segmentation (Comaniciu and Meer 2002; whose locations are combined afterwards. We diversify our
Felzenszwalb and Huttenlocher 2004), hence we adapt it for selective search (1) by using a variety of colour spaces with

123
158 Int J Comput Vis (2013) 104:154–171

Algorithm 1: Hierarchical Grouping Algorithm

n
scolour (ri , r j ) = min(cik , ckj ). (1)
DontPrintSemicolon Input: (colour) image
k=1
Output: Set of object location hypotheses L
Obtain initial regions R = {r1 , · · · , rn } using Felzenszwalb and The colour histograms can be efficiently propagated
Huttenlocher (2004) Initialise similarity set S = ∅;
through the hierarchy by
foreach Neighbouring region pair (ri , r j ) do
Calculate similarity s(ri , r j ); size(ri ) × Ci + size(r j ) × C j
S = S ∪ s(ri , r j ); Ct = . (2)
size(ri ) + size(rj )
while S = ∅ do
Get highest similarity s(ri , r j ) = max(S);
The size of a resulting region is simply the sum of its
Merge corresponding regions rt = ri ∪ r j ;
Remove similarities regarding ri : S = S \ s(ri , r∗ ); constituents: size(rt ) = size(ri ) + size(r j ).
Remove similarities regarding r j : S = S \ s(r∗ , r j ); stextur e (ri , r j ) measures texture similarity. We represent
Calculate similarity set St between rt and its neighbours; texture using fast SIFT-like measurements as SIFT itself
S = S ∪ St ;
works well for material recognition ( Liu et al. 2010). We
R = R ∪ rt ;
Extract object location boxes L from all regions in R; take Gaussian derivatives in eight orientations using σ =
1 for each colour channel. For each orientation for each
colour channel we extract a histogram using a bin size of
different invariance properties, (2) by using different simi- 10. This leads to a texture histogram Ti = {ti1 , · · · , tin }
larity measures si j , and (3) by varying our starting regions. for each region ri with dimensionality n = 240 when
Complementary Colour Spaces. We want to account for three colour channels are used. Texture histograms are
different scene and lighting conditions. Therefore we per- normalised using the L 1 norm. Similarity is measured
form our hierarchical grouping algorithm in a variety of using histogram intersection:
colour spaces with a range of invariance properties. Specif-
ically, we the following colour spaces with an increasing
n
stextur e (ri , r j ) = min(tik , t kj ). (3)
degree of invariance: (1) RG B, (2) the intensity (grey-scale k=1
image) I , (3) Lab, (4) the rg channels of normalized RG B
plus intensity denoted as rg I , (5) H SV , (6) normalized RG B Texture histograms are efficiently propagated through the
denoted as rgb, (7) C Geusebroek et al. (2001) which is an hierarchy in the same way as the colour histograms.
opponent colour space where intensity is divided out, and ssi ze (ri , r j ) encourages small regions to merge early. This
finally (8) the Hue channel H from H SV . The specific invari- forces regions in S, i.e. regions which have not yet been
ance properties are listed in Table 1. merged, to be of similar sizes throughout the algorithm.
Of course, for images that are black and white a change This is desirable because it ensures that object locations
of colour space has little impact on the final outcome of the at all scales are created at all parts of the image. For
algorithm. For these images we rely on the other diversifica- example, it prevents a single region from gobbling up
tion methods for ensuring good object locations. all other regions one by one, yielding all scales only at
In this paper we always use a single colour space through- the location of this growing region and nowhere else.
out the algorithm, meaning that both the initial grouping ssi ze (ri , r j ) is defined as the fraction of the image that ri
algorithm of Felzenszwalb and Huttenlocher (2004) and our and r j jointly occupy:
subsequent grouping algorithm are performed in this colour
space. size(ri ) + size(rj )
Complementary Similarity Measures. We define four com- ssi ze (ri , r j ) = 1 − , (4)
size(im)
plementary, fast-to-compute similarity measures. These mea-
sures are all in range [0, 1] which facilitates combinations of where size(im) denotes the size of the image in pixels.
these measures. s f ill (ri , r j ) measures how well region ri and r j fit into
each other. The idea is to fill gaps: if ri is contained in r j it
scolour (ri , r j ) measures colour similarity. Specifically, is logical to merge these first in order to avoid any holes.
for each region we obtain one-dimensional colour his- On the other hand, if ri and r j are hardly touching each
tograms for each colour channel using 25 bins, which other they will likely form a strange region and should not
we found to work well. This leads to a colour histogram be merged. To keep the measure fast, we use only the size
Ci = {ci1 , · · · , cin } for each region ri with dimensionality of the regions and of the containing boxes. Specifically,
n = 75 when three colour channels are used. The colour we define B Bi j to be the tight bounding box around ri and
histograms are normalised using the L 1 norm. Similarity r j . Now s f ill (ri , r j ) is the fraction of the image contained
is measured using the histogram intersection: in B Bi j which is not covered by the regions of ri and r j :

123
Int J Comput Vis (2013) 104:154–171 159

Table 1 The invariance properties of both the individual colour channels and the colour spaces used in this paper, sorted by degree of invariance
Colour channels R G B I V L a b S r g C H

Light intensity − − − − − − +/− +/− + + + + +

Shadows/shading − − − − − − +/− +/− + + + + +
Highlights − − − − − − − − − − − +/− +
Colour spaces RGB I Lab rgI HSV rgb C H

Light intensity − − +/− 2/3 2/3 + + +

Shadows/shading − − +/− 2/3 2/3 + + +
Highlights − − − − 1/3 − +/− +
“A +/−” means partial invariance. A fraction 1/3 means that one of the three colour channels is invariant to said property

size(B Bi j ) − size(ri ) − size(ri ) We choose to order the combined object hypotheses set
f ill(ri , r j ) = 1 − (5)
size(im) based on the order in which the hypotheses were generated
in each individual grouping strategy. However, as we com-
We divide by size(im) for consistency with Eq. 4. Note bine results from up to 80 different strategies, such order
that this measure can be efficiently calculated by keeping would too heavily emphasize large regions. To prevent this,
track of the bounding boxes around each region, as the we include some randomness as follows. Given a grouping
bounding box around two regions can be easily derived j
strategy j, let ri be the region which is created at position
from these. i in the hierarchy, where i = 1 represents the top of the
hierarchy (whose corresponding region covers the complete
j
In this paper, our final similarity measure is a combination image). We now calculate the position value vi as RND × i,
of the above four: where RND is a random number in range [0, 1]. The final
j
ranking is obtained by ordering the regions using vi .
s(ri , r j ) = a1 scolour (ri , r j ) + a2 stextur e (ri , r j )
When we use locations in terms of bounding boxes, we
+a3 ssi ze (ri , r j ) + a4 s f ill (ri , r j ), (6) first rank all the locations as detailed above. Only afterwards
we filter out lower ranked duplicates. This ensures that dupli-
where ai ∈ {0, 1} denotes if the similarity measure is
cate boxes have a better chance of obtaining a high rank. This
used or not. As we aim to diversify our strategies, we do not
is desirable because if multiple grouping strategies suggest
consider any weighted similarities.
the same box location, it is likely to come from a visually
Complementary Starting Regions. A third diversification
coherent part of the image.
strategy is varying the complementary starting regions. To
the best of our knowledge, the method of Felzenszwalb and
Huttenlocher (2004) is the fastest, publicly available algo- 4 Object Recognition Using Selective Search
rithm that yields high quality starting locations. We could
not find any other algorithm with similar computational effi- This paper uses the locations generated by our selective
ciency so we use only this oversegmentation in this paper. search for object recognition. This section details our frame-
But note that different starting regions are (already) obtained work for object recognition.
by varying the colour spaces, each which has different invari- Two types of features are dominant in object recognition:
ance properties. Additionally, we vary the threshold parame- histograms of oriented gradients (HOG) (Dalal and Triggs
ter k in Felzenszwalb and Huttenlocher (2004). 2005) and bag-of-words ( Csurka et al. 2004; Sivic and
Zisserman 2003). HOG has been shown to be successful in
3.3 Combining Locations combination with the part-based model by Felzenszwalb et
al. (2010). However, as they use an exhaustive search, HOG
In this paper, we combine the object hypotheses of several features in combination with a linear classifier is the only fea-
variations of our hierarchical grouping algorithm. Ideally, sible choice from a computational perspective. In contrast,
we want to order the object hypotheses in such a way that our selective search enables the use of more expensive and
the locations which are most likely to be an object come potentially more powerful features. Therefore we use bag-of-
first. This enables one to find a good trade-off between the words for object recognition ( Harzallah et al. 2009; Lampert
quality and quantity of the resulting object hypothesis set, et al. 2009; Vedaldi 2009). However, we use a more powerful
depending on the computational efficiency of the subsequent (and expensive) implementation than (Harzallah et al. 2009;
feature extraction and classification method. Lampert et al. 2009; Vedaldi 2009) by employing a variety

123
160 Int J Comput Vis (2013) 104:154–171

Fig. 3 The training procedure of our object recognition pipeline. As positive learning examples we use the ground truth. As negatives we use
examples that have a 20–50 % overlap with the positive examples. We iteratively add hard negatives using a retraining phase

of colour-SIFT descriptors ( van de Sande et al. 2010) and a Then we enter a retraining phase to iteratively add hard
finer spatial pyramid division ( Lazebnik et al. 2006). negative examples (e.g. Felzenszwalb et al. (2010)): We apply
Specifically we sample descriptors at each pixel on a sin- the learned models to the training set using the locations
gle scale (σ = 1.2). Using software from van de Sande et al. generated by our selective search. For each negative image
(2010), we extract SIFT ( Lowe 2004) and two colour SIFTs we add the highest scoring location. As our initial training
which were found to be the most sensitive for detecting image set already yields good models, our models converge in only
structures, Extended OpponentSIFT ( van de Sande et al. two iterations.
2012) and RGB-SIFT ( van de Sande et al. 2010). We use a For the test set, the final model is applied to all locations
visual codebook of size 4,000 and a spatial pyramid with 4 generated by our selective search. The windows are sorted by
levels using a 1×1, 2×2, 3×3 and 4×4 division. This gives classifier score while windows which have more than 30 %
a total feature vector length of 360,000. In image classifica- overlap with a higher scoring window are considered near-
tion, features of this size are already used ( Perronnin et al. duplicates and are removed.
2010; Zhou et al. 2010). Because a spatial pyramid results in
a coarser spatial subdivision than the cells which make up a
HOG descriptor, our features contain less information about
5 Evaluation
the specific spatial layout of the object. Therefore, HOG is
better suited for rigid objects and our features are better suited
In this section we evaluate the quality of our selective search.
for deformable object types.
We divide our experiments in four parts, each spanning a
As classifier we employ a Support Vector Machine with
separate subsection:
a histogram intersection kernel using the Shogun Toolbox
( Sonnenburg et al. 2010). To apply the trained classi-
fier, we use the fast, approximate classification strategy of Diversification Strategies. We experiment with a variety
Maji et al. (2008), which was shown to work well for Bag- of colour spaces, similarity measures, and thresholds of
of-Words in Uijlings et al. (2010). the initial regions, all which were detailed in Sect. 3.2.
Our training procedure is illustrated in Fig. 3. The initial We seek a trade-off between the number of generated
positive examples consist of all ground truth object windows. object hypotheses, computation time, and the quality of
As initial negative examples we select from all object loca- object locations. We do this in terms of bounding boxes.
tions generated by our selective search that have an overlap This results in a selection of complementary techniques
of 20–50 % with a positive example. To avoid near-duplicate which together serve as our final selective search method.
negative examples, a negative example is excluded if it has Quality of Locations. We test the quality of the object
more than 70 % overlap with another negative. To keep the location hypotheses resulting from the selective search.
number of initial negatives per class below 20,000, we ran- Object Recognition. We use the locations of our selective
domly drop half of the negatives for the classes car, cat, dog search in the Object Recognition framework detailed in
and person. Intuitively, this set of examples can be seen as Sect. 4. We evaluate performance on the Pascal VOC
difficult negatives which are close to the positive examples. detection challenge.
This means they are close to the decision boundary and are An upper bound of location quality. We investigate how
therefore likely to become support vectors even when the well our object recognition framework performs when
complete set of negatives would be considered. Indeed, we using an object hypothesis set of “perfect” quality. How
found that this selection of training examples gives reason- does this compare to the locations that our selective
ably good initial classification models. search generates?

123
Int J Comput Vis (2013) 104:154–171 161

To evaluate the quality of our object hypotheses we define 5.1.1 Flat Versus Hierarchy
the Average Best Overlap (ABO) and Mean Average Best
Overlap (MABO) scores, which slightly generalises the mea- In the description of our method we claim that using a full
sure used in Endres and Hoiem (2010). To calculate the hierarchy is more natural than using multiple flat partition-
Average Best Overlap for a specific class c, we calculate the ings by changing a threshold. In this section we test whether
best overlap between each ground truth annotation gic ∈ G c the use of a hierarchy also leads to better results. We therefore
and the object hypotheses L generated for the corresponding compare the use of Felzenszwalb and Huttenlocher (2004)
image, and average: with multiple thresholds against our proposed algorithm.
Specifically, we perform both strategies in RG B colour
1
ABO = max Overlap(gic , l j ). (7) space. For Felzenszwalb and Huttenlocher (2004), we vary
|G c | c c l j ∈L
gi ∈G the threshold from k = 50 to k = 1, 000 in steps of 50. This
range captures both small and large regions. Additionally, as
The Overlap score is taken from Everingham et al. (2010) and a special type of threshold, we include the whole image as an
measures the area of the intersection of two regions divided object location because quite a few images contain a single
by its union: large object only. Furthermore, we also take a coarser range
area(gic ) ∩ area(lj ) from k = 50 to k = 950 in steps of 100. For our algorithm, to
Overlap(gic , l j ) = . (8) create initial regions we use a threshold of k = 50, ensuring
area(gic ) ∪ area(lj )
that both strategies have an identical smallest scale. Addi-
Analogously to Average Precision and Mean Average Preci- tionally, as we generate fewer regions, we combine results
sion, Mean Average Best Overlap is now defined as the mean using k = 50 and k = 100. As similarity measure S we use
ABO over all classes. the addition of all four similarities as defined in Eq. 6. Results
Other work often uses the recall derived from the Pascal are in Table 2.
Overlap Criterion to measure the quality of the boxes ( Alexe As can be seen, the quality of object hypotheses is better
et al. 2010; Harzallah et al. 2009; Vedaldi 2009). This cri- for our hierarchical strategy than for multiple flat partition-
terion considers an object to be found when the Overlap of ings: At a similar number of regions, our MABO score is con-
Eq. 8 is larger than 0.5. However, in many of our experiments sistently higher. Moreover, the increase in MABO achieved
we obtain a recall between 95 and 100 % for most classes, by combining the locations of two variants of our hierarchical
making this measure too insensitive for this paper. However, grouping algorithm is much higher than the increase achieved
we do report this measure when comparing with other work. by adding extra thresholds for the flat partitionings. We con-
To avoid overfitting, we perform the diversification strate- clude that using all locations from a hierarchical grouping
gies experiments on the Pascal VOC 2007 train+val set. algorithm is not only more natural but also more effective
Other experiments are done on the Pascal VOC 2007 test than using multiple flat partitionings.
set. Additionally, our object recognition system is bench-
marked on the Pascal VOC 2010 detection challenge, using
the independent evaluation server. 5.1.2 Individual Diversification Strategies

5.1 Diversification Strategies In this paper we propose three diversification strategies to

obtain good quality object hypotheses: varying the colour
In this section we evaluate a variety of strategies to obtain space, varying the similarity measures, and varying the
good quality object location hypotheses using a reasonable thresholds to obtain the starting regions. This section investi-
number of boxes computed within a reasonable amount of gates the influence of each strategy. As basic settings we use
time. the RG B colour space, the combination of all four similarity

Table 2 A comparison of multiple flat partitionings against hierarchical partitionings for generating box locations shows that for the hierarchical
strategy the Mean Average Best Overlap (MABO) score is consistently higher at a similar number of locations
Threshold k in Felzenszwalb and Huttenlocher (2004) MABO No. of Windows

Flat Felzenszwalb and Huttenlocher (2004) k = 50, 150, · · · , 950 0.659 387
Hierarchical (this paper) k = 50 0.676 395
Flat Felzenszwalb and Huttenlocher (2004) k = 50, 100, · · · , 1000 0.673 597
Hierarchical (this paper) k = 50, 100 0.719 625

123
162 Int J Comput Vis (2013) 104:154–171

Table 3 Mean Average Best Overlap for box-based object hypotheses histograms of the individual colour channels. However, we
using a variety of segmentation strategies obtained similar results (MABO of 0.577). We believe that
Similarities MABO No. of Box one reason of the weakness of texture is because of object
boundaries: When two segments are separated by an object
C 0.635 356
boundary, both sides of this boundary will yield similar edge-
T 0.581 303
responses, which inadvertently increases similarity.
S 0.640 466 While the texture similarity yields relatively few object
F 0.634 449 locations, at 300 locations the other similarity measures still
C+T 0.635 346 yield a MABO higher than 0.628. This suggests that when
C+S 0.660 383 comparing individual strategies the final MABO scores in
C+F 0.660 389 Table 3 are good indicators of trade-off between quality and
T+S 0.650 406 quantity of the object hypotheses. Another observation is that
T+F 0.638 400 combinations of similarity measures generally outperform
S+F 0.638 449 the single measures. In fact, using all four similarity measures
C+T+S 0.662 377 perform best yielding a MABO of 0.676.
C+T+F 0.659 381 Looking at variations in the colour space in the top-right
C+S+F 0.674 401 of Table 3, we observe large differences in results, ranging
T+S+F 0.655 427 from a MABO of 0.615 with 125 locations for the C colour
C+T+S+F 0.676 395 space to a MABO of 0.693 with 463 locations for the HSV
colour space. We note that Lab-space has a particularly good
Colours MABO No. of Box
MABO score of 0.690 using only 328 boxes. Furthermore,
HSV 0.693 463 the order of each hierarchy is effective: using the first 328
I 0.670 399 boxes of HSV colour space yields 0.690 MABO, while using
RGB 0.676 395 the first 100 boxes yields 0.647 MABO. This shows that
rgI 0.693 362 when comparing single strategies we can use only the MABO
Lab 0.690 328 scores to represent the trade-off between quality and quantity
H 0.644 322 of the object hypotheses set. We will use this in the next
rgb 0.647 207 section when finding good combinations.
C 0.615 125 Experiments on the thresholds of Felzenszwalb and
Huttenlocher (2004) to generate the starting regions show,
Thresholds MABO No. of Box
in the bottom-right of Table 3, that a lower initial threshold
50 0.676 395 results in a higher MABO using more object locations.
100 0.671 239
150 0.668 168 5.1.3 Combinations of Diversification Strategies
250 0.647 102
500 0.585 46 We combine object location hypotheses using a variety of
1,000 0.477 19 complementary grouping strategies in order to get a good
quality set of object locations. As a full search for the best
(C)olour, (S)ize, and (F)ill perform similar. (T)exture by itself is weak.
combination is computationally expensive, we perform a
The best combination is as many diverse sources as possible
greedy search using the MABO score only as optimization
criterion. We have earlier observed that this score is repre-
measures, and threshold k = 50. Each time we vary a single sentative for the trade-off between the number of locations
parameter. Results are given in Table 3. and their quality.
We start examining the combination of similarity mea- From the resulting ordering we create three configura-
sures on the left part of Table 3. Looking first at colour, tions: a single best strategy, a fast selective search, and a
texture, size, and fill individually, we see that the texture quality selective search using all combinations of individ-
similarity performs worst with a MABO of 0.581, while the ual components, i.e. colour space, similarities, thresholds, as
other measures range between 0.63 and 0.64. To test if the rel- detailed in Table 4. The greedy search emphasizes varia-
atively low score of texture is due to our choice of feature, we tion in the combination of similarity measures. This con-
also tried to represent texture by Local Binary Patterns Ojala firms our diversification hypothesis: In the quality version,
et al. (2002). We experimented with 4 and 8 neighbours on next to the combination of all similarities, Fill and Size are
different scales using different uniformity/consistency of the taken separately. The remainder of this paper uses the three
patterns (see Ojala et al. (2002)), where we concatenate LBP strategies in Table 4.

123
Int J Comput Vis (2013) 104:154–171 163

Table 4 Our selective search methods resulting from a greedy search

Version Diversification strategies MABO No. of Win No. of Strategies Time (s)

Single strategy HSV C+T+S+F k = 100 0.693 362 1 0.71

Selective search fast HSV, Lab C+T+S+F, T+S+F k = 50, 100 0.799 2,147 8 3.79
Selective search quality HSV, Lab, rgI, H, I C+T+S+F, T+S+F, F, S k = 50, 100, 150, 300 0.878 10,108 80 17.15
We take all combinations of the individual diversification strategies selected, resulting in 1, 8, and 80 variants of our hierarchical grouping algorithm.
The Mean Average Best Overlap (MABO) score keeps steadily rising as the number of windows increase

Table 5 Comparison of recall, Mean Average Best Overlap (MABO) and number of window locations for a variety of methods on the Pascal 2007
test set
Method Recall MABO No. of Windows

Arbeláez et al. (2011) 0.752 0.649 ± 0.193 418

Alexe et al. (2012) 0.944 0.694 ± 0.111 1,853
Harzallah et al. (2009) 0.830 – 200 per class
Carreira and Sminchisescu (2010) 0.879 0.770 ± 0.084 517
Endres and Hoiem (2010) 0.912 0.791 ± 0.082 790
Felzenszwalb et al. (2010) 0.933 0.829 ± 0.052 100,352 per class
Vedaldi (2009) 0.940 – 10,000 per class
Single strategy 0.840 0.690 ± 0.171 289
Selective search “Fast” 0.980 0.804 ± 0.046 2,134
Selective search “Quality” 0.991 0.879 ± 0.039 10,097

5.2 Quality of Locations As shown in Table 5, our “Fast” and “Quality” selective
search methods yield a close to optimal recall of 98 and 99 %
In this section we evaluate our selective search algorithms respectively. In terms of MABO, we achieve 0.804 and 0.879
in terms of both Average Best Overlap and the number of respectively. To appreciate what a Best Overlap of 0.879
locations on the Pascal VOC 2007 test set. We first evaluate means, Figure 5 shows for bike, cow, and person an exam-
box-based locations and afterwards briefly evaluate region- ple location which has an overlap score between 0.874 and
based locations. 0.884. This illustrates that our selective search yields high
quality object locations.
5.2.1 Box-Based Locations Furthermore, note that the standard deviation of our
MABO scores is relatively low: 0.046 for the fast selective
We compare with the sliding window search of Harzallah search, and 0.039 for the quality selective search. This shows
et al. (2009), the sliding window search of Felzenszwalb et al. that selective search is robust to difference in object proper-
(2010) using the window ratio’s of their models, the jumping ties, and also to image condition often related with specific
windows of Vedaldi (2009), the “objectness” boxes of Alexe objects (one example is indoor/outdoor lighting).
et al. (2012), the boxes around the hierarchical segmentation If we compare with other algorithms, the second high-
algorithm of Arbeláez et al. (2011), the boxes around the est recall is at 0.940 and is achieved by the jumping win-
regions of Endres and Hoiem (2010), and the boxes around dows (Vedaldi 2009) using 10,000 boxes per class. As we do
the regions of Carreira and Sminchisescu (2010). From these not have the exact boxes, we were unable to obtain the MABO
algorithms, only Arbeláez et al. (2011) is not designed for score. This is followed by the exhaustive search of Felzen-
finding object locations. Yet Arbeláez et al. (2011) is one of szwalb et al. (2010) which achieves a recall of 0.933 and a
the best contour detectors publicly available, and results in MABO of 0.829 at 100,352 boxes per class (this number is
a natural hierarchy of regions. We include it in our evalua- the average over all classes). This is significantly lower then
tion to see if this algorithm designed for segmentation also our method while using at least a factor of 10 more object
performs well on finding good object locations. Furthermore, locations.
Carreira and Sminchisescu (2010); Endres and Hoiem (2010) Note furthermore that the segmentation methods of
are designed to find good object regions rather then boxes. Carreira and Sminchisescu (2010), Endres and Hoiem (2010)
Results are shown in Table 5 and Fig. 4. have a relatively high standard deviation. This illustrates that

123
164 Int J Comput Vis (2013) 104:154–171

1 1

0.95 0.95

Mean Average Best Overlap

0.9 0.9

0.85 0.85

0.8 0.8
Recall

0.75 0.75

0.7 0.7

0.65 Harzallah et al. 0.65

Vedaldi et al.
Alexe et al. Alexe et al.
0.6 Carreira and Sminchisescu
0.6 Carreira and Sminchisescu
Endres and Hoiem Endres and Hoiem
0.55 Selective search Fast 0.55 Selective search Fast
Selective search Quality Selective search Quality
0.5 0.5
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Number of Object Boxes Number of Object Boxes
(a) (b)

Fig. 4 Trade-off between quality and quantity of the object hypotheses the best trade-off. In terms of Mean Average Best Overlap the “Qual-
in terms of bounding boxes on the Pascal 2007 test set. The dashed ity” selective search is comparable with Carreira and Sminchisescu
lines are for those methods whose quantity is expressed is the num- (2010), Endres and Hoiem (2010) yet is much faster to compute and
ber of boxes per class. In terms of recall “Fast” selective search has goes on longer resulting in a higher final MABO of 0.879

(a) (b) (c) (d) (e)

Fig. 5 Examples of locations for objects whose Best Overlap score is around our Mean Average Best Overlap of 0.879. The green boxes are the
ground truth. The red boxes are created using the “Quality” selective search (Color figure online)

a single strategy can not work equally well for all classes. generate. However, these algorithms are computationally 114
Instead, using multiple complementary strategies leads to and 59 times more expensive than our “Fast” method.
more stable and reliable results. Interestingly, the “objectness” method of Alexe et al.
If we compare the segmentation of Arbelaez Arbeláez (2012) performs quite well in terms of recall, but much worse
et al. (2011) with a the single best strategy of our method, in terms of MABO. This is most likely caused by their non-
they achieve a recall of 0.752 and a MABO of 0.649 at 418 maximum suppression, which suppresses windows which
boxes, while we achieve 0.875 recall and 0.698 MABO using have more than an 0.5 overlap score with an existing, higher
286 boxes. This suggests that a good segmentation algorithm ranked window. And while this significantly improved results
does not automatically result in good object locations in terms when a 0.5 overlap score is the definition of finding an object,
of bounding boxes. for the general problem of finding the highest quality loca-
Figure 4 explores the trade-off between the quality and tions this strategy is less effective and can even be harmful
quantity of the object hypotheses. In terms of recall, our by eliminating better locations.
“Fast” method outperforms all other methods. The method of Figure 6 shows for several methods the Average Best
Harzallah et al. (2009) seems competitive for the 200 loca- Overlap per class. It is derived that the exhaustive search of
tions they use, but in their method the number of boxes is Felzenszwalb et al. (2010) which uses 10 times more loca-
per class while for our method the same boxes are used for tions which are class specific, performs similar to our method
all classes. In terms of MABO, both the object hypotheses for the classes bike, table, chair, and sofa, for the other classes
generation method of Carreira and Sminchisescu (2010) and our method yields the best score. In general, the classes with
Endres and Hoiem (2010) have a good quantity/quality trade- the highest scores are cat, dog, horse, and sofa, which are
off for the up to 790 object-box locations per image they easy largely because the instances in the dataset tend to be

123
Int J Comput Vis (2013) 104:154–171 165

Fig. 6 The Average Best 1

Overlap scores per class for
several method for generating 0.95
box-based object locations on
Pascal VOC 2007 test. For all
0.9
classes but table our “Quality”
selective search yields the best
locations. For 12 out of 20 0.85

Average Best Overlap

classes our “Fast” selective
search outperforms the 0.8
expensive Carreira and
Sminchisescu (2010); Endres 0.75
and Hoiem (2010). We always
outperform Alexe et al. (2012)
0.7

0.65
Alexe et al.
0.6
Endres and Hoiem
Carreira and Sminchisescu
Felzenszwalb et al.
0.55 Selective search Fast
Selective search Quality
0.5
plane

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

motor

person

plant

sheep

sofa

train

tv
big. The classes with the lowest scores are bottle, person, and The object regions of both Carreira and Sminchisescu
plant, which are difficult because instances tend to be small. (2010); Endres and Hoiem (2010) are of similar quality as
Nevertheless, cow, sheep, and tv are not bigger than person our “Fast” selective search, 0.665 MABO and 0.679 MABO
and yet can be found quite well by our algorithm. respectively where our “Fast” search yields 0.666 MABO.
To summarize, selective search is very effective in find- While Carreira and Sminchisescu (2010); Endres and Hoiem
ing a high quality set of object hypotheses using a limited (2010) use fewer regions these algorithms are respectively
number of boxes, where the quality is reasonable consistent 114 and 59 times computationally more expensive. Our
over the object classes. The methods of Carreira and Smin- “Quality” selective search generates 22,491 regions and is
chisescu (2010) and Endres and Hoiem (2010) have a simi- respectively 25 and 13 times faster than Carreira and Smin-
lar quality/quantity trade-off for up to 790 object locations. chisescu (2010); Endres and Hoiem (2010), and has by far
However, they have more variation over the object classes. the highest score of 0.730 MABO.
Furthermore, they are at least 59 and 13 times more expen- Figure 7 shows the Average Best Overlap of the regions per
sive to compute for our “Fast” and “Quality” selective search class. For all classes except bike, our selective search consis-
methods respectively, which is a problem for current dataset tently has relatively high ABO scores. The performance for
sizes for object recognition. In general, we conclude that bike is disproportionally lower for region-locations instead
selective search yields the best quality locations at 0.879 of object-locations, because bike is a wire-frame object and
MABO while using a reasonable number of 10,097 class- hence very difficult to accurately delineate.
independent object locations. If we compare our method to others, the method of Endres
and Hoiem (2010) is better for train, for the other classes our
5.2.2 Region-Based Locations “Quality” method yields similar or better scores. For bird,
boat, bus, chair, person, plant, and tv scores are 0.05 ABO
In this section we examine how well the regions that our better. For car we obtain 0.12 higher ABO and for bottle even
selective search generates captures object locations. We do 0.17 higher ABO. Looking at the variation in ABO scores in
this on the segmentation part of the Pascal VOC 2007 Table 6, we see that selective search has a slightly lower
test set. We compare with the segmentation of Arbeláez variation than the other methods: 0.093 MABO for “qual-
et al. (2011) and with the object hypothesis regions of both ity” and 0.108 for Endres and Hoiem (2010). However, this
Carreira and Sminchisescu (2010); Endres and Hoiem score is biased because of the wire-framed bicycle: without
(2010). Table 6 shows the results. Note that the number of bicycle the difference becomes more apparent. The standard
regions is larger than the number of boxes as there are almost deviation for the “quality” selective search becomes 0.058,
no exact duplicates. and 0.100 for Endres and Hoiem (2010). Again, this shows

123
166 Int J Comput Vis (2013) 104:154–171

Table 6 Comparison of algorithms to find a good set of potential object locations in terms of regions on the segmentation part of Pascal 2007 test
Method Recall MABO No. of Regions Time(s)

Arbeláez et al. (2011) 0.539 0.540 ± 0.117 1122 64

Endres and Hoiem (2010) 0.813 0.679 ± 0.108 2167 226
Carreira and Sminchisescu (2010) 0.782 0.665 ± 0.118 697 432
Single Strategy 0.576 0.548 ± 0.078 678 0.7
“Fast” 0.829 0.666 ± 0.089 3574 3.8
“Quality” 0.904 0.730 ± 0.093 22,491 17
Carreira and Sminchisescu (2010); Endres and Hoiem (2010) + “Fast” 0.896 0.737 ± 0.098 6,438 662
Carreira and Sminchisescu (2010); Endres and Hoiem (2010) + “Quality” 0.920 0.758 ± 0.096 25,355 675

Fig. 7 Comparison of the 1

Average Best Overlap Scores
per class between our method 0.9
and others on the Pascal 2007
test set. Except for train, our 0.8
“Quality” method consistently
yields better Average Best 0.7
Average Best Overlap

Overlap scores
0.6

0.5

0.4

0.3

0.2
Carreira and Sminchisescu
Endres and Hoiem
0.1 Selective search Fast
Selective search Quality
0

sofa
bike

table
bus

cow

train
motor

person

sheep
plane

boat

car
bottle

tv
bird

cat

chair

dog

horse

plant
that by relying on multiple complementary strategies instead Now because of the nature of selective search, rather than
of a single strategy yields more stable results. pitting methods against each other, it is more interesting to see
Figure 8 shows several example segmentations from our how they can complement each other. As both Carreira and
method and Carreira and Sminchisescu (2010), Endres and Sminchisescu (2010); Endres and Hoiem (2010) have a very
Hoiem (2010). In the first image, the other methods have different algorithm, the combination should prove effective
problems keeping the white label of the bottle and the book according to our diversification hypothesis. Indeed, as can
apart. In our case, one of our strategies ignores colour while be seen in the lower part of Table 6, combination with our
the “fill” similarity (Eq. 5) helps grouping the bottle and label “Fast” selective search leads to 0.737 MABO at 6,438 loca-
together. The missing bottle part, which is dusty, is already tions. This is a higher MABO using less locations than our
merged with the table before this bottle segment is formed, “quality” selective search. A combination of Carreira and
hence “fill” will not help here. The second image is an exam- Sminchisescu (2010); Endres and Hoiem (2010) with our
ple of a dark image on which our algorithm has generally “quality” sampling leads to 0.758 MABO at 25,355 loca-
strong results due to using a variety of colour spaces. In this tions. This is a good increase at only a modest extra number
particular image, the partially intensity invariant Lab colour of locations.
space helps to isolate the car. As we do not use the contour To conclude, selective search is highly effective for gen-
detection method of Arbeláez et al. (2011), our method some- erating object locations in terms of regions. The use of a
times generates segments with an irregular border, which is variety of strategies makes it robust against various image
illustrated by the third image of a cat. The final image shows conditions as well as the object class. The combination
a very difficult example, for which only Carreira and Smin- of Carreira and Sminchisescu (2010), Endres and Hoiem
chisescu (2010) provides an accurate segment. (2010) and our grouping algorithms into a single selective

123
Int J Comput Vis (2013) 104:154–171 167

Fig. 8 A qualitative
Selective Search [ 4] [9]
comparison of selective search,
Carreira and Sminchisescu
(2010), and Endres and Hoiem
(2010). For our method we
observe: ignoring colour allows
finding the bottle, multiple
colour spaces help in dark
images (car), and not
using Arbeláez et al. (2011) 0.917 0.773 0.741
sometimes result in irregular
borders such as the cat

0.910 0.430 0.901

0.798 0.960 0.908

0.633 0.701 0.891

search showed promising improvements. Given these improve- image. The final round of SVM learning takes around 8
ments, and given that there are many more different par- hours per class on a GPU for approximately 30,000 train-
titioning algorithms out there to use in a selective search, ing examples ( van de Sande et al. 2011) resulting from two
it will be interesting to see how far our selective search rounds of mining negatives on Pascal VOC 2010. Mining
paradigm can still go in terms of computational efficiency, hard negatives is done in parallel and takes around 11 h on 10
number of object locations, and the quality of object machines for a single round, which is around 40 s per image.
locations. This is divided into 30 s for counting visual word frequen-
cies and 0.5 s per class for classification. Testing takes 40 s
5.3 Object Recognition for extracting features, visual word assignment, and count-
ing visual word frequencies, after which 0.5 s is needed per
In this section we will evaluate our selective search strategy class for classification. For comparison, the code of Felzen-
for object recognition using the Pascal VOC 2010 detection szwalb et al. (2010) (without cascade, just like our version)
task. needs for testing slightly less than 4 s per image per class. For
Our selective search strategy enables the use of expensive the 20 Pascal classes this makes our framework faster during
and powerful image representations and machine learning testing.
techniques. In this section we use selective search inside the We evaluate results using the official evaluation server.
Bag-of-Words based object recognition framework described This evaluation is independent as the test data has not been
in Sect. 4. The reduced number of object locations compared released. We compare with the top-4 of the competition. Note
to an exhaustive search make it feasible to use such a strong that while all methods in the top-4 are based on an exhaustive
Bag-of-Words implementation. search using variations on part-based model of Felzenszwalb
To give an indication of computational requirements: The et al. (2010) with HOG-features, our method differs substan-
pixel-wise extraction of three SIFT variants plus visual word tially by using selective search and Bag-of-Words features.
assignment takes around 10 seconds and is done once per Results are shown in Table 7.

123
168

123
Table 7 Results from the Pascal VOC 2010 detection task test set
System Plane Bike Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Motor Person Plant Sheep Sofa Train TV

NLPR .533 .553 .192 .210 .300 .544 .467 .412 .200 .315 .207 .303 .486 .553 .465 .102 .344 .265 .503 .403
MIT UCLA .542 .485 .157 .192 .292 .555 .435 417 .169 .285 .267 .309 .483 .550 .417 .097 .358 .308 .472 .408
Zhu et al. (2010)
NUS .491 .524 .178 .120 .306 .535 .328 .373 .177 .306 .277 .295 .519 .563 .442 .096 .148 .279 .495 .384
UoCTTI .524 .543 .130 .156 .351 .542 .491 .318 .155 .262 .135 .215 .454 .516 .475 .091 .351 .194 .466 .380
Felzenszwalb et al. (2010)
This paper .562 .424 .153 .126 .218 .493 .368 .461 .129 .321 .300 .365 .435 .529 .329 .153 .411 .318 .470 .448
Our method is the only object recognition system based on Bag-of-Words. It has the best scores for 9, mostly non-rigid object categories, where the difference is up to 0.056 AP. The other methods
are based on part-based HOG features, and perform better on most rigid object classes
Best results are displayed in bold
Int J Comput Vis (2013) 104:154–171
Int J Comput Vis (2013) 104:154–171 169

Table 8 Results for ImageNet Large Scale Visual Recognition Table 9 Quality of locations on Pascal VOC 2012 train+val
Challenge 2011 (ILSVRC2011)
Boxes train+val 2012 MABO No. of Locations
Participant Flat error Hierarchical error
“Fast” 0.814 2,006
University of Amsterdam (ours) 0.425 0.285 “Quality” 0.886 10,681
ISI lab., University of Tokyo 0.565 0.410 Segments train+val 2012 MABO No. of Locations
Hierarchical error penalises mistakes less if the predicted class is seman-
tically similar to the real class according to the WordNet hierarchy “Fast” 0.512 3,482
“Quality” 0.559 22,073

It is shown that our method yields the best results for the
classes plane, cat, cow, table, dog, plant, sheep, sofa, and tv. 5.4 Pascal VOC 2012
Except table, sofa, and tv, these classes are all non-rigid. This
is expected, as Bag-of-Words is theoretically better suited Because the Pacal VOC 2012 is the latest and perhaps
for these classes than the HOG-features. Indeed, for the rigid final VOC dataset, we briefly present results on this dataset
classes bike, bottle, bus, car, person, and train the HOG-based to facilitate comparison with our work in the future. We
methods perform better. The exception is the rigid class tv. present quality of boxes using the train+val set, the qual-
This is presumably because our selective search performs ity of segments on the segmentation part of train+val,
well in locating tv’s, see Fig. 6. and our localisation framework using a Spatial Pyramid of
In the Pascal 2011 challenge there are several entries wich 1×1, 2×2, 3×3, and, 4×4 on the test set using the official
achieve significantly higher scores than our entry. These evaluation server.
methods use Bag-of-Words as additional information on Results for the location quality are presented in Table 9.
the locations found by their part-based model, yielding bet- We see that for the box-locations the results are slightly
ter detection accuracy. Interestingly, however, by using Bag higher than in Pascal VOC 2007. For the segments, however,
-of-Words to detect locations our method achieves a higher results are worse. This is mainly because the 2012 segmen-
total recall for many classes ( Everingham et al. 2011). tation set is considerably more difficult.
Finally, our selective search enabled participation to the For the 2012 detection challenge, the Mean Average Pre-
detection task of the ImageNet Large Scale Visual Recog- cision is 0.350. This is similar to the 0.351 MAP obtained on
nition Challenge 2011 (ILSVRC2011) as shown in Table 8. Pascal VOC 2010.
This dataset contains 1,229,413 training images and 100,000
test images with 1,000 different object categories. Test- 5.5 An Upper Bound of Location Quality
ing can be accelerated as features extracted from the loca-
tions of selective search can be reused for all classes. For In this experiment we investigate how close our selective
example, using the fast Bag-of-Words framework of Uijlings search locations are to the optimal locations in terms of recog-
et al. (2010), the time to extract SIFT-descriptors plus two nition accuracy for Bag-of-Words features. We do this on the
colour variants takes 6.7 s and assignment to visual words Pascal VOC 2007 test set.
takes 1.7 s 1 . Using a 1 × 1, 2 × 2, and, 3 × 3 spatial pyra- The red line in Fig. 9 shows the MAP score of our object
mid division it takes 14 s to get all 172,032 dimensional recognition system when the top n boxes of our “quality”
features. Classification in a cascade on the pyramid lev- selective search method are used. The performance starts at
els then takes 0.3 s per class. For 1,000 classes, the total 0.283 MAP using the first 500 object locations with a MABO
process then takes 323 s per image for testing. In con- of 0.758. It rapidly increases to 0.356 MAP using the first
trast, using the part-based framework of Felzenszwalb et 3,000 object locations with a MABO of 0.855, and then ends
al. (2010) it takes 3.9 s per class per image, resulting in at 0.360 MAP using all 10,097 object locations with a MABO
3,900 s per image for testing. This clearly shows that the of 0.883.
reduced number of locations helps scaling towards more The magenta line shows the performance of our object
classes. recognition system if we include the ground truth object loca-
We conclude that compared to an exhaustive search, selec- tions to our hypotheses set, representing an object hypothesis
tive search enables the use of more expensive features and set of “perfect” quality with a MABO score of 1. When only
classifiers and scales better as the number of classes increase. the ground truth boxes are used a MAP of 0.592 is achieved,
which is an upper bound of our object recognition system.
1 We found no difference in recognition accuracy when using the Ran-
However, this score rapidly declines to 0.437 MAP using as
dom Forest assignment of Uijlings et al. (2010) or kmeans nearest neigh- few as 500 locations per image. Remarkably, when all 10,079
bour assignment in van de Sande et al. (2010) on the Pascal dataset. boxes are used the performance drops to 0.377 MAP, only

123
170 Int J Comput Vis (2013) 104:154–171

0.6 High quality object locations are necessary to recognise an

0.55 object in the first place. Being able to sample fewer object
0.5 hypotheses without sacrificing quality makes the classifica-
Mean Average Precision

0.45 tion problem easier and helps to improves results. Remark-

0.4 ably, at a reasonable 10,000 locations, our object hypothesis
0.35 set is close to optimal for our Bag-of-Words recognition sys-
0.3 tem. This suggests that our locations are of such quality that
0.25 features with higher discriminative power than is normally
0.2 found in Bag-of-Words are now required.
0.15
0.1
SS "Quality"
6 Conclusions
0.05
SS "Quality" + Ground Truth
0
0 2000 4000 6000 8000 10000 This paper proposed to adapt segmentation for selective
Number of Object Hypotheses search. We observed that an image is inherently hierarchi-
cal and that there are a large variety of reasons for a region
Fig. 9 Theoretical upper limit for the box selection within our object
recognition framework. The red curve denotes the performance using to form an object. Therefore a single bottom-up grouping
the top n locations of our “quality” selective search method, which has a algorithm can never capture all possible object locations. To
MABO of 0.758 at 500 locations, 0.855 at 3,000 locations, and 0.883 at solve this we introduced selective search, where the main
10,000 locations. The magenta curve denotes the performance using the
insight is to use a diverse set of complementary and hierar-
same top n locations but now combined with the ground truth, which is
the upper limit of location quality (MABO = 1). At 10,000 locations, our chical grouping strategies. This makes selective search sta-
object hypothesis set is close to optimal in terms of object recognition ble, robust, and independent of the object-class, where object
accuracy (Color figure online) types range from rigid (e.g. car) to non-rigid (e.g. cat), and
theoretically also to amorphous (e.g. water).
0.017 MAP more than when not including the ground truth.
In terms of object windows, results show that our algo-
This shows that at 10,000 object locations our hypotheses
rithm is superior to the “objectness” of Alexe et al. (2012)
set is close to what can be optimally achieved for our recog-
where our fast selective search reaches a quality of 0.804
nition framework. The most likely explanation is our use of
Mean Average Best Overlap at 2,134 locations. Compared
SIFT, which is designed to be shift invariant Lowe (2004).
to Carreira and Sminchisescu (2010); Endres and Hoiem
This causes approximate boxes, of a quality visualised in
(2010), our algorithm has a similar trade-off between qual-
Figure 5, to be still good enough. However, the small gap
ity and quantity of generated windows with around 0.790
between the “perfect” object hypotheses set of 10,000 boxes
MABO for up to 790 locations, the maximum that they gen-
and ours suggests that we arrived at the point where the degree
erate. Yet our algorithm is 13-59 times faster. Additionally,
of invariance for Bag-of-Words may have an adverse effect
it creates up to 10,097 locations per image yielding a MABO
rather than an advantageous one.
as high as 0.879.
The decrease of the “perfect” hypothesis set as the num-
In terms of object regions, a combination of our algo-
ber of boxes becomes larger is due to the increased diffi-
rithm with Carreira and Sminchisescu (2010); Endres and
culty of the problem: more boxes means a higher variability,
Hoiem (2010) yields a considerable jump in quality (MABO
which makes the object recognition problem harder. Earlier
increases from 0.730 to 0.758), which shows that by fol-
we hypothesized that an exhaustive search examines all pos-
lowing our diversification paradigm there is still room for
sible locations in the image, which makes the object recog-
improvement.
nition problem hard. To test if selective search alleviates the
Finally, we showed that selective search can be success-
problem, we also applied our Bag-of-Words object recogni-
fully used to create a good Bag-of-Words based localisation
tion system on an exhaustive search, using the locations of
and recognition system. In fact, we showed that the quality
Felzenszwalb et al. (2010). This results in a MAP of 0.336,
of our selective search locations are close to optimal for our
while the MABO was 0.829 and the number of object loca-
version of Bag-of-Words based object recognition.
tions 100,000 per class. The same MABO is obtained using
2,000 locations with selective search. At 2,000 locations,
the object recognition accuracy is 0.347. This shows that
selective search indeed makes the problem easier compared References
to exhaustive search by reducing the possible variation in
Alexe, B., Deselaers, T., Ferrari, V. (2010). What is an object? In CVPR.
locations. Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the object-
To conclude, there is a trade-off between quality and quan- ness of image windows. IEEE Transactions on Pattern Analysis and
tity of object hypothesis and the object recognition accuracy. Machine Intelligence, 34(11), 2189–2202.

123
Int J Comput Vis (2013) 104:154–171 171

Arbeláez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detec- Lowe, D. G. (2004). Distinctive image features from scale-invariant
tion and hierarchical image segmentation. IEEE Transactions on Pat- keypoints. International Journal of Computer Vision, 60, 91–110.
tern Analysis and Machine Intelligence, 33(5), 898–916. Maji, S., Berg, A. C., & Malik, J. (2008). In CVPR: Classification using
Carreira, J., Sminchisescu, C. (2010). Constrained parametric min-cuts intersection kernel support vector machines is efficient.
for automatic object segmentation. In CVPR. Maji, S., & Malik, J. (2009). Object detection using a max-margin hough
Chum, O., Zisserman, A. (2007). An exemplar model for learning object transform. In CVPR.
classes. In CVPR. Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward scale and rotation invariant texture classification with local binary
feature space analysis. IEEE Transactions on Pattern Analysis and patterns. IEEE Transactions on Pattern Analysis and Machine Intel-
Machine Intelligence, 24, 603–619. ligence, 24(7), 971–987.
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). In Perronnin, F., Sánchez, J., & Thomas M. (2010). In ECCV: Improving
ECCV statistical learning in computer vision: Visual categorization the Fisher Kernel for large-scale image classification.
with bags of keypoints. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation.
Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human IEEE Transactions on Pattern Analysis and Machine Intelligence,
detection. In CVPR. 22, 888–905.
Endres, I., Hoiem, D. (2010). Category independent object proposals. Sivic, J., Zisserman, A.(2003). Video google: A text retrieval approach
In ECCV. to object matching in videos. In ICCV.
Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, Sonnenburg, S., Raetsch, G., Henschel, S., Widmer, C., Behr, J., Zien,
A. (2011). The Pascal visual object classes challenge workshop: A., et al. (2010). The shogun machine learning toolbox. Journal of
Overview and results of the detection challenge. Machine Learning Research, 11, 1799–1802.
Everingham, M., van Gool, L., Williams, C. K. I., Winn, J., & Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. (2005). Image parsing: Unify-
Zisserman, A. (2010). The Pascal visual object classes (voc) chal- ing segmentation, detection and recognition. Marr Prize Issue. Inter-
lenge. International Journal of Computer Vision, 88, 303–338. national Journal of Computer Vision.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. Uijlings, J. R. R., Smeulders, A. W. M., & Scha, R. J. H. (2010). Real-
(2010). Object detection with discriminatively trained part based time visual concept classification. IEEE Transactions on Multimedia,
models. IEEE Transactions on Pattern Analysis and Machine Intel- 12(7), 665–681.
ligence, 32, 1627–1645. van de Sande, K. E. A., & Gevers, T. (2012). Illumination-invariant
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based descriptors for discriminative visual object categorization. Technical
image segmentation. International Journal of Computer Vision, 59, report : University of Amsterdam.
167–181. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating
Geusebroek, J. M., van den Boomgaard, R., Smeulders, A. W. M., & color descriptors for object and scene recognition. IEEE Transac-
Geerts, H. (2001). Color invariance. IEEE Transactions on Pattern tions on Pattern Analysis and Machine Intelligence, 32, 1582–1596.
Analysis and Machine Intelligence, 23, 1338–1350. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011). Empow-
Gu, C., Lim, J. J., Arbeláez, P., & Malik, J. (2009). In CVPR: Recog- ering visual categorization with the GPU. IEEE Transactions on
nition using regions. Multimedia, 13(1), 60–70.
Harzallah, H., Jurie, F., & Schmid, C. (2009). In ICCV: Combining Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). In ICCV:
efficient object localization and image classification. Multiple kernels for object detection.
Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2009). Efficient sub- Viola, P., & Jones, M. (2001). Rapid object detection using a boosted
window search: A branch and bound framework for object localiza- cascade of simple features. In CVPR, Volume 1, 511–518.
tion. IEEE Transactions on Pattern Analysis and Machine Intelli- Viola, P., & Jones, M. J. (2004). Robust real-time face detection. Inter-
gence, 31, 2129–2142. national Journal of Computer Vision, 57, 137–154.
Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Zhou, X., Kai, Y., Zhang, T., & Huang, T. S. (2010). In ECCV: Image
Spatial pyramid matching for recognizing natural scene categories. classification using super-vector coding of local image descriptors.
In CVPR. Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). In CVPR: Latent
Li, F., & Carreira, J., Sminchisescu, C. (2010). In CVPR: Object recog- hierarchical structural learning for object detection.
nition as ranking holistic figure-ground hypotheses.
Liu, C., Sharan, L., Adelson, E.H., Rosenholtz, R. (2010). Exploring
features in a bayesian framework for material recognition. In Com-
puter vision and pattern recognition 2010. IEEE.

123

Object Recognition for Researchers
No ratings yet
Object Recognition for Researchers
14 pages
SelectiveSearch Segmentation
No ratings yet
SelectiveSearch Segmentation
15 pages
Shape Sharing ECCV2012
No ratings yet
Shape Sharing ECCV2012
14 pages
Model-Based Object Recognition: A Survey of Recent Research
No ratings yet
Model-Based Object Recognition: A Survey of Recent Research
38 pages
cvpr10 SteinerTrees
No ratings yet
cvpr10 SteinerTrees
8 pages
Object Detection and Localization Using Local and Global Features
No ratings yet
Object Detection and Localization Using Local and Global Features
19 pages
Boss Toshev Cvpr2010
No ratings yet
Boss Toshev Cvpr2010
8 pages
Color Feature Based Object Localization in Real Time Implementation
No ratings yet
Color Feature Based Object Localization in Real Time Implementation
7 pages
Unsupervised Object Discovery Method
No ratings yet
Unsupervised Object Discovery Method
10 pages
FD Report
No ratings yet
FD Report
3 pages
Incremental Learning
No ratings yet
Incremental Learning
8 pages
Real Time Face and Object Tracking As Component of Perceptual User Interface
No ratings yet
Real Time Face and Object Tracking As Component of Perceptual User Interface
6 pages
Local and Global
No ratings yet
Local and Global
20 pages
Keypoint Recognition with Trees
No ratings yet
Keypoint Recognition with Trees
29 pages
Locating An Object of Interest in An Image Using Its Color Feature
No ratings yet
Locating An Object of Interest in An Image Using Its Color Feature
4 pages
Object Categorization Thesis
No ratings yet
Object Categorization Thesis
99 pages
Object Detection With Discriminatively Trained Part Based Models
No ratings yet
Object Detection With Discriminatively Trained Part Based Models
20 pages
Tip 2013 2281406
No ratings yet
Tip 2013 2281406
12 pages
Category Level Object Segmentation by Combining Bag-of-Words Models With Dirichlet Processes and Random Fields
No ratings yet
Category Level Object Segmentation by Combining Bag-of-Words Models With Dirichlet Processes and Random Fields
16 pages
Experiments With Patch-Based Object Classification
No ratings yet
Experiments With Patch-Based Object Classification
6 pages
Classification of Images Using Similar Objects
No ratings yet
Classification of Images Using Similar Objects
4 pages
N-D Image Segmentation Techniques
No ratings yet
N-D Image Segmentation Techniques
8 pages
Shape-Adaptive Selection and Measurement For Oriented Object Detection
No ratings yet
Shape-Adaptive Selection and Measurement For Oriented Object Detection
10 pages
Object Detection Using The Statistics
No ratings yet
Object Detection Using The Statistics
27 pages
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
Graph Cut Segment
No ratings yet
Graph Cut Segment
23 pages
Attribute-Centric Recognition For Cross-Category Generalization
No ratings yet
Attribute-Centric Recognition For Cross-Category Generalization
8 pages
MarszalekSchmid CVPR06 SpatialWeighting
No ratings yet
MarszalekSchmid CVPR06 SpatialWeighting
9 pages
For Optimal Boundary & Region Segmentation of Objects in N-D Images
No ratings yet
For Optimal Boundary & Region Segmentation of Objects in N-D Images
8 pages
Heitz+Koller ECCV08
No ratings yet
Heitz+Koller ECCV08
14 pages
Zhang06 IJCV
No ratings yet
Zhang06 IJCV
28 pages
Class Segmentation and Object Localization With Superpixel Neighborhoods
No ratings yet
Class Segmentation and Object Localization With Superpixel Neighborhoods
8 pages
CSE4261 Lecture-12
No ratings yet
CSE4261 Lecture-12
24 pages
Ferrar
No ratings yet
Ferrar
20 pages
Felisberto Et-Al 2003
No ratings yet
Felisberto Et-Al 2003
10 pages
ObjectRecognitionIntro 2NOV
No ratings yet
ObjectRecognitionIntro 2NOV
28 pages
Putting Objects in Perspective
No ratings yet
Putting Objects in Perspective
8 pages
Computer Vision: SIFT Features
No ratings yet
Computer Vision: SIFT Features
20 pages
Voilajones Paper PDF
No ratings yet
Voilajones Paper PDF
8 pages
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
No ratings yet
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
14 pages
5 Segmentation
No ratings yet
5 Segmentation
62 pages
Sift
No ratings yet
Sift
28 pages
Dari Pakisan
No ratings yet
Dari Pakisan
6 pages
Regions of Interest For Accurate Object Detection
No ratings yet
Regions of Interest For Accurate Object Detection
8 pages
SSTD05 SimilarityModel
No ratings yet
SSTD05 SimilarityModel
18 pages
TIJER2305225
No ratings yet
TIJER2305225
6 pages
Unit II - Chapter 5 - Segmentation
No ratings yet
Unit II - Chapter 5 - Segmentation
64 pages
Accepted Manuscript: Information Fusion
No ratings yet
Accepted Manuscript: Information Fusion
34 pages
Boundary Detection
No ratings yet
Boundary Detection
20 pages
Attribute Discovery Via Predictable Discriminative Binary Codes
No ratings yet
Attribute Discovery Via Predictable Discriminative Binary Codes
14 pages
Image Segmentation Techniques
No ratings yet
Image Segmentation Techniques
25 pages
JNRL PDF
No ratings yet
JNRL PDF
7 pages
Remote Sensing and Digital Image Processing
No ratings yet
Remote Sensing and Digital Image Processing
27 pages
Hutten Loc Her
No ratings yet
Hutten Loc Her
9 pages
Region Based Representations Revisited
No ratings yet
Region Based Representations Revisited
15 pages
English Grammar: Fill-in-the-Blank Exercises
No ratings yet
English Grammar: Fill-in-the-Blank Exercises
2 pages
Science Teaching Reflection
No ratings yet
Science Teaching Reflection
2 pages
8020 Blocked From Use: Tuesday
No ratings yet
8020 Blocked From Use: Tuesday
95 pages
Fdma in Satellite Communication PDF
0% (1)
Fdma in Satellite Communication PDF
2 pages
Spontaneous Vaginal Delivery Report
No ratings yet
Spontaneous Vaginal Delivery Report
1 page
Basic Marine Engineering For Maritime Students
100% (5)
Basic Marine Engineering For Maritime Students
55 pages
Onychophagia (Nail Biting), Anxiety, and Malocclusion
No ratings yet
Onychophagia (Nail Biting), Anxiety, and Malocclusion
4 pages
Digital Satellite Receiver: User Manual
No ratings yet
Digital Satellite Receiver: User Manual
78 pages
Agricultural Engineering
No ratings yet
Agricultural Engineering
5 pages
Cifras Internacionais
No ratings yet
Cifras Internacionais
17 pages
Automatic Transfer Switch PLT
100% (1)
Automatic Transfer Switch PLT
157 pages
Vectors and Projectiles
No ratings yet
Vectors and Projectiles
8 pages
Caulking
No ratings yet
Caulking
6 pages
Summer03 The Labyrinth PDF
100% (1)
Summer03 The Labyrinth PDF
3 pages
District Resource Centre Mahbubnagar: Physics Paper: Ii
No ratings yet
District Resource Centre Mahbubnagar: Physics Paper: Ii
4 pages
Microbiology-Course Module Revised 9.9.23
No ratings yet
Microbiology-Course Module Revised 9.9.23
12 pages
VASTU
100% (1)
VASTU
53 pages
Geometry Exercises 2: Parallelogram Rule
No ratings yet
Geometry Exercises 2: Parallelogram Rule
2 pages
Bar & Beverage Menu
No ratings yet
Bar & Beverage Menu
13 pages
Medieval English Architecture Guide
No ratings yet
Medieval English Architecture Guide
4 pages
Wenger GearBoss Team Cart-TS
No ratings yet
Wenger GearBoss Team Cart-TS
1 page
RP1
No ratings yet
RP1
2 pages
Grade: Midterm II (Quantitative Methods I)
No ratings yet
Grade: Midterm II (Quantitative Methods I)
3 pages
CL - 2 - UIMO - Model Paper For Online Registered Users
No ratings yet
CL - 2 - UIMO - Model Paper For Online Registered Users
21 pages
CORS Software Quick Guide - V1.80 - 202010
No ratings yet
CORS Software Quick Guide - V1.80 - 202010
23 pages
Simple Belt Conveyor Calculation Example
90% (10)
Simple Belt Conveyor Calculation Example
3 pages
Complexity Epistemology and The Challenge of The Future
No ratings yet
Complexity Epistemology and The Challenge of The Future
12 pages
Sewage & Septage Ordinance Guide
100% (2)
Sewage & Septage Ordinance Guide
10 pages
Roll Crushers PDF
No ratings yet
Roll Crushers PDF
5 pages
K80010292V03
No ratings yet
K80010292V03
2 pages

Selective Search in Object Recognition

Uploaded by

Selective Search in Object Recognition

Uploaded by

Int J Comput Vis (2013) 104:154–171

Selective Search for Object Recognition

3 Selective Search selective search. Because the process of grouping itself is

3.1 Selective Search by Hierarchical Grouping 3.2 Diversification Strategies

Algorithm 1: Hierarchical Grouping Algorithm

Light intensity − − − − − − +/− +/− + + + + +

Light intensity − − +/− 2/3 2/3 + + +

5.1 Diversification Strategies In this paper we propose three diversification strategies to

Table 4 Our selective search methods resulting from a greedy search

Single strategy HSV C+T+S+F k = 100 0.693 362 1 0.71

Arbeláez et al. (2011) 0.752 0.649 ± 0.193 418

Mean Average Best Overlap

0.65 Harzallah et al. 0.65

(a) (b) (c) (d) (e)

Fig. 6 The Average Best 1

Average Best Overlap

Arbeláez et al. (2011) 0.539 0.540 ± 0.117 1122 64

Fig. 7 Comparison of the 1

0.910 0.430 0.901

0.798 0.960 0.908

0.633 0.701 0.891

0.6 High quality object locations are necessary to recognise an

0.45 tion problem easier and helps to improves results. Remark-

You might also like