AI-Driven Fruit Detection
AI-Driven Fruit Detection
Abstract: This paper presents a novel approach to fruit detection using deep convolutional
neural networks. The aim is to build an accurate, fast and reliable fruit detection system, which
is a vital element of an autonomous agricultural robotic platform; it is a key element for fruit
yield estimation and automated harvesting. Recent work in deep neural networks has led to
the development of a state-of-the-art object detector termed Faster Region-based CNN (Faster
R-CNN). We adapt this model, through transfer learning, for the task of fruit detection using imagery
obtained from two modalities: colour (RGB) and Near-Infrared (NIR). Early and late fusion methods
are explored for combining the multi-modal (RGB and NIR) information. This leads to a novel
multi-modal Faster R-CNN model, which achieves state-of-the-art results compared to prior work
with the F1 score, which takes into account both precision and recall performances improving from
0.807 to 0.838 for the detection of sweet pepper. In addition to improved accuracy, this approach
is also much quicker to deploy for new fruits, as it requires bounding box annotation rather than
pixel-level annotation (annotating bounding boxes is approximately an order of magnitude quicker
to perform). The model is retrained to perform the detection of seven fruits, with the entire process
taking four hours to annotate and train the new model per fruit.
Keywords: visual fruit detection; deep convolutional neural network; multi-modal; rapid training;
real-time performance; harvesting robots; horticulture; agricultural robotics
1. Introduction
According to [1], sourcing skilled farm labour in the agriculture industry (especially horticulture)
is one of the most cost-demanding factors in that industry. This is due to the rising values of supplies,
such as power, water irrigation, agrochemicals, and so on. This is driving farm enterprises and
horticultural industry to be under pressure with small profit margins. Under these challenges, food
production still needs to meet the growing demands of an ever-growing world population, and this
casts a critical problem to come.
Robotic harvesting can provide a potential solution to this problem by reducing the costs of labour
(longer endurance and high repeatability) and increasing fruit quality. For these reasons, there has
been growing interest in the use of agricultural robots for harvesting fruit and vegetables over the past
three decades [2,3]. The development of such platforms includes numerous challenging tasks, such
as manipulation and picking. However, the development of an accurate fruit detection system is a
crucial step toward fully-automated harvesting robots, as this is the front-end perception system before
subsequent manipulation and grasping systems; if fruit is not detected or seen, it cannot be picked.
This step is challenging due to various factors, among which are illumination variation, occlusions, as
well as the cases when the fruit exhibits a similar visual appearance to the background, as shown in
Figure 1. To overcome these, a well-generalised model that is invariant and robust to brightness and
viewpoint changes and highly discriminative feature representations are required.
(a) (b)
(c) (d)
Figure 1. Example images of the detection for two fruits. (a) and (b) show a colour (RGB) and a
Near-Infrared (NIR) image of sweet pepper detection denoted as red bounding boxes respectively.
(c) and (d) are the detection of rock melon.
In this work, we present a rapid training (about 2 h on a K40 GPU) and real-time fruit
detection system based on Deep Convolutional Neural Networks (DCNN) that can generalise well
to various tasks with pre-trained parameters. It can be also easily adapted to different types of fruits
with a minimum number of training images. In addition, we introduce approaches that combine
multiple modalities of information (colour and near-infrared images) with early and late fusion.
For the evaluation, we demonstrate both quantitative and qualitative results compared to previous
work [4]. The contributions of this paper are therefore:
• Developing a high-performance fruit detection system that can be rapidly trained with a
small number of images using a DCNN that has been pre-trained on a large dataset, such as
ImageNet [5].
• Proposing multi-modal fusion approaches that combine information from colour (RGB) and
Near-Infrared (NIR) images, leading to state-of-the-art detection performance.
• Returning our findings to the community through open datasets and tutorial documentation [6].
Sensors 2016, 16, 1222 3 of 23
To the best of our knowledge, this is the first attempt to fuse RGB and NIR multi-modal images
within a DCNN framework for fruit detection. We use standard evaluation metrics, precision-recall
curves and the F1 score [7] (i.e., the harmonic mean of precision and recall), to perform extensive
evaluations using data collected from three commercial sites acquired during day and night.
This dataset, along with the annotated ground truth imagery and labelling tool will be distributed
upon the publication of this work to encourage further research use in the relevant area.
The remainder of the paper consists of the following. Section 2 introduces related work and
the background. Section 3 presents the descriptive comparisons between our previous works using
the Conditional Random Field (CRF) with hand-crafted features and the proposed approach using
Faster Region-based Convolutional Neural Network (R-CNN) for fruit detection. Multi-modal fusion
schemes are also addressed in this section. We demonstrate the experimental results in Section 4.
Conclusions are drawn in Section 6.
2. Related Work/Background
Although many researchers have tackled the problem of fruit detection, such as the works
presented in [8–13], the problem of creating a fast and reliable fruit detection system persists, as found
in the survey by [14]. This is due to high variation in the appearance of the fruits in field settings,
including colour, shape, size, texture and reflectance properties. Furthermore, in the majority of
these settings, the fruits are partially abstracted and subject to continually-changing illumination and
shadow conditions.
Various works presented in the literature address the problem of fruit detection as an image
segmentation problem (i.e., fruit vs. background). Wang et al. [11] examined the issue of apple
detection for yield prediction. They developed a system that detected apples based on their colour and
distinctive specular reflection pattern. Further information, such as the average size of apples, was used
to either remove erroneous detections or to split regions that could contain multiple apples. Another
heuristic employed was to accept as detections only those regions that were mostly round. Bac et al. [12]
proposed a segmentation approach for sweet peppers. They used a six band multi-spectral camera
and used a range of features, including the raw multispectral data, normalised difference indices, as
well as entropy-based texture features. Experiments in a highly controlled glasshouse environment
showed that this approach produced reasonably accurate segmentation results. However, the authors
noted that it was not accurate enough to build a reliable obstacle map.
Hung et al. [13] proposed the use of conditional random fields for almond segmentation.
They proposed a five-class segmentation approach, which learned features using a Sparse
Autoencoder (SAE). These features were then used within a CRF framework and was shown to
outperform previous work. They achieved impressive segmentation performance, but did not perform
object detection. Furthermore, they noted that occlusion presented a major challenge. Intuitively, such
an approach is only able to cope with low levels of occlusion.
More recently, Yamamoto et al. [10] performed tomato detection by first performing
colour-based segmentation. Then, colour and shape features were used to train a Classifier and
Regression Trees (CART) classifier. This produced a segmentation map and grouped connected pixels
into regions. Each region was declared to be a detection and to reduce the number of false alarms.
They trained a non-fruit classifier using a random forest in controlled glasshouse environments.
In all of the above-mentioned works, a pixel-level segmentation approach for object detection
has been adopted, and most of these works have examined fruit detection predominantly for yield
estimation [8,11]. The limited studies that have conducted accurate fruit detection have done so for
fruits in controlled glasshouse environments. As such, the issue of fruit detection in highly challenging
conditions remains unsolved. This is due to the high variability in the appearance of the target objects
in the agricultural settings, which meant that the classic methods of sliding window approaches,
although showing good performance when tested on datasets of selected images [15], cannot handle
the variability in scale and appearance of the target objects when deployed in real farm settings.
Sensors 2016, 16, 1222 4 of 23
Recently, deep neural networks have made considerable progress in object classification and
detection [5,16,17]. The state-of-the-art detection framework on PASCAL-VOC [18] consists of
two stages. The first stage of the pipeline applies a region proposal method, such as selective search [19]
and edgebox [20] to extract regions of interest from an image and then feed them to a deep neural
network for classification. Although it has high recall performance, this pipeline is computationally
expensive, which prevents it from being used in real time for a robotic application. Region Proposal
Networks (RPNs) [21–23] solve this problem by combining a classification deep convolutional network
with the object proposal network, so the system can simultaneously predict object bounds and classify
them at each position, the parameters of the two networks are shared, which results in a much faster
performance, making it suitable for robotic applications.
In real outdoor farm settings, a single sensor modality can rarely provide the needed information
to detect the target fruits under a wide range of variations in illumination, partial occlusions and
different appearances. This makes a great case for the use of multi-modal fruit detection systems
because varying types of sensors can provide complementary information regarding different aspects
of the fruits. Deep neural networks have already shown great promise when used for multi-modal
systems in domains outside agricultural automation, such as in [24], where audio/video has been used
very successfully, and in [25,26], where image/depth demonstrate a better performance compared to
the utilisation of each modality alone. This work follows the same approach and demonstrates the use
of a multi-modal region-based fruit detection system and how it outperforms pixel-level segmentation
systems, as we show in the following sections.
3. Methodologies
Fruit segmentation is an essential step in order to distinguish the fruits from the background
(leaves and stems). This task is challenging due to variation in fruit colour and illumination, as well as
high levels of occlusion.
In this section, we present the state-of-the-art fruit detection system [4], which performs pixel-wise
segmentation, against which we compare. We then describe the DCNN approach, Faster R-CNN, which
forms the basis of our proposed method. The details behind adapting this model for fruit detection are
then given, followed by a description of the fusion methods we propose for this DCNN architecture.
(a) (b)
end of 13 convolutional nets. It can be observed that filters have reddish and greenish colours that
correspond to red and green sweet peppers. Other filters represent edge filters in varying orientations.
Figure 4b shows the input data layer and one of feature maps from conv5 layer Figure 4c. It can be
seen that the regions for sweet red peppers (cyan boxes) are strongly activated, and this information is
highly useful for RPN and further classification process.
B
G Bounding
R Region Proposal Network box, B1
Cls1 score
NMS P (x1 |B1 )
Fc6 Fc7
Bounding
B1 box, B2
Bounding box
B2 prediction
Cls1 score
P (x2 |B2 )
Class
BN probability
Feature map RoI pooling O1:N : output
RoIs
Bounding
13 Convolutional layers
box, BK
3 channels Softmax Cls1 score
Input image
Faster R-CNN network P (xK |BK )
Figure 3. Illustration of test time the Faster Region-based Convolutional Neural Network (R-CNN).
There are 13 convolutional and 2 fully-connected (Fc6 and Fc7) and one softmax classifier layers.
N denotes the number of proposals and is set as 300. O1:N is the output that contains N bounding
boxes and their scores. Non-Maximum Suppression (NMS) with a threshold of 0.3 removes duplicate
predictions. BK is a bounding box of the K-th detection that is a 4 × 1 vector containing the coordinates
of top-left and bottom right points. xK is a scalar representing an object being detected.
Figure 4. (a) The 3 × 3 (pixels) Conv164 filters of the RGB network from VGG, (b) The input data and
(c) One of the feature activations from the conv5 layer. The cyan boxes in (b) are manually labelled in
the data input layer to highlight the corresponding fruits of the feature map.
50
background
40 capsicum
rockmelon
30
20
10
-10
-20
-30
-40
-50
-50 0 50
Figure 5. t-SNE feature visualisation of 3 classes. The 4 k dimensions of features are extracted from the
Fc7 layer and visualised in 2D. For the visualisation, 86 images are randomly selected from the dataset
and processed for the network shown in Figure 3.
Note that the key contributions of this model (VGG-16) are in demonstrating that the depth of
the network plays significant roles for proper detection performance, and despite its slightly inferior
classification power, its features generated from the network architectures outperform other state-of-the
art networks, such as AlexNet [17], ZF [33] and GoogLeNet [34]. It is, therefore, the most popular
choice at the time of writing this article in the computer vision and machine learning communities for
the front-end feature extraction module. Faster R-CNN also makes use of these feature maps as the
guidance for where to look. We will present how to train VGG-16 net and deploy it for fruits detection
in the following section.
available from online [36], and we also have made publicly available our implementation and tutorial
document [6].
Table 1 shows the number of training images used by CRF and Faster R-CNN only for the
performance evaluation. We can only use a relatively small number of images due to the limited
pixel-wise image annotation datasets from [4]. For a fair comparison, the same training and testing
images are utilised, and the experimental results are presented in Section 4.2. We also conduct further
experiments by increasing the number of classes and training images to detect another fruit and to
demonstrate its generalisation.
Table 1. Number of images used for training and testing for CRF and Faster R-CNN.
Train Test
Total
(RGB + NIR) (RGB + NIR)
CRF and Faster R-CNN 100 (82%) 22 (18%) 122
After the training, we deploy the trained fruit detector on a laptop that has Intel i7, 64-bit
2.90 GHz quad-core CPUs, a GeForce GTX 980M 8 GB GPU (1536 CUDA cores) and 16 GB of memory
space running on an Ubuntu 14.04 Linux system. Input images are obtained from a multi-spectral
camera, the JAI AD-130GE, and a Microsoft Kinect 2. Each camera has a resolution of 1296 × 964 and
1920 × 1080, respectively. Processing for the detection takes an average of 341 ms with a 4 ms standard
deviation for JAI and 393 ms with 3 ms for the Kinect 2 image. The processing time gap is caused by
an external library for reading different resolution images.
M
sp = ∑ sm,p (1)
m =1
to the NIR channel’s wavelength (750–1400 nm). This early fusion network is then fine-tuned as
previously described.
B
G
R
NIR
B ORGB
1:N : RGB network output
G
R Bounding
Bounding box, B1
box, B1
Cls1 score
Cls1 score NMS P (x1 |B1 )
NMS P (x1 |B1 )
Bounding
Bounding Faster box, B2
box, B2 3 channels
input
R-CNN
+
ORGB+NIR
1:2N
Cls1 score
Cls1 score P (x2 |B2 )
O1:N : output P (x2 |B2 ) NIR
+
Bounding
Bounding box, BK
box, BK Cls K score
4 channels Faster Cls K score P (xK |BK )
R-CNN
Input P (xK |BK )
ONIR
1:N : NIR network output
*Note that the first layer of this Faster R-CNN is different
to that of Fig. 5 in order to cope with 4 channels input.
1 channel Faster
input R-CNN
(a) (b)
Figure 6. A diagram of the early and late fusion networks. (a) The early fusion that concatenates a
1-channel NIR image with a 3-channel RGB image; (b) The late fusion that stacks outputs, ORGB 1:2N
+ NIR
,
NIR
from two Faster R-CNN networks. ORGB 1:N and O 1:N represent the output containing N = 300 bounding
boxes and their scores from the RGB and NIR networks, respectively. K is the number of objects being
detected. Note that the Faster R-CNNs of the early fusion are identical to that of Figure 3.
4. Experimental Results
In this section, we qualitatively and quantitatively evaluate our proposed method on five
experimental settings: (1) we compare the early and late fusion performance; (2) we evaluate
the performance between the baseline algorithm (CRF) and the proposed method; (3) we inspect
the performance of RPN; (4) we exam the generalisation of the proposed method by performing
spatial-temporal independent condition experiments; (5) we evaluate the extensibility of the proposed
approach by applying it to several other fruits.
Prior to presenting the experimental results, we mention the creation of the ground truth of the
dataset. Figure 7a depicts hand-labelled bounding boxes (yellow) based on the colour image and
the NIR image. In Figure 7b, the cyan colour box missing from Figure 7a highlights the missing
annotation of a sweet pepper in the NIR image due to its poor visibility; whereas it is more obvious
to see the sweet pepper in the RGB image because of the reflection from the sweet pepper. This also
happens the other way around. A fruit in the dark is difficult to see in the RGB-based image, but can
be identified easily in an NIR image. We thus merge these two ground truth sources using both RGB
and NIR images by computing the pairwise Intersection of Union (IoU) of bounding boxes shown in
Figure 7c. The remainder of this article refers the merged ground truth as merged GT, and the other
two ground truths are referred to as RGB GT and NIR GT based on the image sources used for making
the ground truth.
In this paper, we utilise the precision-recall curve with the corresponding F1 score as the evaluation
metric for fruit detection. It is considered as detected if the IoU between the prediction and ground
truth bounding boxes is greater than 0.4, following [5]. It is worth noting that we choose this threshold
as smaller than the ImageNet challenge (0.5) due to the relatively small fruit size with respect to
the image resolution. Although the threshold affects the performance evaluations (the smaller the
Sensors 2016, 16, 1222 10 of 23
threshold is, the higher the F1 score produced), we consistently use the identical threshold for all
experiments and comparisons presented in this paper.
Figure 7. (a,b) The hand-labelled ground truth using an RGB image and an NIR image respectively;
(c) A merged ground truth bounding box. The cyan box displays a bounding box that is correctly
annotated using the RGB image, but missed in the NIR image, due to the poor visibility of a fruit.
Given this threshold, the precision (P), recall (R) and F1 score are computed as:
TP TP 2·P·R
P= , R= , F1 = (2)
TP + FP TP + FN P+R
where TP is the number of true positives (correct detections), FP is the number of false positives
(false detection), FN is the number of false negatives (miss) and TN is the number of true
negatives (correct rejection).
4.2. Fruit Detection Performance Comparison with CRF and Faster R-CNN
As previously mentioned in Section 3.1, fruit detection performance evaluation is conducted
between CRF and the fine-tuned Faster R-CNN. We use the same training and test settings as described
in Table 1. The only difference is that the pixel-annotated training set is utilised as the ground truth for
CRF training, while bounding box annotations are used for Faster R-CNN (see Figure 2). The ground
Sensors 2016, 16, 1222 11 of 23
truth for test images remains identical. We should note that the output from CRF is a pixel-level
likelihood map representing how much the pixel belongs to a specific label. In order to have a
fair comparison with the bounding box outputs of Faster R-CNN, we use a Laplacian of Gaussian
(LoG) multi-scale blob detector [37] for the CRF-based method to produce detected fruit regions
(i.e., bounding boxes).
Figure 9 shows the precision-recall curves for the CRF and fused networks. CRF has a similar
performance as early fusion, but could not reach late fusion’s performance (shown in Table 3). Note that
we can only compute the F1 score of CRF from the valid point where precision and recall are slightly
different (see the black markers from Figure 9) due to the denominators of precision and recall from
Equation (2) being all zeros. This implies that there are no sweet peppers in the ground truth (TP = 0);
therefore, no false detection (FP = 0) or misdetections (FN = 0) are reported.
Although CRF shows impressive performance, there are a couple of challenges; difficulty in
pixel-level ground truthing and huge processing time. For example, the processing time in order to
produce the results shown in Figure 9 takes 331 s/frame with a 17 s standard deviation for featurisation,
which extracts and prepares features for subsequent detection, and 0.819 s with a 0.067 s standard
deviation for the detection; while Faster R-CNN can run of 393 ms/frame including all procedures
(842-times faster than CRF).
Unfortunately, we are unable to measure the time spent on pixel-level and bounding box
annotation, because it is highly subjective with the human-in-loop and the institutes’ internal ethics
and integrity issues. However, from empirical experience, doing the bounding box annotation is much
faster than pixel-level annotation.
0.6
PRECISION
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
RECALL
Figure 8. Precision-recall curves of four networks. The marks indicate the point where precision and
recall are identical, and F1 scores are computed at these points.
Sensors 2016, 16, 1222 12 of 23
0.6
PRECISION
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
RECALL
Figure 9. Precision-recall curves of the CRF baseline, early and late fusion networks. All make use of
RGB and NIR images as inputs. Due to the performance issue of CRF, we calculate the F1 score slightly
offset from the equilibrium point.
# Prop.
10 50 100 300 500 1000
Net.
RGB network (in s) 0.305 0.315 0.325 0.347 0.367 0.425
Early fusion (in s) 0.263 0.268 0.291 0.309 0.317 0.374
1
1
0.9
0.9
0.8
0.8
0.7
0.7
Rate
DetectionRate
0.6
0.6
Detection
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 RGB network
RGB
NIR network
network
0.1 NIR network
Early fusion
0.1 Early
Late fusion
fusion
0 Late fusion
0 1010050100
50 100 300
300 500
500 1000
1000
1
0050100 300 500 1000
Numberofofobject
Number objectproposal
proposal
Figure 10. Performance evaluation of the region proposals of four different networks.
1.0
# train img=100 100
# train img=50
0.8 # train img=25
# train img=10 80
# train img=1
0.6
# train images
60
PRECISION
0.4 40
0.2 20
0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
RECALL F1 score
(a) (b)
Figure 11. (a) Precision-recall curves with the varying number of training images as denoted by
different colours. The marks indicate the points where precision and recall are identical; (b) The F1
scores versus the number of images being used for fine-turning.
Sensors 2016, 16, 1222 14 of 23
Table 5. Number of images used for training and testing for different fruits.
Name of Fruits Train (# Images), 80% Test (# Images), 20% Total, 100%
Sweet pepper 100 22 122
Rock melon 109 26 135
Apple 51 13 64
Avocado 43 11 54
Mango 136 34 170
Orange 45 12 57
(a) (b)
Figure 12. Instances of detection performance using the same camera setup as the training dataset
and the same location. Above each detection is the classification confidence output from the DCNN.
(a,b) The outputs from the RGB and NIR networks, respectively. It can be seen that there are noticeable
FN (miss) in the NIR image, and colour and surface reflections play important roles in detection for
this example.
(a) (b)
Figure 13. Instances of detection performance using a different camera setup (Kinect 2) and a
different location. Above each detection is the classification confidence output from the DCNN.
(a,b) Without/with a Sun screen shed, respectively. Despite the brightness being obviously different in
the two scenes, the proposed algorithm impressively generalises well to this dataset.
Sensors 2016, 16, 1222 16 of 23
1.0
rockmelon, F1=0.848
strawberry, F1=0.948
0.8 apple, F1=0.938
avocado, F1=0.932
mango, F1=0.942
orange, F1=0.915
0.6 sweetpepper, F1=0.828
PRECISION
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
RECALL
Figure 14. Quantitative performance evaluation for different fruits. The marks indicate the point where
F1 scores are computed.
(a) (b)
(c) (d)
Figure 15. Four instances of sweet pepper detection. (a) and (b) are obtained from a farm site using a
JAI camera, and (c) and (d) are collected using a Kinect 2 camera at a different farm site. Above each
detection is the classification confidence output from the DCNN.
Sensors 2016, 16, 1222 17 of 23
(a) (b)
(c) (d)
Figure 16. Four instances of rock melon detection. (a) and (b) are obtained from a farm site using a JAI
camera, and (c) and (d) are from Google Images. Above each detection is the classification confidence
output from the DCNN.
Figure 17. Eight instances of red (a,e–h) and green (b–d) apples detection (different varieties). Images
are obtained from Google Images. Above each detection is the classification confidence output from
the DCNN.
Sensors 2016, 16, 1222 18 of 23
Figure 18. Eight instances (a–g), and (h) of avocado detection (varying levels of ripeness). Images are
obtained from Google Images.
Figure 19. Eight instances (a–g), and (h) of mango detection (varying levels of ripeness). Images are
obtained from Google Images.
Sensors 2016, 16, 1222 19 of 23
Figure 20. Eight instances (a–g), and (h) of orange detection (varying levels of ripeness). Images are
obtained from Google Images.
Figure 21. Eight instances (a–g), and (h) of strawberry detection (varying levels of ripeness). Images
are obtained from Google Images.
Sensors 2016, 16, 1222 20 of 23
Figure 22. Detection result when two fruits are present in the scene. Two images are manually cropped,
stitched and then fed to the RGB network.
In the last experiment, we evaluate our detector to detect several fruits in the same scene as
depicted in Figure 22 (using a multi-class detector). There are five fruits in total in the scene; three
sweet peppers and two rock melons; and TP = 4 (hit), FP = 3 (false detection) and FN = 1 (miss).
In this example, precision is 0.57 with a 0.8 recall rate with a score threshold of 0.8. Note that all FP
have relatively low scores (i.e., lower than 0.85), whereas fruits being detected are all above 0.9. If the
score threshold is set to 0.85, then precision will be 1.0 with a 0.8 recall rate. This score threshold
should be properly adjusted depending on the circumstances and applications.
In order to deploy the developed system into real robot systems (e.g., unmanned ground vehicles),
the only bottleneck is processing performance, which requires an NVIDIA GPU (Graphics Processing
Unit) device that has more than 8 GB of memory for model testing.
6. Conclusions
We present approaches for a vision-based fruit detection system that can perform up to a 0.83 F1
score with a field farm dataset, maintaining fast detection and a low burden for ground truth
annotation. This is a competitive result compared to our previous pixel-based detector of 0.80. We also
demonstrated qualitative results to show how well the trained model using a small dataset generalises
to entirely independent (unseen) environments.
In developing this system, we performed fine-tuning of the VGG16 network based on
the pre-trained ImageNet model. The novel use of RGB and NIR multi-modal information
within early and late fusion networks provides improvements over a single DCNN. Furthermore,
we investigated the performance of region proposal networks to narrow down a possible bottleneck of
performance degradation. Our findings are returned to the relevant communities through an open
dataset and tutorial documentation.
Future work involves the integration of the proposed algorithm with our custom-built harvesting
robot and the collection of an enormous amount of ground truth annotations for a variety of
fruits by utilising Amazon Mechanical Turk or other out-sourcing supplies to achieve more
accurate performance.
Acknowledgments: The authors would like to thank Chris Lehnert, Andrew English and David Hall, for their
invaluable assistance with data collection, and Raymond Russell and Jason Kulk of the QUTStrategic Investment
in Farm Robotics Program for their key contributions to the design and fabrication of the harvesting platform.
We also would like to acknowledge Elio Jovicich and Heidi Wiggenhauser from the Queensland Department of
Agriculture and Fisheries for their advice and support during the data collection field trips.
Author Contributions: Inkyu Sa contributed to the development of the systems, including implementation, farm
site data collection and the manuscript writing. ZongYan Ge contributed to the development of the systems and
manuscript writing. Feras Dayoub provided significant suggestions on the development and contributed to the
manuscript writing and performance evaluation. Chris McCool contributed to the system development. Inkyu Sa,
Feras Dayoub, ZongYan Ge and Chris McCool analysed the results. Chris McCool supported field data collection
and contributed to the manuscript writing. All authors wrote the manuscript together, and Ben Upcroft and
Tristan Perez guided the whole study.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. ABARE. Australian Vegetable Growing Farms: An Economic Survey, 2013–14 and 2014–15; Research report;
Australian Bureau of Agricultural and Resource Economics (ABARE): Canberra, Australia, 2015.
2. Kondo, N.; Monta, M.; Noguchi, N. Agricultural Robots: Mechanisms and Practice; Trans Pacific Press: Balwyn
North Victoria, Australia, 2011.
3. Bac, C.W.; van Henten, E.J.; Hemming, J.; Edan, Y. Harvesting Robots for High-Value Crops: State-of-the-Art
Review and Challenges Ahead. J. Field Robot. 2014, 31, 888–911.
4. McCool, C.; Sa, I.; Dayoub, F.; Lehnert, C.; Perez, T.; Upcroft, B. Visual Detection of Occluded Crop:
For automated harvesting. In Proceedings of the International Conference on Robotics and Automation,
Stockholm, Sweden, 16–21 May 2016.
5. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.;
Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252.
6. Ge, Z.Y.; Sa, I. Open datasets and tutorial documentation, 2016. Available online: http://goo.gl/9LmmOU
(accessed on 31 July 2016).
7. Wikipedia. F1 Score, 2016. Available online: https://en.wikipedia.org/wiki/F1_score (accessed on
31 July 2016).
8. Nuske, S.T.; Achar, S.; Bates, T.; Narasimhan, S.G.; Singh, S. Yield Estimation in Vineyards by Visual
Grape Detection. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS ’11), San Francisco, CA, USA, 25–30 September 2011.
9. Nuske, S.; Wilshusen, K.; Achar, S.; Yoder, L.; Narasimhan, S.; Singh, S. Automated visual yield estimation
in vineyards. J. Field Robot. 2014, 31, 837–860.
10. Yamamoto, K.; Guo, W.; Yoshioka, Y.; Ninomiya, S. On plant detection of intact tomato fruits using image
analysis and machine learning methods. Sensors 2014, 14, 12191–12206.
11. Wang, Q.; Nuske, S.T.; Bergerman, M.; Singh, S. Automated Crop Yield Estimation for Apple Orchards.
In Proceedings of the 13th Internation Symposium on Experimental Robotics (ISER 2012), Québec City, QC,
Canada, 17–22 June 2012.
12. Bac, C.W.; Hemming, J.; van Henten, E.J. Robust pixel-based classification of obstacles for robotic harvesting
of sweet-pepper. Comput. Electron. Agric. 2013, 96, 148–162.
13. Hung, C.; Nieto, J.; Taylor, Z.; Underwood, J.; Sukkarieh, S. Orchard fruit segmentation using multi-spectral
feature learning. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Tokyo, Japan, 3–7 November 2013; pp. 5314–5320.
14. Kapach, K.; Barnea, E.; Mairon, R.; Edan, Y.; Ben-Shahar, O. Computer vision for fruit harvesting robots-state
of the art and challenges ahead. Int. J. Comput. Vis. Robot. 2012, 3, 4–34.
15. Song, Y.; Glasbey, C.; Horgan, G.; Polder, G.; Dieleman, J.; van der Heijden, G. Automatic fruit recognition
and counting from multiple images. Biosyst. Eng. 2014, 118, 203–215.
16. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos.
In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada,
8–13 December 2014; pp. 568–576.
17. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the Advances in Neural Information Processing Systems, Tahoe City, CA, USA,
3–8 December 2012; pp. 1097–1105.
18. Everingham, M.; Eslami, S.M.A.; van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual
object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136.
19. Uijlings, J.R.; van de Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J.
Comput. Vis. 2013, 104, 154–171.
20. Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014;
Springer: Zurich, Switzerland, 2014; pp. 391–405.
21. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region
proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal,
QC, Canada, 7–12 December 2015; pp. 91–99.
22. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for
visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916.
Sensors 2016, 16, 1222 23 of 23
23. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 13–16 December 2015; pp. 1440–1448.
24. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of
the 28th international conference on machine learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011;
pp. 689–696.
25. Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust
RGB-D object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 681–687.
26. Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724.
27. Domke, J. Learning graphical model parameters with approximate marginal inference. IEEE Trans. Pattern
Anal. Mach. Intell. 2013, 35, 2454–2467.
28. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification
with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987.
29. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA,
USA, 25 June 2005; Volume 1, pp. 886–893.
30. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014.
Available online: https://arxiv.org/abs/1409.1556 (accessed on 31 July 2016).
31. Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; Darrell, T. DeCAF: A Deep Convolutional
Activation Feature for Generic Visual Recognition, 2013. Available online: https://arxiv.org/abs/1310.1531
(accessed on 31 July 2016).
32. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605,
33. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Computer
Vision–ECCV 2014; Springer: Zurich, Switzerland, 2014; pp. 818–833.
34. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
35. Stanford University. CS231n: Convolutional Neural Networks for Visual Recognition (2016). Available
online: http://cs231n.github.io/transfer-learning/ (accessed on 31 July 2016).
36. University of California, Berkeley. Fine-Tuning CaffeNet for Style Recognition on Flickr Style Data (2016),
2016. Available online: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
(accessed on 31 July 2016).
37. Lindeberg, T. Detecting salient blob-like image structures and their scales with a scale-space primal sketch:
A method for focus-of-attention. Int. J. Comput. Vis. 1993, 11, 283–318.
38. Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: an astounding baseline
for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 806–813.
c 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).