Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views4 pages

Classification

Uploaded by

mmamdo59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Classification

Uploaded by

mmamdo59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2017 IEEE Third International Conference on Multimedia Big Data

Ego-motion Classification for Driving Vehicle

Li Du∗ , Wenhui Jiang∗ , Zhicheng Zhao∗ , Fei Su∗


∗ Beijing University of Posts and Telecommunications, Beijing, China

Abstract—Accurate prediction of vehicle ego-motion in real


time is crucial for an autonomous driving system. In this
paper, we formulate the problem of ego-motion classification
as video event detection, and we propose an end-to-end deep
model to address this problem. In this model, we utilize
Convolutional Neural Networks (CNNs) to extract semantic
visual feature of each video frame, and employ a Long Short
Term Memory (LSTM) to model the temporal correlation of
the video streams. To study the performance of ego-motion
classification, we constructed a video dataset-Campus20, which
captured in general driving conditions. Experimental results on
Campus20 verifies the superior performance of our proposed
model over well established baselines.
Keywords-ego-motion classification; Convolutional Neural
Networks (CNNs); Long Short Term Memory (LSTM);
Figure 1. The deep learning architecture of our model.
I. I NTRODUCTION
Autonomous driving has become a promising research
topic in the field of public transport. Accurate and efficient
classification of vehicle ego-motion is one of the most
important parts of a self-driving system. Compared with
recognizing video events from a static camera, detecting
vehicle ego-motion is more challenging, because videos
recorded by a vehicle camera contains large amounts of fast
moving objects due to the continuous moving nature of a
driving car.
In this paper, we formulate the problem of ego-motion
classification as event detection in video streams:given video
streams recorded in real time, we categorize each frame
into one of possible action states (turning, lane-changing,
Figure 2. The ego-motion distribution probability bar image of dataset.
reversing, lane-following, crossing, turn-left and turn-right).
Towards this goal, we propose an end-to-end deep learning
architecture based on CNN and a single layer LSTM. In this To provide a better benchmark on ego-motion classifi-
model, each video frame is fed into a CNN part in sequential cation, we collect a new video dataset, named Campus20.
order to extract semantic visual features individually. Then The data was captured by a single front automobile data
the represented visual features are connected to a LSTM part recorder mounted on a vehicle driving around BUPT in
to model the temporal correlation of the video sequence. Beijing. We compare our method with various baselines (e.g
Finally, we estimate a probability distribution over all pos- SVMs [10] and ELM [6]) based on traditional hand-crafted
sible ego-motion action. A brief illustration of our proposed features) on Campus20. Experimental results show that our
model is shown in Figure 1. model contributes to significant performance improvements
Our architecture design is based on the following observa- compared with traditional feature-based baselines.
tions. Firstly, CNNs excel at capturing complex spatial char-
acters of a input images with their multiple layer receptive II. R ELATED W ORK
fields, which enable them to obtain a better representation Ego-motion, also named camera motion, is one of the
of image. Secondly, Long Short Term Memory (LSTM) is most important research areas in computer vision, which
good at capturing long-range temporal relationships from the benefits from advanced image sensor research and has poten-
input sequence, by using memory cells to store, modify and tial for completing ego-motion related tasks in autonomous
access internal state in visual feature based task [13]. driving system. Johua et al. [5] proposed a method to recover

978-1-5090-6549-3/17 $31.00 © 2017 IEEE 276


DOI 10.1109/BigMM.2017.25
observer’s ego-motion using omnidirectional cameras, which
showed the possibility of ego-motion calculation based on
efficient visual scene capture device.
In 2000, MobileEye [14] proposed a robust approach to
calculate the vehicle ego-motion relative to the road based
on camera videos, and built a reliable method which can
ignore large number of outliers happening in real driving
conditions. In 2015, to make out how images of objects and
scenes behave in response to specific ego-motion, Jayaraman
et al. [7] made motor signals as unsupervised information
in CNNs, for learning visual representation from egocentric
video. To address the problem of visual ego-motion estima- Figure 3. The architecture of our model based on CNN+LSTM, when
tion or briefly Visual Odometry (VO), in [2], an approach applied to process a 25-frame video sequence.
to learn both the best visual features and the best estimator
based on Neural Networks was proposed.
So far, newly emerging autonomous driving research IV. M ODEL D ESCRIPTION
platforms like KITTI Vision Benchmark Suite [4], Google
CityScape [1] and Oxford RobotCar dataset [11], all devel- We propose an end-to-end deep model to address the prob-
oped relative novel challenging real-world computer vision lem of ego-motion classification. The overall architecture is
benchmarks related to autonomous driving platform. To shown is Figure 2. It takes raw video sequences as inputs and
accomplish a more accurate real-time ego-motion prediction output is the probability distribution of the corresponding
task, we come up with a simple deep learning architecture video frames. The network consists of two parts, a CNN
based on CNNs and LSTM. part which works as a visual feature extractor and a LSTM
The contribution of this paper as follows: we proposed which acts as a temporal feature extractor. We will explain
an efficient ego-motion classification method based on tra- both parts in detail in the following subsections.
ditional hand-crafted features, and used the low dimensional
temporal combination features to complete ego-motion clas- A. CNN Part
sification task about driving relying on practical experience,
which was verified with a relative stable performance on We build the CNN part from one of the popular backbone
randomly recorded dataset. And it also demonstrated that architectures, i.e., AlexNet [9]. As shown in Figure 3, it
visual features can be used in driving event classification mainly contains 5 convolutional layers, one fully connected
task; At the same time, ELM was also used to verified layer. The kernel size of each convolutional layer is 7×7,
the potential of neural network in dealing with ego-motion 5×5, 3×3, 3×3 and 3×3. At the same time, a max pooling
classification task under complex dynamic conditions; Then layer with 3×3 kernel and 2 stride interval is connected to
a stacked deep learning architecture based on CNN and the 1st, 2nd and 5th Relu layer. A dropout layer is appended
RNN (LSTM) was also proposed to complete the same to the fully connected layer to prevent over-fitting.
classification task.
III. DATASET B. LSTM Part
To our knowledge, there is no dataset public available for The LSTM part is designed to extract continuous tem-
ego-motion classification. In this paper, to provide a better poral features of video sequences given the visual features
benchmark on ego-motion classification, we collected a new extracted from the CNN part. As depicted in Figure 4, each
video dataset, named Campus20. The videos are recorded on LSTM cell remember a single floating point value (¸t). This
some typical roads in 20 different days. Data was collected value may be diminished or erased through a multiplicative
in the clear days, in the day time. interaction with the forget gate (t) or additively modified
The dataset consists of 15 videos for training and 5 videos by the current input (t) multiplied by the activation of the
for testing, each lasts for 5 mins.The frame rate is 18 FPS, input gate ı(t).The output gate ø(t) controls the emission of
and the spatial resolution is 320×240 pixels. Each frame (t) [15]. When fed a visual feature sequence, the LSTM
is labeled with a specific state of ego-motion action. As part computes the hidden vector sequence and outputs a
shown in Figure 2, from statistical view, the distribution continuous temporal feature sequence.
probabilities of ego-motion in Campus20 like line-driving, In this part, a single-layer LSTM with 256 hidden unites
lane-changing, crossing, turn-left, turn-right, reversing, are is used to extract temporal relationships features within each
respectively 84.381%, 3.469% , 1.585%, 3.923%, 5.610%, frame sequence. Besides, a single-layer fully connected layer
1.028% and 17.133%. is designed to function as a controller for classification.

277
1) Baseline: To reveal the effectiveness of deep neural
network in dealing with ego-motion classification task in
complex dynamic conditions, traditional method based on
kernel SVMS and ELM with 150 hidden nodes, were chosen
as the baselines.
A synthesized model based on four hand-crafted temporal
features, namely motion [10], optical flow [12], velocity and
improved dense trajectories [12] is proposed. Each kind of
feature is normalized separately and concatenated to form
the final feature vectors.
The primitive classifiers for the six action states(turning,
lane-changing, reversing, lane-following, crossing, turn-left
Figure 4. A simple LSTM block with only input, output, and forget gates. and turn-right) are trained by kernel Support Vector Ma-
chines (SVMs) and multiple cross validation is used to select
the optimal parameters including the cost C and relative
C. Objective Function weight of positive and negative sample sets. To verified
The whole network is supervised by softmax loss, the loss the higher calculation efficiency of ELM over SVM in
function is shown as Equation (1) below: dealing with ego-motion classification task under complex
N −1 C−1
dynamic conditions, ELM is also used as a baseline based on
1  
its rigorous theory demonstration [6]. Besides, ego-motion
L= log( ezi,j − zi,yi ) (1)
N i=0 j=0
classification result come from a similar classification model
in [3] is also used as a comparison to verify the efficiency
In the equation above, N is the length of an LSTM sequence, of our model.
C is the number of action states. We define z as the output of 2) Analysis: We evaluate the efficiency of the proposed
our model, zi,yi stands for the predicted probability value of model on Campus20, quantifying the number of the pro-
the ith frame belongs to action state yi and zi,j represents the cessing frames on ego-motion classification performance
ground truth probability value when the ith frame belongs and revealing the importance of temporal relationships in
to the jth state. continuous sequence in accurate ego-motion classification
The optimization can be done by applying Stochastic task. The performance is evaluated by classification error
Gradient Descent (SGD) algorithm with Back-propagation rate, which is defined as follows:
(BP), we will explain the training details in next section.
Err = Ni /Nt (2)
V. E XPERIMENT
Where Ni is the number of frames being incorrect classified,
A. Training Details
and Nt stands for the total frame number of the correspond-
We use CNN parameters pre-trained on UCF-101 [3] ing classification dataset.
to initial the visual CNN part of our model. We used
a GTX 1080 GPU and implement the network in Caffe B. Experimental Results
[8]. We trained our model jointly with stochastic gradient As shows in TABLE I, for ELM with 150 hidden nodes
descent(SGD)algorithm with an initial learning rate of 10−3 . and kernel SVMs, both two methods get the nearly equal
The learning rate is decreased to 10−4 after one epoch. The results in ego-motion classification task and ELM gets a
learning lasts for 4 epochs. lower classification error rate than kernel SVMs. At the
As shown in Section 3, the total distribution probability of same time, when processing data with a same scale, ELM
lane-driving is nearly 80%. To avoid the learned model bi- gets a mild classification accuracy improvement and a less
ased towards ’lane-driving’, we adopt the balance sampling frame processing time (4×10−7 s to 0.006 s) compared with
strategy. Specifically, we set the distribution probability of SVMs. The results demonstrate that as one kind of single
negative sample(lane-driving) to positive samples(crossing, layer feedforward neural network, ELM is more efficient
reversing, lane-changing, turn-left and turn-right) as 3:1, and in dealing with complex computation problems than SVMs,
each mini-batch contains 100 video frames. and single layer neural network also perform in our task.
The CNN part in our model output 25×4096 data, when Due to the fact that data in real-world is almost com-
fed into a 25-frame video sequence each time step. And plex. Efficient traditional hand-craft features extraction and
the LSTM part, which follows the CNN part output a data representation approaches are not always reliable, for they
sequence consists of 25×7 data. Besides, the output of each rely heavily on expert knowledge of researchers. So the
time step is a class sequence consisting of 25 numbers, and traditional feature-based method may hold little potential for
the range of each number from 0 to 6. meeting practical need of ego-motion classification task.

278
Table I [4] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
OVERVIEW OF BASELINE AND PROPOSED M ODELS ready for autonomous driving? the kitti vision benchmark
suite. In Conference on Computer Vision and Pattern Recog-
Method Error Rate Testing Time nition (CVPR), 2012.
SVM 0.2375 0.006 s
ELM 0.2473 4×10−7 s [5] Joshua Gluckman and Shree K Nayar. Ego-motion and
omnidirectional cameras. In Computer Vision, 1998. Sixth
CNN+LSTM [3] 0.1368 0.04 s
International Conference on, pages 999–1005. IEEE, 1998.
CNN+LSTM 0.0417 0.034 s
[6] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew.
Extreme learning machine: theory and applications. Neuro-
In comparison, we can see that the multiple ego-motion computing, 70(1):489–501, 2006.
classification error rate of our CNN-LSTM architecture is [7] Dinesh Jayaraman and Kristen Grauman. Learning image
0.0417and the result of the similar model in [3] is 0.1368. representations tied to ego-motion. In Proceedings of the
And the result of SVMs is 0.2375 and ELM is 0.2473. IEEE International Conference on Computer Vision, pages
The results above show that models based on deep learning 1413–1421, 2015.
have better visual and temporal features classification ability
[8] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
over traditional methods, and the classification ability of our Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadar-
model is superior to the similar model architecture in [3]. rama, and Trevor Darrell. Caffe: Convolutional architecture
And as verified in experiments above, deep learning for fast feature embedding. CoRR, abs/1408.5093, 2014.
method may work well in dealing with complex ego-motion
classification task, depending on its more efficient learning [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
progress and the powerful calculation ability of GPU. works. In International Conference on Neural Information
Processing Systems, pages 1097–1105, 2012.
VI. C ONCLUSION
We propose a real-time ego-motion prediction task related [10] Jianjiang Lu, Yulong Tian, Yang Li, Yafei Zhang, and Zining
to autonomous driving. To complete the task, we come up Lu. A framework for video event detection using weighted
svm classifiers. In Artificial Intelligence and Computational
with a CNN+LSTM model and collected a new dataset
Intelligence, 2009. AICI’09. International Conference on,
to demonstrate the efficiency of our model. Comparing volume 4, pages 255–259. IEEE, 2009.
experiments between traditional method (SVMs and ELM)
and our model verified the power classification ability of [11] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New-
our method, and the proposed architecture get more reliable man. 1 Year, 1000km: The Oxford RobotCar Dataset. The
International Journal of Robotics Research (IJRR), 36(1):3–
features based on powerful automatic feature learning ability
15, 2017.
of CNN and LSTM, which contribute to accurate ego-motion
classification task. [12] Moritz Menze and Andreas Geiger. Object scene flow for
autonomous vehicles. In Proceedings of the IEEE Conference
ACKNOWLEDGMENT on Computer Vision and Pattern Recognition, pages 3061–
This work is supported by Chinese National Natural 3070, 2015.
Science Foundation (61372169 and 61532018). [13] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Re-
current models of visual attention. In Advances in Neural
R EFERENCES Information Processing Systems, pages 2204–2212, 2014.
[1] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, [14] Gideon P Stein, Ofer Mano, and Amnon Shashua. A robust
Stefan Roth, and Bernt Schiele. The cityscapes dataset for method for computing vehicle ego-motion. In Intelligent
semantic urban scene understanding. In Proc. of the IEEE Vehicles Symposium, 2000. IV 2000. Proceedings of the IEEE,
Conference on Computer Vision and Pattern Recognition pages 362–368. IEEE, 2000.
(CVPR), 2016.
[15] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
[2] G Costante, M Mancini, P Valigi, and T. A Ciarfuglia. jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Exploring representation learning with cnns for frame-to- Toderici. Beyond short snippets: Deep networks for video
frame ego-motion estimation. Robotics Automation Letters classification. In Proceedings of the IEEE Conference on
IEEE, 1(1):18–25, 2016. Computer Vision and Pattern Recognition, pages 4694–4702,
2015.
[3] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
and Trevor Darrell. Long-term recurrent convolutional net-
works for visual recognition and description. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2625–2634, 2015.

279

You might also like