Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
56 views10 pages

Abnormal Crowd Behavior Detection Using Motion Inf

Uploaded by

Vũ Thị Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Abnormal Crowd Behavior Detection Using Motion Inf

Uploaded by

Vũ Thị Linh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

Abnormal Crowd Behavior Detection using


Motion Information Images and Convolutional
Neural Networks
Cem Direkoğlu
Electrical and Electronics Engineering Department, Middle East Technical University Northern Cyprus Campus,
Kalkanli, Guzelyurt, North Cyprus, via Mersin 10, Turkey.

Corresponding author: Cem Direkoğlu (e-mail: [email protected]).

ABSTRACT We introduce a novel method for abnormal crowd event detection in surveillance videos.
Particularly, our work focuses on panic and escape behavior detection that may appear because of violent
events and natural disasters. First, optical flow vectors are computed to generate a motion information
image (MII) for each frame, and then MIIs are used to train a convolutional neural network (CNN) for
abnormal crowd event detection. The proposed MII is a new formulation that provides a visual appearance
of crowd motion. The proposed MIIs make the discrimination between normal and abnormal behaviors
easier. The MII is mainly based on the optical flow magnitude, and angle difference computed between the
optical flow vectors in consecutive frames. A CNN is employed to learn normal and abnormal crowd
behaviors using MIIs. The MII generation, and the combination with a CNN is a new approach in the
context of abnormal crowd behavior detection. Experiments are performed on commonly used datasets such
as UMN and PETS2009. Evaluation indicates that our method achieves the best results.

INDEX TERMS Crowd behavior analysis, anomaly detection, motion information image, convolutional
neural network.

I. INTRODUCTION approaches, the crowd consists of groups of individuals


(objects) [2][3] and these objects are detected and tracked
Analysis of crowd behavior has become a popular research in order to understand the global crowd behavior [4]. Major
field in recent years. Crowd behavior analysis can be challenges of object-based methods are accurate object
utilized in variety of applications, for example, automatic identification, tracking and action recognition in dense
detection of panic and escape behavior as a result of crowds, since occlusions affect the whole process.
violence, riots, natural disasters, and so forth. Generally it is Alternatively, in holistic methods [5][6][7], the crowd is
challenging to find effective features for crowds, since considered as a global unit. Thus, these approaches
people inside the crowd may be positioned at different analyzes the whole crowd itself to extract useful features
locations and may move in diverse directions. As a result, (e.g. applying optical flow to the entire frame) in order to
higher level analysis becomes difficult. detect the crowd behavior.
According to [1], abnormal event detection can be In this research, we concentrate on global abnormal
classified as local and global abnormal events. Local crowd event detection in surveillance videos, for example,
abnormal events contain an individual who acts differently sudden escape of people in the same or diverse directions.
from the rest of the individuals within a crowded scene. In Anomaly detection consists of two main phases: event
global abnormal events, crowd behavior inside a global representation and anomaly measurement. For abnormal
scene is abnormal, such as sudden escape of people during event representation, spatial-temporal information can be
an earthquake. This work focuses on global abnormal used, for example social force model [8], Histogram of
crowd behavior detection. Optical Flow (HOF) [9], Histogram of Motion Direction
(HMD) [10], spatial-temporal gradient [11], chaotic
A. RELATED WORK invariant [12], mixtures of dynamic textures [13], sparse
For global crowd behavior analysis usually holistic and representation [14] and behaviour Entropy model (BE)
object-based methods are utilized. In object-based [15]. For anomaly measurement, most of the approaches

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

employ a one-class learning methods to learn normal entropy in order to analyse crowd behaviour for global and
samples. As a one-class learner Hidden Markov Model local anomaly detection. Gnouma et al. [21] present a
[11], Gaussian Mixture Model, one-class Support Vector method based on local distribution of magnitude of optical
Machine (SVM) [16], Replicator Neural Networks [17], flow (DMOF) for global anomaly detection. Patil et al. [22]
Convolutional Neural Networks [18][19] and Bayesian also proposed a method for global anomaly detection. First,
model [20] can be utilized. Then, during testing, if the test optical flow is computed at each frame of a video. Then the
sample is significantly different from the normal, it is video is divided into spatio-temporal volumes (STV). In
accepted as abnormal. each volume, mean value of the optical flow magnitudes is
There are also very recent works on abnormal crowd computed. Next, STVs with the higher mean values are
behaviour detection based on distribution of magnitude of used for testing the anomaly. Histogram of flow orientation
optical flow (DMOF) [21], context location and motion- information to together with mean value of the flow
rich spatio-temporal volumes (CL and MSV) [22], magnitudes in that volume is used as a feature vector for
generative adversarial nets (GAN) [23], temporal abnormal crowd behaviour detection. Generative
convolutional neural network pattern (TCP) [24], global adversarial nets (GAN) [23] use optical flow magnitude
event influence model (GEIM) [25], and histograms of images for global and local abnormal behaviour detection.
optical flow orientation and magnitude (HOFO) [26]. Ravanbakhsh et al. [24] fuse appearance and optical flow
Reviews on crowd behaviour analysis can be found at magnitude image using a convolutional neural network for
[27][28]. Recently, some survey papers have also appeared global and local abnormal crowd behaviour detection. Pan
for deep learning based crowd behaviour analysis [29][30]. et al. [25] performs global abnormal behaviour detection
Below, in part B, we particularly summarize existing using a combination of features such as combination of
optical flow based methods both for global and local crowd scale, velocity and disorder features. In their work, velocity
anomalies since our work is also based on optical flow, and feature is based on optical flow magnitude. Colque et al.
then, in part C, we explain our method and the difference [26] also proposed an optical flow based feature descriptor
from existing works. for global and local anomaly detection. These features are
represented by histograms of optical flow orientation and
B. OPTICAL FLOW BASED METHODS magnitude and entropy. This is a three-dimensional
histogram consist of orientation, magnitude and entropy of
Here, we summarize optical flow based methods both for orientation dimensions.
global and local crowd anomalies. Social force model [8]
focuses on global anomaly. A grid of particles is placed C. CONTRIBUTION
over the image plane, and they are advected with the space- We present a new work for abnormal crowd event
time average of optical flow. Then interaction force, detection. The key contribution is new motion information
between particles, is estimated using social force model. image (MII) generation using optical flow. The proposed
The interaction force is then mapped into the image region MIIs can represent and discriminate normal and abnormal
to obtain Force Flow for every pixel in every frame. The events well, and when MIIs are input to a CNN for training
normal crowd behaviour is modelled using the Force Flow and testing, it achieves very promising results in this
frames. Finally, bag of words approach is used to classify domain. Both normal and abnormal MIIs are trained using a
frames as normal and abnormal. In [9], motion feature is CNN that means we have two categories in the CNN
obtained after binning the current optical flow distribution network. According to our observation, during an abnormal
into angular bins, yielding a one dimensional vector on event, people start to run. Especially in the motion regions,
flow directions for local anomaly detection. In chaotic this abnormal behavior increases the angle difference
invariant [12], the process begins with particle advection between the optical flow vectors computed in the previous
using optical flow. Then particle trajectories are clustered to
frame and in the current frame at each pixel location. In
obtain representative trajectories for a crowd flow. Next,
addition, we also observe that the optical flow magnitude
the chaotic dynamics of all representative trajectories are
extracted. Probabilistic model is learned from these chaotic increases too. We introduce a mathematical formulation to
feature set, and finally, a maximum likelihood estimation produce a MII. As a first step, optical flow angle
criterion is utilized to identify a global abnormal or normal differences are computed for each pixel location based on
behaviour. They can also predict the location. Sparse the current frame and the previous frame. However, some
representation [14] method uses a multi-scale histogram of optical flow measurements are small and noisy, and their
optical flow (MHOF) that also preserves spatial contextual angle difference affect the observation. To overcome this
information to identify local and global anomalies. They problem, the angle difference is multiplied with the optical
concatenate optical flow direction and energy (magnitude) flow magnitude computed in the current frame, and form
information at multiple scales to generate a motion the MII. We compute a MII for each frame. Finally, a CNN
histogram. Behaviour Entropy model (BE) [15] use optical is used to learn normal and abnormal crowd behaviors
flow magnitude information in local regions to model using MIIs. In the testing phase, the CNN classifies the
behavior certainty, behavior entropy, scene behavior input MII image.

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

It is important to emphasize that though there are many these situations are panic and escape situation despite the
optical flow based algorithms have been introduced for direction of movement of each person (whether everyone
crowd behavior understanding, the MII generation is a moves in different direction or same direction). Thus, the
completely new concept that is based on the angle MII must be invariant to the direction of motion, and it
difference between optical flow vectors in consecutive must be discriminative so that the normal and abnormal
frames, and the optical flow magnitude in the current frame. events can be identified at every time frame.
Our studies show that when MIIs are combined with CNN
for classification, it outperforms the existing methods in In an unusual situation, people panic and scatter around. In
abnormal crowd behavior detection. Our experiments are such a situation, we observe that, especially in motion
performed on two commonly used public datasets, such as areas, the angle difference between the optical flow vectors
UMN [31] and PETS2009 [32]. Results illustrate that our in consecutive frames increases at each pixel location. The
method achieves the best results in both datasets. angle difference between two vectors, at each pixel
In our preliminary work [16], optical flow-based features location, is calculated as follows:
are used together with one class SVM for abnormal crowd
behaviour detection. In [16], we created a one-dimensional (𝑢𝑡−1 (𝑥, 𝑦). 𝑢𝑡 (𝑥, 𝑦) + 𝑣𝑡−1 (𝑥, 𝑦). 𝑣𝑡 (𝑥, 𝑦))
feature vector based on a combination of optical flow 𝜃𝑡 (𝑥, 𝑦) = arccos ( )
2 (𝑥, 2 (𝑥,
(√𝑢𝑡−1 𝑦) + 𝑣𝑡−1 𝑦) . √𝑢𝑡2 (𝑥, 𝑦) + 𝑣𝑡2 (𝑥, 𝑦))
magnitude and optical flow angle difference information.
The proposed feature vectors are extracted for frames
representing normal behaviour, and then we use a one class (1)
SVM to train these feature vectors. Finally, if a test frame
is significantly deviating from the normal type, it is labelled where 𝑜𝑡−1 (𝑥, 𝑦)=(𝑢𝑡−1 (𝑥, 𝑦), 𝑣𝑡−1 (𝑥, 𝑦))
⃗⃗⃗⃗⃗⃗⃗⃗ and
to be abnormal. Our earlier work is significantly different ⃗⃗⃗𝑡 (𝑥, 𝑦)=(𝑢𝑡 (𝑥, 𝑦), 𝑣𝑡 (𝑥, 𝑦)) are optical flow vectors,
𝑜
from the current work since in this paper we generate a
respectively, in the previous frame (𝑡 − 1) and in the
novel MII representation that provides a visual appearance
current frame (𝑡) at each pixel location (𝑥, 𝑦). 𝜃𝑡 is the
of crowd motion. The MIIs are input to CNN for training
angle difference at the current frame. The optical flow angle
and testing of two classes: Normal and Abnormal crowd
difference between these two vectors is also shown in Fig. 1
behaviours. It is also important to note that Hatirnaz et al.
(a) and (b). To our observation, the angle difference appears
[33] adopted our preliminary work [16] to develop a
to be higher, as shown in Fig. 1 (a), when there is an
concept-based semantic search interface. They use semantic
abnormal behaviour (i.e. Escape or panic situation), and the
web technologies to improve video retrieval for abnormal
angle difference is smaller as in Fig. 1 (b) when the
crowd behaviors in a surveillance system. The novelty of
behavior is normal. However, there are also some optical
this work is about using semantic web technologies for
flow measurements appear on the image not because of
annotation of surveillance videos and developing an
object motion but because of noise or lighting change in
intelligent semantic search interface. They use the existing
still areas (no motion areas). In still areas, under ideal
work in [16] for crowd behavior feature extraction.
conditions, optical flow measurements should be zero
In this paper, Section 2 introduces motion information
(magnitude is zero, and angle difference is zero). However
image (MII) generation. Section 3 presents abnormal crowd
in practical applications, on real world images, optical flow
event detection using CNN. Section 4 presents experiments
measurements usually appear to have small optical flow
in UMN and PETS2009 datasets, as well as, discusses
magnitude in still areas because of noise or lighting change.
parameter selection and computational complexity
The angle difference between the vectors in consecutive
evaluations. Section 5 is conclusions.
frames may be higher in still areas. We don’t want these
noisy measurements to affect our observation since MIIs
II. MOTION INFORMATION IMAGE GENERATION
are based on angle difference of vectors in consecutive
frames. To overcome this problem, the angle difference is
The proposed motion information image (MII) generation is
multiplied with the optical flow magnitude computed in the
based on optical flow. The optical flow at each frame is
current frame as illustrated below,
computed using the Lucas-Kanade algorithm [34]. In a
panic situation, each person in the crowd may move in
different directions or in the same direction. Therefore, the 𝐼𝑡 (𝑥, 𝑦) = √𝑢𝑡2 (𝑥, 𝑦) + 𝑣𝑡2 (𝑥, 𝑦). 𝜃𝑡 (𝑥, 𝑦) (2)
MII must be invariant to the direction of movement, as well
as it must be discriminative enough so that the normal and
abnormal events can be separated at every time frame. For Where √𝑢𝑡2 (𝑥, 𝑦) + 𝑣𝑡2 (𝑥, 𝑦) is the optical flow magnitude
example, in Fig. 2 (a), when we look at the first and third in the current frame (𝑡) at each pixel location (𝑥, 𝑦).
images, we can observe that each person is moving (e.g. 𝜃𝑡 (𝑥, 𝑦) is the angle difference calculated in Equation 1. 𝐼𝑡
scattering) mostly towards different directions. On the other represents the motion information image (MII) for the
hand, in the second image in Fig. 2(a), everyone is moving current frame (𝑡). If magnitude and angle difference values
towards the same direction (i.e. to the right side). All of are high, the multiplication output will be high as well (this

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

is a case for motion regions). If magnitude is small and the


angle difference is high, the multiplication output will be
smaller (this may be a case for still regions). If magnitude is
small and the angle difference is small, the multiplication
output will be even smaller. Therefore, multiplying
magnitude with the angle difference generates differences
on MIIs. This process generates a significant difference
between the abnormal and normal Motion Information
Images (MIIs). In Section IV (E), we demonstrate that this
multiplication improves the performance considerably.
Fig. 2 (a) shows some example frames representing
abnormal behaviour, and their corresponding MII. On the
other hand, Fig. 2 (b) illustrates some example frames
representing normal behaviour and their corresponding MII.
All of the MIIs are resized to have dimensions 75 × 75 that
will be input to a CNN. In addition, for better illustration, the
MIIs are inverted in Fig. 2. It can be observed that MIIs
produced by abnormal behaviours are significantly different
than the MIIs produced by normal behaviours.

(b) Normal behaviour and corresponding MII

FIGURE 2. Some example frames representing abnormal and normal


behaviours and their corresponding MIIs.

III. CNN TRAINING AND CLASSIFICATION


We use a simple 2D CNN structure, and train the CNN
network with MIIs to achieve abnormal crowd behavior
detection. In the CNN network, there are two classes:
Normal and Abnormal behavior.

FIGURE 1. Optical Flow Angle Difference. (a) Observed behaviour in


abnormal situation, (b) and in normal situation.

FIGURE 3. The CNN Structure

(a) Abnormal behaviour and corresponding MII

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

A. CNN ARCHITECTURE AND TRAINING


TABLE I
MIIs are experimented with simple CNN architectures with ANALYSIS OF THE CNN NETWORK
varying number of convolutional layers and channels,
various filter sizes, and pooling layers to achieve the best Layer Name Activation Learnables Total
Resolutions Learnables
accuracy in both datasets. We observe that the CNN
architecture that uses MIIs for abnormal crowd behavior 1 MII input 75x75x1 - 0
detection can be the one with three convolutional layers as 2 Conv_1 71x71x8 Weights: 208
shown in Fig. 3. It is also important to note that we also (8 5x5x1 5x5x1x8
tested the MIIs with popular deep networks in image convolutions) Bias: 1x1x8
recognition, where the results are presented in Section IV- 3 BatchNorm_1 71x71x8 Offset:1x1x8 16
C. (with 8 Scale:1x1x8
Channels)
In the simple CNN structure, shown in Fig. 3, MII inputs 4 ReLu_1 71x71x8 - 0
are resized to have dimensions 75× 75. The first 5 Maxpool_1 23x23x8 - 0
convolution layer uses 5× 5 filters with 8 channels. After (Block 3x3,
that we perform a batch normalization, a rectified linear Stride 3x3 )
unit (ReLu) activation, and a max pooling (with 3×3) 6 Conv_2 21x21x16 Weights: 1168
(16 3x3x8 3x3x8x16
operation. The second convolutional layer uses 3× 3 filters convolutions) Bias: 1x1x16
with 16 feature maps followed by a batch normalization, a 7 BatchNorm_2 21x21x16 Offset:1x1x16 32
ReLu activation, and 3×3 max pooling. The last (with 16 Scale:1x1x16
convolutional layer utilizes 3×3 filters with 32 feature maps Channels)
followed by a batch normalization, a ReLu, and 2×2 max 8 ReLu_2 21x21x16 - 0
pooling. Then, we form a fully connected layer with two 9 Maxpool_2 10x10x16 - 0
(Block 2x2,
nodes since we have two classes, and finally employ the Stride 2x2)
softmax layer for predictions. The input MII image is 10 Conv_3 8x8x32 Weights: 4640
recognized as normal or abnormal using the classification (32 3x3x16 3x3x16x32
layer. During the training a stochastic gradient descent with convolutions) Bias: 1x1x32
momentum method is used as a solver. The learning rate is 11 BatchNorm_3 8x8x32 Offset:1x1x32 64
(with 32 Scale:1x1x32
0.01, mini-batch size is 50, and the maximum number of
Channels)
epoches is 10. These parameter values are determined 12 ReLu_3 8x8x32 - 0
experimentally to achieve the best performances with the 13 Maxpool_3 4x4x32 - 0
proposed MIIs. The same CNN structure and parameter (Block 2x2,
values are utilized both in UMN and PETS2009 datasets. Stride 2x2)
14 Fully-Connect 1x1x2 Weights: 2x512 1026
TABLE I also shows the details of the network such as (2 units) Bias: 2x1
activation map resolutions at each layer, total learnable 15 Sofmax 1x1x2 - 0
parameters at each layer, and in the whole network. 16 Class Output - - 0
(Normal or
Abnormal)
B. CLASSIFICATION USING THE MIIs AND CNN Total Learnable Parameters in the Network = 7154
The MIIs of test frames are obtained as it is explained in
Section II. In the UMN dataset, the test frame is recognized
using the 28-by-28 neighbourhood frames that means
the window size is 57 (including the test frame). Each of IV. EVALUATION AND RESULTS
the frames in the window is labelled with the CNN
classifier, and then the most frequent class represents the Experiments are performed on commonly used, and
behaviour (normal or abnormal) of the test frame. On the publicly available two different datasets in this domain:
other hand, in the PETS 2009 dataset, the test frame is UMN and PETS2009 datasets. The proposed work is also
recognized using the 21-by-21 neighbourhood frames that compared to the existing works in this domain (global
means that the window size is 43 (including the test frame). anomaly detection) such as Optical Flow Features (OFF)
Each frame in the window is labelled with CNN classifier, [16], the method based on Bayesian model (BM) [20],
and then the most frequent class represents the behaviour sparse reconstruction cost (SRC) [1], chaotic invariants (CI)
(normal or abnormal) of the test frame. The window size [12], the social force model (SF) [8], the force field model
for the UMN and PETS datasets are determined (FF) [35], behaviour Entropy model (BE) [15], distribution
experimentally that will be discussed in the evaluations of magnitude of optical flow (DMOF) [21], context location
section. and motion-rich spatio-temporal volumes (CL and MSV)
[22], generative adversarial nets GAN [23], temporal CNN

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

Pattern (TCP) [24], global event influence model (GEIM)


[25], and histograms of optical flow orientation and
magnitude (HOFO) [26]. We measure the accuracy of the
methods, which is the percentage of correctly classified
frames in comparison to the ground truth. The same
accuracy measurement has been employed by the methods
above. It is also important to note that, recently, Sultani et
al. [36] also proposed a framework for anomaly detection.
However, they need a large dataset for training since they
use the complex C3D [37] network to learn spatiotemporal
features with 3D convolutions. They constructed a large
datasets (1900 videos) consisting of surveillance videos for
abnormal events. However, the anomalies, in their dataset,
involve behaviors by individuals (one or a few people
performing abnormal actions such as one person abusing an
animal, one person breaking the class of a shop, two
persons stealing something from a car). Their dataset is not
about crowd behavior. Although the method looks effective
in their dataset, they did not experiment their work on FIGURE 4. Crowd escape behaviour detection results for Scene #1 in
UMN [31] and PETS2009 [32] for crowd behavior analysis the UMN dataset. (a) Ground truth, (b) detection by the proposed work,
mainly because they need large datasets for training. (c) detection by OFF, (d) detection by BM, (e) detection by FF, (f)
detection by CI, (g) detection by SF, (h) detection by SRC, (i) detection
by DMOF, and (k) detection by GEIM.
A. EVALUATION ON UMN DATASET

The UMN dataset [31] consist of 11 videos, and each video


contains normal and abnormal crowd behaviors. There are
three different scenes in this dataset (two outdoor scenes
and one indoor scene). Scene 1 is an outdoor scene that
consist of two videos (e.g. Video 1 and Video 2). For
testing the Video 1, we use the MIIs of Video 2 to train the
CNN. Similarly, for testing the Video 2, the MIIs of Video
1 are used for training the CNN.
Scene 2 consists of six videos that are captured in an
indoor environment. While testing a particular video in
scene 2, we leave the testing video out, and use the MIIs of
the rest of the videos in scene 2 to train the CNN.
Scene 3 has three videos that are captured in an outdoor
environment. While testing a particular video in scene 3, we
leave the testing video out, and use the MIIs of the rest of
the videos in scene 3 for training purpose.
TABLE II illustrates the accuracy of twelve methods for FIGURE 5. Crowd escape behaviour detection results for Scene #2 in
three different scenes. In overall, the proposed method the UMN dataset. (a) Ground truth, (b) detection by the proposed work,
(c) detection by OFF, (d) detection by BM, (e) detection by FF, (f)
(MII+CNN) outperforms the existing methods in UMN detection by CI, (g) detection by SF, (h) detection by SRC, and (i)
dataset with accuracy 99.08 %. The overall accuracy of detection by GEIM.
proposed method is better than the recently published
works such as GAN (99 %), BE (99 %), TCP (98.8 %), To summarize, our work provides the best results in
GEIM (98.47 %) and DMOF (98.42%). GAN, TCP and BE UMN dataset. In addition to these results, visual accuracy
methods did not provide their performances for individual comparisons of methods for the UMN dataset are illustrated
scenes. Our earlier work, OFF, achieves 97.32 %. We in Fig. 4 and Fig. 5. These illustrations show that our work
improve the performance of the earlier work, as well as performs very accurate segmentation in comparison to other
perform better than the older works in this dataset such as works. The proposed method can detect the start and the
the accuracy of BM (96.40%), FF (81.04%), CI (87.91%), end times of the abnormal event very well.
SF (85.09%) and SRC (84.70%) stay below the proposed
method.

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

TABLE II
ACCURACY COMPARISON OF METHODS IN THE UMN DATASET
MII+CNN OFF BM FF CI SF SRC DMOF GAN TCP GEIM BE
(Our Work)
Scene 1 98.55 99.10 99.03 88.69 90.62 84.41 90.52 98.84 - - 99.18 -
Scene 2 98.91 94.85 95.36 80.00 85.06 82.35 78.48 97.72 - - 98.03 -
Scene 3 99.77 97.76 96.63 77.92 91.58 90.83 92.70 98.7 - - 98.19 -
Overall 99.08 96.46 96.40 81.04 87.91 85.09 84.70 98.42 99 98.8 98.47 99
Accuracy

TABLE III
ACCURACY COMPARISON OF THE METHODS IN THE PETS2009 DATASET
MII+CNN OFF BM FF CI SF DMOF
(Our Work)
View 1 99.12 98.66 96.01 94.50 94.95 91.22 99.00
View 2 99.73 99.20 94.15 63.83 92.02 89.36 99.70
View 3 99.12 99.47 95.21 95.48 94.15 94.68 99.34
View 4 95.45 89.57 91.49 96.81 89.36 64.63 87.72
Overall 98.39 96.72 94.22 87.66 92.62 84.97 96.44
Accuracy

A visual accuracy comparison is also illustrated in Fig. 6


for the PETS2009 dataset. Our work performs very accurate
B. EVALUATION ON PETS DATASET anomaly detection in comparison to existing methods.

In PETS2009 dataset [32], there is one scenario about


abnormal crowd behaviour. This scenario was captured
from four different cameras location (four different
viewpoints), resulting in 4 videos. In this scenario, people
come to centre from different directions, wait there for a
while and suddenly they start to run around in random
directions. Although the same action is performed, there are
significant differences because of different viewpoints. For
example, the distance between camera and the crowd
appears to be different for each viewpoint, the lighting
conditions, distribution and location of people and objects
appear to be different as well. Therefore, we evaluate the
accuracy for these four different viewpoints. Each video
consist of 374 frames. For testing view #1 video of the
scenario, the MIIs of the other views are used for training
the CNN. For testing view #2 of the scenario, the MIIs of FIGURE 6. Crowd escape behaviour detection results for view #2
video of PETS2009 dataset. (a) Ground truth, (b) detection by the
the other views are used for training the CNN. Similarly, proposed work, (c) detection by OFF, (d) detection by BM, (e) detection
we test view #3 and view #4. by FF, (f) detection by CI and (g) detection by SF.
TABLE III illustrates the accuracy of seven methods for
this scenario. In overall, our work (MII+CNN) outperforms C. EVALUATION OF MIIs WITH OTHER NETWORKS
the other methods with accuracy 98.39 %. Other methods
OFF (96.72%) BM (94.22%), FF (87.66%), CI (92.62%), We also evaluate the proposed MIIs with popular deep
SF (84.97%) and DMOF (96.44%) achieves worse than the networks for Anomaly detection such as with ResNet-50
proposed method. Only in view #4, our work ranked [38], GoogleNet-V3 [39], DenseNet-250 [40], and the CNN
slightly behind the FF method. This is mainly because of network proposed by Oquab et al. [41] that is the improved
the resolution problem in view #4. In view #4, there is a version of AlexNet proposed by Krizhevsky et al. [42]. We
low resolution problem, and this is why almost all of the performed transfer learning to tune the network parameters
methods have a lower performance there in comparison to for two possible classes: Normal and Abnormal actions.
other views. Overall accuracy results both in UMN and PETS datasets
are presented below in TABLE IV. We also included

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

results with the simple CNN presented in Section III.


During the transfer learning, particularly we adjusted the
input image size according to a pre-trained network, and
replace the final layers to have only two classes. Results
show that all networks achieve similar results with the
MIIs.
TABLE IV
THE PROPOSED MIIs WITH DIFFERENT NETWORKS

Overall MII MII MII MII MII


Accuracy + + + + +
Simple ResNet-50 GoogleNet- DenseNet- Oquab et al.
CNN V3 250

(7154 (~25 Million (~23 Million (~15 Million (~61 Million (b)
weights) weights) weights) weights) weights) FIGURE 7. The effect of window size on accuracy. (a) UMN dataset.
UMN 99.08 98.99 99.04 98.49 98.22 (b) PETS2009 dataset.
Dataset
PETs 98.39 97.04 98.25 98.31 97.12
Dataset
E. THE INFLUENCE OF ANGLE DIFFERENCE AND
OPTICAL FLOW MAGNITUDE

We experiment the effect of angle difference and optical


D. THE IMPACT OF WINDOW SIZE flow magnitude on detection accuracy and compare with
the combination. In particular, we expect to observe higher
In both UMN and PETS2009 datasets, the impact of detection accuracy for combination of optical flow angle
changing window size is evaluated. Fig. 7 (a) and (b) show difference and optical flow angle magnitude (i.e.
the overall performance of proposed method with changing multiplication of them as shown in Equation 2), in
window size in UMN and PETS2009 datasets, respectively. comparison to only using angle difference or only using
It is seen that the best window size for our approach magnitude. Results for the UMN dataset are presented in
(MII+CNN) in the UMN dataset is 57. In PETS2009 TABLE V. Only angle difference performs 91.30%, only
dataset, the optimal window size for MII+CNN is 43. optical flow magnitude performs 94.34%, and the
combination performs 99.08%. These results show that the
combination increases the accuracy remarkably.
Results for PETS2009 dataset are presented in TABLE
VI. Only angle difference performs 93.25%, only optical
flow magnitude performs 85.49% and the combination
performs 98.39%. Similar to the results in the UMN
dataset, results in the PETS2009 dataset also confirm that
the combination improves the accuracy significantly.

TABLE V
THE INFLUENCE OF ANGLE DIFFERENCE AND OPTICAL FLOW
MAGNITUDE ON ACCURACY (%) IN THE UMN DATASET
Only Angle Only Magnitude Combined
Difference (Optimal (Optimal
(a) (Optimal Window Size 53) Window Size 57)
Window Size 45)
Scene 1 90.49 94.42 98.55
Scene 2 88.44 94.01 98.91
Scene 3 97.38 94.90 99.77
Overall 91.30 94.34 99.08
Accuracy

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

TABLE VI V. CONCLUSIONS
THE INFLUENCE OF ANGLE DIFFERENCE AND OPTICAL FLOW
MAGNITUDE ON ACCURACY (%) IN THE PETS DATASET
We presented an approach for abnormal crowd behaviour
Only Angle Only Magnitude Combined
detection. The proposed approach is based on a new Motion
Difference (Optimal (Optimal
(Optimal Window Size 55) Window Size 43) Information Image (MII) model that is formulated using
Window Size 25) optical flow. The MII depends on the angle difference
View 1 98.39 89.84 99.12 calculated between the optical flow vectors in consecutive
View 2 89.84 82.35 99.73 frames. There are also some optical flow measurements that
View 3 94.92 89.84 99.12 are small, and their angle difference may affect the
View 4 89.84 79.95 95.45 observation. To overcome this problem, the angle
Overall 93.25 85.49 98.39 difference is multiplied with the optical flow magnitude in
Accuracy the current frame to generate the MIIs. A convolutional
neural network (CNN) is used to learn normal and
abnormal events, and when a test sample is input to the
F. COMPUTATION TIME
CNN, it is assigned to one of the two classes (Normal or
Abnormal). Evaluations are conducted on publicly available
The computational time for each phase of our method in UMN and PETS2009 datasets. Results indicate that the
both UMN and PETS2009 datasets are shown in TABLE proposed work is very effective.
VII and TABLE VIII, respectively. Results are obtained
using Matlab 2018 on a Windows 7 Operating System with
Intel Core i7-6700, 2.60GHz and 16GB RAM. Results
REFERENCES
show that MII formation needs considerable amount of time
comparing to other stages. In addition, we observe that once [1] Y. Cong, J. Yuan and J. Liu. “Abnormal Event Detection in Crowded
the CNN is trained, the testing stage is very fast. Scenes using Sparse Representation”. Pattern Recognition, vol. 46,
no. 7, pp. 1851-1864, 2013.
[2] J. S. Marques, P. M. Jorge, A. J. Abrantes, and J. M. Lemos.
“Tracking Groups of Pedestrians in Video Sequences”. Presented at
TABLE VII Computer Vision and Pattern Recognition Workshop, 9:101, 2003.
COMPUTATION TIME ON UMN DATASET [3] P. Tu, T. Sebastian, G. Doretto, N. Krahnstoever, J. Rittscher, and T.
Yu. “Unified Crowd Segmentation”. Presented at European
Performance MII Formation CNN Training Testing Conference on Computer Vision, vol. 5305, pp. 691-704, 2008.
(Average of 5 [4] G. Brostow and R. Cipolla. “Unsupervised Bayesian Detection of
Independent Motion in Crowds”. IEEE Computer Vision and Pattern
runs)
Recognition, no. 1, pp. 594-601, 2006.
Scene 1 1810.38 sec 11.50 sec 3.07 sec [5] S.B. Ernesto L. Andrade and R.B. Fisher. “Modelling Crowd Scenes
(2 videos) (2 videos, (1 video) (1 video) for Event Detection”. Presented at IEEE International Conference on
1451 frames) Pattern Recognition, vol. 1, pp. 175-178, 2006.
Scene 2 6262.88 sec 33.73 sec 3.42 sec [6] S. Ali and M. Shah. “A Lagrangian Particle Dynamics Approach for
(6 videos) (6 videos, (5 videos) (1 video) Crowd Flow Segmentation and Stability Analysis”. Presented at
(4140 frames) IEEE Conference on Computer Vision and Pattern Recognition, pp.
Scene 3 2657.12 sec 17.13 sec 3.12 sec 1-6, 2007.
(3 videos) (3 videos, (2 videos) (1 video) [7] R. Mehran, A. Oyama, M. Shah. “Abnormal Crowd Behavior
2137 frames) Detection using Social Force Model”. Presented at IEEE Conference
on Computer Vision and Pattern Recognition, pp. 935-942, 2009.
[8] R. Mehran, A. Oyama, M. Shah. “Abnormal Crowd Behaviour
TABLE VIII Detection using Social Force Model”. Presented at IEEE Conference
COMPUTATION TIME ON PETS2009 DATASET on Computer Vision and Pattern Recognition, pp. 935-942, 2009.
[9] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. “Robust Real-
Performance MII Formation CNN Training Testing Time Unusual Event Detection using Multiple Fixed-Location
(Average of 5 Monitors”, IEEE Transactions on Pattern Analysis and Machine
runs) Intelligence, vol. 30, pp. 555–560, 2008.
View 1 2421.6 sec 14.39 sec 2.13 sec [10] H. M. Dee and A. Caplier. “Crowd behaviour analysis using
(4 videos/views, (3 videos: view #2, (1 video: view #1) histograms of motion direction”. Presented at IEEE International
1496 frames) #3 and #4) Conference on Image Processing, pp. 1545-1548, 2010.
[11] L. Kratz, and K. Nishino. “Anomaly Detection in Extremely
View 2 2421.6 sec 13.32 sec 2.09 sec Crowded Scenes using Spatio-Temporal Motion Pattern Models”.
(4 videos, (3 videos: view #1, (1 video: view #2) Presented at IEEE Conference on Computer Vision and Pattern
1496 frames) #3 and #4) Recognition, pp. 1446-1453, 2009.
[12] S. Wu, B. E. Moore, M. Shah. “Chaotic Invariants of Lagrangian
View 3 2421.6 sec 14.73 sec 2.10 sec Particle Trajectories for Anomaly Detection in Crowded Scenes”.
(4 videos, (3 videos: view #1, (1 video: view #3) Presented at IEEE Conference on Computer Vision and Pattern
1496 frames) #2 and #4) Recognition, pp. 2054-2060, 2010.
[13] L. Weixin, V. Mahadevan, and N. Vasconcelos. “Anomaly Detection
and Localization in Crowded Scenes”. IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 36, no. 1, pp. 18-32, 2014.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2990355, IEEE Access

[14] Y. Cong, J. Yuan and J. Liu. “Abnormal Event Detection in Crowded International Joint Conference on Artificial Intelligence, pp. 674-
Scenes using Sparse Representation”. Pattern Recognition, vol. 46, 679, 1981.
no. 7, pp. 1851-1864, 2013. [35] D. Chen and P. Huang. “Motion-based unusual event detection in
[15] Ren Weiya, Li Guo-Hui, Chen Jun and Liang Hao-Zhe. “Abnormal human crowds”, Journal of Visual Communication and Image
crowd behavior detection using behavior entropy model”. Presented Representation, vol. 22, no. 2, pp. 178-186, 2011.
at International Conference on Wavelet Analysis and Pattern
Recognition, pp. 212-221, 2012. [36] W. Sultani, C. Chen and M. Shah, "Real-World Anomaly Detection
[16] C. Direkoglu, M. Sah and N. E. O’Connor. “Abnormal crowd in Surveillance Videos," IEEE Conference on Computer Vision and
behaviour detection using novel optical flow-based features”. Pattern Recognition, pp. 6479-6488, 2018.
Presented at IEEE International Conference on Advanced Video and
Signal based Surveillance (AVSS2017), 2017. [37] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri,
[17] S. Hawkins, H. He, G. Williams, and R. Baxter. “Outlier Detection "Learning Spatiotemporal Features with 3D Convolutional
Using Replicator Neural Networks”. Presented at International Networks", IEEE International Conference on Computer Vision
Conference in Data Warehousing and Knowledge Discovery, 2002. (ICCV), pp. 4489-4497, 2015.
[18] Z. Fang, F. Fei, Y. Fang, C. Lee, N. Xiong, L. Shu and S. Chen.
“Abnormal event detection in crowded scenes based on deep [38] K. He and X. Zhang and S. Ren and J. Sun. “Deep Residual
learning”. Multimedia Tools and Applications, vol. 75, no. 22, pp. Learning for Image Recognition”, IEEE Conference on Computer
14617–14639, November 2016. Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[19] Y. Feng, Y. Yuan, and X. Lu. “Deep Representation for Abnormal [39] C. Szegedy and V. Vanhoucke and S. Ioffe and J. Shlens and Z.
Event Detection in Crowded Scenes”. Presented at ACM on Wojna. “Rethinking the Inception Architecture for Computer
Multimedia Conference (MM '16), pp. 591–595, 2016. Vision”, IEEE Conference on Computer Vision and Pattern
[20] S. Wu, H. S. Wong and Z. Yu, “A Bayesian Model for Crowd Recognition (CVPR), pp. 2818-2826, 2016.
Escape Behaviour Detection”. IEEE Transactions on Circuits and [40] Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q.
Systems for Video Technology, vol. 24, no. 1, pp. 85–98, January Weinberger, “Densely Connected Convolutional Networks”. IEEE
2014. Conference on Computer Vision and Pattern Recognition (CVPR),
[21] M. Gnouma, R. Ejbali, and M. Zaied, “Abnormal events’ detection in pp. 4700-4708, 2017.
crowded scenes”. Multimedia Tools and Applications, vol. 77, no. [41] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and
19, pp. 24843–24864, 2018. transferring mid-level image representations using convolutional
[22] N. Patil and P. K. Biswas, “Global abnormal events detection in neural networks”, IEEE Conference on Computer Vision and Pattern
crowded scenes using context location and motion-rich spatio- Recognition (CVPR), pp. 1717-1724, June 2014.
temporal volumes”. IET Image Process, vol. 12, no. 4, pp. 596–604, [42] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
2018. classiffcation with deep convolutional neural networks”, in Proc.
[23] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Advances in Neural Information Processing Systems, pp. 1097-1105,
Regazzoni and N. Sebe. “Abnormal event detection in videos using 2012.
generative adversarial nets”. Presented at IEEE International
Conference on Image Processing (ICIP 2017), 2017.
[24] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto and N. Sebe,
"Plug-and-Play CNN for Crowd Motion Analysis: An Application in
Abnormal Event Detection," Presented at IEEE Winter Conference
on Applications of Computer Vision (WACV 2018), pp. 1689-1698,
2018.
[25] L. Pan, H. Zhou, Y. Liu, and M. Wang. “Global event influence
model: integrating crowd motion and social psychology for global
anomaly detection in dense crowds”. Journal of Electronic Imaging,
vol. 28, no. 2, 023-033, April 2019.
[26] R. V. H. M. Colque, C. A. C. Junior and W. R. Schwartz.
“Histograms of optical flow orientation and magnitude to detect
anomalous events in videos”. IEEE Trans. Circuits Syst. Video
Technol. vol. 27, no. 3, pp. 673–682, 2017.
[27] T. Li, H. Chang, M. Wang, B. Ni, R. Hong and S. Yan. “Crowded
Scene Analysis: A Survey”. IEEE Transactions on Circuits and
Systems for Video Technology, vol. 25, no. 3, pp.367-386, 2015.
[28] V. J. Kok, M. K. Lim, C. S. Chan. “Crowd behaviour analysis: A
review where physics meets biology”. Neurocomputing, 2015.
[29] G. Tripathi, K. Singh and D.K. Vishwakarma. “Convolutional neural
networks for crowd behaviour analysis: a survey”. Visual
Computing, vol. 35, pp. 753-776, https://doi.org/10.1007/s00371-
018-1499-5, 2019.
[30] X. Wang and C.C. Loy. “Deep Learning for Scene Independent
Crowd Analysis”. Group and Crowd Behavior for Computer Vision,
pp. 209-252, 2017.
[31] University of Minnesota, available from
http://mha.cs.umn.edu/movies/crowdactivity-all.avi.
[32] University of Reading, PETS 2009 Dataset S3 Rapid Dispersion,
available from http://www.cvg.rdg.ac.uk/PETS2009/a.html#s2l1
[33] Hatirnaz, E., Sah, M. and Direkoglu, C. “A novel framework and
concept-based semantic search Interface for abnormal crowd
behaviour analysis in surveillance videos.” Multimed Tools Appl
(2020). https://doi.org/10.1007/s11042-020-08659-2
[34] B. D. Lucas and T. Kanade. “An Iterative Image Registration
Technique with an Application to Stereo Vision”. Presented at

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like