0% found this document useful (0 votes)

5 views22 pages

Paper 9

Uploaded by

Tanishka Goyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views22 pages

Paper 9

Uploaded by

Tanishka Goyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Applied Intelligence (2024) 54:10709–10730

https://doi.org/10.1007/s10489-024-05775-6

Crowd behavior detection: leveraging video swin transformer

for crowd size and violence level analysis
Marwa Qaraqe1 · Yin David Yang1 · Elizabeth B Varghese1 · Emrah Basaran1 · Almiqdad Elzein1

Accepted: 12 August 2024 / Published online: 26 August 2024

Abstract
In recent years, crowd behavior detection has posed significant challenges in the realm of public safety and security, even
with the advancements in surveillance technologies. The ability to perform real-time surveillance and accurately identify
crowd behavior by considering factors such as crowd size and violence levels can avert potential crowd-related disasters and
hazards to a considerable extent. However, most existing approaches are not viable to deal with the complexities of crowd
dynamics and fail to distinguish different violence levels within crowds. Moreover, the prevailing approach to crowd behavior
recognition, which solely relies on the analysis of closed-circuit television (CCTV) footage and overlooks the integration of
online social media video content, leads to a primarily reactive methodology. This paper proposes a crowd behavior detection
framework based on the swin transformer architecture, which leverages crowd counting maps and optical flow maps to detect
crowd behavior across various sizes and violence levels. To support this framework, we created a dataset comprising videos
capable of recognizing crowd behaviors based on size and violence levels sourced from CCTV camera footage and online
videos. Experimental analysis conducted on benchmark datasets and our proposed dataset substantiates the superiority of our
proposed approach over existing state-of-the-art methods, showcasing its ability to effectively distinguish crowd behaviors
concerning size and violence level. Our method’s validation through Nvidia’s DeepStream Software Development Kit (SDK)
highlights its competitive performance and potential for real-time intelligent surveillance applications.

Keywords Crowd behavior detection · Swin transformer · DeepStream · Crowd size · Violence Level

1 Introduction is a powerful tool for creating real-time intelligence from

an observed environment [1]. In today’s public surveillance
Intelligent and smart analysis of video data has gained system, it helps to perceive aberrant events such as traffic
immense importance in modern surveillance as it increases rule-breaking, unauthorized parking, fights, violent crowds,
the efficiency, overall capabilities, and effectiveness of secu- etc. In addition, it is a significant aid in the current smart
rity and monitoring operations. In particular, video analysis world to monitor the video streaming content on social media
platforms. Thus, video analysis possesses a wide range of
B Elizabeth B Varghese applications in public surveillance as well as online moni-
[email protected]
toring of events. The most common example is the COVID
Marwa Qaraqe pandemic situation where video analysis enabled the real-
[email protected]
time monitoring of social distancing in public venues and
Yin David Yang curfew adherence. Among the range of applications, one of
[email protected]
the complex and crucial areas is the management of crowds
Emrah Basaran and their associated behavior [2]. This is because the behav-
[email protected]
ior of crowds is often unpredictable and prone to unexpected
Almiqdad Elzein disasters and crime-related events, making them a substantial
[email protected]
concern for government officials and law enforcement agen-
1 Division of Information and Computing Technology, College cies. Although CCTV is heavily deployed throughout the
of Science and Engineering, Hamad Bin Khalifa University, world for public monitoring purposes, CCTV is generally
Qatar Foundation, Doha, Qatar

123
10710 M. Qaraqe

seen as a reactive system and requires manual monitoring of Peaceful Gathering (LPG), Large Violent Gathering (LVG),
events. and Fighting (F) that can distinguish crowd dynamics and
In this context, efficient, autonomous, and real-time anal- extent of violence. To facilitate the learning and prediction
ysis of video data can enable effective and proactive monitor- of crowd behavior classes, we have exerted crowd-counting
ing over large geographical areas, and can assist public safety maps and optical flow maps as influential components within
officials in proactive decision-making in areas that exhibit our proposed model. The crowd-counting maps aid the model
large crowds. Thus, intelligent and smart crowd behavior in distinguishing between large and small events, whereas the
recognition has emerged as an indispensable area in com- optical flow maps enhance the analysis of temporal violent
puter vision research. Since the advent of deep learning (DL) patterns of the crowd. Finally, to demonstrate the outcomes
algorithms, the processing of enormous amounts of unstruc- of the proposed model in real-time and on real videos, we
tured data has led to many human behavior recognition being leverage Nvidia’s DeepStream Software Development Kit
developed using CNN [3–5], LSTM [5], GAN [6], Autoen- (SDK) [9], an intelligent application framework to process
coder [3], ResNet [7], etc. However, most of the methods real-time video data. Thus, our main contributions are:
are for classifying crowds as violent/nonviolent [5, 7] and
normal/abnormal [3, 4, 6]. However, for law enforcement
• A swin transformer-based DL model is developed for the
agencies, the size and violence level of the crowd is also
purpose of classifying crowd behavior into four discrete
crucial to making decisions in practical scenarios [8]. For
categories characterized by varying levels of violence and
example, if the model identifies the existence of a small vio-
crowd sizes.
lent crowd, then the authorities can prioritize containment
• Additional semantic knowledge pertaining to crowd den-
and swift intervention to minimize the impact and prevent
sity and violence levels is augmented into the swin
escalation. At the same time, the identification of large vio-
transformer framework by the integration of crowd-
lent crowds changes the reaction strategies of officials to
counting maps and optical flow maps.
deploy additional resources to maintain crowd control. The
• We have curated a large dataset that can serve as a
development of such systems requires training a model using
benchmark resource for training models dedicated to
classes that characterize crowd size and violence levels.
monitoring crowd-related events through the analysis of
To the best of our knowledge, no such dataset exists in
data originating from public CCTV surveillance cameras
the literature nor has such a problem been addressed by
and online social media platforms. Furthermore, we have
researchers. To this effort, we first present a novel dataset con-
extracted a subset of the dataset comprising exclusively
sisting of videos representative of typical public gatherings.
CCTV footage. This dedicated subset is instrumental in
The video database contains videos of normal public daily
the development of models for public CCTV surveillance
activity, small-scale violent events, large-scale violent events,
applications.
and large-scale peaceful events. This distinction allows for
• Experimental analysis has been executed employing the
crowd behavior classification based on the size of the crowd
DeepStream SDK to ascertain the viability and prac-
within the frame and the level of violence. In addition, the
ticality of our proposed methodology within an actual
dataset contains videos taken from CCTV footage, where
real-time surveillance environment.
the camera is stationary and at a distance from the event,
and from social media uploads, where the video is taken
via a mobile camera, introducing motion in the video. We The rest of the paper is structured as follows: In Section 2,
have deliberately introduced social media video content to a comprehensive review of the existing literature is presented,
develop a system that can not only identify crowd behav- while Section 3 delineates the proposed crowd behavior
ior CCTV cameras managed by authorities but also analyze detection model and elucidates the processes involved in
video content that is uploaded via social media by the pub- dataset creation. Section 4 is dedicated to discussing experi-
lic. The latter allows governments to expand their monitoring mental analysis and its outcomes, and elaborates on real-time
regions and identify potential threats, suspicious behavior, or analysis employing DeepStream. Finally, the paper is con-
illegal activities that might be shared or discussed in these cluded in Section 5.
videos. The proposed system is a proactive approach to pub-
lic safety monitoring that enables the initiation of appropriate
actions to prevent crimes before they occur or escalate. 2 Related work
The need to identify and classify crowd behavior in CCTV
video and social media streaming makes crowd behav- Accurate detection and precise prediction of crowd behav-
ior classification more challenging. Considering all these ior are inevitable for effective crowd management within
aspects, we propose a DL model based on a video swin smart surveillance systems. The increase in crowd-related
transformer to classify crowd behavior to Natural(N), Large mishaps in the past decades has led to significant advances

123
Crowd behavior detection: leveraging video... 10711

in computer vision research, which actively drives efficient language processing (NLP), transformer-based architectures
and proactive crowd surveillance. This section provides an have been adapted for video analysis. These models capture
outlook of recent DL approaches for video data analysis, long-range dependencies and facilitate parallel processing
various methods employed for analyzing video data derived of frames in videos. Furthermore, transformers are more
from the internet and CCTV sources, as well as existing pub- scalable to very large-capacity models [27] and assume less
licly available datasets for tasks related to crowd control and prior knowledge about the structure of the problem as com-
human activity recognition. pared to CNNs and RNNs [28]. These advantages have led
to their success in many computer vision tasks such as image
2.1 Advances in DL methods for video analysis recognition [29] and object detection [30]. Dosovitskiy et
al. [29] proposed Vison Transformers (ViT), which achieved
DL has revolutionized video analysis by enabling the extrac- promising results in image classification tasks by model-
tion of high-level representations from raw video data. The ing the relationship (attention) between the spatial patches
breakthrough in video analysis was mainly due to the power of an image using the standard transformer encoder [26].
of Convolutional Neural Networks (CNN), which are suc- After ViT, many transformer-based video recognition meth-
cessful in object detection [10], tracking [11], and action ods [31, 32] have been proposed. In these works, different
recognition [12]. CNN is widely used for crowd analysis as techniques have been developed for temporal attention as
well. A cascade of 3D CNN and 3D autoencoder was pro- well as spatial attention. Subsequently, attention mechanisms
posed by Sabokrou et al. [3] for crowd anomaly classification. similar to Transformers were used with convLSTM for action
Zhou et al. [4] utilized a spatiotemporal CNN to detect panic recognition [33], crowd behavior prediction [25], and gesture
situations in a crowd. 3DCNN was employed in [13] and [14] recognition [34] from videos.
to detect various crowd behaviors. In a nutshell, Transformer-based approaches have led to
Recently, ResNet, a variant of CNN, which eliminates significant advancements in the realm of computer vision.
the vanishing gradient problem of CNN and helps in easy The performance improvements are quite impressive and
training [15], has been widely used for video processing. represent a major step forward in this field. Among the
Ng et al. [16] proposed a ResNet-based architecture, namely Transformer frameworks discussed above, the swin trans-
ActionFlowNet, for classifying human actions. The long- former [31] has been a game changer in the field of computer
term and short-term features in action videos are segregated vision. It has set new records in object detection [35] and
using ResNet in [17] and a 3D Loop ResNet was uti- semantic segmentation benchmarks [35], and has shown that
lized by Kakamu et al. [18] for predicting various human Transformer-based approaches are the future of visual mod-
actions. ResNet was also employed in [7] for violent behavior eling. In addition, the swin transformer possesses shifted
detection, crowd density classification, and crowd counting. non-overlapping windows, which makes it suitable for faster
Abnormal crowd event detection in small-scale and large- running speed and hardware friendly, which inspired us to
scale crowds was proposed in [19], and in [20], features for use the framework as the backbone of our proposed model
crowd behavior pattern analysis were done using ResNet. (Details of swin transformer framework are given in Sec-
Other widely used DL methods for video analysis include tion 3).
Recurrent Neural Networks (RNNs) and their variant, Long
Short-Term Memory (LSTM) networks. Chen et al. [21], and 2.2 Existing video analysis methods for online
Ebrahimi et al. [22] proposed an RNN-based algorithm to videos and CCTV footage
identify various emotions of a crowd. Moreover, many stud-
ies were performed by exploring the properties of LSTM Online videos constitute multimedia content accessible for
networks that can remember long-term dependencies and either streaming or downloading via the internet. This cate-
solve the vanishing and exploding gradient problem of RNN gory spans diverse content genres, including but not limited
[23]. The sequences of group activities were recognized in to movies, TV shows, documentaries, music videos, tutori-
[24] using a 2-stage LSTM model, and the crowd behaviors als, vlogs, and more. In some cases, surveillance of social
based on psychological properties were predicted in [25] by media videos can contribute to public safety efforts. Wang et
wielding a convolutional LSTM. al. [36] proposed a deep recurrent neural network to extract
In the recent past, attention mechanisms have been applied temporal features to classify audio frames for event detection
to video analysis to focus on relevant spatiotemporal regions from videos such as sandwich making, flash mob gathering,
or frames. Vaswani et al. [26] put forth the idea of atten- etc. Complex events from web videos were classified using
tion as a Transformer in language translation using an RNN a two-stage CNN in [37] and in [38] CNN was utilized to
framework. Inspired by the success of transformers in natural extract features from the video content, and a concept library

123
10712 M. Qaraqe

using Support Vector Machine (SVM) was created to orga- through real-time experiments, thereby rendering our system
nize the events. aptly suited for smart surveillance in real-world scenarios.
Conversely, analyzing CCTV videos is a common practice
in various domains, including security, safety, transportation,
2.3 Existing human activity recognition (HAR) and
and retail, to enhance situational awareness, improve oper-
crowd datasets
ational efficiency, and enable proactive decision-making.
CCTV footage is typically captured by stationary cameras
The most important part of an AI-based smart surveillance
strategically placed at specific locations for surveillance pur-
system for crowd behavior detection is the availability of
poses. Since these cameras have a fixed field of view and
benchmark datasets for training purposes. Here, we provide
do not move, they provide a continuous stream of video
a review of existing publicly available datasets for crowd
footage from a particular perspective. In [39] suspicious
management and HAR closely related to our work.
activities inside a campus were detected from CCTV footage
by employing VGG-16 as the feature extractor and LSTM
as the classifier. The method proposed by Khan et al. [40] • Movie Actions Dataset [44]: The dataset provides anno-
utilized a CNN to find anomalies such as accidents from traf- tated movie clips . Each clip in the dataset belongs to one
fic videos. Anomaly detection was also proposed by Aboah of the 51 classes for various actions such as GetOutCar,
et al. [41] using a decision tree-based approach. Moreover, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp,
CCTV footage was used to analyze the crowd’s real-time etc.
behaviors, which helps in reliable and proactive crowd man- • UCF50 [45] & UCF101 [46]: UCF50 and UCF101
agement. Baqui et al. [42] studied the cross-correlation and datasets consist of YouTube clips grouped into one of
optical flow patterns to analyze pedestrian flows from real- 50 and 101 action categories, respectively. Examples
time CCTV videos. The crowd density and the parameters of of action classes in the UCF50 dataset include Bas-
pedestrian flow such as direction and speed from Hajj videos ketball Shooting and Pull-Ups while the action classes
collected using CCTV cameras were also explored in [43] to in UCF101 include a wider spectrum of classes subdi-
display the crowd movement in 3D animation form for better vided into five different categories, namely, body motion,
crowd control. The camera’s rotation, focal length, position human-human interactions, human-object interactions,
arguments, and CSRNet-based head tracking AI algorithm and playing musical instruments and sports.
were used to detect the position of persons in the crowd. • Kinetics Dataset: This dataset consists of three versions-
Although many works have been proposed for analyzing Kinetics-400 [47], Kinetics-600 [48], and Kinetics-700
online video content for captioning, event detection, senti- [49]. The Kinetics-400 dataset is a large-scale action
ment analysis, etc., the videos have largely remained unused recognition dataset that contains around 240,000 video
by law enforcement agencies and public surveillance systems clips categorized into 400 action classes. Each video
due to the lack of suitable models and datasets for training clip has an average duration of around 10 seconds. This
and evaluation. Despite the pervasive utilization of DL mod- dataset was designed for the task of action recognition
els in the analysis of online and CCTV videos across various in videos. An extension of Kinetics-400, Kinetics-600
domains, none of these models exhibit promising capabilities includes additional action classes for video action recog-
for the discernment of crowd behavior predicated on criteria nition. It provides a broader range of actions for more
such as crowd size and violence level. Henceforth, the exigent comprehensive research and evaluation. Another exten-
requirement is the development of an intelligent surveillance sion of Kinetics-400, Kinetics-700 extends the action
system with global applicability, notably crucial for govern- classes even further, providing a more diverse and chal-
mental agencies facing diverse challenges, especially in cases lenging dataset for action recognition tasks.
of emergencies, such as widespread unrest, and during large- • Violent Flows [50]: Focuses on crowd violence that com-
scale public events, such as concerts, national holidays, and prises 246 crowd videos extracted from YouTube and
sports tournaments. Furthermore, the prevailing literature consists of two classes- violence and non-violence.
lacks comprehensive methodologies supported by real-time • UCF Crime Dataset [51]: Collection of long surveillance
experimentation, which is essential in pre-empting situations videos from YouTube and LiveLeak that consists of thir-
from spiraling out of control due to delayed or inadequate teen crime classes (e.g, road accident, burglary, robbery,
security responses. Therefore, we propose a DL framework etc.).
alongside a diligently created dataset customized for the clas- • CCTV-fights [52]: A dataset of 1000 videos, whose accu-
sification of crowd behaviors contingent upon crowd size and mulative length exceeds 8 hours of real fights caught by
violence levels. Additionally, we furnish empirical validation CCTV cameras with annotation as fight and non-fight.

123
Crowd behavior detection: leveraging video... 10713

• Surveillance Camera Fight Dataset [53]: Contains 300 scale violence (i.e., F) and large-scale violence (i.e., LVG).
videos collected from movies and hockey games and To our knowledge, these aspects have been largely neglected
divided equally into two classes; fight and non-fight. in existing datasets, which motivates this work.
• UMN [54]: The dataset comprises eleven videos and
intends to classify the crowd as either normal or abnor-
mal. The normal and abnormal classes are classified 3 Proposed framework and dataset
based on the running patterns of people in the crowd.
• UCF Normal/Abnormal Web Dataset [55]: A collection This section describes the proposed model for analyzing
of twenty videos with normal, escape panic, clash, and internet and surveillance videos as well as the dataset used
fights as crowd classes to train that model. Figure 1 depicts the overall system archi-
tecture of the proposed framework.
In short, although the HAR datasets are useful for testing dif-
ferent DL architectures, they are not necessarily useful for 3.1 Video swin transformation
specific practical tasks, such as surveillance, which likely
requires the distinction between a limited number of specific The main backbone of our framework is the swin trans-
action classes. Furthermore, to the best of our knowledge, former, more precisely, the variant known as the video swin
no video dataset in the literature contains large gatherings, transformer. The swin transformer is characterized by its hier-
such as protests, as an action class. For instance, protest archical architecture, which partitions images into smaller
datasets in the literature are limited to image datasets [56] and patches at the initial layers of the transformer structure and
protest metadata [57], which document protester demands, progressively consolidates adjacent layers at deeper levels
government responses, protest location, and protester identi- to create larger patches. It leverages the concept of shifted
ties. Thus, the novelty of our developed video dataset is that windows during inference, thereby enhancing its capac-
it is specifically aimed toward identifying scenarios of pub- ity for representation and contributing to its remarkable
lic unrest (violent protests, fights, etc.) or scenarios that have recent state-of-the-art performance [58]. Beyond its state-
the potential to develop into public unrest (large gatherings, of-the-art performance, the swin transformer demonstrates
peaceful protests, etc.). Large gatherings are particularly superior computational efficiency compared to other mod-
interesting and important to be carefully monitored as they els. Notably, the computational demands of the model exhibit
can lead to unruly events. Large gatherings that seem peaceful linear growth in relation to the input image resolution, con-
can evolve into a violent scenario with fighting, destruction trasting with other models where computation time escalates
of property, etc. In addition, the scale of violence captured quadratically with increasing image resolution. Among mul-
can inform the scale of the response from law enforcement. tiple versions of video swin transformer, we contemplate
Thus, for the current task, we divide violence into small- Swin-T, the tiny version of swin as it is designed to be more

Fig. 1 Overall Structure of the Proposed Framework

123
10714 M. Qaraqe

Fig. 2 Architecture of Swin-T [35]. The input video is represented by a tensor of shape T × H × W × 3, where T is the number of frames and
H × W is the height and width of each frame having 3 channels (RGB)

efficient and faster than other versions, making it well-suited size P × M × M. That is, the MSA of the first layer generates

for scenarios where computational resources are limited or non-overlapping 3D windows of size TP × HM × WM .
inference speed is crucial. The architecture of Swin-T is pro- The window partition for the second layer is shifted tem-
vided in Fig. 2. porally and results in ( P2 , M
2 , 2 ) tokens. An example of
M

The framework consists of four stages, where each stage 3DWMSA is provided in Fig. 4. Finally, self-attention is
has three components- Patch Merging, Linear Layer, and a computed by including a 3D relative position bias, B ∈ R,
Video Swin Transformer block except stage 1. In stage 1, and is given by
each frame in the input video, V = { f 1 , f 2 , ... f T } is divided
into 3D patches/tokens of size 2 × 4 × 4 × 3 by the 3D patch qk T
partition layer that results in T2 × H4 × W4 tokens. These Attention(q, k, v) = so f tmax( √ )v, (3)
d+B
tokens are given to the linear embedding layer, where the
features of each token are projected to an arbitrary dimen-
, where q represents the query matrix with dimensions d, and
sion, C (For Swin-T, C = 96). The patch merging layers of
k and v, denote the key and value matrices respectively, for
each stage perform the spatial downsampling and concate-
the self-attention calculation of T frames. Finally, after stage
nation of 2 × 2 neighboring patches, where a linear layer is
utilized to project the concatenated patches to half of the input
dimension. The significant block in each stage is the video
swin transformer block that comprises a 2-layer multi-layer
perceptron (MLP) with Gaussian Error Linear Unit (GELU)
activation unit and 3D shifted window-based multi-head self-
attention (3DWMSA) module as shown in Fig. 3.
A residual connection is established after each module to
overcome vanishing gradients and layer normalization (LN)
is applied after the MLP and 3DWMSA to get control over
covariate shift. A block of the video swin transformer, as
illustrated in Fig. 3, is given by

ẑ l = 3DW M S A(L N (z l−1 )) + z l−1 (1)

and

z l = M L P(L N (ẑ l )) + ẑ l , (2)

where ẑ l represents the input to the MLP at layer l, while

z l denotes the output from the layer l MLP, which is subse-
quently passed to layer l + 1.
The 3DWMSA is responsible for efficient event recog-
nition from temporal video data with its multi-head self-
attention (MSA) property and non-overlapping 3D windows.
Each input V is divided into T × H × W tokens and each
token is further divided into a 3D non-overlapping window of Fig. 3 Illustration of a Video Swin Transformer block [35]

123
Crowd behavior detection: leveraging video... 10715

Fig. 4 Example of 3DWMSA [35]

4, a softmax layer is employed to calculate the probability Algorithm 1 Crowd Behavior Detection Model
distribution of crowd behavior labels. 1: Notations Used: T ← No: of frames in a video sample, n ← No: of
The proposed framework leverages crowd counting maps samples, ← Concatenation operation
Require: Input samples- V1 , V2 , ..., Vn , Input video frames per sample-
(CC_Maps) and optical flow patterns (O pt_Flow) as f 1 , f 2 , ...., f T
important components to augment supplementary semantic Ensure: Behavior: B
knowledge to classify crowd behavior based on attributes 2: procedure CCMaps(V )
including crowd size and violence level. The O pt_Flow 3: for each j in V do
4: for each i in f do
and CC_Maps maps are computed for each two consec- 5: Cr owd_Densit y_Matri x ← Pr eT rainV GG19( f i )
utive frames for each sample. For a sample with frames 6: Convert Cr owd_Densit y_Matri x to Ci
{ f 1 , . . . , f T }, we compute a CC_Maps, C for frames 7: if i is odd then
{ f 1 , f 3 , f 5 , . . . , f T −1 }, skipping one frame at a time. Addi- 8: Cr owdCounting Maps, C ← Append(Ci )
9: end if
tionally, we compute the O pt_Flow, O for each frame 10: end for
pair {( f i , f i+1 )|i ∈ [1, T ]}. Consequently, one sample of 11: end for
input to the swin transformer is the result of the con- 12: end procedure
catenation of T input frames V = ( f 1 , f 2 , ..., f T ), T /2 13: procedure OFlow(V )
14: for each j in V do
CC_Maps, C = (c1 , c2 , ..., cT /2 ), and T-1 O pt_Flow, 15: for each i in f do
O = (o1 , o2 , ....oT −1 ) and is represented as 16: image_1 ← f i )
17: image_2 ← f i+1 )
I j = Vj C j O j, (4) 18: Oi ← Pr etrained R AF T (image_1, image_2)
19: end for
20: O ptical Flow, O ← Append(Oi )
where j = 1, 2, 3, ...n denotes the number of samples of 21: end for
each video and is the concatenation operation. The overall 22: end procedure
procedure of the proposed model is illustrated in Algorithm 23: procedure CrowdBehavior(V, C, O)
1. The following subsection furnishes a detailed explana- 24: for each j in V do
25: I j ← Vj C j O j
tion of the processes involved in generating CC_Maps and 26: end for
O pt_Flow. 27: Train VideoSwin(I)
28: Behavior , B ← V ideoSwin(I )
3.2 Crowd counting and optical flow maps 29: end procedure

Recall that our primary objective entails the classification of

human crowd behavior, and this classification is predicated movement is more likely to be slow and subtle. Also, the
on two key parameters: the crowd’s size and the level of concentration of people in a crowd can inform whether the
violence exhibited. Specifically, we are concerned with two crowd is large or small, and the higher the density in a signifi-
fundamental aspects within the input video data: the dynam- cant proportion of the frame, the more likely the crowd will be
ics of individuals’ movements captured in the video and the large. Besides, in intuitive contexts, a crowd’s mobility and
spatial concentration of these individuals. It is worth noting density distribution may interact in other ways to help clas-
that the motion patterns within the crowd can offer insights sify the crowd as small or large, violent or non-violent. Thus,
into its potential for violence. For instance, a violent crowd to aid in crowd footage classification, we utilize CC_Maps,
tends to manifest erratic motion, while a peaceful crowd’s which can contain information about the crowd’s density

123
10716 M. Qaraqe

distribution, and O pt_Flow, which can store information C. In this grey-scale image, a higher value for each pixel
about crowd movement. This section describes how optical indicates a higher crowd concentration at that pixel. For each
flow and crowd-counting maps are extracted for videos in the frame f i , we produce a crowd density estimation, Ci in the
datasets and how they are utilized for training and validation. form of a grey-scale image. An example of a sequence of 3
frames and their respective CC_Maps is shown in Fig. 5.
3.2.1 Computation of CC_Maps Instead of processing crowd-density maps independently of
the image sequences, we opt for concatenating both sets of
Crowd counting and localization have drawn significant images and processing them through the swin transformer
attention in the literature for their usefulness in surveillance, at once. This would allow the network to learn the complex
tracking, and crowd management applications [59]. Crowd relationship between the frames and the crowd densities and
counting can also be useful in our application since it can how those two change with time.
inform us about the size of the crowd, which would help in
distinguishing between LPG and N, as well as between LVG 3.2.2 Generation of Opt_Flow
and F. There are two ways in which crowd counting could be
helpful for our purposes. One way would be to get the number Optical flow is the distribution of velocities of brightness
of people present in a video [60, 61] and use it as a feature of patterns in an image [63]. The velocities of brightness in
the input video to aid in classification. This approach has two these brightness patterns arise as a result of relative move-
potential drawbacks. First, the total number of people does ment between the objects in the video or the video’s point of
not always inform us about the number of people involved view, such as a change in the position or orientation of the
in the action. In other words, a large number of people could camera. Optical flow maps are image representations that can
be in the background of the scene while relevant action in be computed for two consecutive video frames. The adjacent
the foreground is taking place, meaning that the distribution T are
frame components f i and f i+1 of input video frames f i=1
of the people in the crowd also matters. Secondly, since we utilized to generate O pt_Flow, Oi using pre-trained Recur-
are dealing with video data as the number of people is just a rent All-Pairs Field Transforms (RAFT) model described in
single feature, its influence during inference might be greatly [64].
diminished by the thousands of features extracted and used RAFT is a deep learning architecture that addresses the
to obtain a final classification of an input video. problem of estimating optical flow by predicting per-pixel
Rather than relying solely on headcount as a feature, our displacements between two frames. Unlike traditional opti-
approach is geared towards the computation of crowd density cal flow methods that often rely on handcrafted features
maps. These maps serve as continuous, smoothed heat maps, and assumptions about brightness constancy, RAFT takes
functioning as a visual representation of the crowd’s distribu- a learning-based approach. It utilizes a recurrent neural net-
tion and intensity. We employed the idea proposed by Wan et work (RNN) to model the interactions between pixels in a
al. [59] to generate CC_Maps that take V = { f 1 , f 2 , .. f T } pair of frames and predict the flow field that best explains the
as input to a VGG19 pre-trained model [62] and returns a observed motion. RAFT computes pixel-wise feature vectors
2-dimensional crowd density matrix having values between and uses these vectors to compute the corresponding pixel in
0 and 1, which can be transformed into a grey-scale image, the second image for each pixel in the first image. The prod-

Fig. 5 Sample frames and their respective crowd counting maps

123
Crowd behavior detection: leveraging video... 10717

uct of this operation is a field of vectors, one for each pixel, obtained. Then, the start and end time stamps of the occur-
that shows the “movement” of each pixel. Each neighborhood rences of each class in each of the collected videos were
of pixels that move together is colored homogeneously. An recorded. The record of the occurrence of a class in a video
example of a sample of three consecutive frames and their would be as follows: “video i contains an instance of class c
two corresponding O pt_Flow maps are shown in Fig. 6. from the time stamp h i : m i : si to time stamp h f : m f : s f ”,
Homogeneous color patterns in the figure represent regions where h, m, and s represents the hours, minutes, and seconds
in the video frame where motion is relatively uniform in both of the timestamp respectively.
direction and magnitude. These patterns can help identify In order to prepare a video for being fed into the pro-
regions within a crowd where people are moving collectively posed framework, the frames of the time periods where the
or uniformly, potentially indicating activities like large-scale classes occur must be extracted. Before that, we guaran-
gatherings or synchronized movement. tee that the time difference between every two consecutive
Since Optical Flow shows the movement of objects rel- frames is the same for all videos by setting the frame rate
ative to the camera, we must guarantee that the camera is of each video to 10 frames per second (FPS). 10 FPS was
sufficiently stationary for O pt_Flow to be useful. Thus, the chosen since it’s a reasonable frame rate that allows the
use of O pt_Flow maps is useful mostly for CCTV footage, model to analyze the videos in sufficient detail without need-
not for internet videos, since internet videos tend to be fast- ing excessive storage space for the frames of each video.
moving videos taken from the ground. This is in contrast to Then, we extract the frames of each occurrence of each class
CCTV cameras, which are almost always stationary and are in each of the collected videos. Note that the number of
usually from a high point of view. frames for each occurrence of each of the classes can be
different since the time periods during which an instance of
3.3 Dataset collection a class occurs in a video can vary in length, thus changing the
number of frames of that instance. For example, an instance
Given the absence of pre-existing datasets aligning with our that occurs from time stamp 0:0:0 to 0:0:10 has 11 seconds
specific class criteria, we embarked on the creation of a novel × 10 frames/second = 110 frames while an instance that
dataset. This dataset serves as the training foundation for occurs from time stamp 0:0:5 to 0:0:8 has 4 seconds × 10
our model designed to monitor internet and CCTV videos. frames/second = 40 frames.
Subsequently, a distinct subset extracted from this dataset is However, the number of frames taken by a DL model
employed for the exclusive training of a model designated must be constant and set before training. To resolve this, we
for the analysis of CCTV footage. It’s important to empha- have determined that our model will take as input 20 frame
size that our dataset is uniquely customized to comprise the sequences. This is because 20 frames are the minimum num-
four distinct classes of behavior requisite for monitoring both ber of frames for any possible occurrence of one of the chosen
internet and CCTV videos, spanning large-scale and small- classes of behavior given the way we record these occur-
scale peaceful as well as violent events. This custom dataset rences. The shortest occurrence of a class is an occurrence
fulfills the precise requirements essential for our research that begins at the time stamp h : m : s and ends at the time
objectives. To this end, a large set of YouTube videos and stamp h : m : (s + 1), meaning that it will have 2 seconds
videos from pre-existing datasets that contain one or more of × 10 frames/second = 20 frames. Then, occurrences that are
the classes of interest were identified. The videos were given more than 2 seconds long will be used to produce more than
unique IDs that indicate the order in which the videos were one 20-frame sequence using a sliding window. for instance,

Fig. 6 Three consecutive frames and their two corresponding Optical Flow maps

123
10718 M. Qaraqe

if an occurrence of one of the classes that starts at time stamp tion of samples across classes for both training and validation,
h i : m i : si and ends at time stamp h f : m f : s f has k thereby contributing to the robustness of our model training.
frames { f j , f j+1 , . . . , f j+k−1 }, a sliding window of size 20 As mentioned in Section 3.2.2, O pt_Flow maps prove
will slide through the frames, taking a 20-frame sequence at most effective when applied to videos featuring a stable cam-
each step. era viewpoint. In cases where the camera is in motion, the use
Specifically, the 20-frame sequences that will be extracted of optical flow maps can lead to potential confusion, as the
from the occurrence with frames { f j , f j+1 , . . . , f j+k−1 } model might interpret camera-induced movement as object
will be { f i , . . . , f i+19 }∀i ∈ [0, k − 20]. That is, every two motion within the video. To mitigate this issue, we extracted
consecutive 20-frame sequences will share 19 frames. Note a subset from our dataset, comprising samples characterized
that consecutive 20-frame sequences sharing some frames by minimal changes in the camera’s perspective. This subset
are valuable, as this trains the model to be somewhat time- closely resembles typical CCTV footage, where cameras are
invariant. For example, a 20-frame sequence with a punch typically stationary and not mobile. Our approach involves
must be categorized as Fighting no matter where the punch a detailed examination of each recorded occurrence within
occurs in the 20-frame sequence. However, sharing 19 con- various classes. If a segment of the video demonstrates ”sig-
secutive frame sequences out of 20 frames is inefficient nificant” camera movement, we opt to exclude that particular
because the dataset requires excessive storage space. Instead, occurrence record from the dataset. This process yielded
we use a sliding window that jumps 10 frames at each step, a 25-hour dataset primarily consisting of stationary sam-
meaning that consecutive 20-frame sequences will only share ples, which we refer to as the Static dataset. Conversely,
10 frames. In particular, for an occurrence with k-frames the broader dataset, of which the Static dataset is a sub-
{ f j , f j+1 , . . . , f j+k−1 }, the 20-frame sequences that will be set, is termed the Original dataset. Note that the proposed
used in the dataset are { f 10i , . . . , f 10i+19 }∀i ∈ [0, m] where model, when trained on the Original dataset, can be used for
m = 10 k
− 2. The 20-frame sequences, which we call sam- monitoring internet videos. In contrast, when trained on the
ples, are extracted for each class occurrence in each video Static dataset, it becomes well-suited for CCTV monitoring
collected and added to our dataset. Overall, 2,570 different applications. For the Static dataset, the same random search
videos were collected and the cumulative duration of the procedure was adopted for splitting the dataset into training
occurrences recorded amounted to 68 hours. and validation videos. This process resulted in 1121 videos
allocated for training and 279 videos designated for valida-
tion. The training/validation ratios achieved for the Static
3.4 Model training dataset were 79.72% / 20.28%, 79.03% / 20.97%, 80.01%
/ 19.99%, and 79.80% / 20.20% for N, LPG, LVG, and F,
To train the swin transformer model effectively, we parti- respectively.
tioned the videos into distinct training and validation sets.
However, it’s important to clarify that our division was based
on samples, not entire videos, with the aim of allocating 80% 4 Experimental analysis
of the samples for training and reserving 20% for validation.
Achieving this 80-20 sample split, while ensuring the unique- The experiments were conducted using a novel dataset that
ness of training and validation videos, was accomplished collected videos from YouTube and existing crowd datasets.
through a random search procedure as follows: Initially, a The details of the video collection for the dataset are pro-
random selection of videos, with random sizes, was cho- vided in Section 3.3. We define four crowd behavior classes
sen from the video dataset for training, while the remaining based on the size and violence level such as Natural(N), Large
videos were designated for validation. The number of train- Peaceful Gathering (LPG), Large Violent Gathering (LVG),
ing and validation samples for each class within the training and Fighting (F). LPG depicts a large number of individ-
and validation videos was tallied and the per-class train- uals gathered for a unique purpose, like peaceful protests
ing/validation ratio was calculated After 2 hours of searching, or sports spectators, whereas LVG represents a large group
the training and validation sets that achieve the per-class split of individuals of whom a significant number are engaged
that is closest to 80-20 were selected. As a result of this pro- in violent action that includes clashes with police, fighting
cedure, we arrived at a specific set of 1977 training videos between members of the crowd, property destruction, etc.
and 593 validation videos. These two sets of videos yielded On the other hand, F refers to a small group of individu-
the following per-class training/validation ratios: N : 80.69% als fighting each other, and if the footage shows no relation
(Training) / 19.31% (Validation) LPG : 78.62% (Training) / to the above-described behaviors, it is classified as N. Fig-
21.38% (Validation) LVG : 79.51% (Training) / 20.49% (Val- ure 7 portrays the sample frames from each class. Extracted
idation) F : 79.18% (Training) / 20.82% (Validation) This videos were annotated carefully by identifying when behav-
process ensured a well-balanced and representative distribu- iors of interest occurred. This was done by recording the

123
Crowd behavior detection: leveraging video... 10719

Fig. 7 Samples frames for each behavior from our dataset

start and end time stamps within which interesting behaviors tial learning rate of 0.0001, a momentum of 0.9, and a weight
were observed as shown in Table 1. Each video is assigned decay of 0.0001. The hyperparameters used for training are
a unique ID, and the occurrence of a class recorded in the compiled in Table 2. Figure 8 depicts the average loss values
annotation table is denoted as an instance of that class. during the training and validation of crowd behavior classifi-
Our dataset consists of 68 hours of videos and is referred cation. The decreasing behavior detection loss demonstrates
to as Original dataset in the rest of the paper. The Original that the proposed approach successfully detects the correct
dataset comprises videos from both static CCTV cameras and behaviors similar to the ground truth labels.
moving cameras. To perform experiments on video footage We validated the model by calculating the average accu-
from stationary CCTV cameras, a key component of city- racy and mean average precision (mAP) for the instance
wide surveillance, we extract a subset of videos that match videos as well as the sample videos. Instance videos are a set
the CCTV footage, called the Static dataset, consisting of 25 of frames whose starting and ending timestamps are iden-
hours of video. For training and validation, the videos from tified and recorded for a specific class as given in Table 1.
both datasets were converted to non-overlapping frames, with Sample videos are equal-sized image sequences extracted
224 × 224 as frame size. As explained in Section 3.3, the from the instance video, and we set the sample size as 20
videos were converted to 20 frame samples, and hence the frames. The details of sample extraction are given in Sec-
input to the swin transformer is a tensor of size 20×3×224× tion 3.3, and the number of instances and samples used for
224. The training and validation of the proposed model were training and validation for each crowd behavior is portrayed
performed with a training validation ratio of 8:2, as discussed in Table 3. The average accuracy and mAP of the proposed
in Section 3.4. All the experiments were done using Python’s model are shown in Table 4 and the confusion matrix is por-
PyTorch framework in a GPU having NVIDIA GeForce with trayed in Fig. 9.
CUDA 11.4. In terms of accuracy and mAP, the Static dataset is better
The training process of the proposed model was performed compared to the Original dataset. This improvement is due to
by minimizing the categorical cross-entropy loss by utilizing
the optimizer Stochastic Gradient Descent (SGD) with an ini-
Table 2 Hyperparameters used for training the proposed model
Table 1 A sample annotation table - 5 instances of the behaviors in 3 Hyperparameters Values
separate videos
Input Size 224 × 224
Video ID Starting Time Ending Time Class
Initial learning rate 0.0001
1 00:00:30 00:01:30 LVG Learning Rate Update Frequency 0.3
1 00:02:03 00:02:21 N Momentum 0.9
2 00:00:35 00:00:36 LPG Batch Size 16
2 00:01:25 00:01:29 F Weight Decay 0.0001
3 00:00:00 00:00:03 N No: of frames in a sample 20

123
10720 M. Qaraqe

Fig. 8 Average loss during the training and validation of the proposed model

the nature of the Original dataset, which includes both online The experimental analysis demonstrated the efficacy of
and CCTV videos. When the camera is in motion, optical the proposed model in effectively detecting crowd behav-
flow maps can confuse, leading the model to interpret cam- ior, considering both crowd size and the degree of violence,
era movement as object motion mistakenly. Conversely, the using data from CCTV and online video sources. Since the
Static dataset closely mirrors typical CCTV footage, where videos in the dataset are rich in diverse crowd scenes, that
cameras are usually stationary, thus providing more consis- were captured from multiple climatic conditions and scenar-
tent and reliable data for the model. ios of occlusion, the performance analysis proves that the
Furthermore, we recorded two types of accuracy scores: model is robust to variations like weather conditions, occlu-
”sample accuracy” and ”instance accuracy.” Sample accu- sion, and video quality. Moreover, the swin transformer’s
racy is calculated by performing inference on all samples in ability to capture both local and global contexts through
the validation set, and then dividing the number of correctly shifted windows and hierarchical processing helps in main-
classified validation samples by the total number of valida- taining performance despite these global changes. The use of
tion samples. Conversely, instance accuracy is measured by optical flow maps and the attention mechanism of the swin
performing inference on all samples within an instance. If transformer can reweight the attention distribution to focus
the class to which most samples in an instance are classi- on visible and relevant parts of the image, thereby mitigating
fied matches the instance’s label, the number of correctly the impact of occlusions and video quality.
classified instances is increased by one. The total number of
correctly classified instances is then divided by the total num- 4.1 Impact of crowd counting maps and optical flow
ber of instances in the validation set to obtain the instance in crowd behavior detection
accuracy of the model. Hence, in most cases, sample accuracy
is higher than instance accuracy. In the proposed dataset, the Crowd behavior is recognized using our proposed swin trans-
number of LVG videos is lower compared to other classes. former model that takes crowd counting maps and optical
This class imbalance results in reduced performance for the flow maps as input along with the original input frames.
LVG class relative to others. Given our classification task’s focus on crowd behavior dif-
ferentiation based on size and violence level, the integration
of crowd counting maps plays a pivotal role in enhancing
the precision of crowd behavior detection by effectively dis-
Table 3 Number of instances and samples for each crowd behavior criminating between large-scale and small-scale events. On
used for training and validation the contrary, optical flow maps help analyze the temporal
Class Training Validation Training Validation patterns of significant motions of objects in a sample video.
Samples Samples Instances Instances

N 23,152 5,889 816 103 Table 4 Average Accuracy (%) and mAP(%) of the proposed approach
LPG 27,952 7,418 1,240 223
Dataset Sample Accuracy Instance Accuracy mAP
LVG 6,478 1,618 865 222
F 6,584 1,667 1,194 344 Original Dataset 88.89 88.48 94.46
Total 64,166 16,592 4,115 892 Static Dataset 92.24 91.17 95.15

123
Crowd behavior detection: leveraging video... 10721

Fig. 9 Confusion Matrix portraying the per class accuracy in the Static and Original datasets

We performed experiments to examine the effect of crowd- (%) and mAP (%) in the Original and Static datasets. It is
counting maps and optical flow in the Original and Static clear from the figure that the combination of crowd-counting
datasets. The analysis was done by estimating the mAP, sam- maps and optical flow patterns has a considerable impact
ple accuracy, and instance accuracy by varying the input on behavior detection. Figure 12 shows the sample frames
patterns to the Swin Transformer in the following four ways- representing the four classes that were correctly classified
(1) Swin Only:- Frames from the video sample were given as by the proposed approach. Furthermore, Fig. 13 presents
input (2) Swin+OptFlow:- Input frames were concatenated three scenarios aimed at illustrating the significance of crowd
with optical flow maps. (3) Swin+CCmaps:- Input frames counting maps and optical flow maps in the context of dis-
were concatenated with crowd-counting maps.(4) Swin+ tinguishing crowd behavior with respect to size and violence
CCmaps+OptFlow:- Our proposed approach where crowd levels. These figures exemplify instances where our approach
counting maps and optical flow maps were concatenated with outperformed alternative methods in accurately categorizing
original input frames. The results are portrayed in Figs. 10 crowd behavior.
and 11, which display the comparison of average accuracy

Fig. 10 Impact of crowd counting maps and optical flow in Original Dataset

123
10722 M. Qaraqe

Fig. 11 Impact of crowd counting maps and optical flow in Static Dataset

Fig. 12 Sample frames from the four classes that were correctly classified by the proposed approach

Fig. 13 Three scenarios to show the importance of crowd maps and or F and N. (a) Fight scene in a largely empty area misclassified as N
optical flow in the proposed approach. Crowd counting maps help in without optical flow. (b) A large crowd with a violent scene at the end
the differentiation of LPG and N or LVG and F. On the other hand, is classified as LPG when optical flow is not considered. (c) Crowd
optical flow maps assist in identifying temporal patterns within a video, counting maps help to correctly distinguish LPG from N
aiding in the discrimination between categories such as LPG and LVG,

123
Crowd behavior detection: leveraging video... 10723

Table 5 Comparison of Accuracy (%) in the Original and Static Table 6 Comparison of Accuracy(%) in Hockey Fight Dataset
Datasets
Methods Accuracy (%)
Methods Original Dataset Static Dataset
Sample Instance Sample Instance Violent Flow Descriptor(ViF) [50] 82.9
ViF+Oriented ViF [70] 87.5
AdaFocusV2 [65] 82.54 81.65 83.77 83.3
Jebur et al. [71] 92.89
Swin Transformer [35] 86.03 83.9 88.19 87.15
I3D-Conv Net [48] 93.4
R(2+1)D [66] 86.84 83.23 89.34 85.43
Three streams + LSTM [72] 93.9
ResNet3D [66] 87.37 83.74 88.37 82.74
MoSIFT+KDE [73] 94.3
CoViFocus [67] 87.59 86.69 89.57 88.35
Su et al. [74] 96.8
Proposed Approach 88.89 88.48 92.24 91.17
Convolutional LSTM [75] 97.1
Obregón et al. [76] 97.4
CNN+ LSTM [77] 98
4.2 Comparison with state-of-the-art approaches Proposed Approach 98.5

We implemented state-of-the-art models utilized for video

recognition tasks using our Original and Static analysis: the Hockey Fight Dataset [68], the Surveillance
datasets and the results are given in Table 5. Average accuracy Camera Fight Dataset [53], the Violent Flows Dataset [50],
(%) was calculated for sample videos and instance videos and and the RWF_2000 Dataset [69]. The Hockey Fight Dataset
it is evident that our approach is better than the recent video comprises a collection of 1,000 video sequences categorized
recognition models for the detection of crowd behavior based into two distinct classes: fights and non-fights. A similar
on crowd size and violence level. The results also emphasize binary classification scheme is also applied to the Surveil-
that crowd-counting maps and optical flow maps influence lance Camera Dataset, consisting of 300 video recordings.
the behavior detection ability of the proposed model. The RWF_2000 Dataset, on the other hand, incorporates a
We empirically assess the efficacy of the proposed model more extensive compilation featuring 2000 video clips that
through rigorous performance evaluation on established are segregated into the fight and non-fight categories. Lastly,
publicly available datasets consisting of behavior classes the ViolentFlows dataset comprises 246 video instances, each
associated with fight and violence actions. To effectively annotated to distinguish between violent and non-violent
evaluate the performance of our model, we require datasets behaviors.
containing instances of violent and non-violent scenarios The Hockey Fight and RWF 2000 datasets comprised of
captured from real surveillance environments. As such, we instances categorized into fight and no-fight classes, which
have chosen to utilize four distinct datasets for comparative align with our F and N classes, respectively. Similarly, the

Fig. 14 Sample frames from

benchmark datasets

123
10724 M. Qaraqe

Table 7 Comparison of Accuracy(%) in Surveillance Camera Fight Table 9 Comparison of Accuracy(%) in Violent Flows Dataset
Dataset
Methods Accuracy (%)
Methods Accuracy (%)
Violent Flow Descriptor (ViF) [50] 81.3
VGG16 + Bi-LSTM [53] 52
Local Binary Tracklets(LBT) [84] 81.90
Xception CNN + LSTM [53] 55
Histogram of Oriented Tracklets (HOT) [85] 82.30
VGG16 + LSTM [53] 61.67
Xu et al. [73] 89.05
Xception CNN+ Bi-LSTM [53] 63
ViF+Deep Neural Network [86] 90.17
Xception CNN+ Bi-LSTM + attention [53] 68
3DCNN+SVM [14] 90.6
Aktı et al. [53] 72
Varghese et al. [25] 92.9
Ullah et al. [78] 75.9
Zhang et al. [87] 93.19
Proposed Approach 81.51
Hachiuma et al. [88] 94.7
3DCNN [89] 98
Zhenhua et al. [90] 98
Violent Flows Dataset and the Surveillance Camera Fight Ullah et al. [91] 98.21
Dataset contain scenes depicting both violent and non- Proposed Approach 98.5
violent scenarios, like the classes such as LPG, LVG, and
N. Figure 14 illustrates sample frames from the datasets and
Tables 6, 7, 8, and 9 present the quantitative outcomes in
4.3 Deepstream for real-time analysis
terms of accuracy. The results substantiate the efficiency of
our proposed approach in discerning patterns associated with
A DL model for smart surveillance is considered efficient
violent and fight scenarios.
when it exhibits real-time inference capabilities that align
The evaluated benchmark datasets consist of classes rep-
with the demands of a surveillance environment. As a result,
resenting fight/no fight or violence/non-violence scenarios.
we perform model validation within a real surveillance
The Hockey Fight Dataset predominantly features fight
ecosystem utilizing the DeepStream SDK [9]. This SDK
sequences characterized by clearer visuals and a smaller
serves as a powerful tool for deploying real-time video clas-
number of individuals. In contrast, the Violent Flows Dataset
sification deep learning models. The deployment process
presents more distinct patterns of both violent and non-
involves the provision of a video source to DeepStream,
violent behavior, thereby facilitating the model’s learning
which can be either an MPEG-4 (MP4) video stored locally
and generalization of patterns. Conversely, the Surveillance
or a video stream originating from a camera via the Real-
Camera Fight Dataset and RWF 2000 Dataset present more
Time Streaming Protocol (RTSP) [92].
diverse and challenging scenarios. These datasets com-
DeepStream mandates that the DL model be in the
prise variations in lighting conditions, camera angles, and
Open Neural Network Exchange (ONNX) [93] format. To
crowd dynamics, which posed challenges for our model’s
achieve this, we employed the ”onnx” module provided
performance. Despite these complexities, our approach con-
by PyTorch to perform the conversion of the pre-trained
sistently demonstrated superior performance compared to
proposed model into the ONNX format. Subsequently, the
existing methods.
ONNX file representing the proposed DL model is specified
within DeepStream’s configuration file, facilitating the gen-
eration of an inference engine file. This inference engine file
Table 8 Comparison of Accuracy(%) in RWF_2000 Dataset
is crucial for subsequent executions of the DeepStream SDK,
Methods Accuracy (%) enabling real-time video classification.
ConvLSTM [75] 77.00 As part of our configuration process, we have defined criti-
3DCNN (TwoStream) [48] 81.50 cal DeepStream parameters for utilization. These parameters
Vijeikis et al. [79] 82 are (1) T , which denotes the number of video frames pro-
C3D [80] 82.75
cessed by the DL model during each inference cycle. For
3DCNN [48] 85.75
Chaturvedi et al. [81] 85.97 Table 10 Inference time of the proposed model using DeepStream SDK
Cheng et al. [69] 87.25
Video Source Inference Time (in seconds)
Zhou et al. [82] 89.45
Pan et al. [83] 89.7 MP4 Video 0.3
Proposed Approach 90.75 RTSP Stream 5

123
Crowd behavior detection: leveraging video... 10725

optimal performance, T has been configured to 20, align- by the proposed pre-trained model residing in the GPU. Sub-
ing with the number of frames constituting a single sample. sequently, DeepStream incorporates the model’s output label,
(2) H and W , which represent the dimensions of each video which could signify ”Natural (N),” ”Large Peaceful Gather-
frame, specified as H × W . In our setup, both H and W have ing (LPG),” ”Large Violent Gathering (LVG),” or ”Fighting
been set to 224. This specific dimension is mandated by the (F),” into the incoming video feed, rendering it visible on the
requirement of the backbone swin transformer model, which screen for real-time monitoring and analysis.
necessitates input frames to be 224×224 in size. The proposed approach was tested for real-time inference
Once the ONNX file and the path to the MP4 video or the using DeepStream, which accepted videos stored locally in
RTSP stream link are supplied to DeepStream, the framework MP4 format and video streams from a camera via RTSP
initiates the video playback while concurrently applying the stream. The locally stored video yielded inference results
proposed DL model to the frames. The visualization pro- in 0.3 sec, whereas the RTSP stream exhibited a delay of 5
cess varies depending on the video source: (a) Local MP4 seconds in displaying behavior inference, and is portrayed in
Video: When processing a local MP4 video, DeepStream Table 10.
displays the video at its native frame rate. Simultaneously, it For visual reference, refer to Fig. 15, which illustrates
overlays the inference results for the most recent 20 frames sample frames from each class as displayed within the
in the top-left corner of the video display. This dynamic DeepStream environment, where the input is given in MP4
display simulates real-time inference, providing users with format. This visualization offers insights into how Deep-
up-to-date classification information as the video plays. (b) Stream seamlessly integrates real-time video processing and
RTSP Stream from a Camera: In the case of an RTSP stream DL inference. This integration highlights the effectiveness
sourced from a camera, DeepStream generates the class infer- and value of our proposed approach for crowd behavior
ence result in the top-left corner based on the last 20 frames recognition.
received from the stream. This approach ensures that the dis-
played inference information reflects the most recent data
processed from the live camera feed. 5 Conclusions
To execute the DeepStream SDK, we utilized the NVIDIA
GeForce RTX 3080 GPU. The SDK is configured to capture In a public surveillance system, proactive real-time analysis
either the local MP4 video or the most recent 20 frames from of crowds can be challenging due to the difficulties author-
the RTSP stream. These captured frames are then processed ities face in promptly assessing crowd scale and potential

Fig. 15 Sample frames displayed using DeepStream. The detected behaviors are shown in the top left corner of each frame

123
10726 M. Qaraqe

violence levels. Furthermore, the current practice of con- Availability of data and materials The datasets generated during the
ducting crowd behavior recognition by exclusively analyzing current study are available from the corresponding author upon reason-
able request.
CCTV footage while neglecting the incorporation of online
social media video content results in a predominantly reac- Code Availibility The code of the current study is available from the
tive methodology. This necessitates the vital requirement corresponding author upon reasonable request.
for datasets and models specifically designed to facilitate
the analysis of both CCTV footage and online videos, with Declarations
the capability to detect and classify crowd behavior along
two essential dimensions: violence and crowd size. In this
Competing interests The authors declare they have no competing inter-
paper, we introduced a large dataset comprising 68 hours ests to report regarding the present study.
of data, including both stationary CCTV feeds and online
social media content. We developed a subset of this exten- Ethics approval No ethics approval was required for the study
sive dataset, which includes only CCTV footage, to serve
Consent for publication All authors have approved the manuscript and
as a foundation for developing dedicated models suitable agree with its publication.
for CCTV video data analysis. A DL model based on swin
transformer architecture was trained to capture crowd behav- Open Access This article is licensed under a Creative Commons
iors, consisting of regular events, large peaceful gatherings, Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
large violent gatherings, and small-scale fighting. Besides, long as you give appropriate credit to the original author(s) and the
we aimed to enhance the model’s understanding of the source, provide a link to the Creative Commons licence, and indi-
dataset’s dynamics and violence patterns by incorporating cate if changes were made. The images or other third party material
crowd-counting maps and optical flow maps as auxiliary data in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
sources. The experimental analysis proved the efficacy of is not included in the article’s Creative Commons licence and your
the proposed model in effectively detecting crowd behavior, intended use is not permitted by statutory regulation or exceeds the
taking into account both crowd size and the degree of vio- permitted use, you will need to obtain permission directly from the copy-
lence, across data derived from both CCTV and online video right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
sources. The proposed model was also tested with benchmark
datasets that further demonstrated the model’s proficiency in
distinguishing fight and violence patterns within video data.
Conclusively, the real-time performance analysis of the pro- References
posed model, trained on our dataset and leveraged through
1. Regazzoni CS, Cavallaro A, Wu Y, Konrad J, Hampapur A (2010)
the DeepStream SDK, serves as captivating evidence of the
Video analytics for surveillance: Theory and practice [from the
model’s efficiency in the context of real-time surveillance guest editors]. IEEE Signal Process Mag 27(5):16–17
environments. In the future, we intend to develop multi- 2. Varghese EB, Thampi SM (2023) A comprehensive review of
attention spatiotemporal DL models capable of detecting and crowd behavior and social group analysis techniques in smart
surveillance. Intell Image Video Anal 57–84
predicting fine-grained crowd behavior within a single sce-
3. Sabokrou M, Fayyaz M, Fathy M, Klette R (2017) Deep-cascade:
nario. Cascading 3d deep neural networks for fast anomaly detection
and localization in crowded scenes. IEEE Trans Image Process
Acknowledgements This publication was made possible by AICC03- 26(4):1992–2004
0324-200005 from the Qatar National Research Fund (a member of the 4. Zhou S, Shen W, Zeng D, Fang M, Wei Y, Zhang Z (2016) Spatial-
Qatar Foundation). The findings herein reflect the work and are solely temporal convolutional neural networks for anomaly detection and
the responsibility, of the authors. localization in crowded scenes. Signal Process: Image Commun
47:358–368
Author Contributions Conceptualization: [Marwa Qaraqe, Yin David 5. Sumon SA, Shahria MT, Goni MR, Hasan N, Almarufuzzaman
Yang, Emrah Basaran]; Methodology: [Elizabeth B Varghese, Emrah A, Rahman RM (2019) Violent crowd flow detection using deep
Basaran, Almiqdad Elzein]; Formal analysis and investigation: [Eliza- learning. In: Intelligent information and database systems: 11th
beth B Varghese, Emrah Basaran, Almiqdad Elzein]; Writing - original Asian Conference, ACIIDS 2019, Yogyakarta, Indonesia, April 8–
draft preparation: [Elizabeth B Varghese, Almiqdad Elzein]; Writing - 11, 2019, Proceedings, Part I 11, Springer, pp 613–625
review and editing: [Marwa Qaraqe, Yin David Yang]; Funding acqui- 6. Ravanbakhsh M, Nabi M, Sangineto E, Marcenaro L, Regazzoni C,
sition: [Marwa Qaraqe, Yin David Yang]; Resources: [Marwa Qaraqe]; Sebe N (2017) Abnormal event detection in videos using generative
Supervision: [Marwa Qaraqe, Yin David Yang]. adversarial nets. In: 2017 IEEE International conference on image
processing (ICIP), IEEE, pp 1577–1581
Funding Open Access funding provided by the Qatar National Library. 7. Marsden M, McGuinness K, Little S, O’Connor NE (2017)
This publication was made possible by AICC03-0324-200005 from the Resnetcrowd: A residual deep learning architecture for crowd
Qatar National Research Fund (a member of Qatar Foundation). The counting, violent behaviour detection and crowd density level
findings herein reflect the work and are solely the responsibility, of the classification. In: 2017 14th IEEE International Conference on
authors. Advanced Video and Signal Based Surveillance (AVSS), IEEE,
pp 1–7

123
Crowd behavior detection: leveraging video... 10727

8. Glenesk J, Strang L, Disley E (2018) How Can Crowd Behaviour 27. Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, Krikun
Modelling Be Used to Prevent and Respond to Violence and Anti- M, Shazeer N, Chen Z (2020) Gshard: Scaling giant models with
social Behaviour at Qatar 2022? RAND Corporation, Cambridge, conditional computation and automatic sharding. In: International
UK conference on learning representations
9. DeepStream SDK NVIDIA Developer. https://developer.nvidia. 28. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT
com/deepstream-sdk. Last accessed 5 July 2023 press, London
10. Zhiqiang W, Jun L (2017) A review of object detection based on 29. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X,
convolutional neural network. In: 2017 36th Chinese Control Con- Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S,
ference (CCC), IEEE, pp 11104–11109 Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words:
11. Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online Transformers for image recognition at scale. In: International con-
multi-object tracking using cnn-based single object tracker with ference on learning representations
spatial-temporal attention mechanism. In: Proceedings of the IEEE 30. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable
international conference on computer vision, pp 4836–4845 detr: Deformable transformers for end-to-end object detection. In:
12. Yao G, Lei T, Zhong J (2019) A review of convolutional-neural- International conference on learning representations
network-based action recognition. Pattern Recognit Lett 118:14–22 31. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021)
13. Dupont C, Tobias L, Luvison B (2017) Crowd-11: A dataset for Swin transformer: Hierarchical vision transformer using shifted
fine grained crowd behaviour analysis. In: Proceedings of the IEEE windows. In: Proceedings of the IEEE/CVF international confer-
conference on computer vision and pattern recognition workshops, ence on computer vision, pp 10012–10022
pp 9–16 32. Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C
14. Varghese EB, Thampi SM (2018) A deep learning approach to pre- (2021) Vivit: A video vision transformer. In: Proceedings of the
dict crowd behavior based on emotion. In: Smart Multimedia: First IEEE/CVF international conference on computer vision, pp 6836–
International Conference, ICSM 2018, Toulon, France, August 24– 6846
26, 2018, Revised Selected Papers 1, Springer pp 296–307 33. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018) Videolstm
15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for convolves, attends and flows for action recognition. Comput Vision
image recognition. In: Proceedings of the IEEE conference on com- Image Underst 166:41–50
puter vision and pattern recognition, pp 770–778 34. Zhang L, Zhu G, Mei L, Shen P, Shah SAA, Bennamoun M (2018)
16. Ng JY-H, Choi J, Neumann J, Davis LS (2018) Actionflownet: Attention in convolutional lstm for gesture recognition. Adv Neural
Learning motion representation for action recognition. In: 2018 Inf Process Syst 31
IEEE Winter conference on applications of computer vision 35. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video
(WACV), IEEE, pp 1616–1624 swin transformer. In: Proceedings of the IEEE/CVF conference on
17. Zhang J, Wang X, Wan Y, Wang L, Wang J, Philip SY (2023) computer vision and pattern recognition, pp 3202–3211
Sor-tc: Self-attentive octave resnet with temporal consistency for 36. Wang Y, Neves L, Metze F (2016) Audio-based multimedia event
compressed video action recognition. Neurocomputing 533:191– detection using deep recurrent neural networks. In: 2016 IEEE
205 International conference on acoustics, speech and signal processing
18. Kakamu Y, Hotta K (2022) Predicting human behavior using 3d (ICASSP), IEEE, pp 2742–2746
loop resnet. In: 2022 26th International conference on pattern 37. Lan Z (2017) Towards usable multimedia event detection. PhD
recognition (ICPR), IEEE, pp 3259–3264 thesis, PhD Thesis, Carnegie Mellon University
19. Alafif T, Hadi A, Allahyani M, Alzahrani B, Alhothali A, Alotaibi 38. Ye G, Li Y, Xu H, Liu D, Chang S-F (2015) Eventnet: A large scale
R, Barnawi A (2023) Hybrid classifiers for spatio-temporal abnor- structured concept library for complex event detection in video. In:
mal behavior detection, tracking, and recognition in massive hajj Proceedings of the 23rd ACM international conference on multi-
crowds. Electronics 12(5):1165 media, pp 471–480
20. Mandal B, Fajtl J, Argyriou V, Monekosso D, Remagnino P (2018) 39. Amrutha C, Jyotsna C, Amudha J (2020) Deep learning approach
Deep residual network with subclass discriminant analysis for for suspicious activity detection from surveillance video. In: 2020
crowd behavior recognition. In: 2018 25th IEEE International Con- 2nd International conference on innovative mechanisms for indus-
ference on Image Processing (ICIP), IEEE, pp 938–942 try applications (ICIMIA), IEEE, pp 335–339
21. Chen Y (2020) Crowd behaviour recognition using enhanced 40. Khan SW, Hafeez Q, Khalid MI, Alroobaea R, Hussain S, Iqbal J,
butterfly optimization algorithm based recurrent neural network. Almotiri J, Ullah SS (2022) Anomaly detection in traffic surveil-
Multimed Res 3(3):1–20 lance videos using deep learning. Sensors 22(17):6563
22. Ebrahimi Kahou S, Michalski V, Konda K, Memisevic R, Pal C 41. Aboah A (2021) A vision-based system for traffic anomaly detec-
(2015) Recurrent neural networks for emotion recognition in video. tion using deep learning and decision trees. In: Proceedings of the
In: Proceedings of the 2015 ACM on international conference on IEEE/CVF conference on computer vision and pattern recognition,
multimodal interaction, pp 467–474 pp 4207–4212
23. Sharma V, Gupta M, Kumar A, Mishra D (2021) Video process- 42. Baqui M, Löhner R (2017) Real-time crowd safety and comfort
ing using deep learning techniques: A systematic literature review. management from cctv images. In: Real-Time image and video
IEEE Access 9:139489–139507 processing 2017, SPIE, vol. 10223, pp 10–23
24. Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) 43. Yu Q, Hu L, Alzahrani B, Baranawi A, Alhindi A, Chen M
A hierarchical deep temporal model for group activity recognition. (2021) Intelligent visual-iot-enabled real-time 3d visualization
In: Proceedings of the IEEE conference on computer vision and for autonomous crowd management. IEEE Wireless Commun
pattern recognition, pp 1971–1980 28(4):34–41
25. Varghese EB, Thampi SM, Berretti S (2020) A psychologically 44. Movie Actions Datasaet. https://www.di.ens.fr/~laptev/actions/.
inspired fuzzy cognitive deep learning framework to predict crowd Last accessed 16 July 2023
behavior. IEEE Trans Affective Comput 13(2):1005–1022 45. U.C. for Research in Computer Vision. https://www.crcv.ucf.edu/
26. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez data/UCF50.php. Last accessed 16 July 2023
AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv 46. Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human
Neural Inf Process Syst 30 action classes from videos in the wild. Center for Research in Com-
puter Vision 2(11)

123
10728 M. Qaraqe

47. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijaya- 67. Zheng Z, Yang L, Wang Y, Zhang M, He L, Huang G, Li F (2024)
narasimhan S, Viola F, Green T, Back T, Natsev P, et al (2017) Dynamic spatial focus for efficient compressed video action recog-
The kinetics human action video dataset. arXiv:1705.06950 nition. IEEE Trans Circuits Syst Video Technol 34(2):695–708
48. Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A 68. Nievas EB, Suarez OD, Garcia GB, Sukthankar R (2011) Hockey
(2018) A short note about kinetics-600. arXiv:1808.01340 fight detection dataset. In: Computer Analysis of Images and Pat-
49. Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note terns, Springer, pp 332–339. http://visilab.etsii.uclm.es/personas/
on the kinetics-700 human action dataset. arXiv:1907.06987 oscar/FightDetection/
50. Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real- 69. Cheng M, Cai K, Li M (2021) Rwf-2000: an open large scale video
time detection of violent crowd behavior. In: 2012 IEEE Computer database for violence detection. In: 2020 25th International con-
society conference on computer vision and pattern recognition ference on pattern recognition (ICPR), IEEE, pp 4183–4190
workshops, IEEE, pp 1–6 70. Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection
51. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection using oriented violent flows. Image Vision Comput 48:37–41
in surveillance videos. In: Proceedings of the IEEE conference on 71. Jebur SA, Hussein KA, Hoomod HK, Alzubaidi L (2023) Novel
computer vision and pattern recognition, pp 6479–6488 deep feature fusion framework for multi-scenario violence detec-
52. Perez M, Kot AC, Rocha A (2019) Detection of real-world fights tion. Computers 12(9):175
in surveillance videos. In: ICASSP 2019-2019 IEEE International 72. Dong Z, Qin J, Wang Y (2016) Multi-stream deep networks for per-
conference on acoustics, speech and signal processing (ICASSP), son to person violence detection in videos. In: Pattern Recognition:
IEEE, pp 2662–2666 7th Chinese Conference, CCPR 2016, Chengdu, China, November
53. Aktı Ş, Tataroğlu G.A, Ekenel HK (2019) Vision-based fight 5-7, 2016, Proceedings, Part I 7, Springer, pp 517–531
detection from surveillance cameras. In: 2019 Ninth international 73. Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detec-
conference on image processing theory, tools and applications tion based on mosift feature and sparse coding. In: 2014 IEEE
(IPTA), IEEE, pp 1–6 International conference on acoustics, speech and signal process-
54. Detection of unusual crowd activity. http://mha.cs.umn.edu/proj_ ing (ICASSP), IEEE, pp 3538–3542
events.shtml#crowd. Last accessed 16 July 2023 74. Su Y, Lin G, Zhu J, Wu Q (2020) Human interaction learning on 3d
55. Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior skeleton point clouds for video violence recognition. In: Computer
detection using social force model. In: 2009 IEEE Conference on Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
Computer Vision and Pattern Recognition, IEEE, pp 935–942 August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 74–90
56. Political protest movements: Data. https://guides.library.yale.edu/ 75. Sudhakaran S, Lanz O (2017) Learning to detect violent videos
c.php?g=956915amp;p=6961578. Last accessed 16 July 2023 using convolutional long short-term memory. In: 2017 14th IEEE
57. Political protest movements: MetaData. https://dataverse.harvard. International CONFERENCE ON ADVANCED VIDEO AND
edu/dataverse/MMdata. Last accessed 16 July 2023 SIGNAL BASED SURVEILLANCE (AVSS), IEEE, pp 1–6
58. Chromiak M (2021) Exploring recent advancements of transformer 76. Freire-Obregón D, Barra P, Castrillón-Santana M, Marsico MD
based architectures in computer vision. Selected Topics Appl Com- (2022) Inflated 3d convnet context analysis for violence detection.
put Sci 59–75 Mach Vision Appl 33:1–13
59. Wan J, Liu Z, Chan AB (2021) A generalized loss function for 77. Abdali A-MR, Al-Tuma RF (2019) Robust real-time violence
crowd counting and localization. In: Proceedings of the IEEE/CVF detection in video using cnn and lstm. In: 2019 2nd Scientific Con-
conference on computer vision and pattern recognition, pp 1974– ference of Computer Sciences (SCCS), IEEE, pp 104–108
1983 78. Ullah FUM, Muhammad K, Haq IU, Khan N, Heidari AA, Baik
60. Leibe B, Seemann E, Schiele B (2005) Pedestrian detection in SW, de Albuquerque VHC (2021) Ai-assisted edge vision for vio-
crowded scenes. In: 2005 IEEE Computer society conference on lence detection in iot-based industrial surveillance networks. IEEE
computer vision and pattern recognition (CVPR’05), IEEE, vol. Trans Indust Inf 18(8):5359–5370
1:878–885 79. Vijeikis R, Raudonis V, Dervinis G (2022) Efficient violence detec-
61. Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserv- tion in surveillance. Sensors 22(6):2216
ing crowd monitoring: Counting people without people models 80. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learn-
or tracking. In: 2008 IEEE Conference on computer vision and ing spatiotemporal features with 3d convolutional networks. In:
pattern recognition, IEEE, pp 1–7 Proceedings of the IEEE international conference on computer
62. Simonyan K, Zisserman A (2015) Very deep convolutional net- vision, pp 4489–4497
works for large-scale image recognition. In: 3rd International 81. Chaturvedi K, Dhiman C, Vishwakarma DK (2024) Fight detec-
conference on learning representations (ICLR 2015). Computa- tion with spatial and channel wise attention-based convlstm model.
tional and Biological Learning Society Expert Syst 41(1):13474
63. Horn B, Schunck B (1981) Determining optical flow. Artif Intell 82. Zhou L (2022) End-to-end video violence detection with trans-
17:185–203 former. In: 2022 5th International conference on pattern recognition
64. Teed Z, Deng J (2020) Raft: Recurrent all-pairs field transforms for and artificial intelligence (PRAI), IEEE, pp 880–884
optical flow. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) 83. Pan C, Fei S (2022) Violence detection based on attention mecha-
Computer Vision - ECCV 2020. Springer, Cham, pp 402–419 nism. In: 2022 41st Chinese Control Conference (CCC), IEEE, pp
65. Wang Y, Yue Y, Lin Y, Jiang H, Lai Z, Kulikov V, Orlov N, Shi 6438–6443
H, Huang G (2022) Adafocus v2: End-to-end training of spatial 84. Ravanbakhsh M, Mousavi H, Nabi M, Marcenaro L, Regazzoni C
dynamic networks for video recognition. In: 2022 IEEE/CVF Con- (2018) Fast but not deep: Efficient crowd abnormality detection
ference on computer vision and pattern recognition (CVPR), IEEE with local binary tracklets. In: 2018 15th IEEE International con-
pp 20030–20040 ference on advanced video and signal based surveillance (AVSS),
66. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A IEEE, pp 1–6
closer look at spatiotemporal convolutions for action recognition. 85. Mousavi H, Mohammadi S, Perina A, Chellali R, Murino V (2015)
In: Proceedings of the IEEE Conference on computer vision and Analyzing tracklets for the detection of abnormal crowd behav-
pattern recognition, pp 6450–6459 ior. In: 2015 IEEE Winter conference on applications of computer
vision, IEEE, pp 148–155

123
Crowd behavior detection: leveraging video... 10729

86. Gao M, Jiang J, Ma L, Zhou S, Zou G, Pan J, Liu Z (2019) Yin David Yang received the
Violent crowd behavior detection using deep learning and com- B.Eng. degree in computer sci-
pressive sensing. In: 2019 Chinese control and decision conference ence from the Department of
(CCDC), IEEE, pp 5329–5333 Computer Science and Engi-
87. Zhang T, Jia W, He X, Yang J (2016) Discriminative dictionary neering, Shanghai Jiao Tong
learning with motion weber local descriptor for violence detection. University, Shanghai, China,
IEEE Trans Circuits Syst Video Technol 27(3):696–709 in 2004, and the Ph.D. degree
88. Hachiuma R, Sato F, Sekii T (2023) Unified keypoint-based action in computer science from the
recognition framework via structured keypoint pooling. In: Pro- Department of Computer Science
ceedings of the IEEE/CVF conference on computer vision and and Engineering, The Hong Kong
pattern recognition, pp 22962–22971 University of Science and Tech-
89. Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) nology, Hong Kong, in 2009. He
Violence detection using spatiotemporal features with 3d convolu- is currently an Associate Profes-
tional neural network. Sensors 19(11):2472 sor with the College of Science
90. Zhenhua T, Zhenche X, Pengfei W, Chang D, Weichao Z (2023) and Engineering, Hamad bin
Ftcf: Full temporal cross fusion network for violence detection in Khalifa University, Doha, Qatar. He has published extensively in
videos. Appl Intell 53(4):4218–4230 top venues on differentially private data publication and analysis
91. Ullah FUM, Obaidat MS, Muhammad K, Ullah A, Baik SW, Cuz- and on query authentication in outsourced databases. He is also
zolin F, Rodrigues JJ, de Albuquerque VHC (2022) An intelligent working actively on cloud-based bigdata analytics, with a focus on
system for complex violence pattern analysis and detection. Int J fast-streaming data. His main research interests include cloud com-
Intell Syst 37(12):10400–10422 puting, database security and privacy, and query optimization.
92. Rao A, Lanphier R (1996) Real Time Streaming Protocol(RTSP).
Internet-Draft draft-rao-rtsp-00, Internet Engineering Task Force.
Work in Progress. https://datatracker.ietf.org/doc/draft-rao-rtsp/ Elizabeth B Varghese received
00/ the Ph.D. degree in computer
93. Bai J, Lu F, Zhang K, et al (2019) ONNX: Open Neural Network science from the Indian Insti-
Exchange. GitHub tute of Information Technology
and Management, Kerala (IIITM-
K), Cochin University of Science
and Technology, India, in 2022.
Publisher’s Note Springer Nature remains neutral with regard to juris- She is currently a Postdoctoral
dictional claims in published maps and institutional affiliations. Researcher with the Division of
Information and Computing Tech-
nology, College of Science and
Engineering, Hamad bin Khal-
Marwa Qaraqe received the bach- ifa University. She was awarded
elor’s degree from Texas A&M the prestigious Women Scien-
University at Qatar, Doha, Qatar, tist Scheme A (WOS-A) Ph.D.
in 2010, and the M.Sc. and Ph.D. Fellowship by the Department of Science and Technology (DST),
degrees from Texas A&M Univer- Government of India. Her research interests include computer vision,
sity, College Station, TX, USA, deep learning, machine learning, image processing, and human-
in 2012 and 2016, respectively, computer interaction.
all in electrical engineering. She
is currently an Associate Profes-
sor with the Division of Informa-
tion and Communication Technol-
ogy, College of Science and Engi-
neering, Hamad Bin Khalifa Uni-
versity, Doha. She is an advocate
of predictive health analytics and
personalized learning and is active in research for both proactive
management of health disorders and developing personalized learn-
ing/intervention strategies for children through the strategic coupling
of artificial intelligence and human-based data. Her research inter-
ests include wireless communication, signal processing, and machine
learning, and their application in multidisciplinary fields, including but
not limited to security, the Internet of Things, and health. Her par-
ticular interests are in physical layer security, federated learning over
wireless networks, and machine learning for wireless communication,
security, and health.

123
10730 M. Qaraqe

Emrah Basaran received the Almiqdad Elzein received the

bachelor’s degree in computer B.Sc. degree in computer engi-
engineering from Erciyes Uni- neering from Hamad bin Khalifa
versity, Turkey, in 2011, and the University, Qatar, in 2019. He
M.Sc. and Ph.D. degrees in com- is currently pursuing the Mas-
puter engineering from Istanbul ter of Applied Science degree
Technical University, Turkey, in with the Electrical and Computer
2013 and 2020, respectively. He Engineering Program, University
is currently a Senior Researcher of Windsor. He was a Research
with the Informatics and Infor- Associate with Carnegie Mel-
mation Security Research Center lon University, from 2019 to
(BİLGEM), The Scientific and 2021, and as a Research Assis-
Technological Research Coun- tant with Hamad bin Khalifa
cil of Turkey (TÜBİTAK). His University, from August 2021 to
research interests include image August 2023. His research inter-
processing, computer vision, machine learning, deep learning, and ests include optimization, deep learning, and operations research.
TinyML.

123

Crowd Density Mapping and Anomaly Detection Using YOLOv8 and DEEPSORT
No ratings yet
Crowd Density Mapping and Anomaly Detection Using YOLOv8 and DEEPSORT
19 pages
Deep Learning Based Approach For Crowd Density Estimation and Flow Prediction
No ratings yet
Deep Learning Based Approach For Crowd Density Estimation and Flow Prediction
2 pages
Real-Time Based Violence Detection From CCTV Camera Using Machine Learning Method
No ratings yet
Real-Time Based Violence Detection From CCTV Camera Using Machine Learning Method
7 pages
Vehicle Routing Problem With Time Windows (VRPTW)
No ratings yet
Vehicle Routing Problem With Time Windows (VRPTW)
11 pages
Sumon 2019
No ratings yet
Sumon 2019
13 pages
Minor 4
No ratings yet
Minor 4
37 pages
2.intelligent Video Surveillance - A Review Through Deep Learning Techniques For Crowd Analysis
No ratings yet
2.intelligent Video Surveillance - A Review Through Deep Learning Techniques For Crowd Analysis
27 pages
Research
No ratings yet
Research
27 pages
Video For Crowd Management Monitoring
No ratings yet
Video For Crowd Management Monitoring
6 pages
Deep Learning in Crowd Analysis
No ratings yet
Deep Learning in Crowd Analysis
18 pages
Real Time Crowd Monitoring System
No ratings yet
Real Time Crowd Monitoring System
11 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
29 pages
Paper 6
No ratings yet
Paper 6
9 pages
A Short Review of Deep Crowd Activities
No ratings yet
A Short Review of Deep Crowd Activities
8 pages
A Survey On Crowd Analysis Using Artificial Intelligence Techniques
No ratings yet
A Survey On Crowd Analysis Using Artificial Intelligence Techniques
25 pages
Diagrams Explanation
No ratings yet
Diagrams Explanation
2 pages
Crowd Energy Quantification
No ratings yet
Crowd Energy Quantification
7 pages
Swetha
No ratings yet
Swetha
32 pages
Crowd Attributes Recognition - ACCESS
No ratings yet
Crowd Attributes Recognition - ACCESS
13 pages
Human Violence Detection Using LHOGF Algorithm and Deep Learning Model11
No ratings yet
Human Violence Detection Using LHOGF Algorithm and Deep Learning Model11
5 pages
Anomaly Detection in Crowded Scenes
No ratings yet
Anomaly Detection in Crowded Scenes
8 pages
CCTV Final Edition
No ratings yet
CCTV Final Edition
8 pages
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
219 pages
Wacv2018 - Plug-And-Play CNN For Crowd Motion Analysis An Application in Abnormal Event Detection
No ratings yet
Wacv2018 - Plug-And-Play CNN For Crowd Motion Analysis An Application in Abnormal Event Detection
10 pages
Detecting Violence in Video Based On Dee
No ratings yet
Detecting Violence in Video Based On Dee
15 pages
1 s2.0 S2772941924000334 Main - r4
No ratings yet
1 s2.0 S2772941924000334 Main - r4
12 pages
Crowd Behavior Analysis (2) - 1
No ratings yet
Crowd Behavior Analysis (2) - 1
4 pages
Recent Trends in Crowd Analysis A Review 2021
No ratings yet
Recent Trends in Crowd Analysis A Review 2021
30 pages
Violent Video Detection AI
No ratings yet
Violent Video Detection AI
6 pages
Analyzing Tracklets For The Detection of Abnormal Crowd Behavior
No ratings yet
Analyzing Tracklets For The Detection of Abnormal Crowd Behavior
8 pages
Halder-Chatterjee2020 Article CNN-BiLSTMModelForViolenceDete
No ratings yet
Halder-Chatterjee2020 Article CNN-BiLSTMModelForViolenceDete
10 pages
Abnormal Crowd Behavior Detection Using Motion Inf
No ratings yet
Abnormal Crowd Behavior Detection Using Motion Inf
10 pages
Deep Learning for Violence Detection in Industrial Surveillance
No ratings yet
Deep Learning for Violence Detection in Industrial Surveillance
13 pages
A Novel Deep Architecture For Multi-Task Crowd Analysis
No ratings yet
A Novel Deep Architecture For Multi-Task Crowd Analysis
6 pages
Report of Maleria Prevention Drugs
No ratings yet
Report of Maleria Prevention Drugs
2 pages
Recent Trends in Crowd Management Using Deep Learning Techniques: A Systematic Literature Review
No ratings yet
Recent Trends in Crowd Management Using Deep Learning Techniques: A Systematic Literature Review
29 pages
1 s2.0 S0952197624020992 Main
No ratings yet
1 s2.0 S0952197624020992 Main
18 pages
1violence Detection in Real Life Videos Using Deep Learning6
No ratings yet
1violence Detection in Real Life Videos Using Deep Learning6
5 pages
CrowdGAN: Interactive Video Generation
No ratings yet
CrowdGAN: Interactive Video Generation
16 pages
Deep Learning Approach For Suspicious Activity Detection From Surveillance Video
No ratings yet
Deep Learning Approach For Suspicious Activity Detection From Surveillance Video
6 pages
Rishu
No ratings yet
Rishu
22 pages
Crowd Characterization in Surveillance Videos Using Deep-Graph Convolutional Neural Network
No ratings yet
Crowd Characterization in Surveillance Videos Using Deep-Graph Convolutional Neural Network
12 pages
Enhancing Real-Time Violence Detection in Video Surveillance Using Hybrid Deep Learning Model
No ratings yet
Enhancing Real-Time Violence Detection in Video Surveillance Using Hybrid Deep Learning Model
18 pages
Fin Irjmets1685861510
No ratings yet
Fin Irjmets1685861510
5 pages
12.4.7 TransformerModels
No ratings yet
12.4.7 TransformerModels
37 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
41 pages
Violence Detection
No ratings yet
Violence Detection
8 pages
DL Unit-V
No ratings yet
DL Unit-V
23 pages
Mini Project
No ratings yet
Mini Project
20 pages
Violence Detection Based On Multisource Deep CNN With Handcraft Features8
No ratings yet
Violence Detection Based On Multisource Deep CNN With Handcraft Features8
6 pages
Sensors 22 02216 With Cover
No ratings yet
Sensors 22 02216 With Cover
16 pages
Generative Artificial Intelligence Exploring The Power and Potential of Generative AI 1st Edition Shivam R Solanki Instant Download
No ratings yet
Generative Artificial Intelligence Exploring The Power and Potential of Generative AI 1st Edition Shivam R Solanki Instant Download
51 pages
A Comprehensive Review On Automatic Hate Speech Detection in The Age of The Transformer
No ratings yet
A Comprehensive Review On Automatic Hate Speech Detection in The Age of The Transformer
25 pages
The AI Music Arms Race On The Detection of AI-Gene
100% (1)
The AI Music Arms Race On The Detection of AI-Gene
16 pages
2411.19537v1 Survey
No ratings yet
2411.19537v1 Survey
24 pages
CS224N: GPT-2 Assignment Documentation
No ratings yet
CS224N: GPT-2 Assignment Documentation
30 pages
Miniproject
No ratings yet
Miniproject
20 pages
Paper 4143
No ratings yet
Paper 4143
8 pages
Deep Reinforcement Learning Based Real Time Violence Detection
No ratings yet
Deep Reinforcement Learning Based Real Time Violence Detection
8 pages
16 CS01008 Btpreport
No ratings yet
16 CS01008 Btpreport
30 pages
2025 - MALT Diffusion - Yu Et Al
No ratings yet
2025 - MALT Diffusion - Yu Et Al
26 pages
VENkat
No ratings yet
VENkat
41 pages
Novel Person Detection and Suspicious Activity Recognition Using Enhanced Yolov5 and Motion Feature Map
No ratings yet
Novel Person Detection and Suspicious Activity Recognition Using Enhanced Yolov5 and Motion Feature Map
36 pages
1378-Article Text-5109-3-10-20250124
No ratings yet
1378-Article Text-5109-3-10-20250124
21 pages
NLP Transformers Explained
No ratings yet
NLP Transformers Explained
21 pages
Ijcrt24a5023 Bigdataenabled
No ratings yet
Ijcrt24a5023 Bigdataenabled
6 pages
Financial Time Series
No ratings yet
Financial Time Series
34 pages
Crowd Safety Through AI
No ratings yet
Crowd Safety Through AI
74 pages
Mastering Generative AI
100% (1)
Mastering Generative AI
4 pages
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
No ratings yet
Rotary Embeddings - A Relative Revolution - EleutherAI Blog
13 pages
Generative AI Syllabus
No ratings yet
Generative AI Syllabus
2 pages
Visual Autoregressive Modeling: Scalable Image Generation Via Next-Scale Prediction
No ratings yet
Visual Autoregressive Modeling: Scalable Image Generation Via Next-Scale Prediction
19 pages
Time FM
No ratings yet
Time FM
15 pages
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
No ratings yet
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
8 pages
Lost Language
No ratings yet
Lost Language
8 pages
Srivastava OmniVec Learning Robust Representations With Cross Modal Sharing WACV 2024 Paper
No ratings yet
Srivastava OmniVec Learning Robust Representations With Cross Modal Sharing WACV 2024 Paper
13 pages
A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity
No ratings yet
A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity
10 pages
The Impactof Generative AIon Business Consultingsuprit
No ratings yet
The Impactof Generative AIon Business Consultingsuprit
12 pages
Classification D'images Pour La Détection Du CSSVD Dans Les Plants de Cacao
No ratings yet
Classification D'images Pour La Détection Du CSSVD Dans Les Plants de Cacao
5 pages
4 2 Cse499a
No ratings yet
4 2 Cse499a
6 pages
Core Tree Density Analysis
No ratings yet
Core Tree Density Analysis
9 pages
Abnormal Humans Activity Detection
No ratings yet
Abnormal Humans Activity Detection
36 pages
Naseer Saab
No ratings yet
Naseer Saab
3 pages
2dxformer: Dual Transformers For Wind Power Forecasting With Dual Exogenous Variables
No ratings yet
2dxformer: Dual Transformers For Wind Power Forecasting With Dual Exogenous Variables
6 pages
Game - The Word Challenge
No ratings yet
Game - The Word Challenge
1 page
Behavior Sequence Transformer For E-Commerce Recommend in Alibaba
No ratings yet
Behavior Sequence Transformer For E-Commerce Recommend in Alibaba
4 pages
PDF 9: Key Considerations For Developing Generative AI Systems
No ratings yet
PDF 9: Key Considerations For Developing Generative AI Systems
2 pages
VDNet An Edge Vision-Based Surveillance System For Violence Detection
No ratings yet
VDNet An Edge Vision-Based Surveillance System For Violence Detection
9 pages
Kolmogorov-Arnold-based Transformer For LLMs - Implementation, Evaluation and Benchmarking - UCSC OSPO
No ratings yet
Kolmogorov-Arnold-based Transformer For LLMs - Implementation, Evaluation and Benchmarking - UCSC OSPO
3 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Automated Violence Detection System
No ratings yet
Automated Violence Detection System
17 pages

Paper 9

Uploaded by

Paper 9

Uploaded by

Applied Intelligence (2024) 54:10709–10730

Crowd behavior detection: leveraging video swin transformer

Accepted: 12 August 2024 / Published online: 26 August 2024

1 Introduction is a powerful tool for creating real-time intelligence from

Fig. 1 Overall Structure of the Proposed Framework

ẑ l = 3DW M S A(L N (z l−1 )) + z l−1 (1)

z l = M L P(L N (ẑ l )) + ẑ l , (2)

where ẑ l represents the input to the MLP at layer l, while

Fig. 4 Example of 3DWMSA [35]

Recall that our primary objective entails the classification of

Fig. 5 Sample frames and their respective crowd counting maps

Fig. 7 Samples frames for each behavior from our dataset

We implemented state-of-the-art models utilized for video

Fig. 14 Sample frames from

Emrah Basaran received the Almiqdad Elzein received the

You might also like