Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views10 pages

Video Anomaly Detection Framework

The document presents MIST, a Multiple Instance Self-Training Framework for weakly supervised video anomaly detection (WS-VAD), which aims to enhance the detection of anomalies in videos using only video-level annotations. MIST consists of a pseudo label generator and a self-guided attention boosted feature encoder, designed to produce reliable clip-level pseudo labels and focus on anomalous regions in frames. Experimental results demonstrate that MIST achieves competitive performance compared to existing methods, effectively refining task-specific discriminative representations for anomaly detection.

Uploaded by

Trong Nguyen Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

Video Anomaly Detection Framework

The document presents MIST, a Multiple Instance Self-Training Framework for weakly supervised video anomaly detection (WS-VAD), which aims to enhance the detection of anomalies in videos using only video-level annotations. MIST consists of a pseudo label generator and a self-guided attention boosted feature encoder, designed to produce reliable clip-level pseudo labels and focus on anomalous regions in frames. Experimental results demonstrate that MIST achieves competitive performance compared to existing methods, effectively refining task-specific discriminative representations for anomaly detection.

Uploaded by

Trong Nguyen Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection

Jia-Chang Feng1,3,4 , Fa-Ting Hong1,3 , and Wei-Shi Zheng1,2,3∗


1
School of Computer Science and Engineering, Sun Yat-Sen University
2
Peng Cheng Laboratory, Shenzhen, China
3
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
4
Pazhou Lab, Guangzhou, China
[email protected], [email protected], [email protected]

Decision
Abstract boundary
Refine the
Weakly supervised video anomaly detection (WS-VAD) is discriminative
representation
to distinguish anomalies from normal events based on dis-
criminative representations. Most existing works are lim- 𝐸
Feature space
ited in insufficient video representations. In this work, we
develop a multiple instance self-training framework (MIST) Clip-Level Labeled
to efficiently refine task-specific discriminative representa- Normal Video 𝑉
tions with only video-level annotations. In particular, MIST
𝑦
is composed of 1) a multiple instance pseudo label gener- 𝐺 𝑦
ator, which adapts a sparse continuous sampling strategy y
to produce more reliable clip-level pseudo labels, and 2)
a self-guided attention boosted feature encoder that aims
to automatically focus on anomalous regions in frames
while extracting task-specific representations. Moreover, Video-Level Labeled Pseudo Clip-Level Labeled
we adopt a self-training scheme to optimize both compo- Abnormal Video 𝑉 Abnormal Video 𝑉
nents and finally obtain a task-specific feature encoder. Ex- Input/Output of 𝑮 Input/Output of 𝑬𝑺𝑮𝑨 Normal Abnormal
tensive experiments on two public datasets demonstrate the
efficacy of our method, and our method performs compara- Figure 1: Our proposed MIST first assign clip-level pseudo
bly to or even better than existing supervised and weakly su- labels Ŷ a = {ŷia } to anomaly videos with the help of a
pervised methods, specifically obtaining a frame-level AUC pseudo label generator G. Then, MIST leverages informa-
94.83% on ShanghaiTech. tion from all videos to refine a self-guided attention boosted
feature encoder ESGA .

1. Introduction
[29, 14, 7, 15, 13, 5, 32] , which encodes the usual pat-
Video anomaly detection (VAD) aims to temporally or
tern with only normal training samples, and then detects the
spatially localize anomalous events in videos [33]. As in-
distinctive encoded patterns as anomalies. Here, we aim
creasingly more surveillance cameras are deployed, VAD is
to address the weakly supervised video anomaly detection
playing an increasingly important role in intelligent surveil-
(WS-VAD) problem [20, 31, 28, 34, 24] because obtaining
lance systems to reduce the manual work of live monitoring.
video-level labels is more realistic and can produce more re-
Although VAD has been researched for years, develop-
liable results than unsupervised methods. More specifically,
ing a model to detect anomalies in videos remains challeng-
existing methods in WS-VAD can be categorized into two
ing, as it requires the model to understand the inherent dif-
classes, i.e. encoder-agnostic and encoder-based methods.
ferences between normal and abnormal events, especially
The encoder-agnostic methods [20, 28, 24] utilize task-
anomalous events that are rare and vary substantially. Pre-
agnostic features of videos extracted from a vanilla feature
vious works treat VAD as an unsupervised learning task
encoder denoted as E (e.g. C3D [21] or I3D [2]) to esti-
∗ Corresponding author mate anomaly scores. The encoder-based methods [34, 31]

14009
train both the feature encoder and classifier simultaneously. work MIST is able to produce a task-specific feature en-
The state-of-the-art encoder-based method is Zhong et al. coder. We also compare the proposed framework with
[31], which formulates WS-VAD as a label noise learn- other encoder-agnostic methods on two large datasets i.e.
ing problem and learns from the noisy labels filtered by a , UCF-Crime [20] and ShanghaiTech[15]. In addition, we
label noise cleaner network. However, label noise results run ablation studies to evaluate our proposed sparse contin-
from assigning video-level labels to each clip. Even though uous sampling strategy and self-guided attention module.
the cleaner network corrects some of the noisy labels in We also illustrate some visualized results to provide a more
the time-consuming iterative optimization, the refinement intuitive understanding of our approach. Our experiments
of representations progresses slowly as these models are demonstrate the effectiveness and efficiency of MIST.
mistaught by seriously noisy pseudo labels at the beginning.
We find that the existing methods have not considered 2. Related Works
training a task-specific feature encoder efficiently, which of-
fers discriminative representations for events under surveil- Weakly supervised video anomaly detection. VAD aims
lance cameras. To overcome this problem for WS-VAD, to detect anomaly events in a given video and has been re-
we develop a two-stage self-training procedure (Figure 1) searched for years[9, 29, 14, 7, 15, 13, 12, 32, 31, 5, 24].
that aims to train a task-specific feature encoder with only Unsupervised learning methods [9, 29, 7, 30, 15, 13, 32, 5]
video-level weak labels. In particular, we propose a Multi- encode the usual pattern with only normal training sam-
ple Instance Self-Training framework (MIST) that consists ples and then detect the distinctive encoded patterns as
of a multiple instance pseudo label generator and a self- anomalies. Weakly supervised learning methods [20, 31,
guided attention boosted feature encoder ESGA . 1) MIL- 28, 34, 24] with video-level labels are more applicable to
pseudo label generator. The MIL framework is well ver- distinguish abnormal events and normal events. Existing
ified in weakly supervised learning. MIL-based methods weakly supervised VAD methods can be categorized into
can generate pseudo labels more accurately than those sim- two classes, i.e. , encoder-agnostic methods and encoder-
ply assigning video-level labels to each clip [31]. Moreover, based methods. 1) Encoder-agnostic methods train only the
we adopt a sparse continuous sampling strategy that can classifier. Sultani et al. [20] proposed a deep MIL ranking
force the network to pay more attention to context around framework to detect anomalies; Zhang et al. [28] further in-
the most anomalous part. 2) Self-guided attention boosted troduced inner-bag score gap regularization; Wan et al. [24]
feature encoder. Anomalous events in surveillance videos introduced dynamic MIL loss and center-guided regulariza-
may occur in any place and with any size [11], while in tion. 2) Encoder-based methods train both a feature en-
commonly used action recognition videos, the action usu- coder and a classifier. Zhu et al. [34] proposed an attention
ally appears with large motion [3, 4]. Therefore, we utilize based MIL model combined with a optical flow based auto-
the proposed self-guided attention module in our proposed encoder to encode motion-aware features. Zhong et al. [31]
feature encoder to emphasize the anomalous regions with- took weakly supervised VAD as a label noise learning task
out any external annotation [11] but clip-level annotations and proposed GCNs to filter label noise for iterative model
of normal videos and clip-level pseudo labels of anomalous training, but the iterative optimization was inefficient and
videos. For our WS-VAD modelling, we introduce a deep progressed slowly. Some works focus on detecting anoma-
MIL ranking loss to effectively train the multiple instance lies in an offline manner [23, 25] or a coarse-grained man-
pseudo label generator. In particular, for deep MIL rank- ner [20, 28, 34, 23, 25], which do not meet the real-time
ing loss, we adopt a sparse-continuous sampling strategy to monitoring requirements for real-world applications.
focus more on the context around the anomalous instance. Here, our work is also an encoder-based method and
To obtain a task-specific feature encoder with smaller work in an online fine-grained manner, but we use the
domain-gap, we introduce an efficient two-stage self- learned pseudo labels to optimize our feature encoder
training scheme to optimize the proposed framework. We ESGA rather than using video-level labels as pseudo la-
use the features extracted from the original feature encoder bels directly. Moreover, we design a two-stage self-training
to produce its corresponding clip-level pseudo labels for scheme to efficiently optimize our feature encoder and
anomalous videos by the generator G. Then, we adopt these pseudo label generator instead of iterative optimization[31].
pseudo labels and their corresponding abnormal videos as Multiple Instance Learning. MIL is a popular method for
well as normal videos to refine our improved feature en- weakly supervised learning. In video-related tasks, MIL
coder ESGA (as demonstrated in Figure 1). Therefore, we takes a video as a bag and clips in the video as instances
can acquire a task-specific feature encoder that provides dis- [20, 17, 8]. With a specific feature/score aggregation func-
criminative representations for surveillance videos. tion, video-level labels can be used to indirectly supervise
The extensive experiments based on two different feature instance-level learning. The aggregation functions vary, e.g.
encoders, i.e. C3D [21] and I3D [2] show that our frame- max pooling[20, 28, 34] and attention pooling[17, 8]. In

14010
ℬ (Y = 0) 0.3 0.0
𝐄𝐒𝐆𝐀
0.1 0.0

0.0 0.0

0.1 0.0 𝐘𝐧 𝐇𝐜 ℒ 𝐘𝐧
Moving Min-Max 𝐄
𝒱 𝐆 Mean Norm +
Refine
𝐘𝐚 Self-Guided
0.1 0.0 Attention ℒ 𝐘𝐚
0.7 0.8 Module

0.9 1.0

0.6 0.6
ℬ (Y = 1)
𝒱 Stage I. Pseudo Labels Generation Stage II. Feature Encoder Finetuning

Figure 2: Illustration of our proposed MIST framework. MIST includes a multiple instance pseudo label generator G and
self-guided attention boosted feature encoder ESGA followed by a weighted-classification head Hc . We first train a G and
then generate pseudo labels for ESGA fine-tuning.

this paper, we adopt a sparse continuous sampling strategy Algorithm 1 Multiple instance self-training framework
in our multiple instance pseudo label generator to force the Input: Clip-level labeled normal videos V n = {vin }N i=1 and cor-
network to pay more attention to context around the most responding clip-level labels Y n , video-level labeled abnormal
anomalous part. videos V a = {via }N i=1 , pretrained vanilla feature encoder E.

Self-training. Self-training has been widely investigated in Output: Self-guided attention boosted feature encoder ESGA ,
semi-supervised learning [1, 10, 6, 27, 22, 35]. Self-training multiple instance pseudo label generator G, clip-level pseudo
labels Ŷ a for V a
methods increase labeled data via pseudo label generation
Stage I. Pseudo Labels Generation.
on unlabeled data to leverage the information on both la-
1: Extract features of V a and V n from E as {fa N
i }i=1 and
beled and unlabeled data. Recent deep self-training involves n N
{fi }i=1 .
representation learning of the feature encoder and classi- 2: Training G with {fa N n N
i }i=1 and {fi }i=1 and their corresponding
fier refinement, mostly adopted in semi-supervised learn- video-level labels according to Eq. 7.
ing [10] and domain adaptation [36, 35]. In unsupervised 3: Predict clip-level pseudo labels for each clip of V a via trained
VAD, Pang et al. [18] introduced a self-training framework G as Ŷ a .
deployed on the testing video directly, assuming the exis- Stage II. Feature Encoder Fine-tuning.
tence of an anomaly in the given video. 4: Combine E with self-guided attention module as ESGA , then
Here, we propose a multiple instance self-training frame- fine-tune ESGA with supervision of Y n ∪ Ŷ a .
work that assigns clip-level pseudo labels to all clips in ab-
normal videos via a multiple instance pseudo label genera-
tor. Then, we leverage information from all videos to fine- and clips vi in the video as instances. Specifically, a neg-
tune a self-guided attention boosted feature encoder. ative bag (i.e. Y = 0) marked as B n = {vin }N i=1 has no
anomalous instance, while a positive bag (i.e. Y = 1) de-
3. Approach noted as B a = {via }N i=1 has at least one.
In this work, given a pair of bags (i.e. a positive bag B a
VAD depends on discriminative representations that
and a negative bag B n ), we first pre-extract the features
clearly represent the events in a scene, while action recog-
(i.e. {fia }N n N a n
i=1 and {fi }i=1 for B and B , respectively)
nition datasets pretrained feature encoders are not perfect N
for each clip in the video V = {vi }i=1 using a pretrained
for surveillance videos because of the existence of a do-
vanilla feature encoder, C3D or I3D, forming bags of fea-
main gap [11, 3, 4]. To address this problem, we introduce a n
tures B and B . We then feed the pseudo label genera-
a self-training strategy to refine the proposed improved fea-
tor the extracted features to estimate the anomaly scores of
ture encoder ESGA . An illustration of our method shown in
the clips (i.e. {sai }N n N
i=1 , {si }i=1 ). Then, we produce pseudo
Figure 2 is detailed in the following. a a N
labels Ŷ = {ŷi }i=1 for anomalous video by perform-
3.1. Overview ing smoothing and normalization on estimated scores to su-
Given a video V = {vi }N i=1 with N clips, the annotated pervise the learning of the proposed self-guided attention
video-level label Y ∈ {1, 0} indicates whether an anoma- boosted feature encoder, forming as two-stage self-training
lous event exists in this video. We take a video V as a bag scheme [10, 36, 35].

14011
Sub-Bag 𝑆 𝑆
𝐵 (T continuous
0.0
the refinement process of our feature encoder ESGA .
clips) 0.1 0.1
0.4
Even though recent MIL-based methods [20, 28] have
0.3
0.0
0.5 made considerable progress, the process of slicing a video
0.2
0.3 into fixed segments in an coarse-grained manner regardless
of its duration is prone to bury abnormal patterns as nor-

mal frames that usually constitute the majority, even in ab-

𝑆
𝐵 0.2
normal videos [24]. However, by sampling with a smaller
0.3
temporal scale in a fine-grained manner, the network may
MIL Networks 0.3 overemphasize on the most intense part of an anomaly but
0.7
0.2 ignore the context around it. In reality, anomalous events
often last for a while. With the assumption of minimum du-
ration of anomalies, the MIL network is forced to pay more
Figure 3: The workflow of our multiple instance pseudo attention to the context around the most anomalous part.
label generator. Each bag contains L sub-bags, and each Moreover, to adapt to the variation in duration of
sub-bag is composed of T continuous clips. untrimmed videos and class imbalance in amount, we in-
troduce a sparse continuous sampling strategy: given the
As shown in Figure 2, our proposed feature encoder features for each clip extracted by a vanilla feature encoder
ESGA , adapted from vanilla feature encoder E (e.g. , I3D E from a video {fi }N i=1 , we uniformly sample L subsets
or C3D) by adding our proposed self-guided attention mod- from these video clips, and each subset contains T con-
ule, can be optimized with the estimated pseudo labels to secutive clips, forming L sub-bags B = {fl,t }L,T l=1,t=1 , as
eliminate the domain gap and produce task-specific repre- shown in Figure 3. Remarkably, T , a hyperparameter to be
sentations. Actually, our proposed approach can be viewed tuned, also plays as the assumption of minimum duration
as a two-stage method (see Algorithm 1): 1) we first gener- of anomalies, as discussed in the previous paragraph. Here,
ate clip-level pseudo labels for anomalous videos that have we combine the MIL model with our continuous sampling
only video-level labels via the pseudo label generator, while strategy, as shown in Figure 3. We feed extracted features
the parameters of the pseudo label generator are updated by into our pseudo label generator to produce corresponding
means of the deep MIL ranking loss. 2) After obtaining the anomalous scores {sl,t }L,T
l=1,t=1 . Next, we perform average
clip-level pseudo labels of anomalous videos, our feature pooling of the predicted instance-level scores sl,t of each
encoder ESGA can be trained on both normal and anoma- sub-bag score as Sl below, which can be utilized in Eq. 7.
lous video data. Thus, we form a self-training scheme to
T
optimize both the feature encoder ESGA and pseudo label 1X
generator G. The illustration shown in Figure 2 provides an Sl = sl,t . (1)
T t=1
overview of our proposed method.
To better distinguish anomalous clips from normal ones, After training, the trained multiple instance pseudo la-
we introduce a self-guided attention module in the feature bel generator predicts clip-level scores for all abnormal
encoder, i.e. , ESGA , to capture the anomalous regions in videos marked as S a = {sai }Ni=1 . By performing temporal
videos to help the feature encoder produce more discrimi- smoothing with a moving average filter to relieve the jitter
native representations (see Section 3.3). Moreover, we in- of anomaly scores with kernel size of k,
troduce a sparse continuous sampling strategy in the pseudo i+k
1 X a
label generator to enforce the network to pay more atten- s̃ai = sj , (2)
2k
tion to the context around the most anomalous part (see j=i−k
Section 3.2). Finally, we introduce the deep MIL ranking and min-max normalization,
loss to optimize the learning of the pseudo label generator,  
and we use cross entropy loss to train our proposed feature ŷia = s̃ai − min S̃ a /(max S̃ a − min S̃ a )), i ∈ [1, N ],
encoder ESGA supervised by pseudo labels of anomalous (3)
videos and clip-level annotations of normal videos. we refine the anomaly scores into Ŷ = {ŷia }N i=1 . Specifi-
3.2. Pseudo Label Generation via Multiple Instance cally, ŷia is in [0, 1] and acts as a soft pseudo label. Then, the
Learning pseudo labeled data {V a , Ŷ a } are combined with clip-level
labeled data {V n , Y n } as {V, Y } to fine-tune the proposed
In contrast to [31], which simply assigns video-level la-
feature encoder ESGA .
bels to each clip and then trains the vanilla feature encoder
at the very beginning, we introduce a MLP-based struc- 3.3. Self-Guided Attention in Feature Encoder
ture as the pseudo label generator trained under the MIL In contrast to vanilla feature encoder E, which provides
paradigm to generate pseudo labels, which are utilized in only task-agnostic representations for the down-stream task,

14012
softmax
via L1 and L2 , respectively. That is, we optimize ESGA
ℳ GAP with the pseudo labels (see Section 3.2). Therefore, the
ℳ𝒜 𝑯𝒄
feature encoder ESGA can update its parameters on video
anomaly datasets and eliminate the domain gap from the
𝓛𝟏
pretrained parameters.
ℳ 𝒜 3.4. Optimization Process
𝒀𝒏
Attention Mechanism
- Deep MIL Ranking Loss: Considering that the positive
𝒀𝒂 bag contains at least one anomalous clip, we assume that the
ℱ ℱ 𝑯𝒈
2𝒦 2𝒦
clip from a positive bag with the highest anomalous score
h
h GAP Avg
𝓛𝟐
is the most likely to be an anomaly [8].To adapt our sparse
ℳ∗ ℱ ℳ continuous sampling in 3.2, we treat a sub-bag as an in-
w
softmax
w
stance and acquire a reliable relative comparison between
the mostly likely anomalous sub-bag and the most likely
Self-Guided Attention Module
normal sub-bag:
Figure 4: The structure of self-guided attention boosted fea- max Sln < max Sla (6)
ture encoder ESGA . GAP means the global average pooling 1≤l≤L 1≤l≤L
operation, while Avg means K channel-wise average pool-
Specifically, to avoid too many false positive instances in
ing in producing guided anomaly scores in guided classifi-
positive bags, we introduce a sparse constraint on positive
cation head Hg . A is the attention map. F1 , F2 , F3 are
bags, which instantiates Eq. 6 as a deep MIL ranking loss
three encoding units constructed by convolutional layers.
with sparse regularization:

we propose a self-guided attention boosted feature encoder   L


λX a
ESGA adapted from E, which optimizes attention map gen- LM IL = ǫ − max Sla + max Sln + Sl .
1≤l≤L 1≤l≤L + L
eration via pseudo labels supervision to enhance the learn- l=1
ing of task-specific representations. (7)
As Figure 4 shows, the self-guided attention module where (·)+ means max(0, ·), and the first term in Eq. 7 en-
(SGA) takes feature maps Mb−4 and Mb−5 as input, which sures that max1≤l≤L Sla is larger than max1≤l≤L Sln with a
are produced by the 4th and 5th blocks of vanilla feature en- margin of ǫ. ǫ is a hyperparameter that is equal to 1 in this
coder E, respectively. SGA includes three encoding units, work. The last term in Eq. 7 is the sparse regularization in-
namely F1 , F2 and F3 , which are all constructed by convo- dicating that only a few sub-bags may contain the anomaly,
lutional layers. Mb−4 is encoded as M∗b−4 and then applied while λ is another hyperparameter used to balance the rank-
to attention map A generation, denoted as ing loss with sparsity regularization.
- Classification Loss: After obtaining the pseudo labels for
A = F1 (F2 (Mb−4 )). (4) an abnormal video in Eq. 3, we obtain the training pair
{V a , Ŷ a } that is further combined with {V n , Y n } to train
Finally, we obtain MA via the attention mechanism below:
our feature encoder ESGA . For this purpose, we apply the
MA = Mb−5 + A ◦ Mb−5 , (5) cross entropy loss function to the two classification heads
(Hc and Hg ) in ESGA , i.e. L1 and L2 in Figure 4.
where ◦ is element-wise multiplication, and MA is ap- Finally, we train a task-specific feature encoder ESGA
plied for final anomaly scores prediction via weighted- with the combination of L1 and L2 . In the inference stage,
classification head Hc , a fully connected layer. we use ESGA to predict clip-level scores for videos via
To assist the learning of the attention map, we intro- weighted-classification head Hc .
duce a guided-classification head Hg that uses the pseudo
labels as supervision. In Hg , F3 transforms M∗b−4 into 4. Experiments
M. Specifically, M∗b−4 and M hav 2K channels as K
multiple detectors for each class, i.e. , normal and abnor- 4.1. Datasets and Metrics
mal, to enhance the guided supervision [26]. Then, we de- We conduct experiments on two large datasets, i.e. ,
ploy spatiotemporal average pooling, K channel-wise av- UCF-Crime [20] and ShanghaiTech [15], with two feature
erage pooling on M and Softmax activation to obtain the encoders, i.e. C3D [21] or I3D [2].
guided anomaly scores for each class. UCF-Crime is a large-scale dataset of real-world surveil-
Remarkably, there are two classification heads in ESGA , lance videos, including 13 types of anomalous events with
i.e. , weighted-classification head Hc and guided classifica- 1900 long untrimmed videos, where 1610 videos are train-
tion head Hg , which are both supervised by pseudo labels ing videos and the others are test videos. Liu et al. [11]

14013
MIL iter 1 iter 2 iter 3
94.83 Method Supervised Grained Encoder AUC (%) FAR (%)
95 93.13
Hasan et al. [7] Un Coarse AE RGB 50.6 27.2
90 89.15 Lu et al. [14] Un Coarse Dictionary 65.51 3.1
86.61 SVM Weak Coarse C3DRGB 50 -
AUC (%)

85 Sultani et al. [20] Weak Coarse C3DRGB 75.4 1.9


Zhang et al. [28] Weak Coarse C3DRGB 78.7 -
80 Zhu et al. [34] Weak Coarse AE F low 79.0 -
76.16 76.44
Zhong et al. [31] Weak Fine C3DRGB 80.67∗ (81.08) ∗
3.3 (2.2)
75 73.79
Liu et al. [11] Full(T) Fine C3DRGB 70.1 -
70 Liu et al. [11] Full(S+T) Fine N LN RGB 82.0 -
Zhong et al. (C3D-RGB) Ours (C3D-RGB) Ours (I3D-RGB) MIST Weak Fine C3DRGB 81.40 2.19
MIST Weak Fine I3DRGB 82.30 0.13
Figure 5: Comparisons with the state-of-the-art encoder-
based method Zhong et al. [31] on ShanghaiTech. Table 1: Quantitative comparisons with existing online
methods on UCF-Crime under different levels of supervi-
manually annotated bounding boxes of anomalous regions sion and fineness of prediction. The results in (·) are tested
in one image per 16 frames for each abnormal video, and with 10-crop, while those marked by ∗ are tested without.
we use their annotation of test videos only to evaluate our
model’s capacity to identify anomalous regions. Method Feature Encoder Grained AUC (%) FAR (%)
ShanghaiTech is a dataset of 437 campus surveillance Sultani et al. [20] C3DRGB Coarse 86.30 0.15
videos. It has 130 abnormal events in 13 scenes, but all ab- Zhang et al. [28] C3DRGB Coarse 82.50 0.10
Zhong et al. [31] C3DRGB Fine 76.44 -
normal videos are in the test set, as the dataset is proposed AR-Net [24] C3DRGB Fine 85.01∗ 0.57∗
for unsupervised learning. To adapt to the weakly super- AR-Net [24] I3DRGB Fine 85.38 0.27
vised setting, Zhong et al. [31] re-organized the videos into AR-Net [24] I3DRGB+F low Fine 91.24 0.10
238 training videos and 199 testing videos. MIST C3DRGB Fine 93.13 1.71
Evaluation Metrics. Following previous works [13, 11, MIST I3DRGB Fine 94.83 0.05
20, 24], we compute the area under the curve (AUC) of Table 2: Quantitative comparisons with existing methods
the frame-level receiver operating characteristics (ROC) as on ShanghaiTech. The results with ∗ are re-implemented.
the main metric, where a larger AUC implies higher distin-
guishing ability. We also follow [20, 24] to evaluate robut-
ness by the false alarm rate (FAR) of anomaly videos. while the results in brackets are reported on [31] using 10-
crop augmentation. However, 10-crop augmentation may
4.2. Implementation Details
improve the performance but requires 10 times the compu-
The multiple instance pseudo label generator, is a 3-layer tation. Notably, the result of our MIST still slightly over-
MLP, where the number of units is 512, 32 and 1, respec- takes that of Zhong et al. [31] using 10-crop augmenta-
tively, regularized by dropout with probability of 0.6 be- tion (81.08% vs. 81.40% in terms of AUC and 2.2% vs.
tween each layer. ReLU and Sigmoid functions are de- 2.19% for FAR). Moreover, our method outperforms the su-
ployed after the first and last layer, respectively. Here, We pervised method of Liu et al. [11], which trains C3DRGB
adopt hyperparameters L = 32, T = 3, and λ = 0.01 with external temporal annotations and N LN RGB with ex-
and train the generator with the Adagrad optimizer with a ternal spatiotemporal annotations. These results verify that
learning rate of 0.01. While fine-tuning, we adopt the Adam our proposed MIST is more effective than previous works.
optimizer with a learning rate of 1e − 4 and a weight decay
For the ShanghaiTech dataset results in Table 2, our
of 0.0005 and train 300 epochs. More details about imple-
MIST far outperforms other RGB-based methods [20, 28,
mentation are reported in Supplementary Material.
31, 24], which validates the capacity of MIST. Remarkably,
4.3. Comparisons with Related Methods MIST also surpasses the multi-model method of AR-Net
In Table 1, we present the AUC, FAR to compare our [24] (I3DRGB+F low ) on AUC by more than 4% to 94.83%
MIST with related state-of-the-art online methods in terms and gains a much lower FAR of 0.05%.
of accuracy and robustness. We can find that MIST outper- We detail the comparison with the state-of-the-art
forms or performs similarly to all other methods in terms encoder-based method [31] on ShanghaiTech in Figure
of all evaluation metrics from Table 1, which confirms the 5. The multiple instance pseudo label generator performs
efficacy of MIST. Specifically, the results of Zhong et al. much better than Zhong et al. [31], which indicates the
[31], marked with ∗, are re-tested from the official released drawback of utilizing video-level labels as clip-level labels.
models1 without deploying 10-crop2 for fair comparison, Even though Zhong et al. [31] optimizes for three iterations,
1 https://github.com/jx-zhong-for-academic-purpose/GCN-Anomaly-
it falls far behind our MIST with 16.69% AUC on C3D,
Detection.
which solidly verifies the efficiency and efficacy of MIST.
2 10-crop is a test-time augmentation of cropping images into the center, Moreover, our MIST is much faster in the inference stage,
four corners and their mirrored counterparts. as Zhong et al. [31] applies 10-crop augmentation.

14014
AUC (%) AUC (%) ∆AUC
Encoder-Agnostic Dataset Feature
UCF-Crime ShanghaiTech Uniform Sparse Continuous (%)
Methods
pretrained fine-tuned pretrained fine-tuned C3DRGB 74.29 75.51 +1.22
Sultani et al. [20] 78.43 81.42 86.92 92.63 UCF-Crime
I3DRGB 78.72 79.37 +0.65
Zhang et al. [28] 78.11 81.58 88.87 92.50 C3DRGB 83.68 86.61 +2.93
AR-Net [24] 78.96 82.62 85.38 92.27 ShanghaiTech
I3DRGB 83.10 89.15 +6.05
Our MIL generator 79.37 81.55 89.15 92.24
Table 4: Performance comparisons of sparse continuous
Table 3: Quantitative comparisons between the features
sampling and uniform sampling for MIL generator training.
from the pretrained vanilla feature encoder and those from
MIST on UCF-Crime and ShanghaiTech datasets by adopt-
ing encoder-agnostic methods. indicates the network is more capable of distinguishing
Pretrained MIST anomalies from normal events [13]. We conduct abla-
tion studies on UCF-Crime to analyze the impact of gen-
erated pseudo labels (PLs), the self-guided attention mod-
Robbery137

ule (SGA), and classifier head Hg in SGA of proposed fea-


ture encoder ESGA in Table 5. Compared with the base-
line and MISTw/o PLs , our MIST achievees a significant im-
provement when the generated pseudo labels are utilized. In
particular, we observe 8.17% improvement in AUC and an
approximately 17% score gap, which shows the efficacy of
our multiple instance pseudo label generator with the sparse
Fighting003

continuous sampling strategy. Pseudo labels also plays an


important role. Compared with MIST, the performance of
MISTw/o PLs drops seriously, even worse than the baseline
for the low-quality supervision that influences the attention
map A generation from SGA.
Moreover, SGA enhances the feature encoder on empha-
sizing the informative regions and distinguishing abnormal
Figure 6: Feature space visualization of pretrained vanilla
events from normal ones. Compared with M IST w/oSGA ,
feature encoder I3D and the MIST fine-tuned encoder via
MIST increases by 2% in AUC and 5% in the score gap.
t-SNE [16] on UCF-Crime testing videos. The red dots de-
Specifically, the guided-classification branch in SGA plays
note anomalous regions while the blue ones are normal.
an important role in guiding the attention map generation,
and there is a drop of more than 2% if such a branch is re-
4.4. Task-Specific Feature Encoder moved.
To verify that our feature encoder can produce task- Ablation studies are also conducted on a sparse contin-
specific representations that facilitate the other encoder- uous sampling strategy on UCF-Crime and ShanghaiTech
agnostic methods, we also conduct related experiments with with C3DRGB and I3DRGB features. As shown in Ta-
I3D as presented in Table 3. It is noticeable that all re- ble 4, when sampling the same number of clips for a bag
sults of encoder-agnostic methods are boosted after using and selecting the same number of top clips to represent the
our MIST fine-tuned features, showing a reduction in the bag, our sparse continuous sampling strategy pays more at-
domain gap. For example, AR-Net [24] increases from tention to the context and does better than uniform sam-
85.38% to 92.27% on the UCF-Crime dataset and achieves pling. Especially in ShanghaiTech, sparse continuous sam-
an improvement of 6.89% on the ShanghaiTech dataset. pling gains 2.93% and 6.05% on two kinds of features.
Therefore, our MIST can produce a more powerful task-
specific feature encoder that can be utilized in other ap- 4.6. Visual Results
proaches. We visualize the feature space of the pretrained To further evaluate the performance of our model, we
I3D vanilla feature encoder and the MIST-fine-tuned en- visualize the temporal predictions of the models. As pre-
coder via t-SNE[16] in Figure 6, which also indicates the sented in Figure 7, our model exactly localizes the anoma-
refinement of feature representations. lous events and predicts anomaly scores very close to zero
on normal videos, showing the effectiveness and robustness
4.5. Ablation Study of our model. We collect some failed samples in the right
At first, we introduce another evaluation metric, i.e. row of Figure 7. In addition, our model predicts the high-
score gap, which is the gap between the average scores est score at the end of Arrest001, where a man walks across
of abnormal clips and normal clips. Larger score gap the scene with his arm pointing forward as if brandishing a

14015
Shooting008 Vandalism028 Normal877 Arrest001 Burglary079
Figure 7: Visualization of the testing results on UCF-Crime (better viewed in color). The red blocks in the graphs are temporal
ground truths of anomalous events. The orange circle shows the wrongly labeled ground truth, the blue circle indicates the
wrongly predicted clip, and the red cricle indicates the correctly predicted clip.

Method AUC (%) Score Gap (%) ally, compared with the activation maps generated from the
Baseline 74.13 0.375 MIST without guided-classification head Hg and the MIST
MISTw/o PLs 73.33 0.443 without the SGA module, the results of MIST are concen-
MISTw/o Hg 81.97 15.37
trated on the anomalous regions, which shows the rational-
MISTw/o SGA 80.28 12.74
MIST 82.30 17.71
ity and effectiveness of our self-guided attention module.

Table 5: Ablation Studies on UCF-Crime with I3DRGB . 4.7. Discussions


Baseline is the original I3D trained with video-level la- The key of our MIST is to design a two stage self-
bels [31]. MIST is our whole model. MISTw/o PLs is training strategy to train a task-specific feature encoder for
trained without pseudo labels but with video-level labels. video anomaly detection. Each component of our frame-
MISTw/o Hg is MIST trained without Hg . MISTw/o SGA is work can be replaced by any other advanced module, e.g. ,
trained without the self-guided attention module). replacing C3D with I3D, or a stronger pseudo label gener-
ator to take the place of the multiple instance pseudo label
generator. Additionally, the scheme of our framework can
Origin
Frames be adapted to other tasks, such as weakly supervised video
action localization and video highlight detection.
MIST
w/o
SGA
5. Conclusions
In this work, we propose a multiple instance self-training
MIST
w/o 𝑯𝒈
framework (MIST) to fine-tune a task-specific feature en-
coder efficiently. We adopt a sparse continuous sampling
strategy in the multiple instance pseudo label generator to
MIST produce more reliable pseudo labels. With the estimated
pseudo labels, our proposed feature encoder learns to fo-
Vandalism015 Assault010
cus on the most probable anomalous regions in frames fa-
Figure 8: Visualization results of anomaly activation maps cilitated by the proposed self-guided attention module. Fi-
(better viewed in color). nally, after a two-stage self-training process, we train a task-
feature encoder with discriminative representations that can
also boost other existing methods. Remarkably, our MIST
gun. As the videos in UCF-Crime are low-resolution, it is makes significant improvements on two public datasets.
difficult to judge such a confusing action without any other
context information. Furthermore, the bottom-right part of Acknowledgement
Figure 7 shows another failed case; i.e. , our model suc-
cessfully localizes the major part of the anomalous burglary This work was supported partially by the National
event and raises an alarm when the thieves are rushing out Key Research and Development Program of China
of the house, which should be treated as an anomaly but is (2018YFB1004903), NSFC (U1911401, U1811461),
wrongly labeled as a normal event in the ground truth. We Guangdong Province Science and Technology Innovation
also visualize the spatial activation map via Grad-CAM on Leading Talents (2016TX03X157), Guangdong NSF
MA [19] for spatial explanation. As Figure 8 shows, our Project (Nos.2020B1515120085, 2018B030312002),
model is able to sensitively focus on informative regions Guangzhou Research Project (201902010037), Research
that help decide whether the scene is anomalous . This Projects of Zhejiang Lab (No.2019KD0AB03), and
verifies that our self-guided attention module can boost the the Key-Area Research and Development Program of
feature encoder to focus on anomalous regions. Addition- Guangzhou (202007030004).

14016
References [18] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den
Hengel, and Xiao Bai. Self-trained deep ordinal regression
[1] Massih-Reza Amini and Patrick Gallinari. Semi-supervised for end-to-end video anomaly detection. In Proceedings of
logistic regression. In ECAI, 2002. the IEEE/CVF Conference on Computer Vision and Pattern
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action Recognition, pages 12173–12182, 2020.
recognition? a new model and the kinetics dataset. In IEEE [19] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Conf. Comput. Vis. Pattern Recog., 2017. Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[3] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Grad-cam: Visual explanations from deep networks via
Huang. Why can’t i dance in the mall? learning to miti- gradient-based localization. In IEEE Conf. Comput. Vis. Pat-
gate scene bias in action recognition. In Adv. Neural Inform. tern Recog., 2017.
Process. Syst., 2019. [20] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world
[4] Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and anomaly detection in surveillance videos. In IEEE Conf.
Jia-Bin Huang. Unsupervised and semi-supervised domain Comput. Vis. Pattern Recog., 2018.
adaptation for action recognition from drones. 2020. [21] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
[5] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, and Manohar Paluri. Learning spatiotemporal features with
Moussa Reda Mansour, Svetha Venkatesh, and Anton 3d convolutional networks. In Int. Conf. Comput. Vis., 2015.
van den Hengel. Memorizing normality to detect anomaly: [22] Isaac Triguero, Salvador Garcı́a, and Francisco Herrera.
Memory-augmented deep autoencoder for unsupervised Self-labeled techniques for semi-supervised learning: taxon-
anomaly detection. In Int. Conf. Comput. Vis., 2019. omy, software and empirical study. Knowledge and Informa-
[6] Yves Grandvalet and Yoshua Bengio. Semi-supervised tion systems, 42(2):245–284, 2015.
learning by entropy minimization. In Adv. Neural Inform. [23] Waseem Ullah, Amin Ullah, Ijaz Ul Haq, Khan Muham-
Process. Syst., 2005. mad, Muhammad Sajjad, and Sung Wook Baik. Cnn features
[7] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K with bi-directional lstm for real-time anomaly detection in
Roy-Chowdhury, and Larry S Davis. Learning temporal reg- surveillance networks. Multimedia Tools and Applications,
ularity in video sequences. In IEEE Conf. Comput. Vis. Pat- pages 1–17, 2020.
tern Recog., 2016. [24] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei.
[8] Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei- Weakly supervised video anomaly detection via center-
Shi Zheng. Mini-net: Multiple instance ranking network for guided discriminative learning. In Int. Conf. Multimedia and
video highlight detection. arXiv preprint arXiv:2007.09833, Expo, 2020.
2020. [25] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao,
[9] Timothy Hospedales, Shaogang Gong, and Tao Xiang. A Zhaoyang Wu, and Zhiwei Yang. Not only look, but also
markov clustering topic model for mining behaviour in listen: Learning multimodal violence detection under weak
video. In Int. Conf. Comput. Vis., 2009. supervision. In European Conference on Computer Vision,
[10] Dong-Hyun Lee. Pseudo-label: The simple and effi- pages 322–339. Springer, 2020.
cient semi-supervised learning method for deep neural net- [26] Jufeng Yang, Dongyu She, Yu-Kun Lai, Paul L Rosin, and
works. In Workshop on challenges in representation learn- Ming-Hsuan Yang. Weakly supervised coupled networks for
ing, ICML, volume 3, 2013. visual sentiment analysis. In IEEE Conf. Comput. Vis. Pat-
[11] Kun Liu and Huadong Ma. Exploring background-bias for tern Recog., 2018.
anomaly detection in surveillance videos. In ACM Int. Conf. [27] David Yarowsky. Unsupervised word sense disambiguation
Multimedia, 2019. rivaling supervised methods. In 33rd annual meeting of the
[12] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, Shenghua association for computational linguistics, pages 189–196,
Gao, et al. Margin learning embedded prediction for video 1995.
anomaly detection with a few anomalies. In IJCAI, 2019. [28] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal con-
[13] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- volutional network with complementary inner bag loss for
ture frame prediction for anomaly detection–a new baseline. weakly supervised anomaly detection. In IEEE Int. Conf.
In IEEE Conf. Comput. Vis. Pattern Recog., 2018. Image Process., 2019.
[14] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detec- [29] Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of
tion at 150 fps in matlab. In Int. Conf. Comput. Vis., 2013. unusual events in videos via dynamic sparse coding. In IEEE
[15] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse Conf. Comput. Vis. Pattern Recog., 2011.
coding based anomaly detection in stacked rnn framework. [30] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu,
In Int. Conf. Comput. Vis., 2017. and Xian-Sheng Hua. Spatio-temporal autoencoder for video
[16] Laurens van der Maaten and Geoffrey Hinton. Visualiz- anomaly detection. In ACM Int. Conf. Multimedia, 2017.
ing data using t-sne. Journal of machine learning research, [31] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu,
9(Nov):2579–2605, 2008. Thomas H Li, and Ge Li. Graph convolutional label noise
[17] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. cleaner: Train a plug-and-play action classifier for anomaly
Weakly supervised action localization by sparse temporal detection. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
pooling network. In IEEE Conf. Comput. Vis. Pattern Recog., [32] Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong
2018. Liu, and Rick Siow Mong Goh. Anomalynet: An anomaly

14017
detection network for video surveillance. IEEE Transactions
on Information Forensics and Security, 14(10):2537–2550,
2019.
[33] Sijie Zhu, Chen Chen, and Waqas Sultani. Video
anomaly detection for smart surveillance. arXiv preprint
arXiv:2004.00222, 2020.
[34] Yi Zhu and Shawn Newsam. Motion-aware feature for im-
proved video anomaly detection. Brit. Mach. Vis. Conf.,
2019.
[35] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jin-
song Wang. Confidence regularized self-training. In Int.
Conf. Comput. Vis., 2019.
[36] Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and Jinsong
Wang. Unsupervised domain adaptation for semantic seg-
mentation via class-balanced self-training. In Eur. Conf.
Comput. Vis., 2018.

14018

You might also like