Video Anomaly Detection Framework
Video Anomaly Detection Framework
Decision
Abstract boundary
Refine the
Weakly supervised video anomaly detection (WS-VAD) is discriminative
representation
to distinguish anomalies from normal events based on dis-
criminative representations. Most existing works are lim- 𝐸
Feature space
ited in insufficient video representations. In this work, we
develop a multiple instance self-training framework (MIST) Clip-Level Labeled
to efficiently refine task-specific discriminative representa- Normal Video 𝑉
tions with only video-level annotations. In particular, MIST
𝑦
is composed of 1) a multiple instance pseudo label gener- 𝐺 𝑦
ator, which adapts a sparse continuous sampling strategy y
to produce more reliable clip-level pseudo labels, and 2)
a self-guided attention boosted feature encoder that aims
to automatically focus on anomalous regions in frames
while extracting task-specific representations. Moreover, Video-Level Labeled Pseudo Clip-Level Labeled
we adopt a self-training scheme to optimize both compo- Abnormal Video 𝑉 Abnormal Video 𝑉
nents and finally obtain a task-specific feature encoder. Ex- Input/Output of 𝑮 Input/Output of 𝑬𝑺𝑮𝑨 Normal Abnormal
tensive experiments on two public datasets demonstrate the
efficacy of our method, and our method performs compara- Figure 1: Our proposed MIST first assign clip-level pseudo
bly to or even better than existing supervised and weakly su- labels Ŷ a = {ŷia } to anomaly videos with the help of a
pervised methods, specifically obtaining a frame-level AUC pseudo label generator G. Then, MIST leverages informa-
94.83% on ShanghaiTech. tion from all videos to refine a self-guided attention boosted
feature encoder ESGA .
1. Introduction
[29, 14, 7, 15, 13, 5, 32] , which encodes the usual pat-
Video anomaly detection (VAD) aims to temporally or
tern with only normal training samples, and then detects the
spatially localize anomalous events in videos [33]. As in-
distinctive encoded patterns as anomalies. Here, we aim
creasingly more surveillance cameras are deployed, VAD is
to address the weakly supervised video anomaly detection
playing an increasingly important role in intelligent surveil-
(WS-VAD) problem [20, 31, 28, 34, 24] because obtaining
lance systems to reduce the manual work of live monitoring.
video-level labels is more realistic and can produce more re-
Although VAD has been researched for years, develop-
liable results than unsupervised methods. More specifically,
ing a model to detect anomalies in videos remains challeng-
existing methods in WS-VAD can be categorized into two
ing, as it requires the model to understand the inherent dif-
classes, i.e. encoder-agnostic and encoder-based methods.
ferences between normal and abnormal events, especially
The encoder-agnostic methods [20, 28, 24] utilize task-
anomalous events that are rare and vary substantially. Pre-
agnostic features of videos extracted from a vanilla feature
vious works treat VAD as an unsupervised learning task
encoder denoted as E (e.g. C3D [21] or I3D [2]) to esti-
∗ Corresponding author mate anomaly scores. The encoder-based methods [34, 31]
14009
train both the feature encoder and classifier simultaneously. work MIST is able to produce a task-specific feature en-
The state-of-the-art encoder-based method is Zhong et al. coder. We also compare the proposed framework with
[31], which formulates WS-VAD as a label noise learn- other encoder-agnostic methods on two large datasets i.e.
ing problem and learns from the noisy labels filtered by a , UCF-Crime [20] and ShanghaiTech[15]. In addition, we
label noise cleaner network. However, label noise results run ablation studies to evaluate our proposed sparse contin-
from assigning video-level labels to each clip. Even though uous sampling strategy and self-guided attention module.
the cleaner network corrects some of the noisy labels in We also illustrate some visualized results to provide a more
the time-consuming iterative optimization, the refinement intuitive understanding of our approach. Our experiments
of representations progresses slowly as these models are demonstrate the effectiveness and efficiency of MIST.
mistaught by seriously noisy pseudo labels at the beginning.
We find that the existing methods have not considered 2. Related Works
training a task-specific feature encoder efficiently, which of-
fers discriminative representations for events under surveil- Weakly supervised video anomaly detection. VAD aims
lance cameras. To overcome this problem for WS-VAD, to detect anomaly events in a given video and has been re-
we develop a two-stage self-training procedure (Figure 1) searched for years[9, 29, 14, 7, 15, 13, 12, 32, 31, 5, 24].
that aims to train a task-specific feature encoder with only Unsupervised learning methods [9, 29, 7, 30, 15, 13, 32, 5]
video-level weak labels. In particular, we propose a Multi- encode the usual pattern with only normal training sam-
ple Instance Self-Training framework (MIST) that consists ples and then detect the distinctive encoded patterns as
of a multiple instance pseudo label generator and a self- anomalies. Weakly supervised learning methods [20, 31,
guided attention boosted feature encoder ESGA . 1) MIL- 28, 34, 24] with video-level labels are more applicable to
pseudo label generator. The MIL framework is well ver- distinguish abnormal events and normal events. Existing
ified in weakly supervised learning. MIL-based methods weakly supervised VAD methods can be categorized into
can generate pseudo labels more accurately than those sim- two classes, i.e. , encoder-agnostic methods and encoder-
ply assigning video-level labels to each clip [31]. Moreover, based methods. 1) Encoder-agnostic methods train only the
we adopt a sparse continuous sampling strategy that can classifier. Sultani et al. [20] proposed a deep MIL ranking
force the network to pay more attention to context around framework to detect anomalies; Zhang et al. [28] further in-
the most anomalous part. 2) Self-guided attention boosted troduced inner-bag score gap regularization; Wan et al. [24]
feature encoder. Anomalous events in surveillance videos introduced dynamic MIL loss and center-guided regulariza-
may occur in any place and with any size [11], while in tion. 2) Encoder-based methods train both a feature en-
commonly used action recognition videos, the action usu- coder and a classifier. Zhu et al. [34] proposed an attention
ally appears with large motion [3, 4]. Therefore, we utilize based MIL model combined with a optical flow based auto-
the proposed self-guided attention module in our proposed encoder to encode motion-aware features. Zhong et al. [31]
feature encoder to emphasize the anomalous regions with- took weakly supervised VAD as a label noise learning task
out any external annotation [11] but clip-level annotations and proposed GCNs to filter label noise for iterative model
of normal videos and clip-level pseudo labels of anomalous training, but the iterative optimization was inefficient and
videos. For our WS-VAD modelling, we introduce a deep progressed slowly. Some works focus on detecting anoma-
MIL ranking loss to effectively train the multiple instance lies in an offline manner [23, 25] or a coarse-grained man-
pseudo label generator. In particular, for deep MIL rank- ner [20, 28, 34, 23, 25], which do not meet the real-time
ing loss, we adopt a sparse-continuous sampling strategy to monitoring requirements for real-world applications.
focus more on the context around the anomalous instance. Here, our work is also an encoder-based method and
To obtain a task-specific feature encoder with smaller work in an online fine-grained manner, but we use the
domain-gap, we introduce an efficient two-stage self- learned pseudo labels to optimize our feature encoder
training scheme to optimize the proposed framework. We ESGA rather than using video-level labels as pseudo la-
use the features extracted from the original feature encoder bels directly. Moreover, we design a two-stage self-training
to produce its corresponding clip-level pseudo labels for scheme to efficiently optimize our feature encoder and
anomalous videos by the generator G. Then, we adopt these pseudo label generator instead of iterative optimization[31].
pseudo labels and their corresponding abnormal videos as Multiple Instance Learning. MIL is a popular method for
well as normal videos to refine our improved feature en- weakly supervised learning. In video-related tasks, MIL
coder ESGA (as demonstrated in Figure 1). Therefore, we takes a video as a bag and clips in the video as instances
can acquire a task-specific feature encoder that provides dis- [20, 17, 8]. With a specific feature/score aggregation func-
criminative representations for surveillance videos. tion, video-level labels can be used to indirectly supervise
The extensive experiments based on two different feature instance-level learning. The aggregation functions vary, e.g.
encoders, i.e. C3D [21] and I3D [2] show that our frame- max pooling[20, 28, 34] and attention pooling[17, 8]. In
14010
ℬ (Y = 0) 0.3 0.0
𝐄𝐒𝐆𝐀
0.1 0.0
0.0 0.0
0.1 0.0 𝐘𝐧 𝐇𝐜 ℒ 𝐘𝐧
Moving Min-Max 𝐄
𝒱 𝐆 Mean Norm +
Refine
𝐘𝐚 Self-Guided
0.1 0.0 Attention ℒ 𝐘𝐚
0.7 0.8 Module
0.9 1.0
0.6 0.6
ℬ (Y = 1)
𝒱 Stage I. Pseudo Labels Generation Stage II. Feature Encoder Finetuning
Figure 2: Illustration of our proposed MIST framework. MIST includes a multiple instance pseudo label generator G and
self-guided attention boosted feature encoder ESGA followed by a weighted-classification head Hc . We first train a G and
then generate pseudo labels for ESGA fine-tuning.
this paper, we adopt a sparse continuous sampling strategy Algorithm 1 Multiple instance self-training framework
in our multiple instance pseudo label generator to force the Input: Clip-level labeled normal videos V n = {vin }N i=1 and cor-
network to pay more attention to context around the most responding clip-level labels Y n , video-level labeled abnormal
anomalous part. videos V a = {via }N i=1 , pretrained vanilla feature encoder E.
Self-training. Self-training has been widely investigated in Output: Self-guided attention boosted feature encoder ESGA ,
semi-supervised learning [1, 10, 6, 27, 22, 35]. Self-training multiple instance pseudo label generator G, clip-level pseudo
labels Ŷ a for V a
methods increase labeled data via pseudo label generation
Stage I. Pseudo Labels Generation.
on unlabeled data to leverage the information on both la-
1: Extract features of V a and V n from E as {fa N
i }i=1 and
beled and unlabeled data. Recent deep self-training involves n N
{fi }i=1 .
representation learning of the feature encoder and classi- 2: Training G with {fa N n N
i }i=1 and {fi }i=1 and their corresponding
fier refinement, mostly adopted in semi-supervised learn- video-level labels according to Eq. 7.
ing [10] and domain adaptation [36, 35]. In unsupervised 3: Predict clip-level pseudo labels for each clip of V a via trained
VAD, Pang et al. [18] introduced a self-training framework G as Ŷ a .
deployed on the testing video directly, assuming the exis- Stage II. Feature Encoder Fine-tuning.
tence of an anomaly in the given video. 4: Combine E with self-guided attention module as ESGA , then
Here, we propose a multiple instance self-training frame- fine-tune ESGA with supervision of Y n ∪ Ŷ a .
work that assigns clip-level pseudo labels to all clips in ab-
normal videos via a multiple instance pseudo label genera-
tor. Then, we leverage information from all videos to fine- and clips vi in the video as instances. Specifically, a neg-
tune a self-guided attention boosted feature encoder. ative bag (i.e. Y = 0) marked as B n = {vin }N i=1 has no
anomalous instance, while a positive bag (i.e. Y = 1) de-
3. Approach noted as B a = {via }N i=1 has at least one.
In this work, given a pair of bags (i.e. a positive bag B a
VAD depends on discriminative representations that
and a negative bag B n ), we first pre-extract the features
clearly represent the events in a scene, while action recog-
(i.e. {fia }N n N a n
i=1 and {fi }i=1 for B and B , respectively)
nition datasets pretrained feature encoders are not perfect N
for each clip in the video V = {vi }i=1 using a pretrained
for surveillance videos because of the existence of a do-
vanilla feature encoder, C3D or I3D, forming bags of fea-
main gap [11, 3, 4]. To address this problem, we introduce a n
tures B and B . We then feed the pseudo label genera-
a self-training strategy to refine the proposed improved fea-
tor the extracted features to estimate the anomaly scores of
ture encoder ESGA . An illustration of our method shown in
the clips (i.e. {sai }N n N
i=1 , {si }i=1 ). Then, we produce pseudo
Figure 2 is detailed in the following. a a N
labels Ŷ = {ŷi }i=1 for anomalous video by perform-
3.1. Overview ing smoothing and normalization on estimated scores to su-
Given a video V = {vi }N i=1 with N clips, the annotated pervise the learning of the proposed self-guided attention
video-level label Y ∈ {1, 0} indicates whether an anoma- boosted feature encoder, forming as two-stage self-training
lous event exists in this video. We take a video V as a bag scheme [10, 36, 35].
14011
Sub-Bag 𝑆 𝑆
𝐵 (T continuous
0.0
the refinement process of our feature encoder ESGA .
clips) 0.1 0.1
0.4
Even though recent MIL-based methods [20, 28] have
0.3
0.0
0.5 made considerable progress, the process of slicing a video
0.2
0.3 into fixed segments in an coarse-grained manner regardless
of its duration is prone to bury abnormal patterns as nor-
…
mal frames that usually constitute the majority, even in ab-
…
𝑆
𝐵 0.2
normal videos [24]. However, by sampling with a smaller
0.3
temporal scale in a fine-grained manner, the network may
MIL Networks 0.3 overemphasize on the most intense part of an anomaly but
0.7
0.2 ignore the context around it. In reality, anomalous events
often last for a while. With the assumption of minimum du-
ration of anomalies, the MIL network is forced to pay more
Figure 3: The workflow of our multiple instance pseudo attention to the context around the most anomalous part.
label generator. Each bag contains L sub-bags, and each Moreover, to adapt to the variation in duration of
sub-bag is composed of T continuous clips. untrimmed videos and class imbalance in amount, we in-
troduce a sparse continuous sampling strategy: given the
As shown in Figure 2, our proposed feature encoder features for each clip extracted by a vanilla feature encoder
ESGA , adapted from vanilla feature encoder E (e.g. , I3D E from a video {fi }N i=1 , we uniformly sample L subsets
or C3D) by adding our proposed self-guided attention mod- from these video clips, and each subset contains T con-
ule, can be optimized with the estimated pseudo labels to secutive clips, forming L sub-bags B = {fl,t }L,T l=1,t=1 , as
eliminate the domain gap and produce task-specific repre- shown in Figure 3. Remarkably, T , a hyperparameter to be
sentations. Actually, our proposed approach can be viewed tuned, also plays as the assumption of minimum duration
as a two-stage method (see Algorithm 1): 1) we first gener- of anomalies, as discussed in the previous paragraph. Here,
ate clip-level pseudo labels for anomalous videos that have we combine the MIL model with our continuous sampling
only video-level labels via the pseudo label generator, while strategy, as shown in Figure 3. We feed extracted features
the parameters of the pseudo label generator are updated by into our pseudo label generator to produce corresponding
means of the deep MIL ranking loss. 2) After obtaining the anomalous scores {sl,t }L,T
l=1,t=1 . Next, we perform average
clip-level pseudo labels of anomalous videos, our feature pooling of the predicted instance-level scores sl,t of each
encoder ESGA can be trained on both normal and anoma- sub-bag score as Sl below, which can be utilized in Eq. 7.
lous video data. Thus, we form a self-training scheme to
T
optimize both the feature encoder ESGA and pseudo label 1X
generator G. The illustration shown in Figure 2 provides an Sl = sl,t . (1)
T t=1
overview of our proposed method.
To better distinguish anomalous clips from normal ones, After training, the trained multiple instance pseudo la-
we introduce a self-guided attention module in the feature bel generator predicts clip-level scores for all abnormal
encoder, i.e. , ESGA , to capture the anomalous regions in videos marked as S a = {sai }Ni=1 . By performing temporal
videos to help the feature encoder produce more discrimi- smoothing with a moving average filter to relieve the jitter
native representations (see Section 3.3). Moreover, we in- of anomaly scores with kernel size of k,
troduce a sparse continuous sampling strategy in the pseudo i+k
1 X a
label generator to enforce the network to pay more atten- s̃ai = sj , (2)
2k
tion to the context around the most anomalous part (see j=i−k
Section 3.2). Finally, we introduce the deep MIL ranking and min-max normalization,
loss to optimize the learning of the pseudo label generator,
and we use cross entropy loss to train our proposed feature ŷia = s̃ai − min S̃ a /(max S̃ a − min S̃ a )), i ∈ [1, N ],
encoder ESGA supervised by pseudo labels of anomalous (3)
videos and clip-level annotations of normal videos. we refine the anomaly scores into Ŷ = {ŷia }N i=1 . Specifi-
3.2. Pseudo Label Generation via Multiple Instance cally, ŷia is in [0, 1] and acts as a soft pseudo label. Then, the
Learning pseudo labeled data {V a , Ŷ a } are combined with clip-level
labeled data {V n , Y n } as {V, Y } to fine-tune the proposed
In contrast to [31], which simply assigns video-level la-
feature encoder ESGA .
bels to each clip and then trains the vanilla feature encoder
at the very beginning, we introduce a MLP-based struc- 3.3. Self-Guided Attention in Feature Encoder
ture as the pseudo label generator trained under the MIL In contrast to vanilla feature encoder E, which provides
paradigm to generate pseudo labels, which are utilized in only task-agnostic representations for the down-stream task,
14012
softmax
via L1 and L2 , respectively. That is, we optimize ESGA
ℳ GAP with the pseudo labels (see Section 3.2). Therefore, the
ℳ𝒜 𝑯𝒄
feature encoder ESGA can update its parameters on video
anomaly datasets and eliminate the domain gap from the
𝓛𝟏
pretrained parameters.
ℳ 𝒜 3.4. Optimization Process
𝒀𝒏
Attention Mechanism
- Deep MIL Ranking Loss: Considering that the positive
𝒀𝒂 bag contains at least one anomalous clip, we assume that the
ℱ ℱ 𝑯𝒈
2𝒦 2𝒦
clip from a positive bag with the highest anomalous score
h
h GAP Avg
𝓛𝟐
is the most likely to be an anomaly [8].To adapt our sparse
ℳ∗ ℱ ℳ continuous sampling in 3.2, we treat a sub-bag as an in-
w
softmax
w
stance and acquire a reliable relative comparison between
the mostly likely anomalous sub-bag and the most likely
Self-Guided Attention Module
normal sub-bag:
Figure 4: The structure of self-guided attention boosted fea- max Sln < max Sla (6)
ture encoder ESGA . GAP means the global average pooling 1≤l≤L 1≤l≤L
operation, while Avg means K channel-wise average pool-
Specifically, to avoid too many false positive instances in
ing in producing guided anomaly scores in guided classifi-
positive bags, we introduce a sparse constraint on positive
cation head Hg . A is the attention map. F1 , F2 , F3 are
bags, which instantiates Eq. 6 as a deep MIL ranking loss
three encoding units constructed by convolutional layers.
with sparse regularization:
14013
MIL iter 1 iter 2 iter 3
94.83 Method Supervised Grained Encoder AUC (%) FAR (%)
95 93.13
Hasan et al. [7] Un Coarse AE RGB 50.6 27.2
90 89.15 Lu et al. [14] Un Coarse Dictionary 65.51 3.1
86.61 SVM Weak Coarse C3DRGB 50 -
AUC (%)
14014
AUC (%) AUC (%) ∆AUC
Encoder-Agnostic Dataset Feature
UCF-Crime ShanghaiTech Uniform Sparse Continuous (%)
Methods
pretrained fine-tuned pretrained fine-tuned C3DRGB 74.29 75.51 +1.22
Sultani et al. [20] 78.43 81.42 86.92 92.63 UCF-Crime
I3DRGB 78.72 79.37 +0.65
Zhang et al. [28] 78.11 81.58 88.87 92.50 C3DRGB 83.68 86.61 +2.93
AR-Net [24] 78.96 82.62 85.38 92.27 ShanghaiTech
I3DRGB 83.10 89.15 +6.05
Our MIL generator 79.37 81.55 89.15 92.24
Table 4: Performance comparisons of sparse continuous
Table 3: Quantitative comparisons between the features
sampling and uniform sampling for MIL generator training.
from the pretrained vanilla feature encoder and those from
MIST on UCF-Crime and ShanghaiTech datasets by adopt-
ing encoder-agnostic methods. indicates the network is more capable of distinguishing
Pretrained MIST anomalies from normal events [13]. We conduct abla-
tion studies on UCF-Crime to analyze the impact of gen-
erated pseudo labels (PLs), the self-guided attention mod-
Robbery137
14015
Shooting008 Vandalism028 Normal877 Arrest001 Burglary079
Figure 7: Visualization of the testing results on UCF-Crime (better viewed in color). The red blocks in the graphs are temporal
ground truths of anomalous events. The orange circle shows the wrongly labeled ground truth, the blue circle indicates the
wrongly predicted clip, and the red cricle indicates the correctly predicted clip.
Method AUC (%) Score Gap (%) ally, compared with the activation maps generated from the
Baseline 74.13 0.375 MIST without guided-classification head Hg and the MIST
MISTw/o PLs 73.33 0.443 without the SGA module, the results of MIST are concen-
MISTw/o Hg 81.97 15.37
trated on the anomalous regions, which shows the rational-
MISTw/o SGA 80.28 12.74
MIST 82.30 17.71
ity and effectiveness of our self-guided attention module.
14016
References [18] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den
Hengel, and Xiao Bai. Self-trained deep ordinal regression
[1] Massih-Reza Amini and Patrick Gallinari. Semi-supervised for end-to-end video anomaly detection. In Proceedings of
logistic regression. In ECAI, 2002. the IEEE/CVF Conference on Computer Vision and Pattern
[2] Joao Carreira and Andrew Zisserman. Quo vadis, action Recognition, pages 12173–12182, 2020.
recognition? a new model and the kinetics dataset. In IEEE [19] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Conf. Comput. Vis. Pattern Recog., 2017. Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
[3] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Grad-cam: Visual explanations from deep networks via
Huang. Why can’t i dance in the mall? learning to miti- gradient-based localization. In IEEE Conf. Comput. Vis. Pat-
gate scene bias in action recognition. In Adv. Neural Inform. tern Recog., 2017.
Process. Syst., 2019. [20] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world
[4] Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and anomaly detection in surveillance videos. In IEEE Conf.
Jia-Bin Huang. Unsupervised and semi-supervised domain Comput. Vis. Pattern Recog., 2018.
adaptation for action recognition from drones. 2020. [21] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
[5] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, and Manohar Paluri. Learning spatiotemporal features with
Moussa Reda Mansour, Svetha Venkatesh, and Anton 3d convolutional networks. In Int. Conf. Comput. Vis., 2015.
van den Hengel. Memorizing normality to detect anomaly: [22] Isaac Triguero, Salvador Garcı́a, and Francisco Herrera.
Memory-augmented deep autoencoder for unsupervised Self-labeled techniques for semi-supervised learning: taxon-
anomaly detection. In Int. Conf. Comput. Vis., 2019. omy, software and empirical study. Knowledge and Informa-
[6] Yves Grandvalet and Yoshua Bengio. Semi-supervised tion systems, 42(2):245–284, 2015.
learning by entropy minimization. In Adv. Neural Inform. [23] Waseem Ullah, Amin Ullah, Ijaz Ul Haq, Khan Muham-
Process. Syst., 2005. mad, Muhammad Sajjad, and Sung Wook Baik. Cnn features
[7] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K with bi-directional lstm for real-time anomaly detection in
Roy-Chowdhury, and Larry S Davis. Learning temporal reg- surveillance networks. Multimedia Tools and Applications,
ularity in video sequences. In IEEE Conf. Comput. Vis. Pat- pages 1–17, 2020.
tern Recog., 2016. [24] Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei.
[8] Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, and Wei- Weakly supervised video anomaly detection via center-
Shi Zheng. Mini-net: Multiple instance ranking network for guided discriminative learning. In Int. Conf. Multimedia and
video highlight detection. arXiv preprint arXiv:2007.09833, Expo, 2020.
2020. [25] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao,
[9] Timothy Hospedales, Shaogang Gong, and Tao Xiang. A Zhaoyang Wu, and Zhiwei Yang. Not only look, but also
markov clustering topic model for mining behaviour in listen: Learning multimodal violence detection under weak
video. In Int. Conf. Comput. Vis., 2009. supervision. In European Conference on Computer Vision,
[10] Dong-Hyun Lee. Pseudo-label: The simple and effi- pages 322–339. Springer, 2020.
cient semi-supervised learning method for deep neural net- [26] Jufeng Yang, Dongyu She, Yu-Kun Lai, Paul L Rosin, and
works. In Workshop on challenges in representation learn- Ming-Hsuan Yang. Weakly supervised coupled networks for
ing, ICML, volume 3, 2013. visual sentiment analysis. In IEEE Conf. Comput. Vis. Pat-
[11] Kun Liu and Huadong Ma. Exploring background-bias for tern Recog., 2018.
anomaly detection in surveillance videos. In ACM Int. Conf. [27] David Yarowsky. Unsupervised word sense disambiguation
Multimedia, 2019. rivaling supervised methods. In 33rd annual meeting of the
[12] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, Shenghua association for computational linguistics, pages 189–196,
Gao, et al. Margin learning embedded prediction for video 1995.
anomaly detection with a few anomalies. In IJCAI, 2019. [28] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal con-
[13] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Fu- volutional network with complementary inner bag loss for
ture frame prediction for anomaly detection–a new baseline. weakly supervised anomaly detection. In IEEE Int. Conf.
In IEEE Conf. Comput. Vis. Pattern Recog., 2018. Image Process., 2019.
[14] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detec- [29] Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of
tion at 150 fps in matlab. In Int. Conf. Comput. Vis., 2013. unusual events in videos via dynamic sparse coding. In IEEE
[15] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse Conf. Comput. Vis. Pattern Recog., 2011.
coding based anomaly detection in stacked rnn framework. [30] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu,
In Int. Conf. Comput. Vis., 2017. and Xian-Sheng Hua. Spatio-temporal autoencoder for video
[16] Laurens van der Maaten and Geoffrey Hinton. Visualiz- anomaly detection. In ACM Int. Conf. Multimedia, 2017.
ing data using t-sne. Journal of machine learning research, [31] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu,
9(Nov):2579–2605, 2008. Thomas H Li, and Ge Li. Graph convolutional label noise
[17] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. cleaner: Train a plug-and-play action classifier for anomaly
Weakly supervised action localization by sparse temporal detection. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
pooling network. In IEEE Conf. Comput. Vis. Pattern Recog., [32] Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong
2018. Liu, and Rick Siow Mong Goh. Anomalynet: An anomaly
14017
detection network for video surveillance. IEEE Transactions
on Information Forensics and Security, 14(10):2537–2550,
2019.
[33] Sijie Zhu, Chen Chen, and Waqas Sultani. Video
anomaly detection for smart surveillance. arXiv preprint
arXiv:2004.00222, 2020.
[34] Yi Zhu and Shawn Newsam. Motion-aware feature for im-
proved video anomaly detection. Brit. Mach. Vis. Conf.,
2019.
[35] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jin-
song Wang. Confidence regularized self-training. In Int.
Conf. Comput. Vis., 2019.
[36] Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and Jinsong
Wang. Unsupervised domain adaptation for semantic seg-
mentation via class-balanced self-training. In Eur. Conf.
Comput. Vis., 2018.
14018