Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ACM MM 2022] This is the official implementation of "Temporal Sentiment Localization: Listen and Look in Untrimmed Videos"

License

nku-zhichengzhang/TSL300

Repository files navigation

Temporal Sentiment Localization: Listen and Look in Untrimmed Videos

Zhicheng Zhang and Jufeng Yang

PyTorch Conference License

Key motivation: a video may convey multiple sentiments and each sentiment appears with varying lengths and locations. Images come from "The Wolf of Wall Street".

This repository contains the official implementation of our work in ACM MM 2022. TSL-300 dataset and pytorch training/validation code for weakly-supervised framework TSL-Net are released. More details can be viewed in our paper. [PDF] [Video]

Abstract

Video sentiment analysis aims to uncover the underlying attitudes of viewers, which has a wide range of applications in real world. Existing works simply classify a video into a single sentimental category, ignoring the fact that sentiment in untrimmed videos may appear in multiple segments with varying lengths and unknown locations. To address this, we propose a challenging task, i.e., Temporal Sentiment Localization (TSL), to find which parts of the video convey sentiment. To systematically investigate fully- and weakly-supervised settings for TSL, we first build a benchmark dataset named TSL-300, which is consisting of 300 videos with a total length of 1,291 minutes. Each video is labeled in two ways, one of which is frame-by-frame annotation for the fully-supervised setting, and the other is single-frame annotation, i.e., only a single frame with strong sentiment is labeled per segment for the weakly-supervised setting. Due to the high cost of labeling a densely annotated dataset, we propose TSL-Net in this work, employing single-frame supervision to localize sentiment in videos. In detail, we generate the pseudo labels for unlabeled frames using a greedy search strategy, and fuse the affective features of both visual and audio modalities to predict the temporal sentiment distribution. Here, a reverse mapping strategy is designed for feature fusion, and a contrastive loss is utilized to maintain the consistency between the original feature and the reverse prediction. Extensive experiments show the superiority of our method against the state-of-the-art approaches.

Dependencies

You can set up the environments by using pip3 install -r requirements.txt.

Recommended Environment

  • Python 3.6.13
  • Pytorch 1.10.2
  • CUDA 11.3

TSL-300 dataset

If you need the TSL-300 dataset for academic purposes, please download the application form and fill out the request information, then send it to [email protected]. We will process your application as soon as possible. Please make sure that the email used comes from your educational institution.

Data Preparation

  1. Prepare TSL-300 dataset.

    • We have provided constructed dataset and pre-extracted features.
  2. Extract features with two-stream I3D networks

    • We recommend extracting features using this repo.
    • For convenience, we provide the features we used, which is also included in our dataset.
    • Link the features folder by using sudo ln -s path-to-feature ./dataset/VideoSenti/.
  3. Place the features inside the dataset folder.

    • Please ensure the data structure is as below.
├── dataset
   └── VideoSenti
       ├── gt.json
       ├── split_train.txt
       ├── split_test.txt
       ├── fps_dict.json
       ├── time.json
       ├── videosenti_gt.json
       ├── point_gaussian
           └── point_labels.csv
           ├── train
       └── features
           ├── train
               ├── rgb
                   ├── 1_Ekman6_disgust_3.npy
                   ├── 2_Ekman6_joy_1308.npy
                   └── ...
               └── logmfcc
                   ├── 1_Ekman6_disgust_3.npy
                   ├── 2_Ekman6_joy_1308.npy
                   └── ...
           └── test
               ├── rgb
                   ├── 9_CMU_MOSEI_lzVA--tIse0.npy
                   ├── 17_CMU_MOSEI_CbRexsp1HKw.npy
                   └── ...
               └── logmfcc
                   ├── 9_CMU_MOSEI_lzVA--tIse0.npy
                   ├── 17_CMU_MOSEI_CbRexsp1HKw.npy
                   └── ...

Model Zoo

Metric mAP@ 0.1 mAP@ 0.2 mAP@ 0.3 mAP@AVG Recall@AVG F2@AVG Url
TSL-Net 27.27 20.53 12.06 19.85 75.24 33.69 Baidu drive
Google drive

Running

You can easily train and evaluate the model by running the script below.

You can include more details such as epoch, batch size, etc. Please refer to options.py.

$ bash run_train.sh

Evaulation

The pre-trained model can be found in pretrained model.

You can evaluate the model by running the command below.

$ bash run_eval.sh

References

We referenced the repos below for the code.

Citation

If you find this repo useful in your project or research, please consider citing the relevant publication.

@inproceedings{zhang2022temporal,
  title={Temporal Sentiment Localization: Listen and Look in Untrimmed Videos},
  author={Zhang, Zhicheng and Yang, Jufeng},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  year={2022}
}

About

[ACM MM 2022] This is the official implementation of "Temporal Sentiment Localization: Listen and Look in Untrimmed Videos"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published