| Source Video | Compressed Video | Future Prediction | 
|---|---|---|
| source_video.mp4 | compressed_video.mp4 | generated.mp4 | 
A world model is a model that can predict the next state of the world given the observed previous states and actions.
World models are essential to training all kinds of intelligent agents, especially self-driving models.
commaVQ contains:
- encoder/decoder models used to heavily compress driving scenes
- a world model trained on 3,000,000 minutes of driving videos
- a dataset of 100,000 minutes of compressed driving videos
Losslessly compress 5,000 minutes of driving video "tokens". Go to ./compression/ to start
Prize: highest compression rate on 5,000 minutes of driving video (~915MB) - Challenge ended July, 1st 2024 11:59pm AOE
Submit a single zip file containing the compressed data and a python script to decompress it into its original form. Top solutions are listed on comma's official leaderboard.
| Implementation | Compression rate | 
|---|---|
| pkourouklidis (arithmetic coding with GPT) | 2.6 | 
| anonymous (zpaq) | 2.3 | 
| rostislav (zpaq) | 2.3 | 
| anonymous (zpaq) | 2.2 | 
| anonymous (zpaq) | 2.2 | 
| 0x41head (zpaq) | 2.2 | 
| tillinf (zpaq) | 2.2 | 
| baseline (lzma) | 1.6 | 
A VQ-VAE [1,2] was used to heavily compress each video frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.
A world model [3] was trained to predict the next token given a context of past tokens. This world model is a Generative Pre-trained Transformer (GPT) [4] trained on 3,000,000 minutes of driving videos following a similar recipe to [5].
./notebooks/encode.ipynb and ./notebooks/decode.ipynb for an example of how to visualize the dataset using a segment of driving video from comma's drive to Taco Bell
./notebooks/gpt.ipynb for an example of how to use the world model to imagine future frames.
./compression/compress.py for an example of how to compress the tokens using lzma
- Using huggingface datasets
import numpy as np
from datasets import load_dataset
num_proc = 40 # CPUs go brrrr
ds = load_dataset('commaai/commavq', num_proc=num_proc)
tokens = np.load(ds['0'][0]['path']) # first segment from the first data shard- Manually download from huggingface datasets repository: https://huggingface.co/datasets/commaai/commavq
[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).
[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[3] https://worldmodels.github.io/
[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[5] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations. 2022.