This is the fork of the official Video-LaVIT repository for the fine-tuning it on https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K dataset.
We updated the weights of the model images and video tokenizers (), sinse the rest of the model remain unchanged.
We use PyTorch Lightning framework for fine-tuning and processed conversations between human and assistant using the Chat template from VideoLLaMA2 with minor changes
2025.08.01The notebook was updated (bug about visual token in dataset, unfrozen weights)