Stars
Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind (AAAI2025)
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
We propose MMAD, a novel automated pipeline for precise AD generation. MMAD introduces ambient music alongside visual and linguistic, enhancing the model's multimodal representation learning throug…
repo for active speaker detection for media videos.
an optimized, production-ready implementation of active speaker detection
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
Official repository for the "Powerset multi-class cross entropy loss for neural speaker diarization" paper published in Interspeech 2023.
(ICLR'25) A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents
Multi-model analysis of sentiment and emotion in multi-speaker conversations.
[NAACL 2025] The implementation of paper "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue".
Wellbeing and Emotion Prediction (NeurIPS 2022)
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment".
Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
RoleInteract: Evaluating the Social Interaction of Role-Playing Agents
This repository contains a multi-agent chat application built using the autogen library. The application sets up various conversational agents with distinct personas and allows them to engage in gr…
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
[AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
The implementation of FINER-MLLM, which is accepted by MM2024.
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
This repository contains a curated list of research papers and resources focusing on saliency and scanpath prediction, human attention, human visual search.
[ICCV'23] Official PyTorch implementation for paper "Exploring Predicate Visual Context in Detecting Human-Object Interactions"
[ECCV 2020] DRG: Dual Relation Graph for Human-Object Interaction Detection