Stars
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Awesome-RAG-Vision: a curated list of advanced retrieval augmented generation (RAG) for Computer Vision
[ICLR 2025] See What You Are Told: Visual Attention Sink in Large Multimodal Models
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
[ACM MM25] The official code of "Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs"
Collection of Composed Image Retrieval (CIR) papers.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
[ICCV2025] Where, What, Why: Towards Explainable Driver Attention Prediction
MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.
Awesome things about generative recommendation models.
[NeurIPS 2025] Official code for paper: Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs.
Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)
Collection of AWESOME vision-language models for vision tasks
SkyRover, a modular and extensible simulator tailored for cross-domain pathfinding research.
Physical simulation of Marsupial UAV-UGV Systems Connected by a Variable-Length Hanging Tether
This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
✨✨Latest Advances on Multimodal Large Language Models
A PyTorch implementation of ACRNet based on ICME 2023 paper "Weakly-supervised Temporal Action Localization with Adaptive Clustering and Refining Network"
[🚀ICML 2025] "Taming Rectified Flow for Inversion and Editing" Using FLUX and HunyuanVideo for image and video editing!
[CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Wan: Open and Advanced Large-Scale Video Generative Models
[WWW 2025] Official PyTorch Code for "CTR-Driven Advertising Image Generation with Multimodal Large Language Models"