Stars
Official implementation of "Urban Socio-Semantic Segmentation with Vision-Language Reasoning"
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
[EMNLP’2025] Official code for "HS-STaR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation"
Official code for "POSITION BIAS MITIGATES POSITION BIAS: Mitigate Position Bias Through Inter-Position Knowledge Distillation"
Tree Search for LLM Agent Reinforcement Learning
Code of "DrVideo: Document Retrieval Based Long Video Understanding"
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning
Training-free Stylized Text-to-Image Generation with Fast Inference
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
Official Code of "GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering"
🍳 [CVPR'25] PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
🍳 [CVPR'24 Highlight] Pytorch implementation of "Taming Stable Diffusion for Text to 360° Panorama Image Generation"
🕸️ [ICCV'21 Oral] Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization
🕸️ [CVPR'21] Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation. Also includes a PyTorch implementation of the decoder of LDIF (from 3D Shape …
[ICLR2025] Official code implementation of Video-UTR: Unhackable Temporal Rewarding for Scalable Video MLLMs
Famous Vision Language Models and Their Architectures
Code for the paper "AMEGO: Active Memory from long EGOcentric videos" published at ECCV 2024
a collection of awesome autoregressive visual generation models
[CVPR 2025] Consistent and Controllable Image Animation with Motion Diffusion Models
Awesome-Remote-Sensing-Vision-Language-Models
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Superagent protects your AI applications against prompt injections, data leaks, and harmful outputs. Embed safety directly into your app and prove compliance to your customers.
Build ChatGPT over your data, all with natural language
[TMLR 2025] Latte: Latent Diffusion Transformer for Video Generation.
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, AAAI 2022 (Oral)
A lightweight, scalable, and general framework for visual question answering research
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)