Lists (4)
Sort Name ascending (A-Z)
Stars
Code for paper "SPG Sandwiched Policy Gradient for Masked Diffusion Language Models"
SGLang is a fast serving framework for large language models and multi-modality models.
[AAAI26] LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
dInfer: An Efficient Inference Framework for Diffusion Language Models
My learning notes for ML SYS.
Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give it a star 🌟 if you find it useful.
The official implementation of paper “VChain: Chain-of-Visual-Thought for Reasoning in Video Generation”
🚀 LLM-I: Transform LLMs into natural interleaved multimodal creators! ✨ Tool-use framework supporting image search, generation, code execution & editing
Fully Open Framework for Democratized Multimodal Training
Official implementation of "Diffusion Language Models Know the Answer Before Decoding"
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
[AAAI 2026] ✨ TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
Official repo for "PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning"
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[NeurIPS 2025] Efficient Reasoning Vision Language Models
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
A MemAgent framework that can be extrapolated to 3.5M, along with a training framework for RL training of any agent workflow.
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
[NeurIPS 2025] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
[ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.
**Deep Video Discovery (DVD)** is a deep-research style question answering agent designed for understanding extra-long videos.
verl-agent is an extension of veRL, designed for training LLM/VLM agents via RL. verl-agent is also the official code for paper "Group-in-Group Policy Optimization for LLM Agent Training"
This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding