Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View yanwei-li's full-sized avatar
🎯
Focusing
🎯
Focusing
  • The Chinese University of Hong Kong
  • Hong Kong, China

Organizations

@dvlab-research

Block or report yanwei-li

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

This is a project on visual spatial reasoning tasks-SIBench

Python 14 Updated Oct 20, 2025

Fully Open Framework for Democratized Multimodal Training

Python 578 40 Updated Oct 21, 2025

Official Code for "Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search"

Python 346 15 Updated Sep 15, 2025

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

Python 234 17 Updated Oct 11, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 14,731 2,346 Updated Oct 25, 2025

Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"

112 2 Updated Oct 2, 2025

The implementation of Extreme Viewpoint 4D Video Generation

Python 245 17 Updated Sep 6, 2025

Open-source unified multimodal model

Python 5,202 449 Updated Aug 22, 2025

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Jupyter Notebook 1,472 58 Updated Jun 14, 2025

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

Python 25,126 1,749 Updated Oct 13, 2025

Open source repo for Locate 3D Model, 3D-JEPA and Locate 3D Dataset

Python 378 32 Updated Jun 3, 2025

[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Python 636 46 Updated Jun 13, 2025

✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Python 302 28 Updated May 14, 2025

The Next Step Forward in Multimodal LLM Alignment

Python 184 8 Updated May 1, 2025

Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"

Python 1,359 94 Updated Oct 24, 2025

Official repo and evaluation implementation of VSI-Bench

Python 607 37 Updated Aug 5, 2025

A generative world for general-purpose robotics & embodied AI learning.

Python 27,451 2,523 Updated Oct 25, 2025
Python 183 4 Updated Dec 17, 2024

[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"

Python 301 29 Updated Jan 9, 2025

AllenAI's post-training codebase

Python 3,265 453 Updated Oct 25, 2025

[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation

Python 6,850 681 Updated Jan 22, 2025

Next-Token Prediction is All You Need

Python 2,219 84 Updated Mar 17, 2025

Robust Speech Recognition via Large-Scale Weak Supervision

Python 89,920 11,243 Updated Sep 8, 2025

[ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMs

Python 479 31 Updated Jan 23, 2025

Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 15,481 1,205 Updated Oct 22, 2025

[TPAMI 2025] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Python 1,447 52 Updated Sep 18, 2025

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python 2,428 176 Updated Mar 28, 2025

The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…

Jupyter Notebook 17,393 2,161 Updated Dec 25, 2024

Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA

Python 1,618 80 Updated Sep 25, 2024

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Python 1,955 131 Updated Oct 30, 2024
Next