Stars
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Official Repo for CVPR 2024 Paper "FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Fully-Supervised Action Segmentation"
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparen…
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Scenic: A Jax Library for Computer Vision Research and Beyond
为GPT/GLM等LLM大语言模型提供实用化交互接口,特别优化论文阅读/润色/写作体验,模块化设计,支持自定义快捷按钮&函数插件,支持Python和C++等项目剖析&自译解功能,PDF/LaTex论文翻译&总结功能,支持并行问询多种LLM模型,支持chatglm3等本地模型。接入通义千问, deepseekcoder, 讯飞星火, 文心一言, llama2, rwkv, claude2, m…
The official repository for the paper "ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning"
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
This is a collection of recent papers on reasoning in video generation models.
解决Cursor在免费订阅期间出现以下提示的问题: Your request has been blocked as our system has detected suspicious activity / You've reached your trial request limit. / Too many free trial accounts used on this machine.
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
A framework for few-shot evaluation of language models.
[Lumina Embodied AI] 具身智能技术指南 Embodied-AI-Guide
一个基于nano banana pro🍌的原生AI PPT生成应用,迈向真正的"Vibe PPT"; 支持上传任意模板图片;上传任意素材&智能解析;一句话/大纲/页面描述自动生成PPT;口头修改指定区域、一键导出可编辑ppt - An AI-native PPT generator based on nano banana pro🍌
Official repo of "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens"
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
deep learning for image processing including classification and object-detection etc.
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Glance: Accelerating Diffusion Models with 1 Sample
Temporal Action Detection & Weakly Supervised Temporal Action Detection & Temporal Action Proposal Generation
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
Code for paper "Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy" [NeurIPS 2025] .