Stars
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
STEP-GUI: The top GUI agent solution in the galaxy. Developed by the StepFun-GELab team and powered by StepFun’s cutting-edge research capabilities.
A natural language interface for computers
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
verl: Volcano Engine Reinforcement Learning for LLMs
MAI-UI: Real-World Centric Foundation GUI Agents ranging from 2B to 235B
Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)
AutoSplice: A Text-prompt Manipulated Image Dataset for Media Forensics, WMF@CVPR2023
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
✨✨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"
🔥[AAAI 2026, Official Code] Regression Over Classification: Assessing Image Aesthetics via Multimodal Large Language Models. 克服大模型在美学评估过程中对分数不敏感的问题
A comprehensive collection of IQA papers
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
[CVPR 2025 Oral] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation).
COCO API - Dataset @ http://cocodataset.org/
Ongoing research training transformer models at scale
GLIDE: a diffusion-based text-conditional image synthesis model
[AAAI 2025] Official repository of paper “Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization”
Official code for CAT-Net: Compression Artifact Tracing Network. Image manipulation detection and localization.
The official repo for RGCL:Improving Hateful Meme Detection through Retrieval-Guided Contrastive Learning and RA-HMD: Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Me…
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Optimizing MLLM-based Scoring via a Score-Token + Decoder Paradigm. This paper proposes a unified scoring paradigm for Multimodal Large Language Models (MLLMs).
Codes of the Fine-grained Textual Inversion network for Zero-Shot Composed Image Retrieval
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
[CVPR 2025] Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution