-
University of California, Berkeley
- Berkeley, CA
- https://andy-yang-1.github.io/
Highlights
- Pro
Stars
A benchmark for evaluating LLMs on open-ended CS problems. Exploring the Next Frontier of Computer Science.
Accelerating MoE with IO and Tile-aware Optimizations
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Real-Time VLAs via Future-state-aware Asynchronous Inference.
Miles is an enterprise-facing reinforcement learning framework for large-scale MoE post-training and production workloads, forked from and co-evolving with slime.
RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics, an automated framework, and a live leaderboard.
StreamDiffusion, Live Stream APP
LM engine is a library for pretraining/finetuning LLMs
A bunch of kernels that might make stuff slower 😉
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
StreamingVLM: Real-Time Understanding for Infinite Video Streams
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
Fast and memory-efficient exact kmeans
This project aims to collect the latest "call for reviewers" links from various top CS/ML/AI conferences/journals
Sequence Parallel For Sparse VideoGen
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast, accurate, and 100% private RAG application on your personal device.
[NeurIPS 2025] Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
SkyRL: A Modular Full-stack RL Library for LLMs
MAGI-1: Autoregressive Video Generation at Scale
Lets make video diffusion practical!
Distributed Compiler based on Triton for Parallel Systems