Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View ranjiewwen's full-sized avatar
🎯
Focusing
🎯
Focusing
  • algorithmic engineer
  • chengdu

Organizations

@DIP-ML-AI

Block or report ranjiewwen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accelerate inference without tears

Python 372 22 Updated Jan 23, 2026

Efficient AI Inference & Serving

Python 480 31 Updated Jan 8, 2024

📰 Must-read papers and blogs on Speculative Decoding ⚡️

1,102 63 Updated Jan 24, 2026

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 6,180 568 Updated Aug 22, 2025

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 24,382 2,716 Updated Aug 12, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,176 394 Updated Jul 11, 2024

Official code implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)

Python 629 43 Updated Dec 30, 2024

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Python 1,306 79 Updated Apr 18, 2024

llm-export can export llm model to onnx.

Python 342 38 Updated Oct 24, 2025

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,697 192 Updated Jun 25, 2024

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

Python 2,156 250 Updated Jan 13, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,945 337 Updated Jan 18, 2026

Official inference library for Mistral models

Jupyter Notebook 10,640 1,006 Updated Nov 21, 2025

Inference Llama 2 in one file of pure C

C 19,136 2,441 Updated Aug 6, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 8,602 479 Updated Jan 24, 2026

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 476 37 Updated Mar 15, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,316 78 Updated Mar 6, 2025

a lightweight LLM model inference framework

C++ 748 94 Updated Apr 7, 2024

A series of large language models developed by Baichuan Intelligent Technology

Python 4,117 293 Updated Nov 8, 2024

中文版 llm-numbers

129 6 Updated Dec 25, 2023

Numbers every LLM developer should know

4,279 140 Updated Jan 16, 2024

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,733 2,043 Updated Jan 27, 2026

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)

Python 2,698 294 Updated Aug 14, 2024

The official Python library for the OpenAI API

Python 29,777 4,521 Updated Jan 27, 2026

Simple, safe way to store and distribute tensors

Python 3,602 293 Updated Jan 14, 2026

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,424 289 Updated Jul 17, 2025

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 7,554 649 Updated Jan 26, 2026

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 3,861 296 Updated Jan 27, 2026
Next