Codestin Search App

chenyayu / vllm-gfx906 Public

forked from nlzy/vllm-gfx906

Notifications You must be signed in to change notification settings
Fork 0
Star 0

vLLM for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60

Apache-2.0 license

0 stars 51 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8,733 Commits
.buildkite		.buildkite
.gemini		.gemini
.github.orig		.github.orig
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
requirements		requirements
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README		README
README.md.orig		README.md.orig
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Repository files navigation

vLLM for gfx906
===================

This is a modified version of vLLM, works with (and only works with) AMD gfx906
GPUs such as Radeon VII / Radeon Pro VII / Instinct MI50 / Instinct MI60.

This fork was (and still is) just a passion project shared for fun. I won't be
putting much effort into it. Use it at your own risk, especially please don't
use it as a reference for your GPU purchasing decisions.

RUN WITH DOCKER
-------------------

Please install ROCm 6.3 first, only kernel-mode driver is required. Refer to
the official documentation by AMD.

```
docker pull nalanzeyu/vllm-gfx906
docker run -it --rm --shm-size=2g --device=/dev/kfd --device=/dev/dri \
--group-add video -p 8000:8000 -v <YOUR_MODEL_PATH>:/model \
nalanzeyu/vllm-gfx906 vllm serve /model
```

SUPPORT QUANTIZATIONS
-------------------

See nlzy#29

GPTQ and AWQ are the first recommended quantization formats.

vLLM's llm-compressor with W4A16 INT format is also recommended. Other formats
in llm-compressor are not support.

All MoE quantization models are significantly slow, and all unquantized models
are slightly slow. Not recommended to use.

NEWS
-------------------

2025-09-12:

Minor optimize in triton_unified_attention, reduce VRAM usage of AWQ

2025-09-10:

Update vLLM to v0.10.1

Re-enable V1 and Automatic Prefix Caching by default. Adjusted some default
parameters of V1 engine to better suit gfx906.

Optimized unquantized FP16 GEMM / GEMV, by enable skinny unquantized GEMV
kernel, a triton unquantized GEMM implementation is also added. TG speed is
~90% faster on unquantized models and ~10% faster on GPTQ / AWQ models at low
batch sizes.

2025-08-24:

Backported support for GLM 4.5 (thanks to @anikifoss).

Added support for quantized MoE models packed in AWQ / GPTQ formats. Add
support for asymmetric quantization in ExllamaLinearKernel.

Optimized the exllama 8-bit dequant kernel, making 8-bit GPTQ dense models 10%
faster at low batch sizes. Disable bf16, explicitly adding --dtype float16 is
no longer necessary.

2025-08-16:

Add support for quantized MoE models. Fix support for vLLM's llm-compressor.

Switch back to V0 engine by default. Many users have reported a significant
performance drop after upgrading to V1.

2025-07-08:

Update vLLM to 0.9.2

From this version, V1 engine is default. Startup takes longer time than V0.
Automatic Prefix Caching is off by default due to performance issues. Needs
investigating.

2025-06-10:

I made some optimization on GPTQ and AWQ kernel. Now, single batch got about 5%
faster, and batch sizes between 8-32 got about 30%-50% faster.

Update vLLM to 0.9.1

2025-05-27:

Add support for AWQ quantization on gfx906 without triton, use the same kernel
with GPTQ, so the performance of AWQ should be on par with GPTQ.

Update vLLM to 0.9.0

2025-05-02:

Update vLLM to 0.8.5

Upstream vLLM 0.8.5 has lots of issues with ROCm platform, which have already
been fixed in the main branch. I cherry-picked those fixes.

I also fixed the issue of garbled output for GPTQ desc_act=True models.

2025-04-29:

I have fixed GGUF batched request performance issue. Now it's usable, but still
not as fast as GPTQ.

I also added some autotune configs to `triton_flash_attention.py` by increasing
`num_stages`.

2025-04-28:

Update rocm to 6.3
Update torch to 2.7.0
Update triton to 3.3.0

2025-04-22:

I have fixed GPTQ Int4/Int8 GEMV kernel, by changing dot product accumulator
type from FP16 to FP32 to avoid calculation overflow. Thanks to the fdot2
intrinsic introduced in Vega 12/20, using FP32 accumulators remains fast and
guarantees no overflow.

2025-04-21:

Update vLLM to v0.8.4

2025-04-20:

I changed the reconstruct threshold in GPTQ GEMM kernel. This is a temporary
fix for Qwen2 GPTQ models outputting infinite "!!!!!!!!!!!!!!!!!!!"

2025-04-19:

I attempted to optimize AWQ, by adding '@triton.autotune' to triton_awq.py.
This improved performance by about 50%, but it's still very slow on gfx906 GPUs.

2025-04-01:

Optimized the GEMV kernel for GGUF q4_1 and q8_0 quantization, achieving
10%~20% performance improve.

BUILD
-------------------

Please install ROCm 6.3 first. You need to install both kernel-mode driver and
ROCm packages. Refer to the official documentation by AMD.

You MUST INSTALL triton-gfx906 v3.3.0+gfx906 first, see:
https://github.com/nlzy/triton-gfx906/tree/v3.3.0+gfx906

```
cd vllm-gfx906

python3 -m venv vllmenv
source vllmenv/bin/activate

pip3 install 'torch==2.7' torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
pip3 install -r requirements/rocm-build.txt
pip3 install -r requirements/rocm.txt

pip3 install --no-build-isolation .
```

CREDITS
-------------------

https://github.com/Said-Akbar/vllm-rocm