llama-cpp-python 0.3.8 with CUDA #2010

SeBL4RD · 2025-05-01T23:59:33Z

When will we have a recent version of llama-cpp-python, functional with CUDA, via pip?
It's a real nightmare to make it work any other way.
0.3.4 works with CUDA, but it doesn't take into account models like Qwen 3 quantified.

with kind regards

JamePeng · 2025-05-02T08:43:18Z

You can try to compile the new code I maintain here: https://github.com/JamePeng/llama-cpp-python, but I only pre-compiled the Windows and linux version based on the recent code

m-from-space · 2025-05-12T16:21:31Z

When will we have a recent version of llama-cpp-python, functional with CUDA, via pip?

I probably misunderstand your question, but I am using 0.3.9 with CUDA via pip. I can use Qwen3 related models as well (using Iquants for example or flash attention, is that what you are referring to?). Sorry if my comment is useless.

I built using...
CMAKE_ARGS="-DGGML_CUDA=on -DLLAVA_BUILD=off -DCMAKE_CUDA_ARCHITECTURES=native" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall --upgrade

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-cpp-python 0.3.8 with CUDA #2010

llama-cpp-python 0.3.8 with CUDA #2010

SeBL4RD commented May 1, 2025

JamePeng commented May 2, 2025 •

edited

Loading

m-from-space commented May 12, 2025

llama-cpp-python 0.3.8 with CUDA #2010

llama-cpp-python 0.3.8 with CUDA #2010

Comments

SeBL4RD commented May 1, 2025

JamePeng commented May 2, 2025 • edited Loading

m-from-space commented May 12, 2025

JamePeng commented May 2, 2025 •

edited

Loading