Practical Qwen2 inference implemented in a single Java file.
This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos.
Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.
- Single file, no dependencies
- GGUF format parser
- Qwen 2 tokenizer based on minbpe
- Qwen 2 inference with Grouped-Query Attention
- Support for Q8_0 and Q4_0 quantizations
- Simple CLI with
--chatand--instructmodes. - Compatible with GraalVM's native-image.
Download pure Q4_0 and/or Q8_0 quantized .gguf files from:
https://huggingface.co/collections/mukel/qwen2-666644562f3762a838f035de
Please be gentle with huggingface.co servers:
# Download the 1.5B parameter Q8_0 quantized model
curl -L -O https://huggingface.co/mukel/Qwen2-1.5B-Instruct-GGUF/resolve/main/Qwen2-1.5B-Instruct-Q8_0.gguf
In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0.
A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source
with the quantize utility from llama.cpp as follows:
./llama-quantize --pure Qwen2-1.5B-Instruct-F32.gguf Qwen2-1.5B-Instruct-Q4_0.gguf Q4_0jbang is a perfect fit for this use case, just:
jbang Qwen2.java --help
Or execute directly, also via jbang:
chmod +x Qwen2.java
./Qwen2.java --helpjava Qwen2.java --model Qwen2-1.5B-Instruct-Q8_0.gguf --chatA simple Makefile is provided, run make to produce qwen2.jar or manually:
javac -g -d target/classes Qwen2.java
jar -cvfe qwen2.jar com.llama4j.Qwen2 LICENSE -C target/classes .Run the resulting qwen2.jar as follows:
java -jar qwen2.jar --help
java -jar qwen2.jar --model Qwen2-1.5B-Instruct-Q8_0.gguf --chat
Build a native image:
native-image -jar qwen2.jar -o qwen2Run:
./qwen2 --helpFor example:
./qwen2 --model Qwen2-1.5B-Instruct-Q8_0.gguf --chatMIT