Closed
Description
I can achieve around 1 token per second on a Ryzen 7 3700X on Linux with the 65B model and 4bit quantization.
If we use 8bit instead, would it run faster? I have 128GB RAM. Is 8bit already supported?
$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main: load time = 14010.35 ms
main: sample time = 335.09 ms
main: predict time = 140527.48 ms / 1089.36 ms per token
main: total time = 157951.48 ms