Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Flaky server responses with llama 3 #6785

@kurnevsky

Description

@kurnevsky

I noticed that some of the responses I got from llama-cpp server (latest master) are unnaturally fast for 70b model, and it happens randomly. And when this happens the response has worse quality. The model I'm using is https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/blob/main/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf with the command line llama-server -m Meta-Llama-3-70B-Instruct-Q5_K_M.gguf -c 0 -t 24 -ngl 24. It's only partially offloaded to gpu (with rocm on linux) so maybe somehow llama-cpp doesn't use all layers when it responds quickly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions