-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[CPU Embedding Inference] 2x slower when **Allocated** (not used) memory / RAM is lower. #5846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
By "allocating more memory" do you mean creating a VM with more memory? |
Right, I just figured the behavioural pattern is certain cloud instances with memory, vCPU increases hence the better performance. So the ultimate problem is embedding via llama.cpp takes way more CPU cycles when run in CPU mode. All my attempts are on X86_64 instances. Is this something you experienced. Am asking because these models are built for resource constrained decide but this behaviour is exact opposite of it. Btw thanks for all your work ! |
I tried running the same command as above with a systemd process level memory limit. It was fine all the way down to around 150M, where the speed went down by 3x, presumably due to swapping. Maybe you could try this on your high memory instance and see if it's the same? If you have systemd, the command is:
|
Sorry, if I was not clear, as mentioned in the last comment it's more to do with CPU cycles. (It appeared like more memory helped but it's more CPU that came with more memory, look at my machine details above. We have 8 CPUs. ) Here is the hypothesis, I am working onToken for Token GGML models are the best for $ spent on compute if we run end to end inference in C++ with no python bindings. So I am comparing 4-bit quantised GGML models with 8-bit quantised ONNX models.
What I did: To test I wrote C++ wrappers to serve bert.cpp (tried The llama.cpp server as well) in serverless like AWS lambda (Lambda specific) and Google Cloud Run (HTTP server). Observation: For a same token density text (tried 64 to 512), q4_0 model takes way more CPU cycles than ONNX-8bit. The same behaviour can be noticed while running C++ binaries as CLI. ./embedding in (llama.cpp) or ./main in (bert.cpp) Variants I tried before concluding To save time I loaded model only once, instrumented code to get token + infer time.
The heavy compute appetite of GGML models are consistent across OS and Serverless offerings and Compute instances. I am not saying this is the right conclusion, but if it is true GGML models are not economically viable. (at least for embeddings) Any pointers ? Anything I am missing? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Machine details
Bug details:
When I run the below CLI command
with 50GB RAM, its way faster like 2x
with 12GB RAM,
Observations:
The Used memory doesn't change or increase merely allocating more memory makes it run faster why.
This is not just one run, I have done several runs in a loop with different texts the behaviour persists.
Expected behaviour
Runtime should not change based on allocated memory ?
The text was updated successfully, but these errors were encountered: