-
Notifications
You must be signed in to change notification settings - Fork 5
Why llama.cpp runs substantially faster #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello! I haven't done a head-to-head benchmark actually, it's mostly just based on brief testing and some theoretical hunches. But basically, The other thing is just accuracy. It turns out that tokenization is really complicated, and |
Hello! Thank you very much for your explanation; it answers most of my questions. Regarding tokenization, I want to build DeBERTa v3, which uses a sentencepiece tokenizer, so I plan to use this realization from Google. Does llama.cpp have its own realization of sentencepiece tokenizer? |
Yup, llama.cpp has it's own sentencepiece tokenizer. Though I'm not sure how well it works with embedding models, which have historically use more of a WordPiece tokenizer. I've been trying to get Anyway, let me know if you need any help with the DeBERTa implementation! |
I got it, thank you for the answer. Considering this, I think it is safer to use Google implementation right now. And thank you for intention to help with the DeBERTa implementation, I will back to you if i will have some questions. |
I am thinking about creating DeBERTa version of this project. Initially I thought to use it as a backbone, because it's easier to modify than llama.cpp, but performance is really important for my case. It was mentioned in the readme that llama.cpp realization is substantially faster, I am a beginner of ggml and llama.cpp and I don't understand why. Can someone explain it?
The text was updated successfully, but these errors were encountered: