-
Notifications
You must be signed in to change notification settings - Fork 12.1k
We could use std::unordered_map over std::map #305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d::map<id, token> id_to_token; to std::vector<token> id_to_token;
…token_to_id.size());
the token vector should prob be a struct now which also includes the score (see 074bea2) |
I did a commit merging with the new changes using struct, I am not sure about the names or the organization of the struct |
Are you able to measure the performance gains from these changes? Interested to see how much of an impact they have. Great work! |
Thanks! Sadly it's hard to do consistent tests, but what I got is:
last commit (bd4b46d) + this pull request: (same command above)
This differences should be more noticeable with larger datasets and more tokens, I am using the model 7B. |
there should not really be any. none of the code is particularly hot. |
utils.h
Outdated
@@ -52,19 +52,24 @@ std::string gpt_random_prompt(std::mt19937 & rng); | |||
// Vocab utils | |||
// | |||
|
|||
struct token_score { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusingly named, same with token_t. the type is only used inside gpt_vocab, so why not nest it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also gpt_vocab is token_t already in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, first thought was token_t, but that is too close to token, so, just leave it as token_score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compiler version seems to not accept token token;
/home/runner/work/llama.cpp/llama.cpp/utils.h:68:15: error: declaration of ‘gpt_vocab::token gpt_vocab::token_score::token’ changes meaning of ‘token’ [-fpermissive]
68 | token token;
| ^~~~~
/home/runner/work/llama.cpp/llama.cpp/utils.h:65:11: note: ‘token’ declared here as ‘using token = std::string’
65 | using token = std::string;
Should I rename the using token = std::string; to token_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quickest and simplest fix to that would be to just rename the data member to tok
.
Apologies for the conflicts - lets resolve and merge. Regarding the C++ standard question from the other thread: Overall, my experience tells me this is the better way - or at least it is better in my views and understandings. And if there ever appears a very good reason to bump the standard - we will do it. But at the moment, there is no good enough reason to do it. |
Thanks! I think I resolved the conflicts, if there are some problems I am happy to fix. |
If it is not necessary sorted maps, change std::map to std::unordered_map
std::unordered_map is a hash table so it should be faster than std::map when storing many items.
std::map<id, token> can be a std::vector since the vector index can be equal to the token id