Thanks to visit codestin.com
Credit goes to github.com

Skip to content

server : add more clean up when cancel_tasks is called #11340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 21, 2025

This PR fixes an edge case where multiple requests are sent to the server at once, get queued and not yet processed, but cancel_tasks fail to clean them up.

Also fix a small issue with condition_results.wait_for, where it returns the value of pred() instead of timeout state (aka std::cv_status::timeout). See the docs here.

ggerganov
ggerganov previously approved these changes Jan 22, 2025
Comment on lines 1603 to 1608
for (size_t i = 0; i < queue_results.size(); i++) {
if (queue_results[i]->id == id_task) {
queue_results.erase(queue_results.begin() + i);
i--;
}
}
Copy link
Member

@ggerganov ggerganov Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use std::remove_if or avoid the decrement of unsigned integer like this:

Suggested change
for (size_t i = 0; i < queue_results.size(); i++) {
if (queue_results[i]->id == id_task) {
queue_results.erase(queue_results.begin() + i);
i--;
}
}
for (size_t i = 0; i < queue_results.size(); ) {
if (queue_results[i]->id == id_task) {
queue_results.erase(queue_results.begin() + i);
} else {
i++;
}
}

@ggerganov
Copy link
Member

Hm, I just started testing this branch using llama.vim and it dead-locked after a short while. This was the last message in the log:

srv  remove_waiti: remove task 2009 from waiting list. current waiting = 1 (before remove)

I will now do some testing with master branch to see if this bug is something caused by recent changes.

@ggerganov ggerganov dismissed their stale review January 22, 2025 09:31

llama-server deadlocks in certain cases

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 22, 2025

Yeah I realize that there's a problem with my logic in the loop of recv_with_timeout, having a look on it

@ggerganov
Copy link
Member

ggerganov commented Jan 22, 2025

So, I think it was a false alarm - not 100% sure. Let me test this branch some more time and will report back if any issue occurs.

});
if (!cr_res) {
std::cv_status cr_res = condition_results.wait_for(lock, std::chrono::seconds(timeout));
if (cr_res == std::cv_status::timeout) {
Copy link
Collaborator Author

@ngxson ngxson Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the result comes in at the moment when we're being timeout, this loop will again wait for another result, skipping the result already in the queue_results. Then being timeout again, then repeated timeout without checking for result

The simple fix is to move this check to the bottom, so we will read the result queue before wait_for

  • If we get a result, return without wait_for
  • If we don't get any result, wait_for
    • If wait_for timeout, we return nullptr so that the caller can check for HTTP connection state
    • If NOT timeout, we continue to the next iteration of while (true), which again check for incoming result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm testing the new version now. The previous one was definitely locking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Were you able to test this? Feel free to let me know if you need help btw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, been running without any issues for a while. It's good to merge.

@ngxson ngxson requested a review from ggerganov January 22, 2025 12:06
@ngxson ngxson merged commit 5845661 into ggml-org:master Jan 23, 2025
45 checks passed
anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants