-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Load management for GraphQL queries #1762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I did some testing with real-world data, and load management seems to behave the way you'd expect. Jailing queries seems very effective; with For real-world uses, we still need to determine what the right settings for these parameters are. In my test I used a |
|
And for posterity's sake this is what the |
That3Percent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a pleasure to review. It shows a great deal of craftsmanship. I've added some non-critical comments which you may address.
| .decline(query.shape_hash, query.query_text.as_ref()) | ||
| { | ||
| let err = SubscriptionError::GraphQLError(vec![QueryExecutionError::TooExpensive]); | ||
| return Box::new(future::result(Err(err))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're in a function using the style of the old futures, let's try migrating it to async/await.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel that I don't understand what that migration entails enough to not screw things up in the process :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ask questions sometime. A couple of us have been through the fire.
| } | ||
| }; | ||
| let wait_avg = { | ||
| let mut wait_stats = self.wait_stats.write().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this style of using the braces to constrain the lock and evaluating an expression.
| struct QueryEffortInner { | ||
| window_size: Duration, | ||
| bin_size: Duration, | ||
| effort: HashMap<u64, MovingStats>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
effort may grow without bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very true. I thought briefly about how to address that, but all of the options I could come up with seemed pretty involved. The most performant approach I can think of is to do 'double-buffering': periodically, start a new HashMap and record measurements in both the old and new map for at least window_size, and then drop the old one and use the new one. (The issue here is that I don't want to do the 'reaping' in line with responding to a query, and we don't have any infrastructure for running periodic cleanups)
In a week, we see about 6500 distinct queries (by shape hash) which take < 100MB of storage so the accumulation of unused query stats should be tolerable
| /// proportion of the work while the system was overloaded. Currently, | ||
| /// there is no way for a query to get out of jail other than | ||
| /// restarting the process | ||
| jailed_queries: RwLock<HashSet<u64>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of this jail, for the reasons described in the comment. I'd almost rather oscillate, or assign an additional penalty, but disabled forever seems harsh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation behind it is that this would have ended the load issue we had last week almost immediately by jailing the 2 most expensive queries, which truly were the cause of the issue. When a query lands in jail, some human intervention will be needed: if the query truly is the culprit for high load, it should be configured to be banned; if not, the jail threshold might need to be adjusted (you can turn jailing off completely by setting the threshold to a value > 1)
Comments are in Zac's review of #1762
Comments are in Zac's review of #1762
Instead of logging every time we have to wait more than 100ms for a connection (which we log _a lot_) log the moving average of wait times, but at most every 10s
Make sure that boht the kill_rate and the probability with which we drop a query stay between 0 and 1. Trying to call thread_rng().gen_bool(p) with a p outside of [0,1] causes a panic
Comments are in Zac's review of #1762
This pull request adds facilities for managing and limiting the load that GraphQL query put on the system (really, the database) in an automated fashion. When we detect that the system is overloaded, as evidenced by long DB connection wait times, we try to shed load adaptively. The heart of the algorithm for that is in
graph::data::graphql::effort::LoadManager::declineand the details are explained there.Effort (both for the effort queries cause and for the DB load) is measured as a simple moving average over a configurable time window.
The code unfortunately has to take quite a few locks on the normal path of executing a GraphQL query, and we need to ensure that there is no lock contention under load when this code is most important, and that readers and writers are dealt with fairly, for those locks. It might be worth switching to
parking_lotfor these locks. The code tries very hard to (1) only do a constant amount of work while holding a lock and (2) not do any I/O while holding a lock.The default values for the various tunables are a reasonable guess at good values, but still need to be validated with concrete load tests.
It is possible to turn load management, and all the locks associated with it, off by setting
GRAPH_LOAD_THRESHOLD=0(the default value)