Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lutter
Copy link
Collaborator

@lutter lutter commented Jun 29, 2020

This pull request adds facilities for managing and limiting the load that GraphQL query put on the system (really, the database) in an automated fashion. When we detect that the system is overloaded, as evidenced by long DB connection wait times, we try to shed load adaptively. The heart of the algorithm for that is in graph::data::graphql::effort::LoadManager::decline and the details are explained there.

Effort (both for the effort queries cause and for the DB load) is measured as a simple moving average over a configurable time window.

The code unfortunately has to take quite a few locks on the normal path of executing a GraphQL query, and we need to ensure that there is no lock contention under load when this code is most important, and that readers and writers are dealt with fairly, for those locks. It might be worth switching to parking_lot for these locks. The code tries very hard to (1) only do a constant amount of work while holding a lock and (2) not do any I/O while holding a lock.

The default values for the various tunables are a reasonable guess at good values, but still need to be validated with concrete load tests.

It is possible to turn load management, and all the locks associated with it, off by setting GRAPH_LOAD_THRESHOLD=0 (the default value)

@lutter
Copy link
Collaborator Author

lutter commented Jul 2, 2020

I did some testing with real-world data, and load management seems to behave the way you'd expect. Jailing queries seems very effective; with GRAPH_LOAD_JAIL_THRESHOLD set so high that jailing is disabled, the node exhibits a pattern where it oscillates between dropping queries and running all of them to keep connection wait times around GRAPH_LOAD_THRESHOLD as one would expect with this algorithm.

For real-world uses, we still need to determine what the right settings for these parameters are. In my test I used a GRAPH_LOAD_WINDOW_SIZE of 60s and a GRAPH_LOAD_THRESHOLD of 20ms. The window size determines how quickly we react to an overload, but also to smooth out load spikes. The threshold sets a target load on the system, but we should run this for a while without any management, but collect data to understand what a normal average connection wait time looks like. We also need to set GRAPH_LOAD_JAIL_THRESHOLD to a value that lets us identify the one or two bad queries that are causing an overload (if they exist) without unnecessarily jailing queries that might cause a lot of work but are unremarkable in normal operations.

@lutter
Copy link
Collaborator Author

lutter commented Jul 2, 2020

And for posterity's sake this is what the LoadManager logs when things get dicey:

Jul 02 02:01:25.018 WARN Query overload, event: start, wait_ms: 62, component: LoadManager
Jul 02 02:02:09.520 INFO Query overload still happening, event: ongoing, kill_rate: 0.271, wait_ms: 31, duration_ms: 44501, component: LoadManager
Jul 02 02:02:45.055 INFO Query overload resolved, event: resolved, wait_ms: 15, duration_ms: 80036, component: LoadManager
Jul 02 02:03:56.496 WARN Query overload, event: start, wait_ms: 30, component: LoadManager
Jul 02 02:04:38.214 INFO Query overload still happening, event: ongoing, kill_rate: 0.3030031, wait_ms: 21, duration_ms: 41717, component: LoadManager
Jul 02 02:04:51.905 INFO Query overload resolved, event: resolved, wait_ms: 17, duration_ms: 55409, component: LoadManager
Jul 02 02:05:55.679 WARN Query overload, event: start, wait_ms: 30, component: LoadManager
Jul 02 02:06:32.968 INFO Query overload still happening, event: ongoing, kill_rate: 0.27318925990000004, wait_ms: 27, duration_ms: 37290, component: LoadManager
Jul 02 02:06:40.243 INFO Query overload resolved, event: resolved, wait_ms: 19, duration_ms: 44564, component: LoadManager

Copy link
Contributor

@That3Percent That3Percent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pleasure to review. It shows a great deal of craftsmanship. I've added some non-critical comments which you may address.

.decline(query.shape_hash, query.query_text.as_ref())
{
let err = SubscriptionError::GraphQLError(vec![QueryExecutionError::TooExpensive]);
return Box::new(future::result(Err(err)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're in a function using the style of the old futures, let's try migrating it to async/await.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that I don't understand what that migration entails enough to not screw things up in the process :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ask questions sometime. A couple of us have been through the fire.

}
};
let wait_avg = {
let mut wait_stats = self.wait_stats.write().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this style of using the braces to constrain the lock and evaluating an expression.

struct QueryEffortInner {
window_size: Duration,
bin_size: Duration,
effort: HashMap<u64, MovingStats>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effort may grow without bound

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very true. I thought briefly about how to address that, but all of the options I could come up with seemed pretty involved. The most performant approach I can think of is to do 'double-buffering': periodically, start a new HashMap and record measurements in both the old and new map for at least window_size, and then drop the old one and use the new one. (The issue here is that I don't want to do the 'reaping' in line with responding to a query, and we don't have any infrastructure for running periodic cleanups)

In a week, we see about 6500 distinct queries (by shape hash) which take < 100MB of storage so the accumulation of unused query stats should be tolerable

/// proportion of the work while the system was overloaded. Currently,
/// there is no way for a query to get out of jail other than
/// restarting the process
jailed_queries: RwLock<HashSet<u64>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this jail, for the reasons described in the comment. I'd almost rather oscillate, or assign an additional penalty, but disabled forever seems harsh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation behind it is that this would have ended the load issue we had last week almost immediately by jailing the 2 most expensive queries, which truly were the cause of the issue. When a query lands in jail, some human intervention will be needed: if the query truly is the culprit for high load, it should be configured to be banned; if not, the jail threshold might need to be adjusted (you can turn jailing off completely by setting the threshold to a value > 1)

lutter added a commit that referenced this pull request Jul 3, 2020
lutter added a commit that referenced this pull request Jul 3, 2020
lutter added 20 commits July 2, 2020 19:06
Instead of logging every time we have to wait more than 100ms for a
connection (which we log _a lot_) log the moving average of wait times, but
at most every 10s
Make sure that boht the kill_rate and the probability with which we drop a
query stay between 0 and 1. Trying to call thread_rng().gen_bool(p) with a
p outside of [0,1] causes a panic
@lutter lutter merged commit 75ef747 into master Jul 3, 2020
@lutter lutter deleted the lutter/load branch July 3, 2020 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants