Load management for GraphQL queries #1762

lutter · 2020-06-29T20:26:06Z

This pull request adds facilities for managing and limiting the load that GraphQL query put on the system (really, the database) in an automated fashion. When we detect that the system is overloaded, as evidenced by long DB connection wait times, we try to shed load adaptively. The heart of the algorithm for that is in graph::data::graphql::effort::LoadManager::decline and the details are explained there.

Effort (both for the effort queries cause and for the DB load) is measured as a simple moving average over a configurable time window.

The code unfortunately has to take quite a few locks on the normal path of executing a GraphQL query, and we need to ensure that there is no lock contention under load when this code is most important, and that readers and writers are dealt with fairly, for those locks. It might be worth switching to parking_lot for these locks. The code tries very hard to (1) only do a constant amount of work while holding a lock and (2) not do any I/O while holding a lock.

The default values for the various tunables are a reasonable guess at good values, but still need to be validated with concrete load tests.

It is possible to turn load management, and all the locks associated with it, off by setting GRAPH_LOAD_THRESHOLD=0 (the default value)

lutter · 2020-07-02T02:19:12Z

I did some testing with real-world data, and load management seems to behave the way you'd expect. Jailing queries seems very effective; with GRAPH_LOAD_JAIL_THRESHOLD set so high that jailing is disabled, the node exhibits a pattern where it oscillates between dropping queries and running all of them to keep connection wait times around GRAPH_LOAD_THRESHOLD as one would expect with this algorithm.

For real-world uses, we still need to determine what the right settings for these parameters are. In my test I used a GRAPH_LOAD_WINDOW_SIZE of 60s and a GRAPH_LOAD_THRESHOLD of 20ms. The window size determines how quickly we react to an overload, but also to smooth out load spikes. The threshold sets a target load on the system, but we should run this for a while without any management, but collect data to understand what a normal average connection wait time looks like. We also need to set GRAPH_LOAD_JAIL_THRESHOLD to a value that lets us identify the one or two bad queries that are causing an overload (if they exist) without unnecessarily jailing queries that might cause a lot of work but are unremarkable in normal operations.

lutter · 2020-07-02T02:27:16Z

And for posterity's sake this is what the LoadManager logs when things get dicey:

Jul 02 02:01:25.018 WARN Query overload, event: start, wait_ms: 62, component: LoadManager
Jul 02 02:02:09.520 INFO Query overload still happening, event: ongoing, kill_rate: 0.271, wait_ms: 31, duration_ms: 44501, component: LoadManager
Jul 02 02:02:45.055 INFO Query overload resolved, event: resolved, wait_ms: 15, duration_ms: 80036, component: LoadManager
Jul 02 02:03:56.496 WARN Query overload, event: start, wait_ms: 30, component: LoadManager
Jul 02 02:04:38.214 INFO Query overload still happening, event: ongoing, kill_rate: 0.3030031, wait_ms: 21, duration_ms: 41717, component: LoadManager
Jul 02 02:04:51.905 INFO Query overload resolved, event: resolved, wait_ms: 17, duration_ms: 55409, component: LoadManager
Jul 02 02:05:55.679 WARN Query overload, event: start, wait_ms: 30, component: LoadManager
Jul 02 02:06:32.968 INFO Query overload still happening, event: ongoing, kill_rate: 0.27318925990000004, wait_ms: 27, duration_ms: 37290, component: LoadManager
Jul 02 02:06:40.243 INFO Query overload resolved, event: resolved, wait_ms: 19, duration_ms: 44564, component: LoadManager

That3Percent

This was a pleasure to review. It shows a great deal of craftsmanship. I've added some non-critical comments which you may address.

That3Percent · 2020-07-02T16:05:05Z

graphql/src/runner.rs

+            .decline(query.shape_hash, query.query_text.as_ref())
+        {
+            let err = SubscriptionError::GraphQLError(vec![QueryExecutionError::TooExpensive]);
+            return Box::new(future::result(Err(err)));


If we're in a function using the style of the old futures, let's try migrating it to async/await.

I still feel that I don't understand what that migration entails enough to not screw things up in the process :(

Feel free to ask questions sometime. A couple of us have been through the fire.

node/src/main.rs

store/postgres/src/connection_pool.rs

That3Percent · 2020-07-02T17:54:20Z

store/postgres/src/connection_pool.rs

+            }
+        };
+        let wait_avg = {
+            let mut wait_stats = self.wait_stats.write().unwrap();


I really like this style of using the braces to constrain the lock and evaluating an expression.

store/postgres/src/connection_pool.rs

graph/src/util/stats.rs

graph/src/data/graphql/effort.rs

That3Percent · 2020-07-02T22:35:07Z

graph/src/data/graphql/effort.rs

+struct QueryEffortInner {
+    window_size: Duration,
+    bin_size: Duration,
+    effort: HashMap<u64, MovingStats>,


effort may grow without bound

Yes, very true. I thought briefly about how to address that, but all of the options I could come up with seemed pretty involved. The most performant approach I can think of is to do 'double-buffering': periodically, start a new HashMap and record measurements in both the old and new map for at least window_size, and then drop the old one and use the new one. (The issue here is that I don't want to do the 'reaping' in line with responding to a query, and we don't have any infrastructure for running periodic cleanups)

In a week, we see about 6500 distinct queries (by shape hash) which take < 100MB of storage so the accumulation of unused query stats should be tolerable

That3Percent · 2020-07-02T22:47:14Z

graph/src/data/graphql/effort.rs

+    /// proportion of the work while the system was overloaded. Currently,
+    /// there is no way for a query to get out of jail other than
+    /// restarting the process
+    jailed_queries: RwLock<HashSet<u64>>,


I'm not a big fan of this jail, for the reasons described in the comment. I'd almost rather oscillate, or assign an additional penalty, but disabled forever seems harsh.

The motivation behind it is that this would have ended the load issue we had last week almost immediately by jailing the 2 most expensive queries, which truly were the cause of the issue. When a query lands in jail, some human intervention will be needed: if the query truly is the culprit for high load, it should be configured to be banned; if not, the jail threshold might need to be adjusted (you can turn jailing off completely by setting the threshold to a value > 1)

Comments are in Zac's review of #1762

…ronment

… logs

Instead of logging every time we have to wait more than 100ms for a connection (which we log _a lot_) log the moving average of wait times, but at most every 10s

Make sure that boht the kill_rate and the probability with which we drop a query stay between 0 and 1. Trying to call thread_rng().gen_bool(p) with a p outside of [0,1] causes a panic

Comments are in Zac's review of #1762

lutter requested review from Jannis, That3Percent and leoyvens June 30, 2020 16:28

That3Percent approved these changes Jul 2, 2020

View reviewed changes

lutter added a commit that referenced this pull request Jul 3, 2020

all: Address review comments for load management

d84eb25

Comments are in Zac's review of #1762

lutter added a commit that referenced this pull request Jul 3, 2020

all: Address review comments for load management

a8a46d7

Comments are in Zac's review of #1762

lutter force-pushed the lutter/load branch from d84eb25 to a8a46d7 Compare July 3, 2020 00:36

lutter added 20 commits July 2, 2020 19:06

graph: Add MovingStats to track duration statistics over a moving window

1323473

graph, graphql: Move shape_hash into graph crate

6f309c4

graph, graphql: Store the shape_hash in our Query structs

38e21b3

graph: Add a facility for tracking query effort

01cf897

all: Have GraphQlRunner manage effort tracking

b1cc6c7

graph, graphql: Configure window and bin sizes for stats through envi…

fd63d88

…ronment

all: Track wait stats in connection pool, add LoadManager

a14587a

all: Move blocking of expensive queries into the LoadManager

642ac19

graph: Decline to run queries when load exceeds a threshold

f336570

all: Use a HashSet to store banned/jailed queries in the LoadManager

697b89a

graph: Log information about LoadManager decisions

e864b3c

graph: Log queries that the LoadManager puts in jail

5f52a9b

all: Make LoadManager and argument to GraphQlRunner::new

80d916e

graphql: Use a query's hash and shape_hash as the query_id for timing…

c76f20c

… logs

store: Revamp how we log connection checkout wait times

63fafd1

Instead of logging every time we have to wait more than 100ms for a connection (which we log _a lot_) log the moving average of wait times, but at most every 10s

store: Report average connection wait times to Prometheus

7ee08cc

all: Report moving average of total query time to Prometheus

6281e1e

graph: Fix logic of what to log in LoadManager

8ccfa9a

graph: Clamp LoadManager's kill_rate and probability to drop query

5d7d046

Make sure that boht the kill_rate and the probability with which we drop a query stay between 0 and 1. Trying to call thread_rng().gen_bool(p) with a p outside of [0,1] causes a panic

all: Address review comments for load management

75ef747

Comments are in Zac's review of #1762

lutter force-pushed the lutter/load branch from a8a46d7 to 75ef747 Compare July 3, 2020 02:07

lutter merged commit 75ef747 into master Jul 3, 2020

lutter deleted the lutter/load branch July 3, 2020 02:07

Load management for GraphQL queries #1762

Load management for GraphQL queries #1762

Uh oh!

Conversation

lutter commented Jun 29, 2020

Uh oh!

lutter commented Jul 2, 2020

Uh oh!

lutter commented Jul 2, 2020

Uh oh!

That3Percent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants