Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 622456f

Browse files
authored
docs: Add autoscale recommendations docs (coder#7617)
* Add autoscale recommendations Signed-off-by: Spike Curtis <[email protected]> * review updates Signed-off-by: Spike Curtis <[email protected]> --------- Signed-off-by: Spike Curtis <[email protected]>
1 parent 4a32061 commit 622456f

File tree

1 file changed

+29
-0
lines changed

1 file changed

+29
-0
lines changed

docs/admin/scale.md

+29
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,35 @@ The test does the following:
6464

6565
Concurrency is configurable. `concurrency 0` means the scaletest test will attempt to create & connect to all workspaces immediately.
6666

67+
## Autoscaling
68+
69+
We generally do not recommend using an autoscaler that modifies the number of coderd replicas. In particular, scale
70+
down events can cause interruptions for a large number of users.
71+
72+
Coderd is different from a simple request-response HTTP service in that it services long-lived connections whenever it
73+
proxies HTTP applications like IDEs or terminals that rely on websockets, or when it relays tunneled connections to
74+
workspaces. Loss of a coderd replica will drop these long-lived connections and interrupt users. For example, if you
75+
have 4 coderd replicas behind a load balancer, and an autoscaler decides to reduce it to 3, roughly 25% of the
76+
connections will drop. An even larger proportion of users could be affected if they use applications that use more
77+
than one websocket.
78+
79+
The severity of the interruption varies by application. Coder's web terminal, for example, will reconnect to the same
80+
session and continue. So, this should not be interpreted as saying coderd replicas should never be taken down for any
81+
reason.
82+
83+
We recommend you plan to run enough coderd replicas to comfortably meet your weekly high-water-mark load, and monitor
84+
coderd peak CPU & memory utilization over the long term, reevaluating periodically. When scaling down (or performing
85+
upgrades), schedule these outside normal working hours to minimize user interruptions.
86+
87+
### A note for Kubernetes users
88+
89+
When running on Kubernetes on cloud infrastructure (i.e. not bare metal), many operators choose to employ a _cluster_
90+
autoscaler that adds and removes Kubernetes _nodes_ according to load. Coder can coexist with such cluster autoscalers,
91+
but we recommend you take steps to prevent the autoscaler from evicting coderd pods, as an eviction will cause the same
92+
interruptions as described above. For example, if you are using the [Kubernetes cluster
93+
autoscaler](https://kubernetes.io/docs/reference/labels-annotations-taints/#cluster-autoscaler-kubernetes-io-safe-to-evict),
94+
you may wish to set `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` as an annotation on the coderd deployment.
95+
6796
## Troubleshooting
6897

6998
If a load test fails or if you are experiencing performance issues during day-to-day use, you can leverage Coder's [prometheus metrics](./prometheus.md) to identify bottlenecks during scale tests. Additionally, you can use your existing cloud monitoring stack to measure load, view server logs, etc.

0 commit comments

Comments
 (0)