Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 6451c29

Browse files
committed
copy edit
1 parent f8c5158 commit 6451c29

File tree

1 file changed

+67
-75
lines changed

1 file changed

+67
-75
lines changed

docs/tutorials/best-practices/scale-coder.md

+67-75
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,33 @@
1-
kl# Scale Coder
1+
# Scale Coder
22

3-
December 20, 2024
4-
5-
---
6-
7-
This best practice guide helps you prepare a low-scale Coder deployment so that
8-
it can be scaled up to a high-scale deployment as use grows, and keep it
9-
operating smoothly with a high number of active users and workspaces.
3+
This best practice guide helps you prepare a low-scale Coder deployment that you can
4+
scale up to a high-scale deployment as use grows, and keep it operating smoothly with a
5+
high number of active users and workspaces.
106

117
## Observability
128

13-
Observability is one of the most important aspects to a scalable Coder
14-
deployment.
9+
Observability is one of the most important aspects to a scalable Coder deployment.
10+
When you have visibility into performance and usage metrics, you can make informed
11+
decisions about what changes you should make.
1512

1613
[Monitor your Coder deployment](../../admin/monitoring/index.md) with log output
1714
and metrics to identify potential bottlenecks before they negatively affect the
1815
end-user experience and measure the effects of modifications you make to your
1916
deployment.
2017

2118
- Log output
22-
- Capture log output from Loki, CloudWatch logs, and other tools on your Coder Server instances and external provisioner daemons and store them in a
23-
searchable log store.
24-
- Retain logs for a minimum of thirty days, ideally ninety days. This allows you to look back to see when anomalous behaviors began.
19+
- Capture log output from Loki, CloudWatch logs, and other tools on your Coder Server
20+
instances and external provisioner daemons and store them in a searchable log store.
21+
- Retain logs for a minimum of thirty days, ideally ninety days.
22+
This allows you to look back to see when anomalous behaviors began.
2523

2624
- Metrics
27-
- Capture infrastructure metrics like CPU, memory, open files, and network I/O for all Coder Server, external provisioner daemon, workspace proxy, and PostgreSQL instances.
25+
- Capture infrastructure metrics like CPU, memory, open files, and network I/O for all
26+
Coder Server, external provisioner daemon, workspace proxy, and PostgreSQL instances.
2827

2928
### Capture Coder server metrics with Prometheus
3029

31-
To capture metrics from Coder Server and external provisioner daemons with
30+
Edit your Helm `values.yaml` to capture metrics from Coder Server and external provisioner daemons with
3231
[Prometheus](../../admin/integrations/prometheus.md):
3332

3433
1. Enable Prometheus metrics:
@@ -49,40 +48,35 @@ To capture metrics from Coder Server and external provisioner daemons with
4948
CODER_PROMETHEUS_AGGREGATE_AGENT_STATS_BY=agent_name
5049
```
5150

52-
- To disable agent stats:
51+
- To disable agent stats:
5352

54-
```yaml
55-
CODER_PROMETHEUS_COLLECT_AGENT_STATS=false
56-
```
53+
```yaml
54+
CODER_PROMETHEUS_COLLECT_AGENT_STATS=false
55+
```
5756

5857
Retain metric time series for at least six months. This allows you to see
5958
performance trends relative to user growth.
6059

6160
For a more comprehensive overview, integrate metrics with an observability
62-
dashboard, for example, [Grafana](../../admin/monitoring/index.md).
61+
dashboard like [Grafana](../../admin/monitoring/index.md).
6362

6463
### Observability key metrics
6564

6665
Configure alerting based on these metrics to ensure you surface problems before
6766
they affect the end-user experience.
6867

69-
#### CPU and Memory Utilization
70-
71-
Monitor the utilization as a fraction of the available resources on the instance.
72-
73-
Utilization will vary with use throughout the course of a day, week, and longer timelines. Monitor trends and pay special attention to the daily and weekly peak utilization. Use long-term trends to plan infrastructure upgrades.
74-
75-
#### Tail latency of Coder Server API requests
76-
77-
High tail latency can indicate Coder Server or the PostgreSQL database is low on resources.
68+
- CPU and Memory Utilization
69+
- Monitor the utilization as a fraction of the available resources on the instance.
7870

79-
- Use the `coderd_api_request_latencies_seconds` metric.
71+
Utilization will vary with use throughout the course of a day, week, and longer timelines. Monitor trends and pay special attention to the daily and weekly peak utilization. Use long-term trends to plan infrastructure upgrades.
8072

81-
#### Tail latency of database queries
73+
- Tail latency of Coder Server API requests
74+
- High tail latency can indicate Coder Server or the PostgreSQL database is low on resources.
75+
- Use the `coderd_api_request_latencies_seconds` metric.
8276

83-
High tail latency can indicate the PostgreSQL database is low in resources.
84-
85-
- Use the `coderd_db_query_latencies_seconds` metric.
77+
- Tail latency of database queries
78+
- High tail latency can indicate the PostgreSQL database is low in resources.
79+
- Use the `coderd_db_query_latencies_seconds` metric.
8680

8781
## Coder Server
8882

@@ -116,8 +110,7 @@ Coder's
116110
[validated architectures](../../admin/infrastructure/validated-architectures/index.md)
117111
give specific sizing recommendations for various user scales. These are a useful
118112
starting point, but very few deployments will remain stable at a predetermined
119-
user level over the long term, so monitoring and adjusting of resources is
120-
recommended.
113+
user level over the long term. We recommend monitoring and adjusting resources as needed.
121114

122115
We don't recommend that you autoscale the Coder Servers. Instead, scale the
123116
deployment for peak weekly usage.
@@ -128,75 +121,74 @@ users to their workspaces in two capacities:
128121
1. As an HTTP proxy when they access workspace applications in their browser via
129122
the Coder Dashboard
130123

131-
1. As a DERP proxy when establishing tunneled connections via CLI tools
132-
(`coder ssh`, `coder port-forward`, etc.) and desktop IDEs.
124+
1. As a DERP proxy when establishing tunneled connections with CLI tools like `coder ssh`, `coder port-forward`, and others, and with desktop IDEs.
133125

134126
Stopping a Coder Server instance will (momentarily) disconnect any users
135127
currently connecting through that instance. Adding a new instance is not
136-
disruptive, but removing instances and upgrades should be performed during a
128+
disruptive, but you should remove instances and perform upgrades during a
137129
maintenance window to minimize disruption.
138130

139131
## Provisioner daemons
140132

141133
### Locality
142134

143-
We recommend you disable provisioner daemons within your Coder Server:
135+
We recommend that you run one or more
136+
[provisioner daemon deployments external to Coder Server](../../admin/provisioners.md)
137+
and disable provisioner daemons within your Coder Server.
138+
This allows you to scale them independently of the Coder Server:
144139

145140
```yaml
146141
CODER_PROVISIONER_DAEMONS=0
147142
```
148143

149-
Run one or more
150-
[provisioner daemon deployments external to Coder Server](../../admin/provisioners.md).
151-
This allows you to scale them independently of the Coder Server.
152-
153144
We recommend deploying provisioner daemons within the same cluster as the
154145
workspaces they will provision or are hosted in.
155146

156147
- This gives them a low-latency connection to the APIs they will use to
157148
provision workspaces and can speed builds.
158149

159150
- It allows provisioner daemons to use in-cluster mechanisms (for example
160-
Kubernetes service account tokens, AWS IAM Roles, etc.) to authenticate with
151+
Kubernetes service account tokens, AWS IAM Roles, and others) to authenticate with
161152
the infrastructure APIs.
162153

163154
- If you deploy workspaces in multiple clusters, run multiple provisioner daemon
164155
deployments and use template tags to select the correct set of provisioner
165156
daemons.
166157

167-
- Provisioner daemons need to be able to connect to Coder Server, but this need
168-
not be a low-latency connection.
158+
- Provisioner daemons need to be able to connect to Coder Server, but this does not need
159+
to be a low-latency connection.
169160

170161
Provisioner daemons make no direct connections to the PostgreSQL database, so
171162
there's no need for locality to the Postgres database.
172163

173164
### Scaling
174165

175166
Each provisioner daemon instance can handle a single workspace build job at a
176-
time. Therefore, the number of provisioner daemon instances within a tagged
177-
deployment equals the maximum number of simultaneous builds your Coder
178-
deployment can handle.
167+
time. Therefore, the maximum number of simultaneous builds your Coder deployment
168+
can handle is equal to the number of provisioner daemon instances within a tagged
169+
deployment.
179170

180171
If users experience unacceptably long queues for workspace builds to start,
181172
consider increasing the number of provisioner daemon instances in the affected
182173
cluster.
183174

184-
You may wish to automatically scale the number of provisioner daemon instances
185-
throughout the day to meet demand. If you stop instances with `SIGHUP`, they
186-
will complete their current build job and exit. `SIGINT` will cancel the current
187-
job, which will result in a failed build. Ensure your autoscaler waits long
188-
enough for your build jobs to complete before forcibly killing the provisioner
189-
daemon process.
175+
You might need to automatically scale the number of provisioner daemon instances
176+
throughout the day to meet demand.
190177

191-
If deploying in Kubernetes, we recommend a single provisioner daemon per pod. On
192-
a virtual machine (VM), you can deploy multiple provisioner daemons, ensuring
178+
If you stop instances with `SIGHUP`, they will complete their current build job
179+
and exit. `SIGINT` will cancel the current job, which will result in a failed build.
180+
Ensure your autoscaler waits long enough for your build jobs to complete before
181+
it kills the provisioner daemon process.
182+
183+
If you deploy in Kubernetes, we recommend a single provisioner daemon per pod.
184+
On a virtual machine (VM), you can deploy multiple provisioner daemons, ensuring
193185
each has a unique `CODER_CACHE_DIRECTORY` value.
194186

195187
Coder's
196188
[validated architectures](../../admin/infrastructure/validated-architectures/index.md)
197189
give specific sizing recommendations for various user scales. Since the
198190
complexity of builds varies significantly depending on the workspace template,
199-
consider this a starting point. Monitor queue times and build times to adjust
191+
consider this a starting point. Monitor queue times and build times and adjust
200192
the number and size of your provisioner daemon instances.
201193

202194
## PostgreSQL
@@ -232,20 +224,22 @@ path.
232224

233225
### Scaling
234226

235-
Workspace proxy load is determined by the amount of traffic they proxy. We
236-
recommend you monitor CPU, memory, and network I/O utilization to decide when to
237-
resize the number of proxy instances.
227+
Workspace proxy load is determined by the amount of traffic they proxy.
228+
229+
Monitor CPU, memory, and network I/O utilization to decide when to resize
230+
the number of proxy instances.
231+
232+
Scale for peak demand and scale down or upgrade during a maintenance window.
238233

239234
We do not recommend autoscaling the workspace proxies because many applications
240235
use long-lived connections such as websockets, which would be disrupted by
241-
stopping the proxy. We recommend you scale for peak demand and scale down or
242-
upgrade during a maintenance window.
236+
stopping the proxy.
243237

244238
## Workspaces
245239

246240
Workspaces represent the vast majority of resources in most Coder deployments.
247241
Because they are defined by templates, there is no one-size-fits-all advice for
248-
scaling.
242+
scaling workspaces.
249243

250244
### Hard and soft cluster limits
251245

@@ -259,25 +253,23 @@ utilization against the limits, so that a new influx of users doesn't encounter
259253
failed builds. Monitoring these is outside the scope of Coder, but we recommend
260254
that you set up dashboards and alerts for each kind of limited resource.
261255

262-
As you approach soft limits, you might be able to justify an increase to keep
263-
growing.
256+
As you approach soft limits, you can increase limits to keep growing.
264257

265-
As you approach hard limits, you will need to consider deploying to additional
266-
cluster(s).
258+
As you approach hard limits, consider deploying to additional cluster(s).
267259

268260
### Workspaces per node
269261

270262
Many development workloads are "spiky" in their CPU and memory requirements, for
271-
example, peaking during build/test and then ebbing while editing code. This
272-
leads to an opportunity to efficiently use compute resources by packing multiple
263+
example, they peak during build/test and then ebb while editing code.
264+
This leads to an opportunity to efficiently use compute resources by packing multiple
273265
workspaces onto a single node. This can lead to better experience (more CPU and
274266
memory available during brief bursts) and lower cost.
275267

276-
However, it needs to be considered against several trade-offs.
268+
There are a number of things you should consider before you decide how many
269+
workspaces you should allow per node:
277270

278-
- There are residual probabilities of "noisy neighbor" problems negatively
279-
affecting end users. The probabilities increase with the amount of
280-
oversubscription of CPU and memory resources.
271+
- "Noisy neighbor" issues: Users share the node's CPU and memory resources and might
272+
be susceptible to a user or process consuming shared resources.
281273

282274
- If the shared nodes are a provisioned resource, for example, Kubernetes nodes
283275
running on VMs in a public cloud, then it can sometimes be a challenge to

0 commit comments

Comments
 (0)