chore(docs): tweak replica verbiage on reference architectures #16076

stirby · 2025-01-08T22:05:45Z

A seller noted that the / operator made the node count hard to interpret.

matifali · 2025-01-09T12:54:13Z

docs/admin/infrastructure/validated-architectures/1k-users.md

-| Up to 1,000 | 2 vCPU, 8 GB memory | 1-2 / 1 coderd each | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` |
+| Users       | Node capacity       | Replicas                 | GCP             | AWS        | Azure             |
+|-------------|---------------------|--------------------------|-----------------|------------|-------------------|
+| Up to 1,000 | 2 vCPU, 8 GB memory | 1-2 nodes, 1 coderd each | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` |


Is it technically possible to run more than 1 coderd on each node?

Yes, this can happen automatically during a rollout or during node unavailability.
Note that we do set a pod anti-affinity rule [1] in our Helm chart to prefer spreading out replicas across multiple nodes.

If yes does this benefit any of the use cases or customers?
Why would someone run multiple coderd on a single node?

As far as I'm aware, the main reason to do this would be more for redundancy in case one or more pods become unavilable for whatever reason.

The only other reason I could imagine for running multiple replicas on a single node is to spread out connections across more coderd replicas to minimize the user-facing impact of a single pod failing. However, this won't protect against a failure of the underlying node.

I'll defer to @spikecurtis to weigh in more on the pros and cons of running multiple replicas per node.

[1] https://github.com/coder/coder/blob/main/helm/coder/values.yaml#L223-L237

In any reference architectures we should always recommend having 1 coderd per node.

There are generally 2 reasons for multiple replicas: fault tolerance and scale.

For fault tolerance, you want the replicas spread out into different failure domains. Having all replicas on the same node means you aren't tolerant of node-level faults. There might still be some residual value in being tolerant to replica level faults: e.g. software crashes, OOM. But, most people would rather the higher fault tolerance.

For scale, coderd is written to take advantage of multiple CPU cores in one process, so there is no scale advantage of putting multiple coderd instances on a single node. In fact, it's likely bad for scale since you have multiple processes competing for resources, and extra overhead of coderd to coderd communication.

docs/admin/infrastructure/validated-architectures/1k-users.md

minor tweak to ref architecture replica scaling verbiage

c99456c

stirby requested review from johnstcn and EdwardAngert January 8, 2025 22:06

github-actions bot assigned stirby Jan 8, 2025

johnstcn approved these changes Jan 9, 2025

View reviewed changes

EdwardAngert approved these changes Jan 9, 2025

View reviewed changes

matifali reviewed Jan 9, 2025

View reviewed changes

docs/admin/infrastructure/validated-architectures/1k-users.md Show resolved Hide resolved

spikecurtis approved these changes Jan 10, 2025

View reviewed changes

stirby merged commit 5380690 into main Jan 14, 2025
28 checks passed

stirby deleted the scaletesting-typo branch January 14, 2025 22:26

github-actions bot locked and limited conversation to collaborators Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(docs): tweak replica verbiage on reference architectures #16076

chore(docs): tweak replica verbiage on reference architectures #16076

Uh oh!

stirby commented Jan 8, 2025

Uh oh!

matifali Jan 9, 2025

Uh oh!

johnstcn Jan 9, 2025

Uh oh!

spikecurtis Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	\| Up to 1,000 \| 2 vCPU, 8 GB memory \| 1-2 nodes, 1 coderd each \| `n1-standard-2` \| `t3.large` \| `Standard_D2s_v3` \|
	\| Up to 1,000 \| 2 vCPU, 8 GB memory \| 1-2 nodes \| `n1-standard-2` \| `t3.large` \| `Standard_D2s_v3` \|

chore(docs): tweak replica verbiage on reference architectures #16076

chore(docs): tweak replica verbiage on reference architectures #16076

Uh oh!

Conversation

stirby commented Jan 8, 2025

Uh oh!

matifali Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

johnstcn Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtis Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!