From 2dc5f2fc8b8825d38b84f9d7c6afc272a6403d50 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Mon, 11 Mar 2024 17:00:58 +0100 Subject: [PATCH 01/39] docs: hardware recommendations for reference architectures --- docs/admin/reference-architectures.md | 48 +++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/docs/admin/reference-architectures.md b/docs/admin/reference-architectures.md index 6f9cbd8d11753..1e8810212d83d 100644 --- a/docs/admin/reference-architectures.md +++ b/docs/admin/reference-architectures.md @@ -171,3 +171,51 @@ Database: metadata. - Memory utilization averages at 40%. - `write_ops_count` between 6.7 and 8.4 operations per second. + +## Hardware recommendation + +### Control plane + +When considering the control plane, it's essential to focus on node sizing, +resource limits, and the number of replicas. We recommend referencing public +cloud providers such as AWS, GCP, and Azure for guidance on optimal +configurations. A reasonable approach involves using scaling formulas based on +factors like CPU, memory, and the number of users. + +#### Up to 1,000 users + +The 1k architecture is designed to cover a wide range of workflows. While the +minimum requirements specify 1 CPU core and 2 GB of memory per `coderd` replica, +it is recommended to allocate additional resources to ensure deployment +stability: + +| Users | Configuration | Replicas | GCP | AWS | Azure | +| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | +| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | + +The memory consumption may increase with enabled agent stats collection by the +Prometheus metrics aggregator (optional). + +#### Up to 2,000 users + +TODO + +#### Up to 3,000 users + +TODO + +#### Scaling formula + +reasonable ratio/formula: CPU x memory x users reasonable ratio/formula: +provisionerd x users API latency/response time average number of HTTP requests +advice: + +### Workspaces + +TODO + +### Database + +TODO + +### From d2e1f4268bf13ccb826647349b643a93e575327d Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 11:30:15 +0100 Subject: [PATCH 02/39] Achrs --- docs/admin/reference-architectures.md | 48 +++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/docs/admin/reference-architectures.md b/docs/admin/reference-architectures.md index 1e8810212d83d..2f0cc487b0dd5 100644 --- a/docs/admin/reference-architectures.md +++ b/docs/admin/reference-architectures.md @@ -172,6 +172,35 @@ Database: - Memory utilization averages at 40%. - `write_ops_count` between 6.7 and 8.4 operations per second. +## Available reference architectures + +### Up to 1,000 users + +The 1,000 users architecture is designed to cover a wide range of workflows. +Examples of subjects that might utilize this architecture include medium-sized +tech startups, educational units, or small to mid-sized enterprises. + +### Up to 2,000 users + +In the 2,000 users architecture, there is a moderate increase in traffic, +suggesting a growing user base or expanding operations. This setup is +well-suited for mid-sized companies experiencing growth or for universities +seeking to accommodate their expanding user populations. + +Users can be evenly distributed between 2 regions or be attached to different +clusters. + +The High Available mode is disabled in this setup, but administrators may +consider enabling it. + +### Up to 3,000 users + +The 3,000 users architecture targets large-scale enterprises, possibly with +on-premises network and cloud deployments. + +Typically, such scale requires a fully-managed HA PostgreSQL service, and all +Coder observability features enabled for operational purposes. + ## Hardware recommendation ### Control plane @@ -182,20 +211,25 @@ cloud providers such as AWS, GCP, and Azure for guidance on optimal configurations. A reasonable approach involves using scaling formulas based on factors like CPU, memory, and the number of users. +**Notice about CPU and memory usage** + +The memory consumption may increase with enabled agent stats collection by the +Prometheus metrics aggregator (optional). + +Enabling direct connections between users and workspace agents (apps or SSH +traffic) can help prevent an increase in CPU usage. It is recommended to keep +this option enabled unless there are compelling reasons to disable it. + #### Up to 1,000 users -The 1k architecture is designed to cover a wide range of workflows. While the -minimum requirements specify 1 CPU core and 2 GB of memory per `coderd` replica, -it is recommended to allocate additional resources to ensure deployment -stability: +While the minimum requirements specify 1 CPU core and 2 GB of memory per +`coderd` replica, it is recommended to allocate additional resources to ensure +deployment stability: | Users | Configuration | Replicas | GCP | AWS | Azure | | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | -The memory consumption may increase with enabled agent stats collection by the -Prometheus metrics aggregator (optional). - #### Up to 2,000 users TODO From 842ed5891e08b3de5a2d7823e1cb4742f4fb5c96 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 12:01:55 +0100 Subject: [PATCH 03/39] WIP --- docs/admin/reference-architectures.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/docs/admin/reference-architectures.md b/docs/admin/reference-architectures.md index 2f0cc487b0dd5..fe1d56f820412 100644 --- a/docs/admin/reference-architectures.md +++ b/docs/admin/reference-architectures.md @@ -211,6 +211,10 @@ cloud providers such as AWS, GCP, and Azure for guidance on optimal configurations. A reasonable approach involves using scaling formulas based on factors like CPU, memory, and the number of users. +While the minimum requirements specify 1 CPU core and 2 GB of memory per +`coderd` replica, it is recommended to allocate additional resources to ensure +deployment stability. + **Notice about CPU and memory usage** The memory consumption may increase with enabled agent stats collection by the @@ -220,23 +224,25 @@ Enabling direct connections between users and workspace agents (apps or SSH traffic) can help prevent an increase in CPU usage. It is recommended to keep this option enabled unless there are compelling reasons to disable it. -#### Up to 1,000 users +Inactive users do not consume Coder resources. -While the minimum requirements specify 1 CPU core and 2 GB of memory per -`coderd` replica, it is recommended to allocate additional resources to ensure -deployment stability: +#### Up to 1,000 users -| Users | Configuration | Replicas | GCP | AWS | Azure | +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | #### Up to 2,000 users -TODO +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | +| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | #### Up to 3,000 users -TODO +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | #### Scaling formula From 46f3dc2f49df50d36cc3abcd4a9d109c7b7f1525 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 12:08:53 +0100 Subject: [PATCH 04/39] remodelled --- docs/admin/reference-architectures.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/docs/admin/reference-architectures.md b/docs/admin/reference-architectures.md index fe1d56f820412..995ea599e6a35 100644 --- a/docs/admin/reference-architectures.md +++ b/docs/admin/reference-architectures.md @@ -215,7 +215,7 @@ While the minimum requirements specify 1 CPU core and 2 GB of memory per `coderd` replica, it is recommended to allocate additional resources to ensure deployment stability. -**Notice about CPU and memory usage** +#### CPU and memory usage The memory consumption may increase with enabled agent stats collection by the Prometheus metrics aggregator (optional). @@ -226,6 +226,19 @@ this option enabled unless there are compelling reasons to disable it. Inactive users do not consume Coder resources. +#### HTTP API + +API latency/response time average number of HTTP requests + +TODO + +#### Scaling formula + +reasonable ratio/formula: CPU x memory x users reasonable ratio/formula: +provisionerd x users API latency/response time average number of HTTP requests + +TODO + #### Up to 1,000 users | Users | Cluster capacity | Replicas | GCP | AWS | Azure | @@ -244,12 +257,6 @@ Inactive users do not consume Coder resources. | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | -#### Scaling formula - -reasonable ratio/formula: CPU x memory x users reasonable ratio/formula: -provisionerd x users API latency/response time average number of HTTP requests -advice: - ### Workspaces TODO From 59654fd1e2e2fc56a11138fc8abe90c82103a9d0 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 12:34:32 +0100 Subject: [PATCH 05/39] subpages --- docs/admin/architectures/1k-users.md | 13 +++++ docs/admin/architectures/2k-users.md | 20 ++++++++ docs/admin/architectures/3k-users.md | 15 ++++++ .../index.md} | 47 ++----------------- docs/manifest.json | 22 ++++++++- 5 files changed, 73 insertions(+), 44 deletions(-) create mode 100644 docs/admin/architectures/1k-users.md create mode 100644 docs/admin/architectures/2k-users.md create mode 100644 docs/admin/architectures/3k-users.md rename docs/admin/{reference-architectures.md => architectures/index.md} (81%) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md new file mode 100644 index 0000000000000..225035e7429bd --- /dev/null +++ b/docs/admin/architectures/1k-users.md @@ -0,0 +1,13 @@ +# Reference Architecture: up to 1,000 users + +The 1,000 users architecture is designed to cover a wide range of workflows. +Examples of subjects that might utilize this architecture include medium-sized +tech startups, educational units, or small to mid-sized enterprises. + +## Hardware recommendations + +### Coderd nodes + +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | +| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md new file mode 100644 index 0000000000000..580f64a4b8ea1 --- /dev/null +++ b/docs/admin/architectures/2k-users.md @@ -0,0 +1,20 @@ +# Reference Architecture: up to 2,000 users + +In the 2,000 users architecture, there is a moderate increase in traffic, +suggesting a growing user base or expanding operations. This setup is +well-suited for mid-sized companies experiencing growth or for universities +seeking to accommodate their expanding user populations. + +Users can be evenly distributed between 2 regions or be attached to different +clusters. + +The High Available mode is disabled in this setup, but administrators may +consider enabling it. + +## Hardware recommendations + +### Coderd nodes + +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | +| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md new file mode 100644 index 0000000000000..2a9f97ad1b7a2 --- /dev/null +++ b/docs/admin/architectures/3k-users.md @@ -0,0 +1,15 @@ +# Reference Architecture: up to 3,000 users + +The 3,000 users architecture targets large-scale enterprises, possibly with +on-premises network and cloud deployments. + +Typically, such scale requires a fully-managed HA PostgreSQL service, and all +Coder observability features enabled for operational purposes. + +## Hardware recommendations + +### Coderd nodes + +| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | diff --git a/docs/admin/reference-architectures.md b/docs/admin/architectures/index.md similarity index 81% rename from docs/admin/reference-architectures.md rename to docs/admin/architectures/index.md index 995ea599e6a35..52a288da7c3f5 100644 --- a/docs/admin/reference-architectures.md +++ b/docs/admin/architectures/index.md @@ -1,4 +1,4 @@ -# Reference architectures +# Reference Architectures This document provides prescriptive solutions and reference architectures to support successful deployments of up to 2000 users and outlines at a high-level @@ -174,32 +174,11 @@ Database: ## Available reference architectures -### Up to 1,000 users +[Up to 1,000 users](1k-users.md) -The 1,000 users architecture is designed to cover a wide range of workflows. -Examples of subjects that might utilize this architecture include medium-sized -tech startups, educational units, or small to mid-sized enterprises. +[Up to 2,000 users](2k-users.md) -### Up to 2,000 users - -In the 2,000 users architecture, there is a moderate increase in traffic, -suggesting a growing user base or expanding operations. This setup is -well-suited for mid-sized companies experiencing growth or for universities -seeking to accommodate their expanding user populations. - -Users can be evenly distributed between 2 regions or be attached to different -clusters. - -The High Available mode is disabled in this setup, but administrators may -consider enabling it. - -### Up to 3,000 users - -The 3,000 users architecture targets large-scale enterprises, possibly with -on-premises network and cloud deployments. - -Typically, such scale requires a fully-managed HA PostgreSQL service, and all -Coder observability features enabled for operational purposes. +[Up to 3,000 users](3k-users.md) ## Hardware recommendation @@ -239,24 +218,6 @@ provisionerd x users API latency/response time average number of HTTP requests TODO -#### Up to 1,000 users - -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | -| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | -| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | - -#### Up to 2,000 users - -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | -| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | - -#### Up to 3,000 users - -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | -| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | - ### Workspaces TODO diff --git a/docs/manifest.json b/docs/manifest.json index 1b70f9147d950..362820a79da2a 100644 --- a/docs/manifest.json +++ b/docs/manifest.json @@ -375,10 +375,30 @@ }, { "title": "Scaling Coder", - "description": "Reference architecture and load testing tools", + "description": "Learn how to use load testing tools", "path": "./admin/scale.md", "icon_path": "./images/icons/scale.svg" }, + { + "title": "Reference Architectures", + "description": "Learn about reference architectures for Coder", + "path": "./admin/architectures/index.md", + "icon_path": "./images/icons/scale.svg", + "children": [ + { + "title": "Up to 1,000 users", + "path": "./admin/architectures/1k-users.md" + }, + { + "title": "Up to 2,000 users", + "path": "./admin/architectures/2k-users.md" + }, + { + "title": "Up to 3,000 users", + "path": "./admin/architectures/3k-users.md" + } + ] + }, { "title": "External Provisioners", "description": "Run provisioners isolated from the Coder server", From 894cddbcdffdc2dbeef64da39d1a5420acadf8cb Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 12:55:09 +0100 Subject: [PATCH 06/39] Target load --- docs/admin/architectures/1k-users.md | 4 ++++ docs/admin/architectures/2k-users.md | 6 ++++-- docs/admin/architectures/3k-users.md | 7 +++++-- docs/admin/architectures/index.md | 8 +++++--- 4 files changed, 18 insertions(+), 7 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 225035e7429bd..fb44ae54f696c 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -4,6 +4,10 @@ The 1,000 users architecture is designed to cover a wide range of workflows. Examples of subjects that might utilize this architecture include medium-sized tech startups, educational units, or small to mid-sized enterprises. +**Target load**: API: up to 180 RPS + +**High Availability**: non-essential for small deployments + ## Hardware recommendations ### Coderd nodes diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 580f64a4b8ea1..5f479fc2f45a0 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -8,8 +8,10 @@ seeking to accommodate their expanding user populations. Users can be evenly distributed between 2 regions or be attached to different clusters. -The High Available mode is disabled in this setup, but administrators may -consider enabling it. +**Target load**: API: up to 300 RPS + +**High Availability**: The mode is _disabled_, but administrators may consider +enabling it for deployment reliability. ## Hardware recommendations diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 2a9f97ad1b7a2..aa4fe92cd8469 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -3,8 +3,11 @@ The 3,000 users architecture targets large-scale enterprises, possibly with on-premises network and cloud deployments. -Typically, such scale requires a fully-managed HA PostgreSQL service, and all -Coder observability features enabled for operational purposes. +**Target load**: API: up to 550 RPS + +**High Availability**: Typically, such scale requires a fully-managed HA +PostgreSQL service, and all Coder observability features enabled for operational +purposes. ## Hardware recommendations diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 52a288da7c3f5..998015a897f3c 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -156,8 +156,8 @@ Coder: - Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard tests. -- Median API request rate: 350 req/s during dashboard tests, 250 req/s during - Web Terminal and workspace apps tests. +- Median API request rate: 350 RPS during dashboard tests, 250 RPS during Web + Terminal and workspace apps tests. - 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms. - on average 2400 Web Socket connections during dashboard tests. @@ -205,10 +205,12 @@ this option enabled unless there are compelling reasons to disable it. Inactive users do not consume Coder resources. -#### HTTP API +#### HTTP API latency API latency/response time average number of HTTP requests +depending on database perf + TODO #### Scaling formula From fa1215f11d7bd663dc7229da2512d24e08eac09b Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 13:34:54 +0100 Subject: [PATCH 07/39] now workspaces --- docs/admin/architectures/1k-users.md | 2 +- docs/admin/architectures/2k-users.md | 2 +- docs/admin/architectures/3k-users.md | 2 +- docs/admin/architectures/index.md | 55 ++++++++++++++++++++++------ 4 files changed, 47 insertions(+), 14 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index fb44ae54f696c..4199a2edcc1f7 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -12,6 +12,6 @@ tech startups, educational units, or small to mid-sized enterprises. ### Coderd nodes -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 5f479fc2f45a0..0d1adcc25f8cd 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -17,6 +17,6 @@ enabling it for deployment reliability. ### Coderd nodes -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index aa4fe92cd8469..38da3c4ff3cd3 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -13,6 +13,6 @@ purposes. ### Coderd nodes -| Users | Cluster capacity | Replicas | GCP | AWS | Azure | +| Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 998015a897f3c..c8b76f749055f 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -205,27 +205,60 @@ this option enabled unless there are compelling reasons to disable it. Inactive users do not consume Coder resources. -#### HTTP API latency +#### Scaling formula -API latency/response time average number of HTTP requests +When determining scaling requirements, consider the following factors: -depending on database perf +- 1 vCPU x 2 GB memory x 250 users: A reasonable formula to determine resource + allocation based on the number of users and their expected usage patterns. +- API latency/response time: Monitor API latency and response times to ensure + optimal performance under varying loads. +- Average number of HTTP requests: Track the average number of HTTP requests to + gauge system usage and identify potential bottlenecks. -TODO +**Node Autoscaling** -#### Scaling formula +We recommend to disable autoscaling for `coderd` nodes. Autoscaling can cause +interruptions for user connections, see [Autoscaling](../scale.md#autoscaling) +for more details. + +### Workspaces -reasonable ratio/formula: CPU x memory x users reasonable ratio/formula: -provisionerd x users API latency/response time average number of HTTP requests +Assumptions: -TODO +workspaces also run on the same Kubernetes cluster (recommend a different +namespace/node pool) -### Workspaces +developers can pick between 4-8 CPU and 4-16 GB RAM workspaces (limits) + +developers have a resource quota of 16 GPU 32 GB RAM (2-maxed out workspaces). -TODO +However, the Coder agent itself requires at minimum 0.1 CPU cores and 256 MB to +run inside a workspace. + +web microservice development use case: resources are mostly underutilized but +spike during builds + +Case study: + +Developers for up to 2000+ users architecture are in 2 regions (a different +cluster) and are evenly split. In practice, this doesn’t change much besides the +diagram and workspaces node pool autoscaling config as it still uses the central +provisioner. Recommend multiple provisioner groups for zero-trust and +multi-cloud use cases. Developers for up to 3000+ users architecture are also in +an on-premises network. Document a provisioner running in a different cloud +environment, and the zero-trust benefits of that. + +scaling formula + +provisionerd x users: Another formula to consider, focusing on the capacity of +provisioner nodes relative to the number of workspace builds, triggered by +users. ### Database -TODO +PostgreSQL database + +measure and document the impact of dbcrypt ### From 43812e6d73be4ee141cdb5c37a6d24a5fb4bbfa1 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 14:15:59 +0100 Subject: [PATCH 08/39] HTTP API latency --- docs/admin/architectures/index.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index c8b76f749055f..65870b9d2d435 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -216,6 +216,18 @@ When determining scaling requirements, consider the following factors: - Average number of HTTP requests: Track the average number of HTTP requests to gauge system usage and identify potential bottlenecks. +**HTTP API latency** + +For a reliable Coder deployment dealing with medium to high loads, it's +important that API calls for workspace/template queries and workspace build +operations respond within 300 ms. However, API template insights calls, which +involve browsing workspace agent stats and user activity data, may require more +time. + +Also, if the Coder deployment expects traffic from developers spread across the +globe, keep in mind that customer-facing latency might be higher because of the +distance between users and the load balancer. + **Node Autoscaling** We recommend to disable autoscaling for `coderd` nodes. Autoscaling can cause From f68ed34b639ad59445102611cd3eac660ed4da5d Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 14:56:02 +0100 Subject: [PATCH 09/39] fix --- docs/admin/architectures/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 65870b9d2d435..a86378ff1a872 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -230,7 +230,7 @@ distance between users and the load balancer. **Node Autoscaling** -We recommend to disable autoscaling for `coderd` nodes. Autoscaling can cause +We recommend disabling the autoscaling for `coderd` nodes. Autoscaling can cause interruptions for user connections, see [Autoscaling](../scale.md#autoscaling) for more details. From 2987193bcae38dd01bed00e9b52444929cf868a0 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 16:20:55 +0100 Subject: [PATCH 10/39] WIP --- docs/admin/architectures/1k-users.md | 4 +++ docs/admin/architectures/2k-users.md | 10 ++++++ docs/admin/architectures/3k-users.md | 8 +++++ docs/admin/architectures/index.md | 54 +++++++++++++--------------- 4 files changed, 47 insertions(+), 29 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 4199a2edcc1f7..154e73437baed 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -15,3 +15,7 @@ tech startups, educational units, or small to mid-sized enterprises. | Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | + +### Workspace nodes + +TODO diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 0d1adcc25f8cd..4af7f3193683a 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -20,3 +20,13 @@ enabling it for deployment reliability. | Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | + +### Workspace nodes + +TODO + +Developers for up to 2000+ users architecture are in 2 regions (a different +cluster) and are evenly split. In practice, this doesn’t change much besides the +diagram and workspaces node pool autoscaling config as it still uses the central +provisioner. Recommend multiple provisioner groups for zero-trust and +multi-cloud use cases. diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 38da3c4ff3cd3..e05e89c97d96b 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -16,3 +16,11 @@ purposes. | Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | + +### Workspace nodes + +TODO + +Developers for up to 3000+ users architecture are also in an on-premises +network. Document a provisioner running in a different cloud environment, and +the zero-trust benefits of that. diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index a86378ff1a872..ccbeefa7225e4 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -184,11 +184,11 @@ Database: ### Control plane -When considering the control plane, it's essential to focus on node sizing, -resource limits, and the number of replicas. We recommend referencing public -cloud providers such as AWS, GCP, and Azure for guidance on optimal -configurations. A reasonable approach involves using scaling formulas based on -factors like CPU, memory, and the number of users. +To ensure stability and reliability of the Coder control plane, it's essential +to focus on node sizing, resource limits, and the number of replicas. We +recommend referencing public cloud providers such as AWS, GCP, and Azure for +guidance on optimal configurations. A reasonable approach involves using scaling +formulas based on factors like CPU, memory, and the number of users. While the minimum requirements specify 1 CPU core and 2 GB of memory per `coderd` replica, it is recommended to allocate additional resources to ensure @@ -209,7 +209,7 @@ Inactive users do not consume Coder resources. When determining scaling requirements, consider the following factors: -- 1 vCPU x 2 GB memory x 250 users: A reasonable formula to determine resource +- `1 vCPU x 2 GB memory x 250 users`: A reasonable formula to determine resource allocation based on the number of users and their expected usage patterns. - API latency/response time: Monitor API latency and response times to ensure optimal performance under varying loads. @@ -236,37 +236,33 @@ for more details. ### Workspaces -Assumptions: +To determine workspace resource limits and keep the best developer experience +for workspace users, administrators must be aware of a few assumptions. -workspaces also run on the same Kubernetes cluster (recommend a different -namespace/node pool) +- Workspace pods run on the same Kubernetes cluster, but possible in a different + namespace or a node pool. +- Workspace limits (per workspace user): + - Developers can choose between 4-8 vCPUs, and 4-16 GB memory. + - Evaluate the workspace utilization pattern. For instance, a regular web + development does not require high CPU capacity all the time, but only during + project builds or load tests. + - Minimum requirements for Coder agent running in an idle workspace are 0.1 + vCPU and 256 MB. -developers can pick between 4-8 CPU and 4-16 GB RAM workspaces (limits) - -developers have a resource quota of 16 GPU 32 GB RAM (2-maxed out workspaces). - -However, the Coder agent itself requires at minimum 0.1 CPU cores and 256 MB to -run inside a workspace. - -web microservice development use case: resources are mostly underutilized but -spike during builds - -Case study: - -Developers for up to 2000+ users architecture are in 2 regions (a different -cluster) and are evenly split. In practice, this doesn’t change much besides the -diagram and workspaces node pool autoscaling config as it still uses the central -provisioner. Recommend multiple provisioner groups for zero-trust and -multi-cloud use cases. Developers for up to 3000+ users architecture are also in -an on-premises network. Document a provisioner running in a different cloud -environment, and the zero-trust benefits of that. +#### Scaling formula -scaling formula +TODO provisionerd x users: Another formula to consider, focusing on the capacity of provisioner nodes relative to the number of workspace builds, triggered by users. +- Guidance for reasonable ratio of CPU limits/requests +- Guidance for reasonable ratio for memory requests/limits + +Mention that as users onboard, the autoscaling config should take care of +ongoing workspaces + ### Database PostgreSQL database From 1a4dfb99bed572998ab4a47e9d8e8509138dbe1a Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Tue, 12 Mar 2024 17:27:12 +0100 Subject: [PATCH 11/39] More TODOs --- docs/admin/architectures/1k-users.md | 4 ++++ docs/admin/architectures/2k-users.md | 19 +++++++++++++++++++ docs/admin/architectures/index.md | 2 ++ 3 files changed, 25 insertions(+) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 154e73437baed..722094cd4a96a 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -19,3 +19,7 @@ tech startups, educational units, or small to mid-sized enterprises. ### Workspace nodes TODO + +### Provisioner nodes + +TODO diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 4af7f3193683a..715bc08a75fb0 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -23,10 +23,29 @@ enabling it for deployment reliability. ### Workspace nodes +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | ---------------- | ------------ | ----------------- | +| Up to 2,000 | 8 vCPU, 32 GB memory | 2 | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + TODO +Max pods per node 256 + Developers for up to 2000+ users architecture are in 2 regions (a different cluster) and are evenly split. In practice, this doesn’t change much besides the diagram and workspaces node pool autoscaling config as it still uses the central provisioner. Recommend multiple provisioner groups for zero-trust and multi-cloud use cases. + +### Provisioner nodes + +TODO + +For example, to support 120 concurrent workspace builds: + +- Create a cluster/nodepool with 4 nodes, 8-core each (AWS: `t3.2xlarge` GCP: + `e2-highcpu-8`) +- Run coderd with 4 replicas, 30 provisioner daemons each. + (`CODER_PROVISIONER_DAEMONS=30`) +- Ensure Coder's [PostgreSQL server](./configure.md#postgresql-database) can use + up to 2 cores and 4 GB RAM diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index ccbeefa7225e4..a6e0d2bd06801 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -263,6 +263,8 @@ users. Mention that as users onboard, the autoscaling config should take care of ongoing workspaces +0.25 cores and 256 MB per provisioner daemon + ### Database PostgreSQL database From 4721204d83799c26924353aff89b534a67e53bff Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 10:35:02 +0100 Subject: [PATCH 12/39] WIP --- docs/admin/architectures/2k-users.md | 36 +++++++++++++++------------- docs/admin/architectures/index.md | 16 +++++++++++-- 2 files changed, 34 insertions(+), 18 deletions(-) diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 715bc08a75fb0..8addc3c75e28f 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -21,26 +21,14 @@ enabling it for deployment reliability. | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | -### Workspace nodes - -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | ---------------- | ------------ | ----------------- | -| Up to 2,000 | 8 vCPU, 32 GB memory | 2 | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | - -TODO - -Max pods per node 256 - -Developers for up to 2000+ users architecture are in 2 regions (a different -cluster) and are evenly split. In practice, this doesn’t change much besides the -diagram and workspaces node pool autoscaling config as it still uses the central -provisioner. Recommend multiple provisioner groups for zero-trust and -multi-cloud use cases. - ### Provisioner nodes TODO +In practice, this doesn’t change much besides the diagram and workspaces node +pool autoscaling config as it still uses the central provisioner. Recommend +multiple provisioner groups for zero-trust and multi-cloud use cases. + For example, to support 120 concurrent workspace builds: - Create a cluster/nodepool with 4 nodes, 8-core each (AWS: `t3.2xlarge` GCP: @@ -49,3 +37,19 @@ For example, to support 120 concurrent workspace builds: (`CODER_PROVISIONER_DAEMONS=30`) - Ensure Coder's [PostgreSQL server](./configure.md#postgresql-database) can use up to 2 cores and 4 GB RAM + +### Workspace nodes + +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | ---------------- | ------------ | ----------------- | +| Up to 2,000 | 8 vCPU, 32 GB memory | 128 | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + +**Assumptions**: + +- Workspace user needs 2 GB memory to perform + +**Footnotes**: + +- Maximum number of Kubernetes pods per node: 256 +- Nodes can be distributed in 2 regions, not necessarily evenly split, depending + on developer team sizes diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index a6e0d2bd06801..1835154f0445f 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -182,7 +182,7 @@ Database: ## Hardware recommendation -### Control plane +### Control plane: coderd To ensure stability and reliability of the Coder control plane, it's essential to focus on node sizing, resource limits, and the number of replicas. We @@ -234,7 +234,19 @@ We recommend disabling the autoscaling for `coderd` nodes. Autoscaling can cause interruptions for user connections, see [Autoscaling](../scale.md#autoscaling) for more details. -### Workspaces +### Control plane: provisionerd + +TODO + +#### Scaling formula + +TODO + +**Node Autoscaling** + +TODO + +### Data plane: Workspaces To determine workspace resource limits and keep the best developer experience for workspace users, administrators must be aware of a few assumptions. From 233866f0631b94c327b6eca0c213726aee4fcefa Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 11:55:37 +0100 Subject: [PATCH 13/39] 2k --- docs/admin/architectures/2k-users.md | 25 +++++++++---------------- docs/admin/architectures/index.md | 24 +++++++++++++++++++++--- 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 8addc3c75e28f..92fed77e9892f 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -23,20 +23,16 @@ enabling it for deployment reliability. ### Provisioner nodes -TODO +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | +| Up to 2,000 | 8 vCPU, 32 GB memory | 4 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | -In practice, this doesn’t change much besides the diagram and workspaces node -pool autoscaling config as it still uses the central provisioner. Recommend -multiple provisioner groups for zero-trust and multi-cloud use cases. - -For example, to support 120 concurrent workspace builds: +**Footnotes**: -- Create a cluster/nodepool with 4 nodes, 8-core each (AWS: `t3.2xlarge` GCP: - `e2-highcpu-8`) -- Run coderd with 4 replicas, 30 provisioner daemons each. - (`CODER_PROVISIONER_DAEMONS=30`) -- Ensure Coder's [PostgreSQL server](./configure.md#postgresql-database) can use - up to 2 cores and 4 GB RAM +- An external provisioner is deployed as Kubernetes pod. +- It is not recommended to run provisioner daemons on `coderd` nodes. +- Consider separating provisioners into different namespaces in favor of + zero-trust or multi-cloud deployments. ### Workspace nodes @@ -44,12 +40,9 @@ For example, to support 120 concurrent workspace builds: | ----------- | -------------------- | -------- | ---------------- | ------------ | ----------------- | | Up to 2,000 | 8 vCPU, 32 GB memory | 128 | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | -**Assumptions**: - -- Workspace user needs 2 GB memory to perform - **Footnotes**: +- Assumed that a workspace user needs 2 GB memory to perform - Maximum number of Kubernetes pods per node: 256 - Nodes can be distributed in 2 regions, not necessarily evenly split, depending on developer team sizes diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 1835154f0445f..18c07b6ec0f74 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -236,15 +236,31 @@ for more details. ### Control plane: provisionerd -TODO +Each provisioner can run a single concurrent workspace build. For example, +running 10 provisioner containers will allow 10 users to start workspaces at the +same time. + +By default, the Coder server runs built-in provisioner daemons, but the +_Enterprise_ Coder release allows for running external provisioners to separate +the load caused by workspace provisioning on the `coderd` nodes. #### Scaling formula -TODO +When determining scaling requirements, consider the following factors: + +- `0.5 vCPU x 512 MB memory x concurrent workspace build`: A formula to + determine resource allocation based on the number of concurrent workspace + builds, and standard complexity of a Terraform template. _The rule of thumb_: + the more provisioners are free/available, the more concurrent workspace builds + can be performed. **Node Autoscaling** -TODO +Autoscaling provisioners is not an easy problem to solve unless it can be +predicted when a number of concurrent workspace builds increases. + +We recommend disabling autoscaling and adjusting the number of provisioners to +developer needs based on the workspace build queuing time. ### Data plane: Workspaces @@ -279,6 +295,8 @@ ongoing workspaces ### Database +TODO + PostgreSQL database measure and document the impact of dbcrypt From 17e543127ff7e6a8a4ba7ef2b43e1c43811b995d Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 11:58:03 +0100 Subject: [PATCH 14/39] WIP --- docs/admin/architectures/index.md | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 18c07b6ec0f74..983cfed05b1cf 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -248,11 +248,11 @@ the load caused by workspace provisioning on the `coderd` nodes. When determining scaling requirements, consider the following factors: -- `0.5 vCPU x 512 MB memory x concurrent workspace build`: A formula to - determine resource allocation based on the number of concurrent workspace - builds, and standard complexity of a Terraform template. _The rule of thumb_: - the more provisioners are free/available, the more concurrent workspace builds - can be performed. +- `1 vCPU x 1 GB memory x 2 concurrent workspace build`: A formula to determine + resource allocation based on the number of concurrent workspace builds, and + standard complexity of a Terraform template. _The rule of thumb_: the more + provisioners are free/available, the more concurrent workspace builds can be + performed. **Node Autoscaling** @@ -281,18 +281,12 @@ for workspace users, administrators must be aware of a few assumptions. TODO -provisionerd x users: Another formula to consider, focusing on the capacity of -provisioner nodes relative to the number of workspace builds, triggered by -users. - - Guidance for reasonable ratio of CPU limits/requests - Guidance for reasonable ratio for memory requests/limits Mention that as users onboard, the autoscaling config should take care of ongoing workspaces -0.25 cores and 256 MB per provisioner daemon - ### Database TODO From ab95dddbc7bcbfe7b2c55bdd863e103658cfa2e6 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 12:42:07 +0100 Subject: [PATCH 15/39] 1k 2k 3k --- docs/admin/architectures/1k-users.md | 24 +++++++++++++++++++++--- docs/admin/architectures/2k-users.md | 8 ++++---- docs/admin/architectures/3k-users.md | 27 +++++++++++++++++++++++---- docs/admin/architectures/index.md | 7 ++++--- 4 files changed, 52 insertions(+), 14 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 722094cd4a96a..616d25a5e76c4 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -16,10 +16,28 @@ tech startups, educational units, or small to mid-sized enterprises. | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | -### Workspace nodes +**Footnotes**: -TODO +- For small deployments (ca. 100 users, 10 concurrent workspace builds), it is + acceptable to deploy provisioners on `coderd` nodes. ### Provisioner nodes -TODO +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | +| Up to 1,000 | 8 vCPU, 32 GB memory | 2 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + +**Footnotes**: + +- An external provisioner is deployed as Kubernetes pod. + +### Workspace nodes + +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ----------------------- | ---------------- | ------------ | ----------------- | +| Up to 1,000 | 8 vCPU, 32 GB memory | 64 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + +**Footnotes**: + +- Assumed that a workspace user needs 2 GB memory to perform +- Maximum number of Kubernetes workspace pods per node: 256 diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 92fed77e9892f..932f3b5b31512 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -36,13 +36,13 @@ enabling it for deployment reliability. ### Workspace nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | ---------------- | ------------ | ----------------- | -| Up to 2,000 | 8 vCPU, 32 GB memory | 128 | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | +| Up to 2,000 | 8 vCPU, 32 GB memory | 128 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | **Footnotes**: - Assumed that a workspace user needs 2 GB memory to perform -- Maximum number of Kubernetes pods per node: 256 +- Maximum number of Kubernetes workspace pods per node: 256 - Nodes can be distributed in 2 regions, not necessarily evenly split, depending on developer team sizes diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index e05e89c97d96b..e206a8a41635e 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -17,10 +17,29 @@ purposes. | ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | +### Provisioner nodes + +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 8 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + +**Footnotes**: + +- An external provisioner is deployed as Kubernetes pod. +- It is strongly discouraged to run provisioner daemons on `coderd` nodes. +- Separate provisioners into different namespaces in favor of zero-trust or + multi-cloud deployments. + ### Workspace nodes -TODO +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 256 / 12 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | + +**Footnotes**: -Developers for up to 3000+ users architecture are also in an on-premises -network. Document a provisioner running in a different cloud environment, and -the zero-trust benefits of that. +- Assumed that a workspace user needs 2 GB memory to perform +- Maximum number of Kubernetes workspace pods per node: 256 +- As workspace nodes can be distributed between regions, on-premises networks + and cloud areas, consider different namespaces in favor of zero-trust or + multi-cloud deployments. diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 983cfed05b1cf..b9f9bd8b53544 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -270,12 +270,13 @@ for workspace users, administrators must be aware of a few assumptions. - Workspace pods run on the same Kubernetes cluster, but possible in a different namespace or a node pool. - Workspace limits (per workspace user): - - Developers can choose between 4-8 vCPUs, and 4-16 GB memory. - Evaluate the workspace utilization pattern. For instance, a regular web development does not require high CPU capacity all the time, but only during project builds or load tests. - - Minimum requirements for Coder agent running in an idle workspace are 0.1 - vCPU and 256 MB. + - Evaluate minimal limits for single workspace. Include in the calculation + requirements for Coder agent running in an idle workspace - 0.1 vCPU and 256 + MB. For instance, developers can choose between 0.5-8 vCPUs, and 1-16 GB + memory. #### Scaling formula From 0937f361056b2163dc35d90bd144db461da045a9 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 13:35:42 +0100 Subject: [PATCH 16/39] Workspaces covered --- docs/admin/architectures/index.md | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index b9f9bd8b53544..ffec5126158a6 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -250,7 +250,7 @@ When determining scaling requirements, consider the following factors: - `1 vCPU x 1 GB memory x 2 concurrent workspace build`: A formula to determine resource allocation based on the number of concurrent workspace builds, and - standard complexity of a Terraform template. _The rule of thumb_: the more + standard complexity of a Terraform template. _Rule of thumb_: the more provisioners are free/available, the more concurrent workspace builds can be performed. @@ -280,13 +280,25 @@ for workspace users, administrators must be aware of a few assumptions. #### Scaling formula -TODO +When determining scaling requirements, consider the following factors: + +- `1 vCPU x 2 GB memory x 1 workspace`: A formula to determine resource + allocation based on the minimal requirements for an idle workspace with a + running Coder agent and occasional CPU and memory bursts for building + projects. + +**Node Autoscaling** + +Workspace nodes can be set to operate in autoscaling mode to mitigate the risk +of prolonged high resource utilization. -- Guidance for reasonable ratio of CPU limits/requests -- Guidance for reasonable ratio for memory requests/limits +One approach is to scale up workspace nodes when total CPU usage or memory +consumption reaches 80%. Another option is to scale based on metrics such as the +number of workspaces or active users. It's important to note that as new users +onboard, the autoscaling configuration should account for ongoing workspaces. -Mention that as users onboard, the autoscaling config should take care of -ongoing workspaces +Scaling down workspace nodes to zero is not recommended, as it will result in +longer wait times for workspace provisioning by users. ### Database From 67c4604bd54499a28892f75a56fa9aa62eb34b7c Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 13:55:16 +0100 Subject: [PATCH 17/39] More TODOs --- docs/admin/architectures/1k-users.md | 4 ++++ docs/admin/architectures/2k-users.md | 4 ++++ docs/admin/architectures/3k-users.md | 4 ++++ 3 files changed, 12 insertions(+) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 616d25a5e76c4..de14fca213235 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -41,3 +41,7 @@ tech startups, educational units, or small to mid-sized enterprises. - Assumed that a workspace user needs 2 GB memory to perform - Maximum number of Kubernetes workspace pods per node: 256 + +### Database nodes + +TODO diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 932f3b5b31512..6dc4a899d1dd6 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -46,3 +46,7 @@ enabling it for deployment reliability. - Maximum number of Kubernetes workspace pods per node: 256 - Nodes can be distributed in 2 regions, not necessarily evenly split, depending on developer team sizes + +### Database nodes + +TODO diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index e206a8a41635e..2ff323e6b4699 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -43,3 +43,7 @@ purposes. - As workspace nodes can be distributed between regions, on-premises networks and cloud areas, consider different namespaces in favor of zero-trust or multi-cloud deployments. + +### Database nodes + +TODO From 11dbdd7b477d24478e613c60103d9443cef0ed1a Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 14:38:09 +0100 Subject: [PATCH 18/39] Database requirements --- docs/admin/architectures/index.md | 34 ++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index ffec5126158a6..f24f11df6751d 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -300,12 +300,36 @@ onboard, the autoscaling configuration should account for ongoing workspaces. Scaling down workspace nodes to zero is not recommended, as it will result in longer wait times for workspace provisioning by users. -### Database +### External database -TODO +While running in production, Coder deployment requires an access to an external +PostgreSQL database. Depending on the scale of the user-base, workspace +activity, and High Availability requirements, the amount of CPU and memory +resources may differ. -PostgreSQL database +#### Scaling formula + +When determining scaling requirements, take into account the following +considerations: + +- `2 vCPU x 8 GB RAM x 512 GB storage`: A baseline for database requirements for + Coder deployment with less than 1000 users, and low activity level (30% active + users). This capacity should be sufficient to support 100 external + provisioners. +- Allocate an additional CPU core to the database instance for every 1000 active + users. +- Enable _High Availability_ mode for database engine for large scale + deployments. + +With enabled database encryption feature in Coder, consider allocating an +additional CPU core to every `coderd` replica. + +#### Performance optimization guidelines -measure and document the impact of dbcrypt +We provide the following general recommendations for PostgreSQL settings: -### +- Increase number of vCPU if CPU utilization or database latency is high. +- Allocate extra GB memory if database performance is poor and CPU utilization + is low. +- Utilize faster disk options such as SSDs or NVMe drives for optimal + performance enhancement. From 776d4c61dcca06d4180a7685447ac2fb3c4c276e Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 14:38:45 +0100 Subject: [PATCH 19/39] Fix --- docs/admin/architectures/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index f24f11df6751d..749f76da77a79 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -300,7 +300,7 @@ onboard, the autoscaling configuration should account for ongoing workspaces. Scaling down workspace nodes to zero is not recommended, as it will result in longer wait times for workspace provisioning by users. -### External database +### Data plane: External database While running in production, Coder deployment requires an access to an external PostgreSQL database. Depending on the scale of the user-base, workspace From 6a87a93620f8b253ac6708c288205e88049a2709 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Wed, 13 Mar 2024 15:15:12 +0100 Subject: [PATCH 20/39] 1k 2k 3k --- docs/admin/architectures/1k-users.md | 4 +++- docs/admin/architectures/2k-users.md | 9 ++++++++- docs/admin/architectures/3k-users.md | 9 ++++++++- docs/admin/architectures/index.md | 2 ++ docs/admin/scale.md | 2 +- 5 files changed, 22 insertions(+), 4 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index de14fca213235..7750d7513185d 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -44,4 +44,6 @@ tech startups, educational units, or small to mid-sized enterprises. ### Database nodes -TODO +| Users | Node capacity | Replicas | Storage | GCP | AWS | Azure | +| ----------- | ------------------- | -------- | ------- | ------------------ | ------------- | ----------------- | +| Up to 1,000 | 2 vCPU, 8 GB memory | 1 | 512 GB | `db-custom-2-7680` | `db.t3.large` | `Standard_D2s_v3` | diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 6dc4a899d1dd6..b86d85c93afc0 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -49,4 +49,11 @@ enabling it for deployment reliability. ### Database nodes -TODO +| Users | Node capacity | Replicas | Storage | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | ------- | ------------------- | -------------- | ----------------- | +| Up to 2,000 | 4 vCPU, 16 GB memory | 1 | 1 TB | `db-custom-4-15360` | `db.t3.xlarge` | `Standard_D4s_v3` | + +**Footnotes**: + +- Consider adding more replicas if the workspace activity is higher than 500 + workspace builds per day. diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 2ff323e6b4699..975d91baf417d 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -46,4 +46,11 @@ purposes. ### Database nodes -TODO +| Users | Node capacity | Replicas | Storage | GCP | AWS | Azure | +| ----------- | -------------------- | -------- | ------- | ------------------- | --------------- | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 2 | 1.5 TB | `db-custom-8-30720` | `db.t3.2xlarge` | `Standard_D8s_v3` | + +**Footnotes**: + +- Consider adding more replicas if the workspace activity is higher than 1500 + workspace builds per day. diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 749f76da77a79..b72edbe6d3cd0 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -316,6 +316,8 @@ considerations: Coder deployment with less than 1000 users, and low activity level (30% active users). This capacity should be sufficient to support 100 external provisioners. +- Storage size depends on user activity, workspace builds, log verbosity, + overhead on database encryption, etc. - Allocate an additional CPU core to the database instance for every 1000 active users. - Enable _High Availability_ mode for database engine for large scale diff --git a/docs/admin/scale.md b/docs/admin/scale.md index 2d7635f5c9ff3..43ca8f0d0a781 100644 --- a/docs/admin/scale.md +++ b/docs/admin/scale.md @@ -111,7 +111,7 @@ For example, to support 120 concurrent workspace builds: | Kubernetes (GKE) | 3 cores | 12 GB | 1 | db-f1-micro | 200 | 3 | 200 simulated | `v0.24.1` | Jun 26, 2023 | | Kubernetes (GKE) | 4 cores | 8 GB | 1 | db-custom-1-3840 | 1500 | 20 | 1,500 simulated | `v0.24.1` | Jun 27, 2023 | | Kubernetes (GKE) | 2 cores | 4 GB | 1 | db-custom-1-3840 | 500 | 20 | 500 simulated | `v0.27.2` | Jul 27, 2023 | -| Kubernetes (GKE) | 2 cores | 4 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 | +| Kubernetes (GKE) | 2 cores | 8 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 | > Note: a simulated connection reads and writes random data at 40KB/s per > connection. From 813688e92d15b094dcc7f7889b58af34ec06f8d2 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 11:11:50 +0100 Subject: [PATCH 21/39] WIP --- docs/admin/architectures/1k-users.md | 2 +- docs/admin/architectures/index.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 7750d7513185d..25fef1e89f3dc 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -14,7 +14,7 @@ tech startups, educational units, or small to mid-sized enterprises. | Users | Node capacity | Replicas | GCP | AWS | Azure | | ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | -| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | +| Up to 1,000 | 2 vCPU, 8 GB memory | 1-2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | **Footnotes**: diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index b72edbe6d3cd0..e6a4bb49e23ce 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -236,11 +236,11 @@ for more details. ### Control plane: provisionerd -Each provisioner can run a single concurrent workspace build. For example, -running 10 provisioner containers will allow 10 users to start workspaces at the -same time. +Each external provisioner can run a single concurrent workspace build. For +example, running 10 provisioner containers will allow 10 users to start +workspaces at the same time. -By default, the Coder server runs built-in provisioner daemons, but the +By default, the Coder server runs 3 built-in provisioner daemons, but the _Enterprise_ Coder release allows for running external provisioners to separate the load caused by workspace provisioning on the `coderd` nodes. From cf29c2647003eacb5d89bae4e521b83b3970b2d1 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 11:14:37 +0100 Subject: [PATCH 22/39] WIP --- docs/admin/architectures/1k-users.md | 10 ++++++---- docs/admin/architectures/2k-users.md | 6 +++--- docs/admin/architectures/3k-users.md | 6 +++--- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 25fef1e89f3dc..8c4bf576fd016 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -23,9 +23,9 @@ tech startups, educational units, or small to mid-sized enterprises. ### Provisioner nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | -| Up to 1,000 | 8 vCPU, 32 GB memory | 2 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------------ | ---------------- | ------------ | ----------------- | +| Up to 1,000 | 8 vCPU, 32 GB memory | 2 nodes / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | **Footnotes**: @@ -39,7 +39,9 @@ tech startups, educational units, or small to mid-sized enterprises. **Footnotes**: -- Assumed that a workspace user needs 2 GB memory to perform +- Assumed that a workspace user needs at minimum 2 GB memory to perform. We + recommend against over-provisioning memory for developer workloads, as this my + lead to OOMKiller invocations. - Maximum number of Kubernetes workspace pods per node: 256 ### Database nodes diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index b86d85c93afc0..72cf8b48fcc7b 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -23,9 +23,9 @@ enabling it for deployment reliability. ### Provisioner nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | -| Up to 2,000 | 8 vCPU, 32 GB memory | 4 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------------ | ---------------- | ------------ | ----------------- | +| Up to 2,000 | 8 vCPU, 32 GB memory | 4 nodes / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | **Footnotes**: diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 975d91baf417d..ede0c7fde1d23 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -32,9 +32,9 @@ purposes. ### Workspace nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | -| Up to 3,000 | 8 vCPU, 32 GB memory | 256 / 12 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ------------------------------ | ---------------- | ------------ | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 256 nodes / 12 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | **Footnotes**: From 18bd4d2692c735de7d6ae4e9be649fd1690ed945 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 11:25:36 +0100 Subject: [PATCH 23/39] WIP --- docs/admin/architectures/index.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index e6a4bb49e23ce..090a389922d7f 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -196,8 +196,7 @@ deployment stability. #### CPU and memory usage -The memory consumption may increase with enabled agent stats collection by the -Prometheus metrics aggregator (optional). +Enabling agent stats collection (optional) may increase memory consumption. Enabling direct connections between users and workspace agents (apps or SSH traffic) can help prevent an increase in CPU usage. It is recommended to keep @@ -267,12 +266,12 @@ developer needs based on the workspace build queuing time. To determine workspace resource limits and keep the best developer experience for workspace users, administrators must be aware of a few assumptions. -- Workspace pods run on the same Kubernetes cluster, but possible in a different - namespace or a node pool. +- Workspace pods run on the same Kubernetes cluster, but possibly in a different + namespace or on a separate set of nodes. - Workspace limits (per workspace user): - - Evaluate the workspace utilization pattern. For instance, a regular web - development does not require high CPU capacity all the time, but only during - project builds or load tests. + - Evaluate the workspace utilization pattern. For instance, web application + development does not require high CPU capacity at all times, but will spike + during builds or testing. - Evaluate minimal limits for single workspace. Include in the calculation requirements for Coder agent running in an idle workspace - 0.1 vCPU and 256 MB. For instance, developers can choose between 0.5-8 vCPUs, and 1-16 GB From d36e893a9750cd63cc874e8cb379ed8801d047c2 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 11:27:50 +0100 Subject: [PATCH 24/39] WIP --- docs/admin/architectures/index.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 090a389922d7f..b5beaa9e4edda 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -297,14 +297,16 @@ number of workspaces or active users. It's important to note that as new users onboard, the autoscaling configuration should account for ongoing workspaces. Scaling down workspace nodes to zero is not recommended, as it will result in -longer wait times for workspace provisioning by users. +longer wait times for workspace provisioning by users. However, this may be +necessary for workspaces with special resource requirements (e.g. GPUs) that +incur significant cost overheads. ### Data plane: External database -While running in production, Coder deployment requires an access to an external -PostgreSQL database. Depending on the scale of the user-base, workspace -activity, and High Availability requirements, the amount of CPU and memory -resources may differ. +While running in production, Coder requires a access to an external PostgreSQL +database. Depending on the scale of the user-base, workspace activity, and High +Availability requirements, the amount of CPU and memory resources required by +Coder's database may differ. #### Scaling formula @@ -322,8 +324,8 @@ considerations: - Enable _High Availability_ mode for database engine for large scale deployments. -With enabled database encryption feature in Coder, consider allocating an -additional CPU core to every `coderd` replica. +If you enable [database encryption](../encryption.md) in Coder, consider +allocating an additional CPU core to every `coderd` replica. #### Performance optimization guidelines From 066d6ff03c66e027a0c5ec6a6a7d914f03b417a3 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 11:30:44 +0100 Subject: [PATCH 25/39] WIP --- docs/admin/architectures/index.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index b5beaa9e4edda..db3a1223b3e75 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -191,8 +191,8 @@ guidance on optimal configurations. A reasonable approach involves using scaling formulas based on factors like CPU, memory, and the number of users. While the minimum requirements specify 1 CPU core and 2 GB of memory per -`coderd` replica, it is recommended to allocate additional resources to ensure -deployment stability. +`coderd` replica, it is recommended to allocate additional resources depending +on the workload size to ensure deployment stability. #### CPU and memory usage @@ -332,7 +332,8 @@ allocating an additional CPU core to every `coderd` replica. We provide the following general recommendations for PostgreSQL settings: - Increase number of vCPU if CPU utilization or database latency is high. -- Allocate extra GB memory if database performance is poor and CPU utilization - is low. +- Allocate extra memory if database performance is poor and CPU utilization is + low. For maximum performance, the entire database should be able to fit in + RAM. - Utilize faster disk options such as SSDs or NVMe drives for optimal performance enhancement. From d774ed5f0d2a5c8f1286bc66e5057a5f96f3f869 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 13:00:20 +0100 Subject: [PATCH 26/39] WIP: long lived --- docs/admin/architectures/index.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index db3a1223b3e75..a594ffb243159 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -221,11 +221,12 @@ For a reliable Coder deployment dealing with medium to high loads, it's important that API calls for workspace/template queries and workspace build operations respond within 300 ms. However, API template insights calls, which involve browsing workspace agent stats and user activity data, may require more -time. +time. Moreover, Coder API exposes WebSocket long-lived connections for Web +Terminal (bidirectional), and Workspace events/logs (unidirectional). -Also, if the Coder deployment expects traffic from developers spread across the -globe, keep in mind that customer-facing latency might be higher because of the -distance between users and the load balancer. +If the Coder deployment expects traffic from developers spread across the globe, +be aware that customer-facing latency might be higher because of the distance +between users and the load balancer. **Node Autoscaling** From 395d300fc8d30387270e828628e33be0d59cfb93 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 13:14:46 +0100 Subject: [PATCH 27/39] CODER_BLOCK_DIRECT --- docs/admin/architectures/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index a594ffb243159..d7907ee7bb1aa 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -200,7 +200,8 @@ Enabling agent stats collection (optional) may increase memory consumption. Enabling direct connections between users and workspace agents (apps or SSH traffic) can help prevent an increase in CPU usage. It is recommended to keep -this option enabled unless there are compelling reasons to disable it. +[this option enabled](../../cli.md#--disable-direct-connections) unless there +are compelling reasons to disable it. Inactive users do not consume Coder resources. From 9ae4b61972f91cae8f734915896697e3d5290c58 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 13:18:26 +0100 Subject: [PATCH 28/39] scale.md --- docs/admin/scale.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/admin/scale.md b/docs/admin/scale.md index 43ca8f0d0a781..eaa64b5951bf5 100644 --- a/docs/admin/scale.md +++ b/docs/admin/scale.md @@ -106,12 +106,13 @@ For example, to support 120 concurrent workspace builds: > Note: the below information is for reference purposes only, and are not > intended to be used as guidelines for infrastructure sizing. -| Environment | Coder CPU | Coder RAM | Coder Replicas | Database | Users | Concurrent builds | Concurrent connections (Terminal/SSH) | Coder Version | Last tested | -| ---------------- | --------- | --------- | -------------- | ---------------- | ----- | ----------------- | ------------------------------------- | ------------- | ------------ | -| Kubernetes (GKE) | 3 cores | 12 GB | 1 | db-f1-micro | 200 | 3 | 200 simulated | `v0.24.1` | Jun 26, 2023 | -| Kubernetes (GKE) | 4 cores | 8 GB | 1 | db-custom-1-3840 | 1500 | 20 | 1,500 simulated | `v0.24.1` | Jun 27, 2023 | -| Kubernetes (GKE) | 2 cores | 4 GB | 1 | db-custom-1-3840 | 500 | 20 | 500 simulated | `v0.27.2` | Jul 27, 2023 | -| Kubernetes (GKE) | 2 cores | 8 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 | +| Environment | Coder CPU | Coder RAM | Coder Replicas | Database | Users | Concurrent builds | Concurrent connections (Terminal/SSH) | Coder Version | Last tested | +| ---------------- | --------- | --------- | -------------- | ----------------- | ----- | ----------------- | ------------------------------------- | ------------- | ------------ | +| Kubernetes (GKE) | 3 cores | 12 GB | 1 | db-f1-micro | 200 | 3 | 200 simulated | `v0.24.1` | Jun 26, 2023 | +| Kubernetes (GKE) | 4 cores | 8 GB | 1 | db-custom-1-3840 | 1500 | 20 | 1,500 simulated | `v0.24.1` | Jun 27, 2023 | +| Kubernetes (GKE) | 2 cores | 4 GB | 1 | db-custom-1-3840 | 500 | 20 | 500 simulated | `v0.27.2` | Jul 27, 2023 | +| Kubernetes (GKE) | 2 cores | 8 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 | +| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | na. (provisioned) | 2000 simulated | `v2.8.4` | Feb 28, 2024 | > Note: a simulated connection reads and writes random data at 40KB/s per > connection. From 701a205dc816cd359bcc1eaacc5708f219b1358e Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 13:27:24 +0100 Subject: [PATCH 29/39] mention proxies --- docs/admin/architectures/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index d7907ee7bb1aa..d5a585c4e0fa0 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -227,7 +227,8 @@ Terminal (bidirectional), and Workspace events/logs (unidirectional). If the Coder deployment expects traffic from developers spread across the globe, be aware that customer-facing latency might be higher because of the distance -between users and the load balancer. +between users and the load balancer. Fortunately, the latency can be improved +with a deployment of Coder [workspace proxies](../workspace-proxies.md). **Node Autoscaling** From 088395a24c23b1281dec8e61e780bd6099ef7c4c Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Thu, 14 Mar 2024 13:31:26 +0100 Subject: [PATCH 30/39] Link to CLI for now --- docs/admin/architectures/index.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index d5a585c4e0fa0..2d44026183e47 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -196,7 +196,8 @@ on the workload size to ensure deployment stability. #### CPU and memory usage -Enabling agent stats collection (optional) may increase memory consumption. +Enabling [agent stats collection](../../cli.md#--prometheus-collect-agent-stats) +(optional) may increase memory consumption. Enabling direct connections between users and workspace agents (apps or SSH traffic) can help prevent an increase in CPU usage. It is recommended to keep From 34c4903659ecaacaf76e42c4442026b9de89969c Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 11:45:11 +0100 Subject: [PATCH 31/39] WIP --- docs/admin/architectures/2k-users.md | 2 +- docs/admin/architectures/3k-users.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 72cf8b48fcc7b..894c511ed8ec1 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -56,4 +56,4 @@ enabling it for deployment reliability. **Footnotes**: - Consider adding more replicas if the workspace activity is higher than 500 - workspace builds per day. + workspace builds per day or to achieve higher RPS. diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index ede0c7fde1d23..d7f42d95e6f68 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -53,4 +53,4 @@ purposes. **Footnotes**: - Consider adding more replicas if the workspace activity is higher than 1500 - workspace builds per day. + workspace builds per day or to achieve higher RPS. From 13dee4cb944bd18e0631cc40baadf7140b9ab097 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 11:50:21 +0100 Subject: [PATCH 32/39] WIP: coderd each --- docs/admin/architectures/1k-users.md | 6 +++--- docs/admin/architectures/2k-users.md | 6 +++--- docs/admin/architectures/3k-users.md | 6 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/admin/architectures/1k-users.md b/docs/admin/architectures/1k-users.md index 8c4bf576fd016..158eb10392e79 100644 --- a/docs/admin/architectures/1k-users.md +++ b/docs/admin/architectures/1k-users.md @@ -12,9 +12,9 @@ tech startups, educational units, or small to mid-sized enterprises. ### Coderd nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | -| Up to 1,000 | 2 vCPU, 8 GB memory | 1-2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | ------------------- | ------------------- | --------------- | ---------- | ----------------- | +| Up to 1,000 | 2 vCPU, 8 GB memory | 1-2 / 1 coderd each | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | **Footnotes**: diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 894c511ed8ec1..04e5332bfdfcd 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -17,9 +17,9 @@ enabling it for deployment reliability. ### Coderd nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | -| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ----------------------- | --------------- | ----------- | ----------------- | +| Up to 2,000 | 4 vCPU, 16 GB memory | 2 nodes / 1 coderd each | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | ### Provisioner nodes diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index d7f42d95e6f68..13ee7ce5b2a42 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -13,9 +13,9 @@ purposes. ### Coderd nodes -| Users | Node capacity | Replicas | GCP | AWS | Azure | -| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | -| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | +| Users | Node capacity | Replicas | GCP | AWS | Azure | +| ----------- | -------------------- | ----------------- | --------------- | ----------- | ----------------- | +| Up to 3,000 | 8 vCPU, 32 GB memory | 4 / 1 coderd each | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | ### Provisioner nodes From bb268004c765e81d2299bf784c86e5784d89e1a7 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 11:52:31 +0100 Subject: [PATCH 33/39] WIP --- docs/admin/architectures/2k-users.md | 4 ++-- docs/admin/architectures/3k-users.md | 3 ++- docs/admin/architectures/index.md | 8 +++++--- 3 files changed, 9 insertions(+), 6 deletions(-) diff --git a/docs/admin/architectures/2k-users.md b/docs/admin/architectures/2k-users.md index 04e5332bfdfcd..04ff5bf4ec19a 100644 --- a/docs/admin/architectures/2k-users.md +++ b/docs/admin/architectures/2k-users.md @@ -10,8 +10,8 @@ clusters. **Target load**: API: up to 300 RPS -**High Availability**: The mode is _disabled_, but administrators may consider -enabling it for deployment reliability. +**High Availability**: The mode is _enabled_; multiple replicas provide higher +deployment reliability under load. ## Hardware recommendations diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 13ee7ce5b2a42..13a31908a7b5a 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -26,7 +26,8 @@ purposes. **Footnotes**: - An external provisioner is deployed as Kubernetes pod. -- It is strongly discouraged to run provisioner daemons on `coderd` nodes. +- It is strongly discouraged to run provisioner daemons on `coderd` nodes at + this level of scale. - Separate provisioners into different namespaces in favor of zero-trust or multi-cloud deployments. diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 2d44026183e47..0a9f5f6f1512d 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -1,7 +1,7 @@ # Reference Architectures This document provides prescriptive solutions and reference architectures to -support successful deployments of up to 2000 users and outlines at a high-level +support successful deployments of up to 3000 users and outlines at a high-level the methodology currently used to scale-test Coder. ## General concepts @@ -168,7 +168,7 @@ Provisionerd: Database: - Median CPU utilization is 80%, with a significant portion dedicated to writing - metadata. + workspace agent metadata. - Memory utilization averages at 40%. - `write_ops_count` between 6.7 and 8.4 operations per second. @@ -215,7 +215,9 @@ When determining scaling requirements, consider the following factors: - API latency/response time: Monitor API latency and response times to ensure optimal performance under varying loads. - Average number of HTTP requests: Track the average number of HTTP requests to - gauge system usage and identify potential bottlenecks. + gauge system usage and identify potential bottlenecks. The number of proxied + connections: For a very high number of proxied connections, more memory is + required. **HTTP API latency** From 40def43a9474d5fd78daf64f0cb616b09d37a001 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 11:55:56 +0100 Subject: [PATCH 34/39] WIP --- docs/admin/architectures/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 0a9f5f6f1512d..e13b27961d1de 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -154,8 +154,8 @@ seconds. Here are the resulting metrics: Coder: -- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard - tests. +- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU while all tests are + running concurrently. - Median API request rate: 350 RPS during dashboard tests, 250 RPS during Web Terminal and workspace apps tests. - 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms. From 19ea38167a76576d6996692241d6b911f193647a Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 12:00:40 +0100 Subject: [PATCH 35/39] WIP --- docs/admin/architectures/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index e13b27961d1de..387e1f060e8aa 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -325,8 +325,8 @@ considerations: provisioners. - Storage size depends on user activity, workspace builds, log verbosity, overhead on database encryption, etc. -- Allocate an additional CPU core to the database instance for every 1000 active - users. +- Allocate two additional CPU core to the database instance for every 1000 + active users. - Enable _High Availability_ mode for database engine for large scale deployments. From 627e26fc8bb4ca1544db79ba5234868c5edd9707 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 12:02:18 +0100 Subject: [PATCH 36/39] WIP --- docs/admin/scale.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/admin/scale.md b/docs/admin/scale.md index eaa64b5951bf5..58fcd93373dad 100644 --- a/docs/admin/scale.md +++ b/docs/admin/scale.md @@ -112,7 +112,7 @@ For example, to support 120 concurrent workspace builds: | Kubernetes (GKE) | 4 cores | 8 GB | 1 | db-custom-1-3840 | 1500 | 20 | 1,500 simulated | `v0.24.1` | Jun 27, 2023 | | Kubernetes (GKE) | 2 cores | 4 GB | 1 | db-custom-1-3840 | 500 | 20 | 500 simulated | `v0.27.2` | Jul 27, 2023 | | Kubernetes (GKE) | 2 cores | 8 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 | -| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | na. (provisioned) | 2000 simulated | `v2.8.4` | Feb 28, 2024 | +| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | 50 | 2000 simulated | `v2.8.4` | Feb 28, 2024 | > Note: a simulated connection reads and writes random data at 40KB/s per > connection. From 8d87b346786201431f7e19e83947b62b809c9b3b Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 12:05:15 +0100 Subject: [PATCH 37/39] WIP --- docs/admin/architectures/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 387e1f060e8aa..89f2435260187 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -341,5 +341,5 @@ We provide the following general recommendations for PostgreSQL settings: - Allocate extra memory if database performance is poor and CPU utilization is low. For maximum performance, the entire database should be able to fit in RAM. -- Utilize faster disk options such as SSDs or NVMe drives for optimal - performance enhancement. +- Utilize faster disk options (higher IOPS) such as SSDs or NVMe drives for + optimal performance enhancement and possibly reduce database load. From d0c9fd66405ab45037378b85631ba3f443cf78fe Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 12:07:23 +0100 Subject: [PATCH 38/39] WIP --- docs/admin/architectures/index.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/admin/architectures/index.md b/docs/admin/architectures/index.md index 89f2435260187..c4f2c21ef7aac 100644 --- a/docs/admin/architectures/index.md +++ b/docs/admin/architectures/index.md @@ -338,8 +338,7 @@ allocating an additional CPU core to every `coderd` replica. We provide the following general recommendations for PostgreSQL settings: - Increase number of vCPU if CPU utilization or database latency is high. -- Allocate extra memory if database performance is poor and CPU utilization is - low. For maximum performance, the entire database should be able to fit in - RAM. +- Allocate extra memory if database performance is poor, CPU utilization is low, + and memory utilization is high. - Utilize faster disk options (higher IOPS) such as SSDs or NVMe drives for optimal performance enhancement and possibly reduce database load. From a34ae19bdca225df65ac84443a22f9832bd44677 Mon Sep 17 00:00:00 2001 From: Marcin Tojek Date: Fri, 15 Mar 2024 14:59:56 +0100 Subject: [PATCH 39/39] Observability --- docs/admin/architectures/3k-users.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/admin/architectures/3k-users.md b/docs/admin/architectures/3k-users.md index 13a31908a7b5a..093ec21c5c52c 100644 --- a/docs/admin/architectures/3k-users.md +++ b/docs/admin/architectures/3k-users.md @@ -9,6 +9,11 @@ on-premises network and cloud deployments. PostgreSQL service, and all Coder observability features enabled for operational purposes. +**Observability**: Deploy monitoring solutions to gather Prometheus metrics and +visualize them with Grafana to gain detailed insights into infrastructure and +application behavior. This allows operators to respond quickly to incidents and +continuously improve the reliability and performance of the platform. + ## Hardware recommendations ### Coderd nodes