|
| 1 | +# Reference architectures |
| 2 | + |
| 3 | +This document provides prescriptive solutions and reference architectures to |
| 4 | +support successful deployments of up to 2000 users and outlines at a high-level |
| 5 | +the methodology currently used to scale-test Coder. |
| 6 | + |
| 7 | +## General concepts |
| 8 | + |
| 9 | +This section outlines core concepts and terminology essential for understanding |
| 10 | +Coder's architecture and deployment strategies. |
| 11 | + |
| 12 | +### Administrator |
| 13 | + |
| 14 | +An administrator is a user role within the Coder platform with elevated |
| 15 | +privileges. Admins have access to administrative functions such as user |
| 16 | +management, template definitions, insights, and deployment configuration. |
| 17 | + |
| 18 | +### Coder |
| 19 | + |
| 20 | +Coder, also known as _coderd_, is the main service recommended for deployment |
| 21 | +with multiple replicas to ensure high availability. It provides an API for |
| 22 | +managing workspaces and templates. Each _coderd_ replica has the capability to |
| 23 | +host multiple [provisioners](#provisioner). |
| 24 | + |
| 25 | +### User |
| 26 | + |
| 27 | +A user is an individual who utilizes the Coder platform to develop, test, and |
| 28 | +deploy applications using workspaces. Users can select available templates to |
| 29 | +provision workspaces. They interact with Coder using the web interface, the CLI |
| 30 | +tool, or directly calling API methods. |
| 31 | + |
| 32 | +### Workspace |
| 33 | + |
| 34 | +A workspace refers to an isolated development environment where users can write, |
| 35 | +build, and run code. Workspaces are fully configurable and can be tailored to |
| 36 | +specific project requirements, providing developers with a consistent and |
| 37 | +efficient development environment. Workspaces can be autostarted and |
| 38 | +autostopped, enabling efficient resource management. |
| 39 | + |
| 40 | +Users can connect to workspaces using SSH or via workspace applications like |
| 41 | +`code-server`, facilitating collaboration and remote access. Additionally, |
| 42 | +workspaces can be parameterized, allowing users to customize settings and |
| 43 | +configurations based on their unique needs. Workspaces are instantiated using |
| 44 | +Coder templates and deployed on resources created by provisioners. |
| 45 | + |
| 46 | +### Template |
| 47 | + |
| 48 | +A template in Coder is a predefined configuration for creating workspaces. |
| 49 | +Templates streamline the process of workspace creation by providing |
| 50 | +pre-configured settings, tooling, and dependencies. They are built by template |
| 51 | +administrators on top of Terraform, allowing for efficient management of |
| 52 | +infrastructure resources. Additionally, templates can utilize Coder modules to |
| 53 | +leverage existing features shared with other templates, enhancing flexibility |
| 54 | +and consistency across deployments. Templates describe provisioning rules for |
| 55 | +infrastructure resources offered by Terraform providers. |
| 56 | + |
| 57 | +### Workspace Proxy |
| 58 | + |
| 59 | +A workspace proxy serves as a relay connection option for developers connecting |
| 60 | +to their workspace over SSH, a workspace app, or through port forwarding. It |
| 61 | +helps reduce network latency for geo-distributed teams by minimizing the |
| 62 | +distance network traffic needs to travel. Notably, workspace proxies do not |
| 63 | +handle dashboard connections or API calls. |
| 64 | + |
| 65 | +### Provisioner |
| 66 | + |
| 67 | +Provisioners in Coder execute Terraform during workspace and template builds. |
| 68 | +While the platform includes built-in provisioner daemons by default, there are |
| 69 | +advantages to employing external provisioners. These external daemons provide |
| 70 | +secure build environments and reduce server load, improving performance and |
| 71 | +scalability. Each provisioner can handle a single concurrent workspace build, |
| 72 | +allowing for efficient resource allocation and workload management. |
| 73 | + |
| 74 | +### Registry |
| 75 | + |
| 76 | +The Coder Registry is a platform where you can find starter templates and |
| 77 | +_Modules_ for various cloud services and platforms. |
| 78 | + |
| 79 | +Templates help create self-service development environments using |
| 80 | +Terraform-defined infrastructure, while _Modules_ simplify template creation by |
| 81 | +providing common features like workspace applications, third-party integrations, |
| 82 | +or helper scripts. |
| 83 | + |
| 84 | +Please note that the Registry is a hosted service and isn't available for |
| 85 | +offline use. |
| 86 | + |
| 87 | +## Scale-testing methodology |
| 88 | + |
| 89 | +Scaling Coder involves planning and testing to ensure it can handle more load |
| 90 | +without compromising service. This process encompasses infrastructure setup, |
| 91 | +traffic projections, and aggressive testing to identify and mitigate potential |
| 92 | +bottlenecks. |
| 93 | + |
| 94 | +A dedicated Kubernetes cluster for Coder is Kubernetes cluster specifically |
| 95 | +configured to host and manage Coder workloads. Kubernetes provides container |
| 96 | +orchestration capabilities, allowing Coder to efficiently deploy, scale, and |
| 97 | +manage workspaces across a distributed infrastructure. This ensures high |
| 98 | +availability, fault tolerance, and scalability for Coder deployments. Code is |
| 99 | +deployed on this cluster using the |
| 100 | +[Helm chart](../install/kubernetes#install-coder-with-helm). |
| 101 | + |
| 102 | +Our scale tests include the following stages: |
| 103 | + |
| 104 | +1. Prepare environment: create expected users and provision workspaces. |
| 105 | + |
| 106 | +2. SSH connections: establish user connections with agents, verifying their |
| 107 | + ability to echo back received content. |
| 108 | + |
| 109 | +3. Web Terminal: verify the PTY connection used for communication with Web |
| 110 | + Terminal. |
| 111 | + |
| 112 | +4. Workspace application traffic: assess the handling of user connections with |
| 113 | + specific workspace apps, confirming their capability to echo back received |
| 114 | + content effectively. |
| 115 | + |
| 116 | +5. Dashboard evaluation: verify the responsiveness and stability of Coder |
| 117 | + dashboards under varying load conditions. This is achieved by simulating user |
| 118 | + interactions using instances of headless Chromium browsers. |
| 119 | + |
| 120 | +6. Cleanup: delete workspaces and users created in step 1. |
| 121 | + |
| 122 | +### Infrastructure and setup requirements |
| 123 | + |
| 124 | +The scale tests runner can distribute the workload to overlap single scenarios |
| 125 | +based on the workflow configuration: |
| 126 | + |
| 127 | +| | T0 | T1 | T2 | T3 | T4 | T5 | T6 | |
| 128 | +| -------------------- | --- | --- | --- | --- | --- | --- | --- | |
| 129 | +| SSH connections | X | X | X | X | | | | |
| 130 | +| Web Terminal (PTY) | | X | X | X | X | | | |
| 131 | +| Workspace apps | | | X | X | X | X | | |
| 132 | +| Dashboard (headless) | | | | X | X | X | X | |
| 133 | + |
| 134 | +This pattern closely reflects how our customers naturally use the system. SSH |
| 135 | +connections are heavily utilized because they're the primary communication |
| 136 | +channel for IDEs with VS Code and JetBrains plugins. |
| 137 | + |
| 138 | +The basic setup of scale tests environment involves: |
| 139 | + |
| 140 | +1. Scale tests runner (32 vCPU, 128 GB RAM) |
| 141 | +2. Coder: 2 replicas (4 vCPU, 16 GB RAM) |
| 142 | +3. Database: 1 instance (2 vCPU, 32 GB RAM) |
| 143 | +4. Provisioner: 50 instances (0.5 vCPU, 512 MB RAM) |
| 144 | + |
| 145 | +The test is deemed successful if users did not experience interruptions in their |
| 146 | +workflows, `coderd` did not crash or require restarts, and no other internal |
| 147 | +errors were observed. |
| 148 | + |
| 149 | +### Traffic Projections |
| 150 | + |
| 151 | +In our scale tests, we simulate activity from 2000 users, 2000 workspaces, and |
| 152 | +2000 agents, with two items of workspace agent metadata being sent every 10 |
| 153 | +seconds. Here are the resulting metrics: |
| 154 | + |
| 155 | +Coder: |
| 156 | + |
| 157 | +- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard |
| 158 | + tests. |
| 159 | +- Median API request rate: 350 req/s during dashboard tests, 250 req/s during |
| 160 | + Web Terminal and workspace apps tests. |
| 161 | +- 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms. |
| 162 | +- on average 2400 Web Socket connections during dashboard tests. |
| 163 | + |
| 164 | +Provisionerd: |
| 165 | + |
| 166 | +- Median CPU usage is 0.35 vCPU during workspace provisioning. |
| 167 | + |
| 168 | +Database: |
| 169 | + |
| 170 | +- Median CPU utilization is 80%, with a significant portion dedicated to writing |
| 171 | + metadata. |
| 172 | +- Memory utilization averages at 40%. |
| 173 | +- `write_ops_count` between 6.7 and 8.4 operations per second. |
0 commit comments