Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit fb7f1ac

Browse files
authored
docs: update reference architecture: glossary, scale tests methodology (#12438)
1 parent 8427998 commit fb7f1ac

File tree

1 file changed

+173
-0
lines changed

1 file changed

+173
-0
lines changed

docs/admin/reference-architectures.md

+173
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Reference architectures
2+
3+
This document provides prescriptive solutions and reference architectures to
4+
support successful deployments of up to 2000 users and outlines at a high-level
5+
the methodology currently used to scale-test Coder.
6+
7+
## General concepts
8+
9+
This section outlines core concepts and terminology essential for understanding
10+
Coder's architecture and deployment strategies.
11+
12+
### Administrator
13+
14+
An administrator is a user role within the Coder platform with elevated
15+
privileges. Admins have access to administrative functions such as user
16+
management, template definitions, insights, and deployment configuration.
17+
18+
### Coder
19+
20+
Coder, also known as _coderd_, is the main service recommended for deployment
21+
with multiple replicas to ensure high availability. It provides an API for
22+
managing workspaces and templates. Each _coderd_ replica has the capability to
23+
host multiple [provisioners](#provisioner).
24+
25+
### User
26+
27+
A user is an individual who utilizes the Coder platform to develop, test, and
28+
deploy applications using workspaces. Users can select available templates to
29+
provision workspaces. They interact with Coder using the web interface, the CLI
30+
tool, or directly calling API methods.
31+
32+
### Workspace
33+
34+
A workspace refers to an isolated development environment where users can write,
35+
build, and run code. Workspaces are fully configurable and can be tailored to
36+
specific project requirements, providing developers with a consistent and
37+
efficient development environment. Workspaces can be autostarted and
38+
autostopped, enabling efficient resource management.
39+
40+
Users can connect to workspaces using SSH or via workspace applications like
41+
`code-server`, facilitating collaboration and remote access. Additionally,
42+
workspaces can be parameterized, allowing users to customize settings and
43+
configurations based on their unique needs. Workspaces are instantiated using
44+
Coder templates and deployed on resources created by provisioners.
45+
46+
### Template
47+
48+
A template in Coder is a predefined configuration for creating workspaces.
49+
Templates streamline the process of workspace creation by providing
50+
pre-configured settings, tooling, and dependencies. They are built by template
51+
administrators on top of Terraform, allowing for efficient management of
52+
infrastructure resources. Additionally, templates can utilize Coder modules to
53+
leverage existing features shared with other templates, enhancing flexibility
54+
and consistency across deployments. Templates describe provisioning rules for
55+
infrastructure resources offered by Terraform providers.
56+
57+
### Workspace Proxy
58+
59+
A workspace proxy serves as a relay connection option for developers connecting
60+
to their workspace over SSH, a workspace app, or through port forwarding. It
61+
helps reduce network latency for geo-distributed teams by minimizing the
62+
distance network traffic needs to travel. Notably, workspace proxies do not
63+
handle dashboard connections or API calls.
64+
65+
### Provisioner
66+
67+
Provisioners in Coder execute Terraform during workspace and template builds.
68+
While the platform includes built-in provisioner daemons by default, there are
69+
advantages to employing external provisioners. These external daemons provide
70+
secure build environments and reduce server load, improving performance and
71+
scalability. Each provisioner can handle a single concurrent workspace build,
72+
allowing for efficient resource allocation and workload management.
73+
74+
### Registry
75+
76+
The Coder Registry is a platform where you can find starter templates and
77+
_Modules_ for various cloud services and platforms.
78+
79+
Templates help create self-service development environments using
80+
Terraform-defined infrastructure, while _Modules_ simplify template creation by
81+
providing common features like workspace applications, third-party integrations,
82+
or helper scripts.
83+
84+
Please note that the Registry is a hosted service and isn't available for
85+
offline use.
86+
87+
## Scale-testing methodology
88+
89+
Scaling Coder involves planning and testing to ensure it can handle more load
90+
without compromising service. This process encompasses infrastructure setup,
91+
traffic projections, and aggressive testing to identify and mitigate potential
92+
bottlenecks.
93+
94+
A dedicated Kubernetes cluster for Coder is Kubernetes cluster specifically
95+
configured to host and manage Coder workloads. Kubernetes provides container
96+
orchestration capabilities, allowing Coder to efficiently deploy, scale, and
97+
manage workspaces across a distributed infrastructure. This ensures high
98+
availability, fault tolerance, and scalability for Coder deployments. Code is
99+
deployed on this cluster using the
100+
[Helm chart](../install/kubernetes#install-coder-with-helm).
101+
102+
Our scale tests include the following stages:
103+
104+
1. Prepare environment: create expected users and provision workspaces.
105+
106+
2. SSH connections: establish user connections with agents, verifying their
107+
ability to echo back received content.
108+
109+
3. Web Terminal: verify the PTY connection used for communication with Web
110+
Terminal.
111+
112+
4. Workspace application traffic: assess the handling of user connections with
113+
specific workspace apps, confirming their capability to echo back received
114+
content effectively.
115+
116+
5. Dashboard evaluation: verify the responsiveness and stability of Coder
117+
dashboards under varying load conditions. This is achieved by simulating user
118+
interactions using instances of headless Chromium browsers.
119+
120+
6. Cleanup: delete workspaces and users created in step 1.
121+
122+
### Infrastructure and setup requirements
123+
124+
The scale tests runner can distribute the workload to overlap single scenarios
125+
based on the workflow configuration:
126+
127+
| | T0 | T1 | T2 | T3 | T4 | T5 | T6 |
128+
| -------------------- | --- | --- | --- | --- | --- | --- | --- |
129+
| SSH connections | X | X | X | X | | | |
130+
| Web Terminal (PTY) | | X | X | X | X | | |
131+
| Workspace apps | | | X | X | X | X | |
132+
| Dashboard (headless) | | | | X | X | X | X |
133+
134+
This pattern closely reflects how our customers naturally use the system. SSH
135+
connections are heavily utilized because they're the primary communication
136+
channel for IDEs with VS Code and JetBrains plugins.
137+
138+
The basic setup of scale tests environment involves:
139+
140+
1. Scale tests runner (32 vCPU, 128 GB RAM)
141+
2. Coder: 2 replicas (4 vCPU, 16 GB RAM)
142+
3. Database: 1 instance (2 vCPU, 32 GB RAM)
143+
4. Provisioner: 50 instances (0.5 vCPU, 512 MB RAM)
144+
145+
The test is deemed successful if users did not experience interruptions in their
146+
workflows, `coderd` did not crash or require restarts, and no other internal
147+
errors were observed.
148+
149+
### Traffic Projections
150+
151+
In our scale tests, we simulate activity from 2000 users, 2000 workspaces, and
152+
2000 agents, with two items of workspace agent metadata being sent every 10
153+
seconds. Here are the resulting metrics:
154+
155+
Coder:
156+
157+
- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard
158+
tests.
159+
- Median API request rate: 350 req/s during dashboard tests, 250 req/s during
160+
Web Terminal and workspace apps tests.
161+
- 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms.
162+
- on average 2400 Web Socket connections during dashboard tests.
163+
164+
Provisionerd:
165+
166+
- Median CPU usage is 0.35 vCPU during workspace provisioning.
167+
168+
Database:
169+
170+
- Median CPU utilization is 80%, with a significant portion dedicated to writing
171+
metadata.
172+
- Memory utilization averages at 40%.
173+
- `write_ops_count` between 6.7 and 8.4 operations per second.

0 commit comments

Comments
 (0)