Description
I have Coder 0.9.2 running on Amazon EKS. I'm using a Kubernetes template which allocates workspace pods on nodes controlled by an autoscaler. When no workspaces are running, these nodes are brought down. As expected, creating the first workspace can take a few minutes because the autoscaler needs time to react and bring a node up.
Workspace creation works as long as the user simply waits, but if they press Cancel before completion, their workspace might get out of sync with the Kubernetes resources. If the user later tries to start the workspace again, Terraform complains that the resource already exists.
I can't reproduce the issue consistently, but frequently enough. Here's what triggers it for me:
- Launch a Kubernetes workspace which takes a few minutes to be successfully scheduled on a node.
- Hit the Cancel button in the Coder UI while the pod is still pending.
- After a 1-minute timeout, the workspace is marked failed, but the Kubernetes operation doesn't stop. Eventually, the pod goes into running state.
- Attempting to start the workspace fails with
Error: pods "coder-bob-work" already exists
orError: persistentvolumeclaims "coder-bob-work-home" already exists
. It's possible to delete the workspace, but the Kubernetes resources are not deleted.
Template:
terraform {
required_providers {
coder = {
source = "coder/coder"
version = "0.5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.12.1"
}
}
}
variable "namespace" {
type = string
sensitive = true
description = "The namespace to create workspaces in (must exist prior to creating workspaces)"
default = "coder"
}
variable "gpu" {
type = bool
description = "Allocate a GPU?"
default = false
}
data "coder_workspace" "me" {}
resource "coder_agent" "main" {
os = "linux"
arch = "amd64"
startup_script = <<EOT
#!/bin/bash
# home folder can be empty, so copying default bash settings
if [ ! -f ~/.profile ]; then
cp /etc/skel/.profile $HOME
fi
if [ ! -f ~/.bashrc ]; then
cp /etc/skel/.bashrc $HOME
fi
# install code-server
curl -fsSL https://code-server.dev/install.sh | sh | tee code-server-install.log
# install extensions
code-server --install-extension ms-python.python
# start code-server
code-server --auth none --port 13337 | tee code-server-install.log &
EOT
}
# code-server
resource "coder_app" "code-server" {
agent_id = coder_agent.main.id
name = "code-server"
icon = "/icon/code.svg"
url = "http://localhost:13337?folder=/home/coder"
subdomain = false
healthcheck {
url = "http://localhost:13337/healthz"
interval = 3
threshold = 30
}
}
resource "kubernetes_persistent_volume_claim" "home" {
count = 1 # don't delete this volume after stopping the workspace
wait_until_bound = false
metadata {
name = lower("coder-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}-home")
namespace = var.namespace
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "efs-workspacehomes"
resources {
requests = {
# Storage specs are irrelevant for EFS volumes
storage = "10Gi"
}
}
}
}
resource "kubernetes_pod" "main" {
count = data.coder_workspace.me.start_count
metadata {
name = lower("coder-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}")
namespace = var.namespace
# Prevent the autoscaler from evicting this pod.
labels = {
"cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
}
}
spec {
security_context {
run_as_user = "1000"
fs_group = "1000"
}
affinity {
node_affinity {
required_during_scheduling_ignored_during_execution {
node_selector_term {
match_expressions {
key = "workspace_node"
operator = "In"
values = [ "true" ]
}
match_expressions {
key = "k8s.amazonaws.com/accelerator"
operator = var.gpu == true ? "Exists" : "DoesNotExist"
}
}
}
}
}
container {
name = "dev"
image = "codercom/enterprise-base:ubuntu"
command = ["sh", "-c", coder_agent.main.init_script]
security_context {
run_as_user = "1000"
}
env {
name = "CODER_AGENT_TOKEN"
value = coder_agent.main.token
}
volume_mount {
mount_path = "/home/coder"
name = "home"
read_only = false
}
volume_mount {
mount_path = "/datashare"
name = "datashare"
read_only = true
}
resources {
limits = var.gpu == true ? {"nvidia.com/gpu" = "1"} : {}
}
}
volume {
name = "home"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.home[0].metadata.0.name
read_only = false
}
}
volume {
name = "datashare"
persistent_volume_claim {
claim_name = "efs-claim"
read_only = true
}
}
}
}
Output of the first workspace build:
Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/kubernetes versions matching "~> 2.12.1"...
- Finding coder/coder versions matching "0.5.0"...
- Using hashicorp/kubernetes v2.12.1 from the shared cache directory
- Using coder/coder v0.5.0 from the shared cache directory
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Warning: Incomplete lock file information for providers
Due to your customized provider installation methods, Terraform was forced to
calculate lock file checksums locally for the following providers:
- coder/coder
- hashicorp/kubernetes
The current .terraform.lock.hcl file only includes checksums for linux_amd64,
so Terraform running on another platform will fail to install these
providers.
To calculate additional checksums for another platform, run:
terraform providers lock -platform=linux_amd64
(where linux_amd64 is the platform to generate)
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Terraform 1.3.0
data.coder_workspace.me: Refreshing...
data.coder_workspace.me: Refresh complete after 0s [id=6bf26e72-25a3-435a-93a3-286f7e7b265c]
coder_agent.main: Plan to create
coder_app.code-server: Plan to create
kubernetes_pod.main[0]: Plan to create
Plan: 3 to add, 0 to change, 0 to destroy.
coder_agent.main: Creating...
coder_agent.main: Creation complete after 0s [id=b5f1c784-052c-4eb9-b238-faad2922bd76]
coder_app.code-server: Creating...
coder_app.code-server: Creation complete after 0s [id=6cb793cb-b63a-40b7-949d-b3ad41e56bcf]
kubernetes_pod.main[0]: Creating...
kubernetes_pod.main[0]: Still creating... [11s elapsed]
kubernetes_pod.main[0]: Still creating... [21s elapsed]
kubernetes_pod.main[0]: Still creating... [31s elapsed]
kubernetes_pod.main[0]: Still creating... [41s elapsed]
kubernetes_pod.main[0]: Still creating... [51s elapsed]
kubernetes_pod.main[0]: Still creating... [1m1s elapsed]
Interrupt received. Please wait for Terraform to exit or data loss may occur. Gracefully shutting down...
Stopping operation...
kubernetes_pod.main[0]: Still creating... [1m11s elapsed]
kubernetes_pod.main[0]: Still creating... [1m21s elapsed]
kubernetes_pod.main[0]: Still creating... [1m31s elapsed]
kubernetes_pod.main[0]: Still creating... [1m41s elapsed]
kubernetes_pod.main[0]: Still creating... [1m51s elapsed]
kubernetes_pod.main[0]: Still creating... [2m1s elapsed]
Log messages from the main Coder pod:
2022-10-17 11:19:22.648 [INFO] <./provisionerd/provisionerd.go:256> (*Server).acquireJob acquired job {"initiator_bob": "bob", "provisioner": "terraform", "job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:19:22.703 [INFO] <./provisionerd/runner/runner.go:328> (*Runner).do unpacking template source archive {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c", "size_bytes": 4608}
2022-10-17 11:19:22.751 [WARN] (coderd) <./coderd/authorize.go:99> (*API).checkAuthorization check-auth {"request_id": "85316c03-c695-4ee9-895f-7b84a4c2014b", "my_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "got_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "name": "bob", "roles": ["owner", "member", "organization-admin:6e7dda79-e2b1-488b-9262-e712ecea4616", "organization-member:6e7dda79-e2b1-488b-9262-e712ecea4616"], "scope": "all"}
2022-10-17 11:20:25.311 [WARN] (coderd) <./coderd/authorize.go:99> (*API).checkAuthorization check-auth {"request_id": "b92c8240-fd88-4d2c-85b6-c4462db1a287", "my_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "got_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "name": "bob", "roles": ["owner", "member", "organization-admin:6e7dda79-e2b1-488b-9262-e712ecea4616", "organization-member:6e7dda79-e2b1-488b-9262-e712ecea4616"], "scope": "all"}
2022-10-17 11:20:26.703 [INFO] <./provisionerd/runner/runner.go:437> (*Runner).heartbeat attempting graceful cancelation {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:21:26.703 [WARN] <./provisionerd/runner/runner.go:443> (*Runner).heartbeat Cancel timed out {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:21:26.758 [WARN] <./provisionerd/runner/runner.go:281> (*Runner).doCleanFinish.func2 failed to log cleanup {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
Output of the second workspace build (starting from a failed state):
Initializing the backend...
Initializing provider plugins...
- Finding coder/coder versions matching "0.5.0"...
- Finding hashicorp/kubernetes versions matching "~> 2.12.1"...
- Using coder/coder v0.5.0 from the shared cache directory
- Using hashicorp/kubernetes v2.12.1 from the shared cache directory
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Warning: Incomplete lock file information for providers
Due to your customized provider installation methods, Terraform was forced to
calculate lock file checksums locally for the following providers:
- coder/coder
- hashicorp/kubernetes
The current .terraform.lock.hcl file only includes checksums for linux_amd64,
so Terraform running on another platform will fail to install these
providers.
To calculate additional checksums for another platform, run:
terraform providers lock -platform=linux_amd64
(where linux_amd64 is the platform to generate)
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Terraform 1.3.0
data.coder_workspace.me: Refreshing...
data.coder_workspace.me: Refresh complete after 0s [id=6bf26e72-25a3-435a-93a3-286f7e7b265c]
coder_agent.main: Plan to create
coder_app.code-server: Plan to create
kubernetes_pod.main[0]: Plan to create
Plan: 3 to add, 0 to change, 0 to destroy.
coder_agent.main: Creating...
coder_agent.main: Creation complete after 0s [id=8a963259-bfa4-4505-a3d4-a85f23a3178b]
coder_app.code-server: Creating...
coder_app.code-server: Creation complete after 0s [id=538ad593-46bd-48af-87cd-b9765a681882]
kubernetes_pod.main[0]: Creating...
kubernetes_pod.main[0]: Creation errored after 0s
Error: pods "coder-bob-work" already exists