Failed cancellation of workspace build may leave orphan resources behind

I have Coder 0.9.2 running on Amazon EKS. I'm using a Kubernetes template which allocates workspace pods on nodes controlled by an autoscaler. When no workspaces are running, these nodes are brought down. As expected, creating the first workspace can take a few minutes because the autoscaler needs time to react and bring a node up.

Workspace creation works as long as the user simply waits, but if they press Cancel before completion, their workspace might get out of sync with the Kubernetes resources. If the user later tries to start the workspace again, Terraform complains that the resource already exists.

I can't reproduce the issue consistently, but frequently enough. Here's what triggers it for me:

1. Launch a Kubernetes workspace which takes a few minutes to be successfully scheduled on a node.
2. Hit the Cancel button in the Coder UI while the pod is still pending.
3. After a 1-minute timeout, the workspace is marked failed, but the Kubernetes operation doesn't stop. Eventually, the pod goes into running state.
4. Attempting to start the workspace fails with `Error: pods "coder-bob-work" already exists` or `Error: persistentvolumeclaims "coder-bob-work-home" already exists`. It's possible to delete the workspace, but the Kubernetes resources are not deleted.

Template:

```
terraform {
  required_providers {
    coder = {
      source  = "coder/coder"
      version = "0.5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.12.1"
    }
  }
}

variable "namespace" {
  type        = string
  sensitive   = true
  description = "The namespace to create workspaces in (must exist prior to creating workspaces)"
  default     = "coder"
}

variable "gpu" {
  type        = bool
  description = "Allocate a GPU?"
  default     = false
}

data "coder_workspace" "me" {}

resource "coder_agent" "main" {
  os             = "linux"
  arch           = "amd64"
  startup_script = <<EOT
    #!/bin/bash

    # home folder can be empty, so copying default bash settings
    if [ ! -f ~/.profile ]; then
      cp /etc/skel/.profile $HOME
    fi
    if [ ! -f ~/.bashrc ]; then
      cp /etc/skel/.bashrc $HOME
    fi

    # install code-server
    curl -fsSL https://code-server.dev/install.sh | sh  | tee code-server-install.log

    # install extensions
    code-server --install-extension ms-python.python

    # start code-server
    code-server --auth none --port 13337 | tee code-server-install.log &
  EOT
}

# code-server
resource "coder_app" "code-server" {
  agent_id  = coder_agent.main.id
  name      = "code-server"
  icon      = "/icon/code.svg"
  url       = "http://localhost:13337?folder=/home/coder"
  subdomain = false

  healthcheck {
    url       = "http://localhost:13337/healthz"
    interval  = 3
    threshold = 30
  }
}

resource "kubernetes_persistent_volume_claim" "home" {
  count = 1  # don't delete this volume after stopping the workspace
  wait_until_bound = false
  metadata {
    name      = lower("coder-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}-home")
    namespace = var.namespace
  }
  spec {
    access_modes = ["ReadWriteOnce"]
    storage_class_name = "efs-workspacehomes"
    resources {
      requests = {
        # Storage specs are irrelevant for EFS volumes
        storage = "10Gi"
      }
    }
  }
}

resource "kubernetes_pod" "main" {
  count = data.coder_workspace.me.start_count
  metadata {
    name      = lower("coder-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}")
    namespace = var.namespace
    # Prevent the autoscaler from evicting this pod.
    labels = {
      "cluster-autoscaler.kubernetes.io/safe-to-evict" = "false"
    }
  }
  spec {
    security_context {
      run_as_user = "1000"
      fs_group    = "1000"
    }
    affinity {
      node_affinity {
        required_during_scheduling_ignored_during_execution {
          node_selector_term {
            match_expressions {
              key      = "workspace_node"
              operator = "In"
              values   = [ "true" ]
            }
            match_expressions {
              key      = "k8s.amazonaws.com/accelerator"
              operator = var.gpu == true ? "Exists" : "DoesNotExist"
            }
          }
        }
      }
    }
    container {
      name    = "dev"
      image   = "codercom/enterprise-base:ubuntu"
      command = ["sh", "-c", coder_agent.main.init_script]
      security_context {
        run_as_user = "1000"
      }
      env {
        name  = "CODER_AGENT_TOKEN"
        value = coder_agent.main.token
      }
      volume_mount {
        mount_path = "/home/coder"
        name       = "home"
        read_only  = false
      }
      volume_mount {
        mount_path = "/datashare"
        name       = "datashare"
        read_only  = true
      }
      resources {
        limits = var.gpu == true ? {"nvidia.com/gpu" = "1"} : {}
      }
    }

    volume {
      name = "home"
      persistent_volume_claim {
        claim_name = kubernetes_persistent_volume_claim.home[0].metadata.0.name
        read_only  = false
      }
    }
    volume {
      name = "datashare"
      persistent_volume_claim {
        claim_name = "efs-claim"
        read_only  = true
      }
    }
  }
}
```

Output of the first workspace build:

```
Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/kubernetes versions matching "~> 2.12.1"...
- Finding coder/coder versions matching "0.5.0"...
- Using hashicorp/kubernetes v2.12.1 from the shared cache directory
- Using coder/coder v0.5.0 from the shared cache directory
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Warning: Incomplete lock file information for providers
Due to your customized provider installation methods, Terraform was forced to
calculate lock file checksums locally for the following providers:
 - coder/coder
 - hashicorp/kubernetes
The current .terraform.lock.hcl file only includes checksums for linux_amd64,
so Terraform running on another platform will fail to install these
providers.
To calculate additional checksums for another platform, run:
 terraform providers lock -platform=linux_amd64
(where linux_amd64 is the platform to generate)
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Terraform 1.3.0
data.coder_workspace.me: Refreshing...
data.coder_workspace.me: Refresh complete after 0s [id=6bf26e72-25a3-435a-93a3-286f7e7b265c]
coder_agent.main: Plan to create
coder_app.code-server: Plan to create
kubernetes_pod.main[0]: Plan to create
Plan: 3 to add, 0 to change, 0 to destroy.
coder_agent.main: Creating...
coder_agent.main: Creation complete after 0s [id=b5f1c784-052c-4eb9-b238-faad2922bd76]
coder_app.code-server: Creating...
coder_app.code-server: Creation complete after 0s [id=6cb793cb-b63a-40b7-949d-b3ad41e56bcf]
kubernetes_pod.main[0]: Creating...
kubernetes_pod.main[0]: Still creating... [11s elapsed]
kubernetes_pod.main[0]: Still creating... [21s elapsed]
kubernetes_pod.main[0]: Still creating... [31s elapsed]
kubernetes_pod.main[0]: Still creating... [41s elapsed]
kubernetes_pod.main[0]: Still creating... [51s elapsed]
kubernetes_pod.main[0]: Still creating... [1m1s elapsed]
 Interrupt received. Please wait for Terraform to exit or data loss may occur. Gracefully shutting down...
Stopping operation...
kubernetes_pod.main[0]: Still creating... [1m11s elapsed]
kubernetes_pod.main[0]: Still creating... [1m21s elapsed]
kubernetes_pod.main[0]: Still creating... [1m31s elapsed]
kubernetes_pod.main[0]: Still creating... [1m41s elapsed]
kubernetes_pod.main[0]: Still creating... [1m51s elapsed]
kubernetes_pod.main[0]: Still creating... [2m1s elapsed]
```

Log messages from the main Coder pod:

```
2022-10-17 11:19:22.648 [INFO]  <./provisionerd/provisionerd.go:256>    (*Server).acquireJob    acquired job    {"initiator_bob": "bob", "provisioner": "terraform", "job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:19:22.703 [INFO]  <./provisionerd/runner/runner.go:328>   (*Runner).do    unpacking template source archive       {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c", "size_bytes": 4608}
2022-10-17 11:19:22.751 [WARN]  (coderd)        <./coderd/authorize.go:99>      (*API).checkAuthorization       check-auth      {"request_id": "85316c03-c695-4ee9-895f-7b84a4c2014b", "my_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "got_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "name": "bob", "roles": ["owner", "member", "organization-admin:6e7dda79-e2b1-488b-9262-e712ecea4616", "organization-member:6e7dda79-e2b1-488b-9262-e712ecea4616"], "scope": "all"}
2022-10-17 11:20:25.311 [WARN]  (coderd)        <./coderd/authorize.go:99>      (*API).checkAuthorization       check-auth      {"request_id": "b92c8240-fd88-4d2c-85b6-c4462db1a287", "my_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "got_id": "674bc8a1-25a1-4e88-9cf0-41ba591da155", "name": "bob", "roles": ["owner", "member", "organization-admin:6e7dda79-e2b1-488b-9262-e712ecea4616", "organization-member:6e7dda79-e2b1-488b-9262-e712ecea4616"], "scope": "all"}
2022-10-17 11:20:26.703 [INFO]  <./provisionerd/runner/runner.go:437>   (*Runner).heartbeat     attempting graceful cancelation {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:21:26.703 [WARN]  <./provisionerd/runner/runner.go:443>   (*Runner).heartbeat     Cancel timed out        {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
2022-10-17 11:21:26.758 [WARN]  <./provisionerd/runner/runner.go:281>   (*Runner).doCleanFinish.func2   failed to log cleanup   {"job_id": "b4c90df2-b4a9-4779-bff5-a53c0495488c"}
```

Output of the second workspace build (starting from a failed state):

```
Initializing the backend...
Initializing provider plugins...
- Finding coder/coder versions matching "0.5.0"...
- Finding hashicorp/kubernetes versions matching "~> 2.12.1"...
- Using coder/coder v0.5.0 from the shared cache directory
- Using hashicorp/kubernetes v2.12.1 from the shared cache directory
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Warning: Incomplete lock file information for providers
Due to your customized provider installation methods, Terraform was forced to
calculate lock file checksums locally for the following providers:
 - coder/coder
 - hashicorp/kubernetes
The current .terraform.lock.hcl file only includes checksums for linux_amd64,
so Terraform running on another platform will fail to install these
providers.
To calculate additional checksums for another platform, run:
 terraform providers lock -platform=linux_amd64
(where linux_amd64 is the platform to generate)
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Terraform 1.3.0
data.coder_workspace.me: Refreshing...
data.coder_workspace.me: Refresh complete after 0s [id=6bf26e72-25a3-435a-93a3-286f7e7b265c]
coder_agent.main: Plan to create
coder_app.code-server: Plan to create
kubernetes_pod.main[0]: Plan to create
Plan: 3 to add, 0 to change, 0 to destroy.
coder_agent.main: Creating...
coder_agent.main: Creation complete after 0s [id=8a963259-bfa4-4505-a3d4-a85f23a3178b]
coder_app.code-server: Creating...
coder_app.code-server: Creation complete after 0s [id=538ad593-46bd-48af-87cd-b9765a681882]
kubernetes_pod.main[0]: Creating...
kubernetes_pod.main[0]: Creation errored after 0s
Error: pods "coder-bob-work" already exists
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed cancellation of workspace build may leave orphan resources behind #4586

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed cancellation of workspace build may leave orphan resources behind #4586

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions