Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Bug: Agent script timings are returning more than one script timing for script #16124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BrunoQuaresma opened this issue Jan 13, 2025 · 5 comments · Fixed by #16203
Closed

Bug: Agent script timings are returning more than one script timing for script #16124

BrunoQuaresma opened this issue Jan 13, 2025 · 5 comments · Fixed by #16203
Assignees
Labels
bug risk Prone to bugs needs-triage Issue that require triage

Comments

@BrunoQuaresma
Copy link
Collaborator

BrunoQuaresma commented Jan 13, 2025

I believe an agent script timing should only be reported once per build since the script only runs once per build. Considering this, the response from the /api/v2/workspacebuilds/build-id/timings endpoint becomes confusing when it returns multiple agent script timings for the same script in the same build.

For instance, here's an example of the response:

Response
{
  "provisioner_timings": [
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:09.36658Z",
      "ended_at": "2024-12-18T16:07:12.056761Z",
      "stage": "init",
      "source": "terraform",
      "action": "initializing terraform",
      "resource": "state file"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.248206Z",
      "ended_at": "2024-12-18T16:07:13.25545Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.git_email"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.248394Z",
      "ended_at": "2024-12-18T16:07:13.255321Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.gpu"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.249603Z",
      "ended_at": "2024-12-18T16:07:13.255382Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.git_user_name"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.252024Z",
      "ended_at": "2024-12-18T16:07:13.256101Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.splunk_version"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.25204Z",
      "ended_at": "2024-12-18T16:07:13.256586Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.home_disk_size"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.252068Z",
      "ended_at": "2024-12-18T16:07:13.255598Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_workspace.me"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.253162Z",
      "ended_at": "2024-12-18T16:07:13.257632Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.memory"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.253177Z",
      "ended_at": "2024-12-18T16:07:13.258667Z",
      "stage": "plan",
      "source": "coder",
      "action": "read",
      "resource": "data.coder_parameter.cpu"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.262178Z",
      "ended_at": "2024-12-18T16:07:13.266056Z",
      "stage": "plan",
      "source": "coder",
      "action": "state refresh",
      "resource": "coder_agent.main"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.277698Z",
      "ended_at": "2024-12-18T16:07:13.281774Z",
      "stage": "plan",
      "source": "coder",
      "action": "state refresh",
      "resource": "coder_script.splunk_ansible"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.277881Z",
      "ended_at": "2024-12-18T16:07:13.281838Z",
      "stage": "plan",
      "source": "coder",
      "action": "state refresh",
      "resource": "coder_app.code-server"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.281617Z",
      "ended_at": "2024-12-18T16:07:13.283034Z",
      "stage": "plan",
      "source": "coder",
      "action": "state refresh",
      "resource": "coder_app.splunk"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.290473Z",
      "ended_at": "2024-12-18T16:07:13.309158Z",
      "stage": "plan",
      "source": "kubernetes",
      "action": "state refresh",
      "resource": "kubernetes_persistent_volume_claim.home"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:13.406039Z",
      "ended_at": "2024-12-18T16:07:14.77887Z",
      "stage": "graph",
      "source": "terraform",
      "action": "building terraform dependency graph",
      "resource": "state file"
    },
    {
      "job_id": "70dc09b7-199b-4853-9e03-5d3b8b6c007d",
      "started_at": "2024-12-18T16:07:15.726965Z",
      "ended_at": "2024-12-18T16:07:15.810591Z",
      "stage": "apply",
      "source": "kubernetes",
      "action": "create",
      "resource": "kubernetes_deployment.main[0]"
    }
  ],
  "agent_script_timings": [
    {
      "started_at": "2024-12-18T16:07:49.769585Z",
      "ended_at": "2024-12-18T16:07:51.118721Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "Startup Script",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2024-12-19T06:48:24.52297Z",
      "ended_at": "2024-12-19T06:48:30.545644Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "Startup Script",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2024-12-19T09:18:35.94846Z",
      "ended_at": "2024-12-19T09:18:37.509152Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "Startup Script",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2025-01-04T17:38:34.907521Z",
      "ended_at": "2025-01-04T17:38:41.363145Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "Startup Script",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2025-01-04T17:52:38.905293Z",
      "ended_at": "2025-01-04T17:52:40.977108Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "Startup Script",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2024-12-18T16:07:49.769517Z",
      "ended_at": "2024-12-18T16:09:15.998633Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "splunk_ansible",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2024-12-19T06:48:24.523009Z",
      "ended_at": "2024-12-19T06:49:51.191554Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "splunk_ansible",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2024-12-19T09:18:35.948436Z",
      "ended_at": "2024-12-19T09:20:04.274409Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "splunk_ansible",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    },
    {
      "started_at": "2025-01-04T17:52:38.949353Z",
      "ended_at": "2025-01-04T17:56:30.692371Z",
      "exit_code": 0,
      "stage": "start",
      "status": "ok",
      "display_name": "splunk_ansible",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    }
  ],
  "agent_connection_timings": [
    {
      "started_at": "2024-12-18T16:07:17.575081Z",
      "ended_at": "2024-12-18T16:07:49.708293Z",
      "stage": "connect",
      "workspace_agent_id": "137a736b-a619-437d-b653-6a117f85069d",
      "workspace_agent_name": "main"
    }
  ]
}

Multiple timings for the "Startup Script" are reported, spanning different dates within the same build, which doesn’t align with expected behavior.

Possible Causes:

  1. Incorrect Build ID Insertion: Agent script timings may be incorrectly associated with the wrong build ID.
  2. Query Error: The API query for fetching agent script timings might be flawed.

Additional Context:

  • This issue is related to #15921.
  • A workaround for a similar problem was implemented in this PR.
  • Fixes for agent script timings in this PR might have introduced this bug again.
@coder-labeler coder-labeler bot added bug risk Prone to bugs needs-triage Issue that require triage labels Jan 13, 2025
@BrunoQuaresma
Copy link
Collaborator Author

BrunoQuaresma commented Jan 13, 2025

Another strange observation is that the agent connection time and provisioner timings are from 2024-12-18, which seems to be an unusually old date. 🤔 This makes me wonder if there might be a date-related issue in the codebase. This might be a different bug.

@DanielleMaywood
Copy link
Contributor

To summarize a call I had with @mafredri

We believe that the behavior witnessed with one script having multiple script timings is expected in the instance that an agent crashes and restarts. In this instance the agent will re-run the startup scripts, resulting in a second set of script runs ending up in the database.

We discussed a couple of options:

  1. We could investigate if this agent behavior makes sense, and possibly change it
  2. We could add a "run ID" to each run of the scripts so we could split the runs
  3. We could throw away any script timings after the first run
  4. We could only return the first script timing for each script in the database

We've decided that the final option makes the most sense for now, as it requires the least changes.

@dannykopping
Copy link
Contributor

Sounds like a reasonable approach 👍

I think option 2 is worth doing later on, though.

@mafredri
Copy link
Member

I think option 2 is worth doing later on, though.

Expanding on 2, we can infer a pseudo-run ID via UNIQUE(script_id, start_time), although that’s not enforced on the DB level. But if we had a UUID that persists for the runtime of the agent (process, generated at agent or script runner instantiation), and every script run has this UUID, that could be useful if we ever need to discern if the script was run by a restarted agent.

@dannykopping
Copy link
Contributor

I think option 2 is worth doing later on, though.

Expanding on 2, we can infer a pseudo-run ID via UNIQUE(script_id, start_time), although that’s not enforced on the DB level. But if we had a UUID that persists for the runtime of the agent (process, generated at agent or script runner instantiation), and every script run has this UUID, that could be useful if we ever need to discern if the script was run by a restarted agent.

This might be useful soon for prebuilds, since we'll have the agent reinitialize after being claimed by a user: https://www.notion.so/coderhq/Prebuilds-176d579be59280398851fe6473badfe7?pvs=4#17cd579be592805db784d9bc8fc54a0e

gcp-cherry-pick-bot bot pushed a commit that referenced this issue Jan 21, 2025
…16203)

Fixes #16124

If a workspace agent crashes, it is possible for any startup scripts to
be ran again. This PR makes it so that the
`GetWorkspaceAgentScriptTimingsByBuildID` query only returns the first
timing recorded per-script.
matifali pushed a commit that referenced this issue Jan 21, 2025
SasSwart pushed a commit that referenced this issue Jan 22, 2025
…16203)

Fixes #16124

If a workspace agent crashes, it is possible for any startup scripts to
be ran again. This PR makes it so that the
`GetWorkspaceAgentScriptTimingsByBuildID` query only returns the first
timing recorded per-script.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug risk Prone to bugs needs-triage Issue that require triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants