Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve agent connection troubleshooting #15423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matifali opened this issue Nov 7, 2024 · 8 comments
Open

Improve agent connection troubleshooting #15423

matifali opened this issue Nov 7, 2024 · 8 comments

Comments

@matifali
Copy link
Member

matifali commented Nov 7, 2024

Problem Description

Coder agents sometimes fail to connect to the Coder server due to a variety of issues, including network restrictions (e.g., DNS issues, firewalls), missing permissions (e.g., CAP_NET_ADMIN), OS or architecture mismatches, and missing tools for downloading the agent binary. Currently, there’s limited guidance in the UI to help users diagnose and resolve these issues effectively, leading to delays in troubleshooting.

Image

For example, failures in the agent bootstrap script can result in non-connecting agents without a clear indication of the root cause. When checking the workspace logs i.e., docker logs <container name or container id> a typical DNS failure log might look like this:

+ trap waitonexit EXIT
+ mktemp -d -t coder.XXXXXX
+ BINARY_DIR=/tmp/coder.1uZgEp
+ BINARY_NAME=coder
+ BINARY_URL=https://coder.example.com:3000/bin/coder-linux-amd64
+ cd /tmp/coder.1uZgEp
+ :
+ status=
+ command -v curl
+ curl -fsSL --compressed http://coder.example.com:3000/bin/coder-linux-amd64 -o coder
curl: (6) Could not resolve host: coder.example.com
+ status=6
+ echo error: failed to download coder agent
+ echo        command returned: 6
+ echo Trying again in 30 seconds...
+ sleep 30
error: failed to download coder agent
command returned: 6
Trying again in 30 seconds..

Desired Solution

Implement enhanced diagnostics and UI hints that provide actionable guidance to users based on the detected issue. By giving users specific suggestions directly in the UI, they can resolve connectivity issues faster and with less frustration. This includes:

  1. Enhanced Error Logging and Diagnostics

    • Log detailed error messages for each failure point, covering:
      • Network/DNS issues, with suggestions to verify DNS configuration or consult network administrators.
      • Download tool availability (e.g., curl or wget), with instructions on how to install the required tool.
      • OS/architecture mismatches with a link to supported environments in the documentation.
  2. UI Hints for Diagnosed Issues1

    • Network/DNS Issue: If a DNS or network error is detected, show a UI message like:
      “It appears there’s a DNS or firewall issue preventing the agent from connecting to the server. Learn more about network configuration.”
    • Download Tool Missing: If the required download tools (curl, wget) are unavailable, suggest a hint:
      “Required download tool not found. Please install either curl or wget.”
    • Unsupported OS/Architecture: If OS or architecture compatibility issues arise, prompt users to check supported platforms:
      “This environment may be unsupported. Review supported OS and architectures.”
    • Download Logs: If the agent doesn't connect, link to docs to show how to fetch agent logs outside of Coder.

Proposed Implementation

  • Backend Logging: Improve diagnostic logging in the agent bootstrap script to provide clearer insights into why each specific failure occurs.
  • UI Updates1: Implement conditional pop-ups or error messages in the Coder UI that guide users based on diagnosed connectivity issues.
  • Documentation Update: Expand documentation with a troubleshooting section that covers all major connectivity blockers, including example configurations.

Footnotes

  1. This may not be possible currently, as we do not have any way to expose these logs to the UI without the agent running. 2

@coder-labeler coder-labeler bot added the docs Area: coder.com/docs label Nov 7, 2024
@matifali matifali changed the title Imrpove error Improve agent connection troubleshooting Nov 7, 2024
@matifali matifali removed the docs Area: coder.com/docs label Nov 7, 2024
@ethanndickson
Copy link
Member

ethanndickson commented Nov 7, 2024

Related to #6711.

I spent a little time on that issue not long ago and we decided it wouldn't be unreasonable for the agent script to attempt to log errors to /api/v2/workspaceagents/me/logs*, if it was unable to start the agent. This would let issues like a weird environment (no curl, wget) or an unsupported OS/arch get displayed on the Web UI. This is kinda tricky since (afaik) the script can start before the agent token gets inserted into the DB, so it needs to do this on a retry.

If there's a network or DNS issue, I don't think we're gonna able to propagate that to the UI in any way. If the script can't download the agent binary, it's not going to be able to post logs to the agent route.

In any case, I think we should definitely improve the log output to make some of these common cases easily identifiable and fixable.

*We'd also need to un-deprecate the route, as there's no way we can do a dRPC call from a bash script.

@bpmct
Copy link
Member

bpmct commented Nov 7, 2024

I spent a little time on that issue not long ago and we decided it wouldn't be unreasonable for the agent script to attempt to log errors to /api/v2/workspaceagents/me/logs*, if it was unable to start the agent.

Nice. It seems like a combination of logs sent to the server, improved logging of the provisioner script, and better documentation is the key here.

If there's a network or DNS issue, I don't think we're gonna able to propagate that to the UI in any way. If the script can't download the agent binary, it's not going to be able to post logs to the agent route.

Yep, ideally, we can have better hints in the logs of our agent download & install script or even hint that there is a DNS error 🤞🏼

UI Hints for Diagnosed Issues
Documentation Update

I think it'd be awesome to drill down a bit more on the current gaps in the docs and UI.

@matifali
Copy link
Member Author

matifali commented Nov 12, 2024

I think it'd be awesome to drill down a bit more on the current gaps in the docs and UI.

@EdwardAngert, if you can help find the shortcoming in our getting-started docs.

@bpmct
Copy link
Member

bpmct commented Nov 18, 2024

From #15462:

Currently if an agent is not able to establish a connection to the control plane we display a warning that the agent isn't healthy

Image

Unfortunately we don't provide any additional information as to why the agent is unhealthy. We should add either add context to the existing icon or add an additional icon on the agent resource itself explaining why that specific agent isn't healthy. We supply the reason in the payload.

"agents": [
    {
        "id": "b446b044-0b0d-4e65-806c-ef30c10b535a",
        "name": "dev",
        "apps": [],
        "connection_timeout_seconds": 120,
        "troubleshooting_url": "https://coder.com/docs/templates/troubleshooting",
        "subsystems": [],
        "health": {
            "healthy": false,
            "reason": "agent is not running"
        }
    }
]

@bpmct
Copy link
Member

bpmct commented Nov 18, 2024

@johnstcn @SasSwart - related to the envbuilder git clone failed thing that happened today 😉

@johnstcn
Copy link
Member

We discussed this today in stand-up. There are a few different sorts of issues here:

  1. Agent bootstrap script is unable to download the agent (e.g. due to network/dns/certs issue)
  2. Agent bootstrap script is unable to start the agent (could be any number of reasons)
  3. Agent init scripts fail for some reason (again, could be any number of reasons)

With regard to 1), we don't have many options for surfacing errors here.
For 2) and 3), a simple API endpoint to send logs back to Coderd that can also be accessible via cURL could provide some way of exposing this information more easily.

@bpmct
Copy link
Member

bpmct commented Nov 25, 2024

Did we also discuss how we can improve the docs and UX to point people to docs, or is this already sufficient?

@johnstcn
Copy link
Member

Did we also discuss how we can improve the docs and UX to point people to docs, or is this already sufficient?

No, we were mostly focused on how to improve the troubleshooting situation. We didn't focus much on docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants