feat: add datadog-error-monitor skill#336
Draft
tofarr wants to merge 13 commits into
Draft
Conversation
Initial draft of a cron automation skill that polls Datadog logs, maintains a regex-based error pattern library, and triggers OpenHands investigation conversations on new or spiking errors. Co-authored-by: openhands <[email protected]>
Fixes test_marketplace_includes_all_skills — every skill with a SKILL.md must have a corresponding marketplace entry. Co-authored-by: openhands <[email protected]>
Runs build-skills-catalog.mjs to include datadog-error-monitor in the generated catalog. Fixes test_index_is_up_to_date. Co-authored-by: openhands <[email protected]>
- skills/datadog-error-monitor/.plugin/plugin.json (required by test_all_marketplace_skills_have_plugin_json) - .claude-plugin and .codex-plugin symlinks (required by test_all_marketplace_skills_have_vendor_symlinks) - README.md catalog regenerated via sync_extensions.py catalog Co-authored-by: openhands <[email protected]>
Adds automations/catalog/datadog-error-monitor.json and updates automations/index.js so the automation appears in the agent-canvas beta automations list alongside linear-triage, standup-digest, etc. Co-authored-by: openhands <[email protected]>
- integrations/catalog/datadog.json: new HTTP integration entry with iconBg #632CA6 (Datadog brand purple) - automations/catalog/datadog-error-monitor.json: add "datadog" as the first requiredIntegrationId so the automation card shows the Datadog logo rather than the Slack logo Datadog is kind: "http" (not mcp) since the automation uses the Datadog REST API directly via DD_API_KEY / DD_APP_KEY secrets. Co-authored-by: openhands <[email protected]>
…oyment correlation
Implements all discussed improvements to main.py:
* archive_stale_patterns() — patterns not seen in 30 days are moved to
dd_monitor_{id}_archive.json (separate file, not deleted)
* Pattern schema gains first_seen, total_events, and description fields;
total_events is incremented on every matched log event
* EXAMPLES_PER_PATTERN lowered 5 → 3
* Investigation prompt restructured into 4 tasks:
Task 1 — Categorize unknown logs (with deduplication check against existing
patterns before creating new ones)
Task 2 — Correlate first_seen against git tags to surface likely deploy
(with explicit note asking user to confirm their deployment signal)
Task 3 — Investigate spiking patterns within a hard tool-call budget
(INVESTIGATION_BUDGET=10); step-by-step with one permitted
follow-up Datadog query; explicit "declare inconclusive" escape hatch
Task 4 — Post Slack summary including inconclusive patterns
Co-authored-by: openhands <[email protected]>
SKILL.md (parameter table + substitution table) and references/agent-prompt-template.md were still showing the old default of 5 after the value was changed in main.py. Co-authored-by: openhands <[email protected]>
SKILL.md content changed (EXAMPLES_PER_PATTERN doc fix); rerun build-skills-catalog.mjs to keep index in sync. Co-authored-by: openhands <[email protected]>
Co-authored-by: openhands <[email protected]>
…ey is not masked Without this header, /api/settings returns the llm.api_key as a redacted placeholder. That placeholder flows into the spawned conversation payload, causing LiteLLM to fail with "Missing credentials". Matches the pattern already used in github-repo-monitor. Co-authored-by: openhands <[email protected]>
…not permanently blocked When a conversation fails silently (e.g. due to LiteLLM errors), it can remain in 'running' state indefinitely. The active-conversation guard then exits early on every subsequent run, skipping both the unknown-log and spike triggers entirely. This adds STUCK_CONVERSATION_MINUTES = 45. Any conversation that has been in a non-terminal state for longer than that is treated as stuck, the active slot is cleared, and trigger evaluation proceeds normally on the same run. Co-authored-by: openhands <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
datadog-error-monitorskill — a cron automation that polls Datadog logs every 15 minutes, maintains a self-evolving regex-based error pattern library, and triggers targeted OpenHands investigation conversations when new or spiking errors are detected.How it works
Token efficiency: The cron script is 100% deterministic — zero LLM calls on quiet runs. A conversation is only started when triggered, and only one conversation runs at a time.
Files
SKILL.mdscripts/main.pyreferences/state-schema.mdreferences/datadog-api.mdreferences/agent-prompt-template.mdREADME.mdKey design decisions
current_count > mean(last 3 runs) × SPIKE_MULTIPLIER— activates only after 3+ history entries to avoid false positives on newly added patterns.Open questions / known gaps for review
Pattern bootstrapping UX — On first run everything is uncategorized. Users may want to run a one-time manual bootstrap query to see what the first agent conversation will receive. Should the SKILL.md include an optional pre-run step that previews a sample of recent errors?
min_cluster_sizenot yet configurable — Currently hardcoded to 1 (any single unmatched log triggers). Should this be a configurable setup parameter?Investigation conversation workspace isolation — The agent's workspace is set to the first configured repo path. For multi-repo setups this is a minor limitation. A dedicated investigations directory could be used instead. Opinions?
No
.pluginmetadata directory — Other skills have a.plugin/directory with plugin manifests. This PR doesn't include it yet. Should it follow the same pattern?Screenshots
Skill appears in beta list:

Setup works as expected:

Monitors create conversations
