feat(iac): Add Ansible role for Netdata deployment#21668
feat(iac): Add Ansible role for Netdata deployment#21668ktsaou wants to merge 1 commit intonetdata:masterfrom
Conversation
Add Infrastructure as Code support for deploying Netdata with Ansible: - Reusable role with profile-based configuration - Support for standalone, parent, child, and child_minimal profiles - Cloud claiming via claim.conf - Streaming configuration for parent/child architectures - Managed files with automatic restart on changes - Example inventory with group variables - E2E testing framework (libvirt + Docker) - Documentation for end-users and internal guidelines
There was a problem hiding this comment.
5 issues found across 34 files
Confidence score: 3/5
- Missing checksum verification for the Netdata kickstart download in
src/IaC/ansible/roles/netdata/tasks/install.ymlcould allow a tampered script to be executed, which is a concrete security risk. netdata_claim_enableddefaulting to true and the undefinednetdata_managed_files_dest_profilesaccess insrc/IaC/ansible/roles/netdata/defaults/main.ymlandsrc/IaC/ansible/roles/netdata/tasks/managed-files.ymlcan cause the role to fail out of the box.- Score reflects multiple medium-to-high severity reliability/security issues, so there is some merge risk despite the changes being localized.
- Pay close attention to
src/IaC/ansible/roles/netdata/tasks/install.yml,src/IaC/ansible/roles/netdata/defaults/main.yml,src/IaC/ansible/roles/netdata/tasks/managed-files.yml- checksum verification and default/undefined variable handling.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/IaC/ansible/roles/netdata/tasks/install.yml">
<violation number="1" location="src/IaC/ansible/roles/netdata/tasks/install.yml:28">
P1: Add checksum verification for the kickstart script download to prevent executing a tampered artifact.</violation>
</file>
<file name="src/IaC/ansible/roles/netdata/defaults/main.yml">
<violation number="1" location="src/IaC/ansible/roles/netdata/defaults/main.yml:5">
P2: Defaulting `netdata_claim_enabled` to true causes the role to fail unless callers always set a claim token. Set the default to false so the role works out of the box when no token is provided.</violation>
</file>
<file name="src/IaC/ansible/e2e/run.sh">
<violation number="1" location="src/IaC/ansible/e2e/run.sh:16">
P2: The `run()` wrapper never prints its error block when a command fails because `set -e` exits immediately on `"$@"`. Wrap the command in an `if` so failures are handled inside the function.</violation>
</file>
<file name="src/IaC/README.md">
<violation number="1" location="src/IaC/README.md:3">
P3: Use the product naming convention (“Netdata Agent”) in documentation.</violation>
</file>
<file name="src/IaC/ansible/roles/netdata/tasks/managed-files.yml">
<violation number="1" location="src/IaC/ansible/roles/netdata/tasks/managed-files.yml:32">
P2: Avoid indexing netdata_managed_files_dest_profiles before it exists; the first iteration will fail because the variable is undefined. Default the dictionary before indexing it.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Ctrl as Ansible Controller
participant Host as Target Host (Agent)
participant Parent as Parent Agent (Optional)
participant Cloud as Netdata Cloud
Note over Ctrl,Host: NEW: Provisioning & Configuration Flow
Ctrl->>Host: NEW: Run kickstart.sh (non-interactive)
Host->>Host: Install Netdata binaries & packages
Ctrl->>Ctrl: NEW: Resolve profiles (Standalone, Parent, Child)
Note over Ctrl: Check for file collisions between profiles
Ctrl->>Host: NEW: Deploy managed files (netdata.conf, stream.conf)
Note over Host: Automatic service restart on file changes
alt Profile: child / child_minimal
Host->>Parent: NEW: Initiate streaming (Port 19999)
Parent-->>Host: Accept metrics (API Key validation)
else Profile: parent
Host->>Host: NEW: Configure [web] static-threaded
Host->>Host: NEW: Define allowed Child API Keys
end
Note over Ctrl,Cloud: NEW: Cloud Claiming Flow (State-based)
opt netdata_claim_enabled: true
Ctrl->>Host: NEW: Write claim.conf (Token, Rooms, Proxy)
alt netdata_reclaim: true OR not claimed
Host->>Host: NEW: netdatacli reload-claiming-state
Host->>Cloud: NEW: Authenticate & Claim node
Cloud-->>Host: Return claimed_id marker
else Already Claimed
Host->>Host: Skip reload (idempotent)
end
end
Ctrl->>Host: Verify netdata.service state (Started/Enabled)
Host-->>Ctrl: Provisioning Complete
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| - ansible_facts.os_family == "Debian" | ||
|
|
||
| - name: Download kickstart script | ||
| get_url: |
There was a problem hiding this comment.
P1: Add checksum verification for the kickstart script download to prevent executing a tampered artifact.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/IaC/ansible/roles/netdata/tasks/install.yml, line 28:
<comment>Add checksum verification for the kickstart script download to prevent executing a tampered artifact.</comment>
<file context>
@@ -0,0 +1,61 @@
+ - ansible_facts.os_family == "Debian"
+
+- name: Download kickstart script
+ get_url:
+ url: "{{ netdata_kickstart_url }}"
+ dest: /tmp/netdata-kickstart.sh
</file context>
| netdata_kickstart_url: "https://get.netdata.cloud/kickstart.sh" | ||
| netdata_release_channel: "stable" | ||
|
|
||
| netdata_claim_enabled: true |
There was a problem hiding this comment.
P2: Defaulting netdata_claim_enabled to true causes the role to fail unless callers always set a claim token. Set the default to false so the role works out of the box when no token is provided.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/IaC/ansible/roles/netdata/defaults/main.yml, line 5:
<comment>Defaulting `netdata_claim_enabled` to true causes the role to fail unless callers always set a claim token. Set the default to false so the role works out of the box when no token is provided.</comment>
<file context>
@@ -0,0 +1,51 @@
+netdata_kickstart_url: "https://get.netdata.cloud/kickstart.sh"
+netdata_release_channel: "stable"
+
+netdata_claim_enabled: true
+netdata_claim_token: ""
+netdata_claim_rooms: ""
</file context>
| printf >&2 "${GRAY}$(pwd) >${NC} ${YELLOW}" | ||
| printf >&2 "%q " "$@" | ||
| printf >&2 "${NC}\n" | ||
| "$@" |
There was a problem hiding this comment.
P2: The run() wrapper never prints its error block when a command fails because set -e exits immediately on "$@". Wrap the command in an if so failures are handled inside the function.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/IaC/ansible/e2e/run.sh, line 16:
<comment>The `run()` wrapper never prints its error block when a command fails because `set -e` exits immediately on `"$@"`. Wrap the command in an `if` so failures are handled inside the function.</comment>
<file context>
@@ -0,0 +1,522 @@
+ printf >&2 "${GRAY}$(pwd) >${NC} ${YELLOW}"
+ printf >&2 "%q " "$@"
+ printf >&2 "${NC}\n"
+ "$@"
+ local exit_code=$?
+ if [[ ${exit_code} -ne 0 ]]; then
</file context>
| {{ | ||
| netdata_managed_files_dest_profiles | default({}) | ||
| | combine({ | ||
| item.dest_resolved: (netdata_managed_files_dest_profiles[item.dest_resolved] | default([])) + [ item.__profile | default('host') ] |
There was a problem hiding this comment.
P2: Avoid indexing netdata_managed_files_dest_profiles before it exists; the first iteration will fail because the variable is undefined. Default the dictionary before indexing it.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/IaC/ansible/roles/netdata/tasks/managed-files.yml, line 32:
<comment>Avoid indexing netdata_managed_files_dest_profiles before it exists; the first iteration will fail because the variable is undefined. Default the dictionary before indexing it.</comment>
<file context>
@@ -0,0 +1,83 @@
+ {{
+ netdata_managed_files_dest_profiles | default({})
+ | combine({
+ item.dest_resolved: (netdata_managed_files_dest_profiles[item.dest_resolved] | default([])) + [ item.__profile | default('host') ]
+ }, recursive=True)
+ }}
</file context>
| @@ -0,0 +1,99 @@ | |||
| # Netdata Infrastructure as Code (IaC) | |||
|
|
|||
| Deploy and configure Netdata agents at scale using your preferred configuration management tool. | |||
There was a problem hiding this comment.
P3: Use the product naming convention (“Netdata Agent”) in documentation.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/IaC/README.md, line 3:
<comment>Use the product naming convention (“Netdata Agent”) in documentation.</comment>
<file context>
@@ -0,0 +1,99 @@
+# Netdata Infrastructure as Code (IaC)
+
+Deploy and configure Netdata agents at scale using your preferred configuration management tool.
+
+## Supported Tools
</file context>
There was a problem hiding this comment.
Pull request overview
Adds an Ansible-based IaC workflow for installing and configuring Netdata with profile-based configuration, Cloud claiming, streaming support, and an accompanying E2E test harness.
Changes:
- Introduces a reusable Ansible role (
roles/netdata) to install Netdata via kickstart, manage config files, configure streaming, and handle Cloud claiming. - Adds example inventories/group vars and profile-based file bundles to demonstrate common topologies (standalone/parent/child).
- Adds libvirt + Docker E2E test harness plus documentation, and updates
.gitignorefor secrets and test artifacts.
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/IaC/ansible/roles/netdata/templates/claim.conf.j2 | Jinja template for Netdata claim.conf generation. |
| src/IaC/ansible/roles/netdata/tasks/stream.yml | Streaming configuration via stream.conf ini updates (child + parent patterns). |
| src/IaC/ansible/roles/netdata/tasks/stream-parent-options.yml | Helper task include for per-parent-key stream options. |
| src/IaC/ansible/roles/netdata/tasks/service.yml | Ensures netdata service is enabled and started. |
| src/IaC/ansible/roles/netdata/tasks/profiles.yml | Resolves profile definitions into an effective managed-file list. |
| src/IaC/ansible/roles/netdata/tasks/profile-files.yml | Collects managed files for each selected profile. |
| src/IaC/ansible/roles/netdata/tasks/managed-files.yml | Copies/templates managed files, detects destination collisions, and triggers restart. |
| src/IaC/ansible/roles/netdata/tasks/main.yml | Orchestrates role execution order (install → config-dir → profiles → files → config → stream → service → claim). |
| src/IaC/ansible/roles/netdata/tasks/install.yml | Installs Netdata via downloaded kickstart script with configurable args. |
| src/IaC/ansible/roles/netdata/tasks/configure.yml | Optional ini-style tweaks for netdata.conf. |
| src/IaC/ansible/roles/netdata/tasks/config-dir.yml | Auto-detects/configures Netdata config directory and ensures it exists. |
| src/IaC/ansible/roles/netdata/tasks/claim.yml | Cloud claiming workflow (template claim.conf + netdatacli reload + marker waits). |
| src/IaC/ansible/roles/netdata/handlers/main.yml | Defines Netdata restart handler. |
| src/IaC/ansible/roles/netdata/defaults/main.yml | Provides default role variables (claiming, streaming, managed files, etc.). |
| src/IaC/ansible/playbooks/netdata.yml | Playbook entrypoint applying the netdata role to hosts. |
| src/IaC/ansible/inventories/example/inventory.yml | Example inventory demonstrating profile assignment per host. |
| src/IaC/ansible/inventories/example/group_vars/netdata_standalone.yml | Deprecated placeholder group vars for older layout. |
| src/IaC/ansible/inventories/example/group_vars/netdata_parent.yml | Deprecated placeholder group vars for older layout. |
| src/IaC/ansible/inventories/example/group_vars/netdata_child_minimal.yml | Deprecated placeholder group vars for older layout. |
| src/IaC/ansible/inventories/example/group_vars/netdata_child.yml | Deprecated placeholder group vars for older layout. |
| src/IaC/ansible/inventories/example/group_vars/all.yml | Example end-user configuration (claiming + managed files + profile definitions). |
| src/IaC/ansible/inventories/example/files/profiles/parent/stream.conf | Example parent streaming config file (profile-managed). |
| src/IaC/ansible/inventories/example/files/profiles/parent/netdata.conf | Example parent Netdata config file (profile-managed). |
| src/IaC/ansible/inventories/example/files/profiles/child_minimal/stream.conf | Example minimal child streaming config file (profile-managed). |
| src/IaC/ansible/inventories/example/files/profiles/child_minimal/netdata.conf | Example minimal child Netdata config file (profile-managed). |
| src/IaC/ansible/inventories/example/files/profiles/child/stream.conf | Example child streaming config file (profile-managed). |
| src/IaC/ansible/inventories/example/files/global/health.d/custom.conf | Example global health override file (managed file example). |
| src/IaC/ansible/e2e/run.sh | E2E automation to provision targets (libvirt + Docker) and validate claiming/streaming. |
| src/IaC/ansible/e2e/README.md | E2E usage/prerequisites documentation. |
| src/IaC/ansible/README.md | End-user guide for deploying Netdata with Ansible (profiles, claiming, streaming). |
| src/IaC/ansible/AGENTS.md | Technical reference for the Ansible implementation details. |
| src/IaC/README.md | Top-level IaC overview and concepts. |
| src/IaC/AGENTS.md | Cross-tool IaC guidelines and conventions. |
| .gitignore | Ignores IaC claim env file and E2E artifacts (plus additional generated artifacts). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| run() { | ||
| printf >&2 "${GRAY}$(pwd) >${NC} ${YELLOW}" | ||
| printf >&2 "%q " "$@" | ||
| printf >&2 "${NC}\n" | ||
| "$@" | ||
| local exit_code=$? | ||
| if [[ ${exit_code} -ne 0 ]]; then | ||
| echo -e >&2 "${RED}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}" | ||
| echo -e >&2 "${RED}[ERROR]${NC} Command failed with exit code ${exit_code}: ${YELLOW}$1${NC}" | ||
| echo -e >&2 "${RED} Full command:${NC} $*" | ||
| echo -e >&2 "${RED} Working dir:${NC} $(pwd)" | ||
| echo -e >&2 "${RED}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}" | ||
| return $exit_code | ||
| fi | ||
| } |
There was a problem hiding this comment.
run() attempts to capture the failing command’s exit code and print a custom error, but with set -e enabled the script will exit immediately on a failing "$@" before reaching the error-handling block. Update run() to prevent set -e from aborting (e.g., use if ! "$@"; then ... fi or temporarily set +e around the command) so errors are reported as intended.
| file: | ||
| path: "{{ netdata_lib_dir }}/cloud.d/claimed_id" | ||
| state: absent | ||
| when: netdata_reclaim | bool |
There was a problem hiding this comment.
netdata_reclaim currently removes {{ netdata_lib_dir }}/cloud.d/claimed_id even when claiming is disabled. This contradicts the intended behavior of preserving Cloud state when netdata_claim_enabled: false. Gate the reclaim deletion behind netdata_claim_enabled | bool as well (or otherwise ensure disabling claim never removes existing claim markers).
| when: netdata_reclaim | bool | |
| when: | |
| - netdata_reclaim | bool | |
| - netdata_claim_enabled | bool |
| - not netdata_claim_enabled or (netdata_claim_token | length > 0) | ||
| fail_msg: "netdata_claim_token is required when netdata_claim_enabled=true" |
There was a problem hiding this comment.
Claim validation only asserts netdata_claim_token when claiming is enabled, but the docs/examples treat netdata_claim_rooms as required too. Either update the validation to require rooms when claiming is enabled, or update the docs to clearly mark rooms as optional (so users don’t get a harder-to-debug claim failure later).
| - not netdata_claim_enabled or (netdata_claim_token | length > 0) | |
| fail_msg: "netdata_claim_token is required when netdata_claim_enabled=true" | |
| - not netdata_claim_enabled or ((netdata_claim_token | length > 0) and (netdata_claim_rooms | length > 0)) | |
| fail_msg: "netdata_claim_token and netdata_claim_rooms are required when netdata_claim_enabled=true" |
| - Managed file changes | ||
| - Configuration changes via `ini_file` | ||
| - Service state changes | ||
|
|
||
| - `Reload Netdata health` - Triggered by: | ||
| - Health config file changes (type: health) | ||
|
|
There was a problem hiding this comment.
This doc section references a Reload Netdata health handler and describes when it triggers, but roles/netdata/handlers/main.yml only defines Restart Netdata. Update the documentation to match the actual handlers, or add the missing handler + wiring if health reload is intended.
| - Managed file changes | |
| - Configuration changes via `ini_file` | |
| - Service state changes | |
| - `Reload Netdata health` - Triggered by: | |
| - Health config file changes (type: health) | |
| - Managed file changes (including health config files of type `health`) | |
| - Configuration changes via `ini_file` | |
| - Service state changes |
Summary
What's Included
Role (
src/IaC/ansible/roles/netdata/)Documentation
src/IaC/README.md- End-user overview of IaC provisioningsrc/IaC/AGENTS.md- Internal guidelines for all provisioning systemssrc/IaC/ansible/README.md- User guide for Ansible deploymentsrc/IaC/ansible/AGENTS.md- Technical reference for the implementationExample Inventory (
src/IaC/ansible/inventories/example/)E2E Testing (
src/IaC/ansible/e2e/)Test plan
Summary by cubic
Adds an Ansible role to install and configure Netdata with profiles (standalone, parent, child, child_minimal), Cloud claiming, and streaming. Includes example inventory, documentation, and an E2E test harness for Ubuntu 22.04, Debian 12, and Rocky 9.
New Features
Migration
Written for commit 1adc8f1. Summary will update on new commits.