-
Notifications
You must be signed in to change notification settings - Fork 490
Open
Labels
Integration:fleet_serverFleet ServerFleet ServerStalledTeam:Elastic-Agent-Control-PlaneLabel for the Agent Control Plane teamLabel for the Agent Control Plane teamTeam:FleetFleet team [elastic/fleet]Fleet team [elastic/fleet]enhancementNew feature or requestNew feature or requestestimation:WeekTask that represents a week of work.Task that represents a week of work.
Description
Let's add a dashboard for Fleet Server operators to help them scale the server when needed and troubleshoot issues
It should show:
- System metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Helps to identify infrastructure limits. This should work for agents with system monitoring enabled and hosted agents with only internal monitoring.
- Fleet server process metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Identifies capacity usage from Fleet Server compared to other processes.
- Status by host over time. Provides a history of when Fleet Server was offline, updating, unhealthy, etc.
- Log stream component showing errors. Useful for troubleshooting.
- Add a note with a link to the stack monitoring app where users can monitor APM server and standalone FB/MB, which are in the same container.
- Filter on hostname. Lets operators isolate metrics from particular Fleet server hosts.
Stretch:
- Number of active connections by host over time. Lets operators see resource usage as a function of capacity.
- Number of rejected connections by host over time. Lets operators see when limits are reached and the impact on clients.
Related issues:
- Logs, metrics and status info for Fleet Server [Elastic Agent] Logs, metrics and status info for Fleet server beats#24415
- Elastic Agent dashboard https://github.com/elastic/beats/issues/23948
- Enabling metricbeat on ESS/ECE https://github.com/elastic/cloud/issues/74800
Open questions:
- Should we have a separate dashboard for Fleet Server or combine it with the Elastic Agent dashboard?
- They should have separate dashboards. They have separate metrics to visualize, like only Fleet Server has connection count. System metrics are particularly useful for Fleet Server because the goal is to maximize utilization and observe when its necessary to scale the infrastructure. The Elastic Agent running on an endpoint should have low utilization so it will be easier to visualize these use cases separately. Also, its a standard pattern to include dashboards for each integration, so it will be more discoverable as part of the Fleet Server integration.
- Confirm system metrics are enabled on cloud
- We are not planning to enable them on cloud, but we should reconsider that. I created this issue to discuss it [Fleet] Enable system metrics on Elastic Cloud agent policy kibana#96248
- How can we filter the fleet server hosts from the other hosts in the dashboards?
Metadata
Metadata
Assignees
Labels
Integration:fleet_serverFleet ServerFleet ServerStalledTeam:Elastic-Agent-Control-PlaneLabel for the Agent Control Plane teamLabel for the Agent Control Plane teamTeam:FleetFleet team [elastic/fleet]Fleet team [elastic/fleet]enhancementNew feature or requestNew feature or requestestimation:WeekTask that represents a week of work.Task that represents a week of work.