[Fleet Server] Fleet Server Observability

Let's add a dashboard for Fleet Server operators to help them scale the server when needed and troubleshoot issues

It should show:
- [ ] System metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Helps to identify infrastructure limits. This should work for agents with system monitoring enabled and hosted agents with only internal monitoring.
- [ ] Fleet server process metrics like CPU, memory usage by host over time. Should be accurate for VMs and containers. Identifies capacity usage from Fleet Server compared to other processes.
- [ ] Status by host over time. Provides a history of when Fleet Server was offline, updating, unhealthy, etc.
- [ ] Log stream component showing errors. Useful for troubleshooting.
- [ ] Add a note with a link to the stack monitoring app where users can monitor APM server and standalone FB/MB, which are in the same container.
- [ ] Filter on hostname. Lets operators isolate metrics from particular Fleet server hosts.

Stretch:
- [ ] Number of active connections by host over time. Lets operators see resource usage as a function of capacity.
- [ ] Number of rejected connections by host over time. Lets operators see when limits are reached and the impact on clients.

Related issues:
- Logs, metrics and status info for Fleet Server https://github.com/elastic/beats/issues/24415
- Elastic Agent dashboard https://github.com/elastic/beats/issues/23948
- Enabling metricbeat on ESS/ECE https://github.com/elastic/cloud/issues/74800

Open questions:
1. Should we have a separate dashboard for Fleet Server or combine it with the Elastic Agent dashboard? 
   - They should have separate dashboards. They have separate metrics to visualize, like only Fleet Server has connection count. System metrics are particularly useful for Fleet Server because the goal is to maximize utilization and observe when its necessary to scale the infrastructure. The Elastic Agent running on an endpoint should have low utilization so it will be easier to visualize these use cases separately. Also, its a standard pattern to include dashboards for each integration, so it will be more discoverable as part of the Fleet Server integration.
2. Confirm system metrics are enabled on cloud
   - We are not planning to enable them on cloud, but we should reconsider that. I created this issue to discuss it https://github.com/elastic/kibana/issues/96248
3. How can we filter the fleet server hosts from the other hosts in the dashboards? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fleet Server] Fleet Server Observability #812

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Fleet Server] Fleet Server Observability #812

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions