[Serve] fix bug in monitoring docs #59571

abrarsheikh · 2025-12-19T06:54:49Z

emitting target replicas on every update cycle so that we can compare it with actual replicas on a time series.

Signed-off-by: abrar <[email protected]>

gemini-code-assist

Code Review

This pull request fixes a bug in the monitoring documentation and ensures a key autoscaling metric is always up-to-date. The documentation for three autoscaling metrics has been corrected from Histogram to Gauge to reflect their actual implementation. Additionally, the ray_serve_autoscaling_target_replicas gauge is now emitted on every check, even when there are no changes to the deployment, preventing the metric from becoming stale.

I've added a couple of suggestions for future improvements:

In monitoring.md, I suggested that the delay metrics might be more useful as Histograms instead of Gauges to provide more detailed observability into latency distributions.
In deployment_state.py, I pointed out a similar place in the autoscale method where the target replicas gauge might not be emitted, which could be fixed for consistency.

Overall, the changes in this PR are correct and improve the accuracy of Serve's monitoring capabilities. I approve these changes.

gemini-code-assist · 2025-12-19T07:01:56Z

doc/source/serve/monitoring.md

+| `ray_serve_autoscaling_policy_execution_time_ms` | Gauge | `deployment`, `application`, `policy_scope` | Time taken to execute the autoscaling policy in milliseconds. `policy_scope` is `deployment` or `application`. |
+| `ray_serve_autoscaling_replica_metrics_delay_ms` | Gauge | `deployment`, `application`, `replica` | Time taken for replica metrics to reach the controller in milliseconds. High values may indicate controller overload. |
+| `ray_serve_autoscaling_handle_metrics_delay_ms` | Gauge | `deployment`, `application`, `handle` | Time taken for handle metrics to reach the controller in milliseconds. High values may indicate controller overload. |


While this change correctly aligns the documentation with the current implementation (these metrics are indeed Gauges), it's worth considering if these metrics would be more useful as Histograms. Latency and delay metrics are often more insightful as histograms, as they allow observing distributions (e.g., p50, p99 latencies) rather than just the last reported value. Other latency metrics in Serve (like serve_replica_startup_latency_ms) are histograms. This could be a valuable future enhancement for better autoscaling observability.

gemini-code-assist · 2025-12-19T07:01:56Z

python/ray/serve/_private/deployment_state.py

+            # Emit target replicas metric when the deployment info hasn't changed.
+            self.target_replicas_gauge.set(self._target_state.target_num_replicas)
            return False


This is a good fix to ensure the target_replicas_gauge is emitted periodically, preventing it from becoming stale.

A similar issue might exist in the autoscale method in this file (around line 2646). It also has an early return if the replica count doesn't change, and it doesn't emit the gauge in that path. For consistency and to prevent stale metrics during autoscaling, you might consider applying a similar fix there in a follow-up.

# L2646 in python/ray/serve/_private/deployment_state.py if decision_num_replicas == self._target_state.target_num_replicas: # Add this line self.target_replicas_gauge.set(self._target_state.target_num_replicas) return False

akyang-anyscale · 2025-12-19T19:24:58Z

doc/source/serve/monitoring.md

-| `ray_serve_autoscaling_policy_execution_time_ms` | Histogram | `deployment`, `application`, `policy_scope` | Time taken to execute the autoscaling policy in milliseconds. `policy_scope` is `deployment` or `application`. |
-| `ray_serve_autoscaling_replica_metrics_delay_ms` | Histogram | `deployment`, `application`, `replica` | Time taken for replica metrics to reach the controller in milliseconds. High values may indicate controller overload. |
-| `ray_serve_autoscaling_handle_metrics_delay_ms` | Histogram | `deployment`, `application`, `handle` | Time taken for handle metrics to reach the controller in milliseconds. High values may indicate controller overload. |
+| `ray_serve_autoscaling_policy_execution_time_ms` | Gauge | `deployment`, `application`, `policy_scope` | Time taken to execute the autoscaling policy in milliseconds. `policy_scope` is `deployment` or `application`. |


why gauge instead of histogram?

These metrics are emitted once per control loop iteration, they have low volume and are sequential. We care about instantaneous values, not distribution.

I think histograms are better when logging high-volume concurrent requests, and the question we are asking are about distribution.

ray-project#59218 emitting target replicas on every update cycle so that we can compare it with actual replicas on a time series. Signed-off-by: abrar <[email protected]>

[Serve] fix bug in monitoring docs

0bc7789

Signed-off-by: abrar <[email protected]>

abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 19, 2025

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

abrarsheikh requested a review from akyang-anyscale December 19, 2025 17:12

abrarsheikh marked this pull request as ready for review December 19, 2025 17:12

abrarsheikh requested review from a team as code owners December 19, 2025 17:12

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 19, 2025

akyang-anyscale approved these changes Dec 19, 2025

View reviewed changes

abrarsheikh merged commit 1801699 into master Dec 19, 2025
6 checks passed

abrarsheikh deleted the 59218-abrar-bug branch December 19, 2025 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] fix bug in monitoring docs #59571

[Serve] fix bug in monitoring docs #59571

Uh oh!

abrarsheikh commented Dec 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

akyang-anyscale Dec 19, 2025

Uh oh!

abrarsheikh Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Serve] fix bug in monitoring docs #59571

[Serve] fix bug in monitoring docs #59571

Uh oh!

Conversation

abrarsheikh commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

akyang-anyscale Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abrarsheikh commented Dec 19, 2025 •

edited

Loading