Performance and Scalability #8445

Tommolo · 2025-05-08T10:07:46Z

Tommolo
May 8, 2025

I'm trying to use Kiali in an environment where I have a namespace with a high number of deployments and services (around 320 deployments and 200 services). I've noticed that Kiali is very slow when loading the traffic graph. Is there a way to reduce the loading time?

aljesusg · 2025-05-08T10:20:08Z

aljesusg
May 8, 2025
Maintainer

Hi @Tommolo , thanks for opening the issue.

What version of Kiali are you using?

0 replies

Tommolo · 2025-05-08T11:01:59Z

Tommolo
May 8, 2025
Author

I'm using kiali v2.0.0

0 replies

jmazzitelli · 2025-05-08T14:06:51Z

jmazzitelli
May 8, 2025
Maintainer

This probably should not be a separate issue - we already have an epic on this with related issues, unless you have a specific issue that you can specify what is causing the slowdown. So I would recommend participating in the already existing epic/issues for performance related tasks.

Please see:

Improve the Usability and Performance of the Kiali tool in Large Environments #8230
document features that enable you to more easily view a large graph #5043
Add help for working with large mesh/graph kiali.io#875 (test doc page for this is here)

0 replies

jmazzitelli · 2025-05-08T14:08:04Z

jmazzitelli
May 8, 2025
Maintainer

As for this question:

Is there a way to reduce the loading time?

See the one link above with the test doc page - that's what it is geared to answer

0 replies

nrfox · 2025-05-08T19:45:50Z

nrfox
May 8, 2025
Collaborator

@jmazzitelli at least it's another helpful data point.

@Tommolo for performance issues it's also helpful to include as much context as you can to help us get an idea of where the bottleneck is. You've included how many workloads/services this is for which is very helpful. How many namespaces are selected on the graph page? Roughly how many istio configuration objects (DestinationRules, VirtualService, etc.) are there? For graph generation it could be rendering the graph (UI), generating the graph (Kiali backend), prometheus, or the connection between your browser and the Kiali API. We're actively working on making these issues easier to diagnose and report. See: #8345. In the meantime, Kiali emits some metrics about graph generation. You can query for kiali_graph_* in prom to see how long Kiali may be taking on the backend to generate the graph. You can also open your dev console and see how long the /api/namespaces/graph request to the backend is taking.

There's not really a magic configuration knob to make performance better. You can try scaling Kiali up (more cpu/mem) or scaling up Prometheus but without knowing exactly where the bottleneck is I can't guarantee that will help.

0 replies

RobyBobby24 · 2025-05-21T06:36:58Z

RobyBobby24
May 21, 2025

I’m also having issues with this and had a couple of questions:

Could the number of clusters connected to the Kiali instance impact performance, and how do you suggest to manage this situation for don't worse the performance (we have more than 10 clusters in mesh) ?
Does installing the Kiali Operator improve performance comparing with direct istallation?

0 replies

jmazzitelli · 2025-05-21T13:22:23Z

jmazzitelli
May 21, 2025
Maintainer

Could the number of clusters connected to the Kiali instance impact performance, and how do you suggest to manage this situation for don't worse the performance (we have more than 10 clusters in mesh) ?

That would seem likely, though I can't give you a reason why. I don't think we have had any users up to this point with 10 clusters in the mesh (none that told us about it anyway).

Does installing the Kiali Operator improve performance comparing with direct istallation?

No, the installation mechanism shouldn't make a difference when looking at performance of the server.

0 replies

jshaughn · 2025-05-21T14:09:29Z

jshaughn
May 21, 2025
Maintainer

I'd also suggest always using the most recent version of Kiali that is compatible with your Istio version. Also, Kiali performance is very tied to Prometheus query performance, so you may also want to look at https://kiali.io/docs/configuration/p8s-jaeger-grafana/prometheus/#prometheus-tuning.

0 replies

jshaughn · 2025-05-21T14:12:01Z

jshaughn
May 21, 2025
Maintainer

There is no specific issue here, and nothing we can really action on. I'm going to convert this to a discussion...

0 replies

RobyBobby24 · 2025-05-21T15:50:31Z

RobyBobby24
May 21, 2025

@jmazzitelli, in our multi-cluster configuration, we have Thanos connected to Kiali with about ten clusters. When we try to access the Kiali dashboard without having selected any namespace yet, the browser freezes. At this point, we wondered if it could be a problem related to the number of metrics. Therefore, we tried connecting a local cluster's Prometheus instead of Thanos, with significantly fewer metrics, and despite the slowness, it worked better. We were even able to load the traffic graph of a namespace. At this point, I ask if there is a mechanism that performs preemptive queries and, if so, whether it is possible to disable it or configure it somehow.

P.S. We followed the instructions you sent us by reducing the number of metrics to a minimum, but even by leaving only istio_request_total, istio_tcp_received_bytes_total, istio_tcp_sent_bytes_total, the performance remains unchanged.

2 replies

jshaughn May 21, 2025
Maintainer

I'm sorry that doing the Prom tuning didn't really solve your issues. It seems that your performance is tied directly to Kiali's ability to receive timely responses from the metrics back-end, but I'm not sure what to recommend to make those queries faster. In version 2.10 we have added more logging around determining where Kiali is spending it's time when trying to build a graph, for example. @jmazzitelli can add more on how to perform that logging.

I ask if there is a mechanism that performs preemptive queries and, if so, whether it is possible to disable it...

Also, in Kiali 2.9, we added the manual refresh interval. This can be set in the UI via the dropdown, set or bookmarked in a URL, or configured as the default in the CR. This will prevent Kiali from querying the backend without a manual request from the user, allowing the user to set options/filters before anything actually happens. Is that sort of what you are asking about?

nrfox May 22, 2025
Collaborator

@RobyBobby24 are there any errors in the Kiali logs when the freeze happens? Do the other pages load fine and just the graph page freezes? If the graph page is freezing before you even select any namespaces it doesn't sound like it's related to your metrics store. You can open up your dev console in your browser and see which API calls Kiali your browser is making to the backend and which of these might be slow.

RobyBobby24 · 2025-05-21T15:54:34Z

RobyBobby24
May 21, 2025

I add that with fewer connected clusters the performance was sufficient for general use.

0 replies

jmazzitelli · 2025-05-21T18:24:28Z

jmazzitelli
May 21, 2025
Maintainer

In version 2.10 we have added more logging around determining where Kiali is spending it's time when trying to build a graph, for example. @jmazzitelli can add more on how to perform that logging.

We've always had Kiali metrics, but Prometheus and Kiali need to be configured to collect them (check your Prometheus data and see if you have any metrics with the prefix kiali_.)

In 2.10, you can now set spec.deployment.logger.log_level to trace and get these timing values in the log. So you can see the times for what individual Prom queries are taking.spec.deployment.logger.log_format to json and you can use a json query tool (like jq) to search the logs easier. e.g. kubectl logs -n istio-system deployments/kiali | jq -R 'fromjson? | select(has("duration"))' should show you all the log messages for all prometheus queries.

0 replies

ro-distefano · 2025-05-23T10:06:40Z

ro-distefano
May 23, 2025

I'm the same person (@RobyBobby24 )

Upon analyzing the traffic, we identified that the call slowing down the dashboard is:
http://kiali-svc:20001/kiali/api/namespaces/graph?duration=60s&graphType=versionedApp&includeIdleEdges=false&injectServiceNodes=true&boxBy=cluster,namespace,app&ambientTraffic=none&appenders=deadNode,istio,serviceEntry,meshCheck,workloadEntry,health&rateGrpc=requests&rateHttp=requests&rateTcp=sent&namespaces=kiali-test-in1

This call is necessary to retrieve the JSON containing the information required to reconstruct the traffic graph. From issue #5743 , it is clear that the call consists of a request to Kubernetes (k8s) and one to Prometheus. Testing, as explained in the issue, the URL that makes the call only to Prometheus:
http://kiali-svc:20001/kiali/api/namespaces/graph?duration=60s&graphType=workload&appenders=&namespaces=kiali-test-in1
shows immediate response times. Therefore, it is evident that the problem lies with the call to k8s.

This raises the following questions:

Is the call to k8s indispensable for generating the traffic graph? If not, is there a way to configure Kiali to avoid making the call to k8s?
Issue Graphs loading very slowly #5743 also suggests working on the k8s cache, particularly recommending clearing the cache, which in our case should be done automatically upon restarting Kiali deployment. Are there other configurations or best practices useful for managing the cache?
Additionally, from the Kiali logs, we noticed a warning:
DBG Unable to list webhooks for cluster [cluster2]. Give Kiali permission to read 'mutatingwebhookconfigurations'. Skipping getting tags.
could this justify the behavior described above?

3 replies

jshaughn May 27, 2025
Maintainer

The hope would be that the graph generation would have 100% cache hit, and not need to query k8s directly, and wait for results. We will need to keep looking at how to ensure better k8s caching.

Is the call to k8s indispensable for generating the traffic graph? If not, is there a way to configure Kiali to avoid making the call to k8s?

I don't think there are options available in the Graph UI page to completely avoid calls to the k8s API. It is assumed that some graph decoration is always going to be needed. That is why you will see this snippet in the URL: appenders=deadNode,istio,serviceEntry,meshCheck,workloadEntry,health. Without this you won't see a bunch of stuff, but it's why it's faster with appenders=.

nrfox May 27, 2025
Collaborator

It could also mean that the slowdown is not the call to the k8s api or the prom queries but running the appender(s).

It looks like there might be a perf regression that is related to the slowdown you are seeing on the first request. Opened up: #8462 to try and address it.

DBG Unable to list webhooks for cluster [cluster2]. Give Kiali permission to read 'mutatingwebhookconfigurations'. Skipping getting tags.

Kiali logs this because it requires mutatingwebhookconfigurations permissions but probably the remote Kiali service accounts don't have this. This permission requirement was added relatively recently and the log is there to nudge you to add it. How are you managing the remote Kiali resources (RBAC, service account)? Are you installing via helm to each cluster and setting https://github.com/kiali/helm-charts/blob/2c75b9ba85117f8b163c8a2c417399f7b0773691/kiali-server/values.yaml#L85 this to true for the remote clusters?

jshaughn May 27, 2025
Maintainer

It could also mean that the slowdown is not the call to the k8s api or the prom queries but running the appender(s).

I don't think appender code itself would be the issue, it's more likely the fact that various appenders need various k8s information. As an aside, now with ambient there is also now a need to pull waypoint workloads during the main graph gen. But that is not the main issue here.

Tommolo · 2025-05-29T09:21:27Z

Tommolo
May 29, 2025
Author

I'm a @ro-distefano's colleguae who is working at the same project.

We tried using an interceptor that, through a regex, extracts the parameters from the HTTP request used by Kiali to generate the graph
(http://kiali-svc:20001/kiali/api/namespaces/graph?duration=60s&graphType=versionedApp&includeIdleEdges=false&injectServiceNodes=true&boxBy=cluster,namespace,app&ambientTraffic=none&appenders=deadNode,istio,serviceEntry,meshCheck,workloadEntry,health&rateGrpc=requests&rateHttp=requests&rateTcp=sent&namespaces=kiali-test-in1) and inserts them into a second HTTP request ( http://kiali-svc:20001/kiali/api/namespaces/graph?duration=60s&graphType=workload&appenders=&namespaces=kiali-test-in1).

We immediately noticed a significant improvement: although it no longer loads Kubernetes information, the graph is still fully rendered and easy to interpret.
At this point, we're wondering if there's a way to configure this request directly, in order to avoid using the interceptor.

In this image, you can see the response times obtained with the new request. Despite the high load (due to the presence of 13 namespaces), the graph loads in about 1.12 seconds.

0 replies

jmazzitelli · 2025-05-29T10:59:55Z

jmazzitelli
May 29, 2025
Maintainer

If I interpret what you are saying (and I may not - I don't think I fully grok what you are doing with that interceptor), but it looks like all you did was remove the bulk of the appenders from the query parameter in the URL.

Removing appenders is definitely expected to speed up the request (the appenders do a lot of work and many access the k8s API).

Start removing appenders one-by-one in your URL and see which one(s) cause the slow down. That can help the Kiali devs narrow down where the performance issue is and maybe something can be done to speed up that appender in the code.

we're wondering if there's a way to configure this request directly

Some appenders are turned off by de-selecting the checkbox options in the graph's Display menu. Or you could bookmark those URLs with the small appenders list and just use that bookmark to access the large graphs.

4 replies

jshaughn May 29, 2025
Maintainer

There is currently no way to disable all of the appenders. But perhaps we should offer a way to do this, just to provide a more simple graph in a timely manner. But, the real solution would be to understand which appender(s) are the problem. Perhaps you can perform a few more queries in your environment, adding back one or more of the appenders in:

&appenders=deadNode,istio,serviceEntry,meshCheck,workloadEntry,health

And report back on which is affecting you in the most negative ways.

jmazzitelli May 29, 2025
Maintainer

We actually have metrics that track performance of each appender. If your Prometheus is collecting kiali metrics (I think it should be doing so by default), go into Prometheus UI and look at the metric kiali_graph_appender_duration_seconds - there is a metric for each appender (see label appender). That can help pinpoint a problem appender(s).

Tommolo Jun 10, 2025
Author

There is currently no way to disable all of the appenders. But perhaps we should offer a way to do this, just to provide a more simple graph in a timely manner. But, the real solution would be to understand which appender(s) are the problem. Perhaps you can perform a few more queries in your environment, adding back one or more of the appenders in:

&appenders=deadNode,istio,serviceEntry,meshCheck,workloadEntry,health

And report back on which is affecting you in the most negative ways.

Thank you for your answers and sorry for the delay in answering.
I've identified that the problem is originating from the Istio appender. Is there a way to remove it?

jshaughn Jun 10, 2025
Maintainer

Thanks for the feedback. The "istio" appender makes sense, it is responsible for "decorating" the graph using a bunch of k8s information. And so it's likely querying k8s api for varios things. What we want is for that information to already be cached, but perhaps it's not.

So no, currently there is no way to remove it, but let me look into both the perf issue and also maybe a way to optionally disable some appenders.

Performance and Scalability #8445

Uh oh!

Tommolo May 8, 2025

Replies: 15 comments · 9 replies

Uh oh!

aljesusg May 8, 2025 Maintainer

Uh oh!

Uh oh!

Tommolo May 8, 2025 Author

Uh oh!

jmazzitelli May 8, 2025 Maintainer

Uh oh!

jmazzitelli May 8, 2025 Maintainer

Uh oh!

nrfox May 8, 2025 Collaborator

Uh oh!

Uh oh!

RobyBobby24 May 21, 2025

Uh oh!

jmazzitelli May 21, 2025 Maintainer

Uh oh!

jshaughn May 21, 2025 Maintainer

Uh oh!

jshaughn May 21, 2025 Maintainer

Uh oh!

RobyBobby24 May 21, 2025

Uh oh!

jshaughn May 21, 2025 Maintainer

Uh oh!

nrfox May 22, 2025 Collaborator

Uh oh!

RobyBobby24 May 21, 2025

Uh oh!

jmazzitelli May 21, 2025 Maintainer

Uh oh!

ro-distefano May 23, 2025

Uh oh!

jshaughn May 27, 2025 Maintainer

Uh oh!

nrfox May 27, 2025 Collaborator

Uh oh!

jshaughn May 27, 2025 Maintainer

Uh oh!

Tommolo May 29, 2025 Author

Uh oh!

jmazzitelli May 29, 2025 Maintainer

Uh oh!

jshaughn May 29, 2025 Maintainer

Uh oh!

jmazzitelli May 29, 2025 Maintainer

Uh oh!

Uh oh!

Tommolo Jun 10, 2025 Author

Uh oh!

jshaughn Jun 10, 2025 Maintainer

Tommolo
May 8, 2025

Replies: 15 comments 9 replies

aljesusg
May 8, 2025
Maintainer

Tommolo
May 8, 2025
Author

jmazzitelli
May 8, 2025
Maintainer

jmazzitelli
May 8, 2025
Maintainer

nrfox
May 8, 2025
Collaborator

RobyBobby24
May 21, 2025

jmazzitelli
May 21, 2025
Maintainer

jshaughn
May 21, 2025
Maintainer

jshaughn
May 21, 2025
Maintainer

RobyBobby24
May 21, 2025

jshaughn May 21, 2025
Maintainer

nrfox May 22, 2025
Collaborator

RobyBobby24
May 21, 2025

jmazzitelli
May 21, 2025
Maintainer

ro-distefano
May 23, 2025

jshaughn May 27, 2025
Maintainer

nrfox May 27, 2025
Collaborator

jshaughn May 27, 2025
Maintainer

Tommolo
May 29, 2025
Author

jmazzitelli
May 29, 2025
Maintainer

jshaughn May 29, 2025
Maintainer

jmazzitelli May 29, 2025
Maintainer

Tommolo Jun 10, 2025
Author

jshaughn Jun 10, 2025
Maintainer