Replies: 24 comments 80 replies
-
That's a lot to ingest but this are indeed impressive results, thanks for sharing! Did your changes also improve the performance for managing large numbers of clients (> 10000)? |
Beta Was this translation helpful? Give feedback.
-
This is a really interesting optimisation! We have also pinpointed quite a few time ago another issue with keycloak, when it has a high number of Identity Providers within a realm. |
Beta Was this translation helpful? Give feedback.
-
@davoustp You have been busy :) Indeed, it is very detailed work and we should definitely benefit from it. FYI, some time ago we did a similar analysis. At that time, the goal was to achieve 5k realms and evaluate the effort needed to achieve that. In a nutshell, we ended up with a few PRs merged but the main one wasn't. Please, see #7712 and check the description for a bit more details about the changes. I think the PR above contains some of the suggestions/changes you are doing and that is great. I would love to resume this discussion and related changes because at that time we gave up on the "multi-realm" story to focus on other things. The main argument was the new store and how it is going to improve this area. Including changes to how we should be doing SaaS. If we can continue this effort in parallel with our current priorities, that would be awesome. |
Beta Was this translation helpful? Give feedback.
-
@pedroigor @ahus1 What I did:
Output from the task with the branch containing the proposed improvements:
The creation of 3,000 realms took 426 seconds in my environment. I suppose that this provider is a more a provisioning facility than a benchmarking-focused endpoint. Would you suggest some more moves using |
Beta Was this translation helpful? Give feedback.
-
@davoustp I couldn't find your PR related to these changes. What's the final decision? Is your PR merged or we are not merging and waiting for the new store implementation in keycloak 18 to fix the multiple realms issue? I am evaluating the multiple realms vs single realms + multiple groups approach as we would need to create around 20k realms. We are currently in the process of migrating to keycloak 17. So any input from @pedroigor and you would help us making informed decision. |
Beta Was this translation helpful? Give feedback.
-
@mposolda @pedroigor @hmlnarik please review this and provide some input on how we can move this forward |
Beta Was this translation helpful? Give feedback.
-
This is awesome effort and great investigation! Your effort for this is appreciated. I have few points below (some probably already mentioned and discussed before):
Regarding this point, I am not saying that we should not optimize at the store layer as well (as optimizing store performance is also beneficial). Just thinking that if we optimize "logic" layer, we will automatically optimize Keycloak for both the old and new store. Conclusion: I am personally not sure how to move forward. Hopefully we can have some feedback soon from the new store team if the changes in the model SPI (EG. especially new methods for retrieving list of IDs for particular objects and retrieve the list of objects by list of IDs etc) makes sense for the new store. In the meantime, you can try to send some initial PR to the keycloak main with some smaller/isolated subset of your optimizations and we can possibly discuss here being more focused just on the particular/smaller change. In parallel, we can rethink how to optimize the |
Beta Was this translation helpful? Give feedback.
-
@ahus1 @pedroigor |
Beta Was this translation helpful? Give feedback.
-
@davoustp - thank you π for the PR. I'll have a look; please feel free to mention me directly on PRs. |
Beta Was this translation helpful? Give feedback.
-
The performance analysis of the PRs showed the slowdown stems from the one-client-per-realm that is added to the master realm and that slows down the evaluation of roles and composite roles. I'd like to analyze if those roles could be removed, or only kept temporarily to setup an admin-per-realm. Please join the discussion here #12332 to dive deeper into that topic. Thanks! |
Beta Was this translation helpful? Give feedback.
-
@davoustp Hi is there any guide that we can follow to get your forked branch up and working to local or maybe push it through docker build? |
Beta Was this translation helpful? Give feedback.
-
Any chance this improvement becomes production-ready soon? |
Beta Was this translation helpful? Give feedback.
-
@davoustp thanks for the detailed investigation and all the work, as @mposolda pointed out here the next steps is really to start breaking this into smaller issues and corresponding PRs. That will help further discussions and reviews, and we can start getting some of the improvements merged bit by bit. It may be an idea to create an epic in GH issues that links to individual issues so we can track all the issues that needs to be solved for a full solution to this problem. Seeing as it's been a while since there's been any updates from you I wanted to ask if you're still looking into this? If you are blocked waiting for us I'll do what I can here to unblock you moving ahead with this. |
Beta Was this translation helpful? Give feedback.
-
@davoustp @stianst |
Beta Was this translation helpful? Give feedback.
-
has any one tried the new store, to see how much of this is address in the new store? Will give it a try again when i find time.. but i am too eagerly looking for scaling keycloak natively and for zero down time upgrades. Presently, i scale by having multiple keycloak clusters, and putting a front proxy (envoy proxy) before all my keycloak clusters to that to external world it all looks like one single keycloak instance :) |
Beta Was this translation helpful? Give feedback.
-
I'm using this keycloak discussion about scalability issues to point out that there is also issues on generating user's claim in /userinfo when there is thousands of roles in the realm and especially composite roles. Even with just one realm. More info in this ticket |
Beta Was this translation helpful? Give feedback.
-
Thinking out load around many realms scalability in admin console. There seems to be two primary issues:
There's a few other things that may be issues with scaling large amounts of realms, but fixing those two issues should be a good start. |
Beta Was this translation helpful? Give feedback.
-
Also in the context of large number of realms, its just not how many number of realms we are able to support. but the concurrency also is a big concern, i am not able to create more than 2 or 3 relams at a time when i have more than 100 realms, Also , We get into super ugly dead lock problems if create and delete realms are in progress at the same time. |
Beta Was this translation helpful? Give feedback.
-
We are considering adding an organization concept within realms. An organization will be something you can associate users and clients to, probably have a organization level admin, ability to link IdPs to, etc. I wondering if that would solve most of the needs for having larger amounts of realms and that we may want to focus on such an org concept more than supporting 1000s of realms? |
Beta Was this translation helpful? Give feedback.
-
It makes sense to have support for both.
From my experience, use-case 1) is more common than 2) and should be prioritized if possible. |
Beta Was this translation helpful? Give feedback.
-
We face also this issue on our platform, do you have any plan to solve/improve it soon ? |
Beta Was this translation helpful? Give feedback.
-
For what it's worth, I've got a PoC for a 'sharded' keycloak which provides a way to scale keycloak horizontally beyond 200 realms. The PoC has a single KC & DB for every shard, but you can easily use a kc cluster and replicated db for every shard. https://github.com/lordvlad/keycloak-host-and-realm-chooser Not a full solution but it might give some people an idea to build upon. I'm not very proud for messing with the redirect_uri, maybe theres a better option. |
Beta Was this translation helpful? Give feedback.
-
Does the performance improvements here get priority now as the map store is discontinued? I think the low scalability of realms is the toughest challenge with Keycloak at the moment. |
Beta Was this translation helpful? Give feedback.
-
Hi, While investigating a solution for the keycloak scalability issue we found out about this issue, and the great work that @davoustp have done on this investigation, sadly it was already on an older version of keycloak and the fix is not easily applied to the latest keycloak version. While looking into the solution that @davoustp purposes, we found out that if we increase the cache settings, like it was did on the original implementation, to be higher than the ones that are default namely this environment variables:
Keycloak becomes usable again, we can navigate through the admin console without huge delays even with more than 1k realms init. What we would like to know is that if we can except issues to happen because of having that huge increase on the cache, other than the memory increase by keycloak. Thank you in advance. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Update 2025-10-05: With Keycloak 26.4, it should be fine to run Keycloak with 1k+ realms as long as you keep increasing the realm cache. See #11074 (reply in thread) for details.
Hi,
As described in KEYCLOAK-4593, Keycloak struggles to scale beyond 100-200 realms. This proves to be a road-block to embrace Keycloak as the main component of a large scale multi-tenanted solution.
Regardless of the efforts on the under-development Map-based storage subsystem (which should fix some of the underlying issues at least), I performed a quite intensive investigation session around this scalability problem, which is described here with findings and proposals.
Reproducing the issues
Running Keycloak
Keycloak is run using Docker and a Postgres engine:
Heapdump on OOM is enabled as well as JMX to monitor the instance.
Increasing access token lifespan
The same access token is used across the entire session. As a result, you'll need to extend the lifespan of access tokens (including implicit flow) in
master
realm to a value larger than default (Settings / Token), for example 1 day.You also need to set the same lifespan onto the
admin-cli client
(Client / admin-cli / Settings / Advanced Settings).Measuring tenant creation duration
Creating the realms is performed using shell scripting (useful to be run against any existing Keycloak, even though this could be done using a Java-based project as the ones existing in the codebase):
keycloak-create-realms.zip
This script just needs
curl
,jq
(from https://stedolan.github.io/jq/ ) and GNUgdate
.Running it is straighforward:
It will authenticate as the root admin (change credentials if different from above
KEYCLOAK_ADMIN
andKEYCLOAK_ADMIN_PASSWORD
values), and attempts to create 1000 realms.It outputs the time spent for each operation for each realm ordinal.
Measuring Admin console load time
With Keycloak being cold started, the amount of time required to login as root admin to manage realms is measured manually for various number of tenants in Keycloak, using the browser developer console (
Finish
duration measure using Chrome), from the moment the login form is submitted to the moment the console page is fully loaded.Measures with version 17.0.0
All measures below were performed with stock version 17.0.0.
Note that the same measures have also been performed against the latest codebase (commit
fa87d462108899d38f22242a8abdb36e08cd1af0
), with the same results.Realm creation time with version 17.0.0
Grabbing the output from the script above allows to graph the time spent to create a new realm against the total number of created realms:

The tenant creation duration diverges exponentially when the number of tenants grows.
You can also note that measure was stopped at ~620 tenants, because it was taking ages at that point.
Admin console load time with version 17.0.0
Here is the outcome for stock Keycloak 17.0.0:

Here again, the load time is increasing exponentially with the number of realms.
Findings and proposals
These are the main findings, I may have skipped some minor ones, but they all fall along these same lines.
Adding/removing children roles to/from parent role triggers full JPA collection load
A very well known issue with JPA/Hibernate: adding or removing a role to the list of children roles actually triggers loading the OneToMany relationship, before inserting the proper record into the
COMPOSITE_ROLE
join table.Proposed optimizations:
Role composition suffers from N+1 load issue
Because of the composite nature of roles, the set of roles is actually a directed graph, implemented within the database schema using the
COMPOSITE_ROLE
join table .The children roles are loaded from their parent using standard JPA/Hibernate OneToMany mapping, which goes through the join table.
As a result, when attempting to expand all roles with their direct or indirect children roles (implemented within
RoleUtils
as a recursive traversal of roles, in current codebase), this generates a huge amount of queries to the system, especially visible onto the${role_admin}
role (the root role of many other roles).Computing this set of roles (which is the transitive closure of the initial role set, in graph/set theory wording) requires a different approach for scaling.
Proposed optimizations:
Using a transitive closure table would probably be the ideal choice, but this would require quite significant changes and book-keeping, so I ruled it out.
Instead, the proposal is to introduce bulk loading for both roles and role composition.
Performing the role expansion algorithm can be depicted as:
As a result, for a graph of
n
roles composed in a graph of depthd
, the expansion now requires onlyd
(simple) lookups instead ofn
(complex joined) lookups.As a bonus, the role composition structures collected during the first phase allows to avoid another N+1 issue in the caching layer (which eagerly caches role composites).
There are quite a few things around this one:
select ... from ... where id in :ids
JPA query (as an alternative, the Hibernate-specificloadByMultipleIds
can be used instead) - requires to enable Hibernate'sin_clause_padding
to avoid cluttering statement cachesselect ... from ... where id in :ids
JPA query (as Oracle and other engines do have limit - 1000 for Oracle)Bulk-loading large collections by ids to optimize caching layer hit ratio
This is especially true for retrieving the list of realms: the current implementation cannot do anything but hit the storage layer to retrieving all the realm (and cache the resulting realms in the process).
Proposed optimizations:
Perform this as a two-step lookup: first collect all entity ids, then bulk load them (which gives the opportunity to leverage the caching layer).
Accessing entities just for grabbing their ids
This happens quite frequently: role mappings, clients, default role, client scopes...
This means loading a lot of entities while only their ids are required.
Proposed optimizations:
Extend the model to get access to these readily available ids when appropriate.
Loading large JPA dataset clutters the persistence context and degrades flush-time efficiency
When loading a large set of entities from the JPA layer, these will be kept into the persistence context (first-level cache in Hibernate wording): most of the time, these entities are converted to application-level model instances without retaining the underlying JPA entity, which nevertheless stays in the persistence context.
As a result, any time a
flush
is required (explicit, JPQL, HQL, native query), then the Hibernate layer needs to scan the persistence context and perform a dirty check on each appropriate object to actually execute the proper persistence operation(s) if need be. The thing is that this dirty checking mechanism is expensive, and its cost grows quadratically with the number of items to scan.See https://stackoverflow.com/a/18948517 for a pretty decent explanation.
Proposed optimizations:
This one is a little bit tricky, because it involves detaching JPA-loaded entities: detaching an entity which is otherwise referred to by another part of the code leads to failing insert or detached exceptions.
To do this is safe-enough manner, it first requires to analyze the object model and cascade the
DETACH
operation to relationships down to the graph. In other words, careful code analysis is required.Second, detaching an entity during an operation that loads it implies that it was not already loaded previously (or the graph of objects may have been modified prior to the operation and a
detach
will actually loose all modifications).This can be achieved by looking into the Hibernate persistence context using the Hibernate SPI to check if the identifier of an object about to be loaded exists in here.
Investigation outcome
Performing these optimisations as a measure-optimize-repeat process led to the results shown in this section.
Note that the same vertical scales are used to allow to properly compare to the initial performance baseline.
Realm creation time with optimisations (forked from main 18.x branch)
The realm creation time with the optimisation-enabled branch is depicted below:

This now exhibits a linear, almost constant creation time, within the sub-second range.
Going up to 3k created realms:

Low-slope linear profile is confirmed when number of realms grows.
Admin console load time with optimisations (forked from main 18.x branch)
The admin console load time using the optimisation-enabled branch is shown below:

Again, this now shows a linear admin console load time (20 seconds to load with 1000 realms).
Going up to 3k created realms:

Linearity still looks ok, with a load time of 50 seconds for 3000 realms.
Code changes
All the code changes has been done in fork https://github.com/davoustp/keycloak/tree/KEYCLOAK-4593-investigations
Please note that this is an investigation branch, so I did NOT collapse the various commits, but I find it valuable so that the community can have a look into it.
I'd say that I hate such large code changes, which I usually break apart in some smaller, self-sufficient units, but doing so when performing these optimisation loops (and discovering the codebase internals as well) was too high a challenge.
Nevertheless, all test base integration tests pass, so it looks I did not break too many things along. ;-)
I did not attempt to go too deep into the map storage layer, since this is a hot topic right now (not sure how to run the test suite against a map storage provider, btw), so a deeper look is probably needed.
Please let me know if this is of interest to you guys, and let's discuss it to see if and how it can be contributed to Keycloak. :-)
Beta Was this translation helpful? Give feedback.
All reactions