Hello, I am an engineer at Zalando, a fashion player in Europe with over 5 billion euros in revenue every year. I am representing the Merchant Operations department, where we are using Keycloak for over 2 years as our IAM solution. We have hundreds of users and clients, with an average of 74 requests per minute and spikes of 1 to 2 thousand requests per second.
We use Kubernetes to operate the stack, and one of the biggest challenges is the use of Infinispan and JGroups as a data grid to store the sessions of the authenticated users and clients. Keycloak doesn't allow a direct connection with a remote externalized cluster of Infinispan, it has a custom implementation that synchronizes between a local Infinispan and a remote one, where you have to create a clusters for the local nodes and the remote ones and interact via HotRod protocol between the clusters.
We already have a well-instrumented infrastructure containing capabilities for service discovery, deployment with zero downtime and lifecycle management of instances. By preventing us to change certain behaviours when operating the stack and become truly stateless, operating Keycloak prevents us from taking advantage of the investments in our internal tools and the tendencies of the industry, requiring developers to acquire very specific knowledge which results in costs for onboarding and training new joiners.
In order to try to increase the scalability and resilience of the system, significantly increase start and shutdown time, take advantage of our internal tools, reduce the training to operate the stack, we are aiming to create an external cache layer that doesn't require clustering or rebalancing would allow the stack to achieve a stateless behaviour.
We use multiple SPIs to inject our extensions like login flows, new entities that relate with the original ones (user, clients, etc), new ways of resetting your password or registering OTP, controlling the issuing and claims of tokens, and so on. Inside our department we know Keycloak as Merchant IAM, it is a repository using Maven to compile some modules into a JAR. It has multiple Keycloak modules as dependencies including the test suite which we also use to write our own functional tests with Arquillian. It links these dependencies as provided. The main module yield a JAR which is deployed to Wildfly (deployments folder) in production based on the official image https://hub.docker.com/r/jboss/keycloak/.
We want to implement the set of interfaces and SPIs to create a new model module for Keycloak (https://github.com/keycloak/keycloak/tree/master/model) which uses an external cache system. The project will leverage Maven to either output a JAR or be used as a dependency to be easily injected to Keycloak and switch the model layer to a different one.
Keycloak already has some extensions that will help to understand how we best distribute and code this extension (https://www.keycloak.org/extensions.html). Our goal is to make the injectability of the project as easy as possible, with a maximum of one to two steps. In order to already test the integration of the project with Keycloak, we will be hosting this project on a separate repository inside our enterprise Github. The project will be integrated with our current Merchant IAM as a Maven dependency, allowing from day zero to test how easy it is to integrate with a Keycloak project and avoid breaking changes when introducing changes.
The most natural step for a web application that wants to increase its performance is to reduce the access to the relational database layer by introducing a cache layer that normally has already prepared data by keys loaded in-memory. There are various systems developed along the years that deliver such capabilities for an in-memory data grid layer, being the most popular ones Redis, Memcached and Apache Ignite. We are choosing Redis as the system that powers our data grid since it provides the perfect balance between capabilities, performance and complexity to integrate/operate when compared to the other two options.
The final goal is to release this extension to the community, and to reach this point we will be guided by the following timeline:
- Initialize JIRA ticket at official Keycloak to collect feedback of the proposal from the community.
- Create a repository for the project.
- Instrument building and test suite integration for the extension.
- Implement the set of SPIs and have 80% or more coverage with functional tests.
- Integrate the solution with Merchant IAM as an experiment.
- Release Merchant IAM to production with the extension.
- Collect data on the impact.
- Move the repository and open source the project with collected materials.
- Contact Keycloak Community to provide visibility.
We will start the project with one engineer working two days per week (16 hours per week), giving a boost to the groundwork necessary for the project to take off. This engineer can be rotated with the supervision of the main maintainer. During this phase of the project, we will be inviting engineers to review pull request reviews and architectural discussions, in an ad-hoc way, making sure that we follow the four-eyes principle.
Once we achieve the final step of the timeline we will be able to hopefully start leveraging on the community and other engineers to help with the maintenance of the project. The project at this point should be "gaining life" and we will be evaluating the next steps.
We can already foresee a few challenges when developing and maintaining this project. The main technical challenge will be to keep the same level of performance with an external cache layer that was with a cache layer that sits side-by-side in memory, even if we are in a VPC, the data and communication will have to go through the network, which is a significantly slower mean when compared to communicating with modules loaded in-memory and storing data in-memory. To tackle this we will have to make sure to only execute read/write operation to the cache when it is necessary.
Another challenge present in any extension of this size is to maintain it. Even though we will be implementing interfaces and the project already "communicate via contracts", whenever the Keycloak team changes one of these interfaces we will have to quickly react to that. We should closely follow the changelog and the set of tickets that will be resolved from the Keycloak's JIRA on every release to make sure we can quickly release a version that is compatible with the newest Keycloak version.
As stated during the Timeline section, we would like to collect from the specialists here, possible blockers and challenges that we are not foreseeing by jumping into this project. We also would like to understand if there is an interest in this project to be done in partnership with core maintainers, or who would be interested in helping with its expertise on helping at least with architectural decisions and code reviews.
We thank you already in advance for your time to read and answer, and we are hoping to work together for a even better Keycloak