Narrative
Customers run on-premise or cloud stretched clusters and use it a their way to approach high availability. A stretched cluster provides the ease of day two operations, where Pods are re-scheduled in another site when one site fails.
Simplified setup: Customers might prefer a stretched cluster as its simplifies the operations compared to managing multiple clusters. For a Keycloak setup, it also removes the need of deploying, configuring and keeping up to date an external Infinispan. Given the reduced complexity in day-to-day operations, a budget and resource constrained team running the service might achieve a better availability with a simpler setup based on a stretched cluster.
Meeting customer where they are: A customer that wants to deploy Keycloak would need to use the setup that is available to them. If a company decided years ago to run with a stretched OpenShift setup, this will usually not be overthrown by the wish to use Keycloak. So meeting customer where they are, with their needs and abilities, supporting a stretched cluster is both good enough and sometimes the only way for them.
For cloud environments, a multi-AZ cluster is the defacto standard: This the case since OpenShift 4.1. AZ is defined as “low latency, and full control over the network”, which is true for all cloud providers. The guidance to avoid spanning across regions doesn’t apply for multi-AZ setups (https://access.redhat.com/articles/3220991, https://access.redhat.com/articles/3221001).
Goals
- Make deployment as simple as possible (no external ISPN/DG)
- Extend documentation to Pods spread across sites:
- Validate that a failover works as expected when nodes and sites fail
- Show how to do a seamless switchover from one site to another
- Document the affinity rules necessary to do this
- Test and document monitoring to detect problems like split-brains in Keycloak
- The documentation for multi-az stretched cluster should be clearly separated in the guides
- The documentation is enhanced with an introduction to the different deployment types (single site, stretched, multi-az)
- Rely on Kubernetes features only to make this supported also for on-premise
- Loadbalancer will be provided by OpenShift
- Understand how OpenShift handles a splitbrains and spike latencies
- Performance tests
Proposal of Design spec to BU / customers:
- Design:
- Deployment with RHBK Operator on OpenShift
- Relying on standard affinity/anti-affinity scheduling of Pods
- Pods can run in any site thanks to transparent networking inside of an OpenShift cluster
- Recommended setup for low-latency responses with all Pods in the site where the primary database resides, and let OpenShift automatically schedule them in other sites on failover and switchover
- Loadbalancing is handled with the standard OpenShift router/ingress like in a single-site
- All user/client sessions are stored in the database and survive a restart of all nodes (same as single-site today)
- Authentication sessions and single use tokens (used during login) and information about brute force prevention are stored in memory in two nodes (same as single-site today)
- Supported environments:
- Tested with ROSA multi-AZ HCP with three AZs (as the SRE team can rebuilt this kind of cluster automatically and scripted)
- Validated with OpenShift SMEs to understand limitations of stretches cluster
- No use of functionality outside OpenShift, so supported on all cloud providers and on-premise
- Documentation provided:
- How to deploy (focus on affinity/anti-affinity setups)
- How to monitor
- How to switch over from one site to another / how to evacuate one site (focus on affinity/anti-affinity setups)
- Behavior:
- When one Pod fails, no data is lost. After a short re-configuration of the loadbalancer (less than 1 min) operations will continue as normal (same as single-site today)
- When two or more Pods fail simultaneously, some in-memory data might be lost, but all users are still logged in as the session data is stored in the database (same as single-site today)
A longer description is available here: https://docs.google.com/document/d/1LoXBgKtHy7VJC7pvzL0h_A1uuL05pA4hjKox8S5mXrw/edit?tab=t.0#heading=h.26g1paipwcme
Non-Goals
- Support non-persisted sessions with a stretched cluster
- Server hinting plus additional number of owners to prevent losing data in failover scenarios
(challenge: how make the data available to the pod) as the information isn't availabe in the downward API. It might arrive in a year or two, and then we can use it in a simple way. Until then it would be complicated. - Make the number of owners configurable - strange behavior if mixed in the cluster
- Network partition handling by using a quorum
Implementation notes
- review ISPN strategies on cluster merge
- Future downward API that can in 12+ months be used for server hinting: https://github.com/kubernetes/enhancements/blob/master/keps%2Fsig-node%2F4742-node-topology-downward-api%2FREADME.md