Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 8.0.2, 9.0.3, 10.0.1, 10.0.2
    • Fix Version/s: Backlog
    • Component/s: Clustering
    • Labels:
      None
    • Steps to Reproduce:
      Hide
      • Install a cluster with 3 nodes
      • Configure them to run in TCP mode: /subsystem=jgroups/channel=ee:write-attribute(name="stack", value="tcp")
      • Bring the cluster up so it's in sync
      • Shut one node down again (shutdown the server, only stopping the process isn't enough)
      • Create or delete a realm on one of the remaining nodes and it's really slow
      Show
      Install a cluster with 3 nodes Configure them to run in TCP mode: /subsystem=jgroups/channel=ee:write-attribute(name="stack", value="tcp") Bring the cluster up so it's in sync Shut one node down again (shutdown the server, only stopping the process isn't enough) Create or delete a realm on one of the remaining nodes and it's really slow
    • Docs QE Status:
      NEW
    • QE Status:
      NEW

      Description

      We run a Keycloak cluster in Kubernetes, and every time after the deployment of new containers (with the same Keycloak version as before), the cluster was really slow for some minutes (Keycloak requests ended in timeouts). First I thought it's maybe that Keycloak is really slow after startup, but after some more testing I saw that that's not the case, Keycloak itself is fast, but the whole cluster gets slow (even an already running member of the cluster gets slow).

      The strange thing was, I couldn't reproduce it outside of Kubernetes, restarting a Keycloak cluster with normal VMs worked fine, even with the rest of the configuration similar to our k8s setup. But after some more testing and debugging, I finally found what's triggering the slowdown and could also reproduce it without k8s.

      So here is what's going on:

      • The "slow mode" starts after rebalancing when a member joins/leaves and the new cluster has still 2 or more members (so 1 member to 2 members is a problem, 2 members down to 1 member is ok ... if your cluster has 3 members, then 2 to 3 and 3 to 2 is both a problem). In my k8s setup with 2 replicas, this meant that is was slow after the first instance restarted and joined again, but since the second instance was killed to be restarted as well directly after the first was ready again, it was fast again, as soon as the second left the cluster and there was again only one member in the cluster. But then after the second joined again, it was slow again, because there were two members again.
      • The "slow mode" ends again after more or less exactly 4 minutes after a node left the cluster. So if you have two nodes and one leaves the cluster, the 4 minutes start counting from here. If at this point there are still 2 members left (if you had 3), the cluster is slow for the complete 4 minutes. But if you had only two nodes, the remaining node isn't affected, because it's the only one left. If after two minutes another node joins again, it rebalanced with that node, and since there are now two members again, it's slow now. But since the time already started two minutes ago when the other node left the cluster, it's only slow for the remaining two minutes.
      • It only happens if the cluster runs in TCP mode (in k8s UDP doesn't work, so TCP is the only option, but when running in VMs it uses UDP by default)
      • The leaving node needs to be shutdown completely (halt the server, the IP needs to be unreachable ... this was the main reason why I couldn't reproduce it first with VMs). The 4-minute timer starts when the Cluster rebalances because the node left, but if only the process was stopped, nothing happens. If you then shutdown the complete server of the leaving node after a minute (and there are still two other nodes left in the cluster), the cluster gets slow for the remaining 3 minutes of the 4-minute timeslot, and if you start the server again before, it gets fast again as soon as the IP is reachable again.

      So if there are two or more members in the cluster, they still try to talk to past members for about 4 minutes after they left. And if the cluster is in TCP mode, but the IP of the left member isn't reachable anymore, it gets really slow.

      The problem with that is, in Kubernetes the pods get a new IP every time, so if an instance restarts, the old IP isn't reachable anymore, but if it joins again (even with the same name/ID as the member that left), the old IP is still known somewhere and causes the slowdown of both the old not restarted node and the new restarted node (which never knew the old IP). But it's also a problem outside of Kubernetes, if you have 3 nodes, and want to reboot one because of security updates.

      The attached keycloak-slowdown-timing.png shows the timing of some scenarios (the two scenarios where instance 3 doesn't exist at the beginning but joins later could be a completely new instance or like in my k8s scenario be the same instance that was stopped before, but with a different IP).

      The slowdown is the most notable when creating/deleting realms, but logins/logout is also 10 times slower than normal.

      In thread dumps during the slow period I saw that the request often hangs in "org.keycloak.cluster.infinispan.InfinispanNotificationsManager.notify(InfinispanNotificationsManager.java:155)" and I verified that by adding New Relic and adding a custom instrumentation for that method. As you can see in the two attached screenshots from New Relic, the additional time comes from exactly that method, everything else is still fast. I had a script running that created a realm and deleted it again in a loop (using kcadm.sh), and during the slow period, everything took at least 10 times as long as normal.

      I don't understand why the old IP needs to stay reachable after a member has cleanly left the cluster. This shouldn't be needed. Even more strange, a few times the slowdown didn't happen, and I was really confused, but I saw in the logs that the remaining node logged that he lost the other member and that there is maybe a split-brain situation. So it looks like the problem doesn't happen if the member drops out of a cluster without leaving it cleanly. I couldn't trigger the split-brain situation in my local tests, so I'm not 100% sure about that, it could also be something else why it didn't slow down in that case.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  supertux88 Benjamin Neff
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  6 Start watching this issue

                  Dates

                  • Created:
                    Updated: