Uploaded image for project: 'Red Hat build of Keycloak'
  1. Red Hat build of Keycloak
  2. RHBK-4322

Keycloak cluster with 3 nodes and jdbc-ping stack fails to rejoin after temporary network partition [GHI#45980]

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Before reporting an issue

      [x] I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

      Area

      infinispan

      Describe the bug

      Environment

      • 3 separate VMs (Docker containers)
      • Shared PostgreSQL 16.4 on a separate VM
      • Cache stack: default jdbc-ping (TCP + JDBC_PING2)
      • No custom JGroups/Infinispan XML (using Quarkus/Keycloak defaults)

      Description

      A 3-node Keycloak cluster using the default jdbc-ping stack forms successfully at startup:

      • 3 entries appear in the jgroups_ping table in PostgreSQL.
      • One node becomes coordinator.
      • Keycloak works normally, and caches (sessions, etc.) synchronize between nodes.

      When one node temporarily loses network connectivity to the other nodes and to the shared PostgreSQL DB:

      • The isolated node is correctly suspected and removed from the cluster view (via FD_ALL3).
      • Network-related errors appear in logs (no physical address, dropping messages).

      After network connectivity is restored (DB access works again):

      • The node detects the DB and attempts to rejoin the cluster.
      • It receives a new cluster view.
      • However, it does not merge into the existing cluster and remains isolated (no full rejoin).
      • Scheduled tasks (e.g., ClearExpiredAdminEvents, ClearExpiredUserSessions) fail with Infinispan timeouts, as the cluster view is inconsistent.

      This leads to split-brain-like behavior, failed cache operations, and stuck nodes until full restart.

      Docker compose file

      services:
        keycloak:
      

      image: keycloak/keycloak:26.4.2
      restart: always
      command:

      • "start-dev"
      • "--spi-events-listener-jboss-logging-success-level=info"
        environment:
      • KC_PROXY=edge
      • KC_PROXY_HEADERS=xforwarded
      • KC_HOSTNAME_STRICT=false
      • KC_DB=postgres
      • KC_DB_URL=jdbc:postgresql://192.168.0.20:5432/keycloak?connectTimeout=5&socketTimeout=30&currentSchema=public
      • KC_DB_USERNAME=keycloak
      • KC_DB_PASSWORD=keycloak
      • KC_BOOTSTRAP_ADMIN_USERNAME=admin
      • KC_BOOTSTRAP_ADMIN_PASSWORD=password
      • KC_HEALTH_ENABLED=true
      • KC_METRICS_ENABLED=true
      • KC_HTTP_ENABLED=true
      • KC_LOG_LEVEL=INFO,org.keycloak.events:debug,org.infinispan:debug,org.jgroups:debug
      • KC_FEATURES=scripts
      • KC_METRICS_STATS_ENABLED=true
      • KC_EVENT_METRICS_USER_ENABLED=true
      • JAVA_OPTS_APPEND='-Xms1g' '-Xmx2g' '-Djgroups.external_addr=192.168.0.10'
      • KC_CACHE=ispn
      • KC_CACHE_STACK=jdbc-ping
        volumes:
        -./cache-ispn.xml:/opt/keycloak/conf/cache-ispn.xml
        ports:
      • 8080:8080
      • 7800:7800
      • 57800:57800
      • 9000:9000
        networks:
      • keycloak
        networks:
          keycloak:
        

        driver: bridge

        
        

        Infinispan cache config file

        Use default file with parameter node-name="demo-kc-01"

      Logs start Keycloak

      2026-02-03 14:15:43,434 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) JGroups protocol stack: TCP(defer_client_bind_addr=false;time_service_interval=500;log_details=true;buffered_output_stream_size=65536;ergonomics=true;port_range=10;log_accept_error=true;conn_expire_time=0;recv_buf_size=0;bind_port=7800;loopback_copy=false;thread_naming_pattern=pl;non_blocking_sends=false;reaper_interval=0;send_buf_size=640000;tcp_nodelay=true;use_acks=false;stats=true;buffered_input_stream_size=65536;external_addr=192.168.0.10;suppress_time_different_version_warnings=60000;loopback_separate_thread=true;id=57;logical_addr_cache_expiration=360000;logical_addr_cache_max_size=2000;suppress_time_different_cluster_warnings=60000;max_length=0;peer_addr_read_timeout=1000;max_send_queue=128;sock_conn_timeout=300;linger=-1;client_bind_port=0;receive_on_all_interfaces=false;use_virtual_threads=true;external_port=0;bundler_type=per-destination;log_discard_msgs=true;bind_addr=172.17.0.2;logical_addr_cache_reaper_interval=60000;who_has_cache_timeout=2000;log_discard_msgs_version=true)
      :RED(min_threshold=0.5;max_threshold=1.0;stats=true;ergonomics=true;weight_factor=1.0;id=0;enabled=true)
      :org.keycloak.jgroups.protocol.KEYCLOAK_JDBC_PING2(async_discovery_use_separate_thread_per_request=false;update_store_on_view_change=true;ergonomics=true;insert_single_sql=INSERT INTO JGROUPS_PING values (?, ?, ?, ?, ?);max_rank_to_reply=0;info_writer_sleep_time=10000;remove_all_data_on_view_change=true;select_all_pingdata_sql=SELECT address, name, ip, coord FROM JGROUPS_PING WHERE cluster_name=?;send_cache_on_join=true;break_on_coord_rsp=true;stats=true;use_disk_cache=false;clear_sql=DELETE from JGROUPS_PING WHERE cluster_name=?;discovery_rsp_expiry_time=60000;id=1025;info_writer_max_writes_after_view=2;register_shutdown_hook=false;stagger_timeout=0;delete_single_sql=DELETE from JGROUPS_PING WHERE address=?;async_discovery=false;remove_old_coords_on_view_change=false;max_members_in_discovery_request=500;num_discovery_runs=1;initialize_sql=;location=/tmp/jgroups;write_data_on_find=true;return_entire_cache=false)
      :MERGE3(check_interval=48000;stats=true;min_interval=10000;ergonomics=true;id=38;max_participants_in_merge=100;max_interval=30000)
      :FD_SOCK2(offset=50000;connect_timeout=1000;linger=-1;client_bind_port=0;ergonomics=true;port_range=3;suspect_msg_interval=5000;max_port=65536;stats=true;external_port=0;external_addr=192.168.0.10;id=71;min_port=1024)
      :FD_ALL3(stats=true;ergonomics=true;interval=8000;id=70;timeout=40000)
      :VERIFY_SUSPECT2(num_msgs=1;use_mcast_rsps=false;stats=true;ergonomics=true;id=72;timeout=1000)
      :pbcast.NAKACK2(resend_last_seqno_max_times=1;use_mcast_xmit=false;ergonomics=true;xmit_table_msgs_per_row=1024;xmit_table_max_compaction_time=30000;become_server_queue_size=50;xmit_interval=200;resend_last_seqno=true;max_xmit_req_size=1024;discard_delivered_msgs=true;suppress_time_non_member_warnings=60000;xmit_table_num_rows=50;stats=true;xmit_from_random_member=false;log_discard_msgs=true;log_not_found_msgs=true;xmit_table_resize_factor=1.2;id=41;max_rebroadcast_timeout=2000;max_batch_size=0;use_mcast_xmit_req=false)
      :UNICAST3(ergonomics=true;xmit_table_msgs_per_row=1024;xmit_table_max_compaction_time=30000;ack_threshold=100;sync_min_interval=2000;max_retransmit_time=60000;xmit_interval=200;max_xmit_req_size=511600;conn_close_timeout=5000;conn_expiry_timeout=120000;xmit_table_num_rows=50;xmits_enabled=true;stats=true;xmit_table_resize_factor=1.2;log_not_found_msgs=true;id=48;max_batch_size=0;loopback=false)
      :pbcast.STABLE(stats=true;ergonomics=true;desired_avg_gossip=5000;max_bytes=1000000;id=13)
      :pbcast.GMS(max_join_attempts=10;print_local_addr=false;leave_timeout=1000;all_clients_retry_timeout=100;log_view_warnings=true;ergonomics=true;use_delta_views=true;use_flush_if_present=true;print_view_details=true;print_physical_addrs=true;join_timeout=2000;max_leave_attempts=1;view_ack_collection_timeout=2000;stats=true;num_prev_views=10;merge_timeout=5000;num_prev_mbrs=50;id=12;log_collect_msgs=false;membership_change_policy=org.jgroups.protocols.pbcast.GMS$DefaultMembershipPolicy@3d8c55ff)
      :UFC(min_threshold=0.4;stats=true;ergonomics=true;max_credits=4000000;min_credits=1600000;id=31;max_block_time=5000)
      :MFC(min_threshold=0.4;stats=true;ergonomics=true;max_credits=4000000;min_credits=1600000;id=30;max_block_time=5000)
      :FRAG4(frag_size=60000;stats=true;ergonomics=true;id=69)
      
      
      2026-02-03 14:15:43,435 INFO  [org.infinispan.CLUSTER] (main) ISPN000078: Starting JGroups channel `ISPN` with stack `jdbc-ping`
      2026-02-03 14:15:43,440 INFO  [org.jgroups.JChannel] (main) local_addr: 00000000-0000-0000-0000-00000000000a, name: demo-kc-01
      2026-02-03 14:15:43,464 INFO  [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800
      2026-02-03 14:15:43,470 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) address=demo-kc-01, cluster=ISPN, physical address=192.168.0.10:7800
      2026-02-03 14:15:43,510 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) demo-kc-01: sending JOIN(demo-kc-01) to demo-kc-02
      2026-02-03 14:15:43,710 DEBUG [org.infinispan.metrics.impl.MetricsRegistryImpl] (Notification Thread) Registered metric MeterId{name='jvm.gc.pause', tags=[tag(action=end of minor GC),tag(cause=Metadata GC Threshold),tag(gc=G1 Young Generation)]}
      2026-02-03 14:15:43,904 DEBUG [org.infinispan.metrics.impl.MetricsRegistryImpl] (Notification Thread) Registered metric MeterId{name='jvm.gc.concurrent.phase.time', tags=[tag(action=end of concurrent GC pause),tag(cause=No GC),tag(gc=G1 Concurrent GC)]}
      2026-02-03 14:15:44,036 DEBUG [org.jgroups.protocols.pbcast.NAKACK2] (main) 
      [demo-kc-01 setDigest()]
      existing digest:  []
      new digest:       demo-kc-02: [33 (33)], demo-kc-03: [0 (0)], demo-kc-01: [0 (0)]
      resulting digest: demo-kc-02: [33 (33)], demo-kc-03: [0 (0)], demo-kc-01: [0 (0)]
      2026-02-03 14:15:44,040 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) demo-kc-01: installing view [demo-kc-02|4] (3) [demo-kc-02, demo-kc-03, demo-kc-01] ([demo-kc-02, demo-kc-03, demo-kc-01] joined)
      2026-02-03 14:15:44,105 DEBUG [org.jgroups.protocols.FD_SOCK2] (main) demo-kc-01: trying to connect to demo-kc-02
      2026-02-03 14:15:44,115 DEBUG [org.jgroups.protocols.FD_SOCK2] (main) demo-kc-01: connected successfully to demo-kc-02 (192.168.0.11:57800) in 9 ms
      2026-02-03 14:15:44,131 INFO  [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [demo-kc-02|4] (3) [demo-kc-02, demo-kc-03, demo-kc-01]
      2026-02-03 14:15:44,132 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Joined: [demo-kc-02, demo-kc-03, demo-kc-01], Left: []
      2026-02-03 14:15:44,136 INFO  [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate
      

      Version

      Keycloak 26.4.2 (Docker images)

      Regression

      [ ] The issue is a regression

      Expected behavior

      After network restoration, the node should cleanly rejoin the cluster (via JDBC_PING2 discovery + MERGE3), update the view, and resume cache synchronization without timeouts or isolation.

      Actual behavior

      Node remains out of sync. Infinispan throws repeated ISPN000476: Timed out waiting for responses on 'work' cache operations (scheduled tasks). Cluster is effectively broken until node restart.

      Keycloak logs after network connectivity is restored

      2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02)
      2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02)
      2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03)
      2026-02-03 15:10:32,264 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: haven't received a heartbeat from demo-kc-02 in timeout period (40000 ms), adding it to suspect list
      2026-02-03 15:10:32,264 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: haven't received a heartbeat from demo-kc-03 in timeout period (40000 ms), adding it to suspect list
      2026-02-03 15:10:32,265 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: suspecting [demo-kc-02, demo-kc-03]
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_ALL3] () demo-kc-01: unsuspecting demo-kc-03
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03)
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_ALL3] () demo-kc-01: unsuspecting demo-kc-02
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02)
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03)
      2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02)
      2026-02-03 15:10:50,899 WARN  [org.jgroups.protocols.TCP] (pd-bundler-15,demo-kc-01) JGRP000032: demo-kc-01: no physical address for demo-kc-02, dropping message
      2026-02-03 15:10:51,308 WARN  [org.keycloak.cookie.DefaultCookieProvider] (executor-thread-5) Non-secure context detected; cookies are not secured, and will not be available in cross-origin POST requests. Please review whether this direct HTTP usage is expected.
      2026-02-03 15:10:51,676 WARN  [org.jgroups.protocols.TCP] (pd-bundler-16,demo-kc-01) JGRP000032: demo-kc-01: no physical address for demo-kc-03, dropping message
      2026-02-03 15:11:06,995 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (non-blocking-thread-demo-kc-01-p2-t4) ISPN000136: Error executing command PutKeyValueCommand on Cache 'authenticationSessions', writing keys [fdbfc215-d05a-5040-9547-01f3ea995d58]: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
      

      at org.infinispan.statetransfer.StateTransferLockImpl.transactionDataTimeoutException(StateTransferLockImpl.java:133)
      at org.infinispan.statetransfer.StateTransferLockImpl.lambda$transactionDataFuture$1(StateTransferLockImpl.java:126)
      at org.infinispan.util.concurrent.ConditionFuture.lambda$newConditionStage$0(ConditionFuture.java:60)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
      at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
      at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

      2026-02-03 15:11:06,996 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (executor-thread-5) Uncaught server error: java.util.concurrent.CompletionException: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
      

      at org.infinispan.commons.util.concurrent.CompletionStages.join(CompletionStages.java:91)
      at org.keycloak.models.sessions.infinispan.transaction.DefaultInfinispanTransactionProvider.commitImpl(DefaultInfinispanTransactionProvider.java:73)
      at org.keycloak.models.AbstractKeycloakTransaction.commit(AbstractKeycloakTransaction.java:46)
      at org.keycloak.services.DefaultKeycloakTransactionManager.lambda$commitWithTracing$1(DefaultKeycloakTransactionManager.java:170)
      at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:59)
      at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:69)
      at org.keycloak.services.DefaultKeycloakTransactionManager.commitWithTracing(DefaultKeycloakTransactionManager.java:169)
      at org.keycloak.services.DefaultKeycloakTransactionManager.lambda$commit$0(DefaultKeycloakTransactionManager.java:146)
      at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:59)
      at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:69)
      at org.keycloak.services.DefaultKeycloakTransactionManager.commit(DefaultKeycloakTransactionManager.java:121)
      at org.keycloak.services.DefaultKeycloakSession.closeTransactionManager(DefaultKeycloakSession.java:392)
      at org.keycloak.services.DefaultKeycloakSession.close(DefaultKeycloakSession.java:357)
      at org.keycloak.models.KeycloakBeanProducer_ProducerMethod_getKeycloakSession_XoSEUTXOsE3bpqXlGMAykCiECUM_ClientProxy.close(Unknown Source)
      at org.keycloak.quarkus.runtime.transaction.TransactionalSessionHandler.close(TransactionalSessionHandler.java:60)
      at org.keycloak.quarkus.runtime.integration.jaxrs.CloseSessionFilter.closeSession(CloseSessionFilter.java:67)
      at org.keycloak.quarkus.runtime.integration.jaxrs.CloseSessionFilter.filter(CloseSessionFilter.java:63)
      at org.jboss.resteasy.reactive.server.handlers.ResourceResponseFilterHandler.handle(ResourceResponseFilterHandler.java:25)
      at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:192)
      at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:147)
      at io.quarkus.vertx.core.runtime.VertxCoreRecorder$15.runWith(VertxCoreRecorder.java:645)
      at org.jboss.threads.EnhancedQueueExecutor$Task.doRunWith(EnhancedQueueExecutor.java:2651)
      at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2630)
      at org.jboss.threads.EnhancedQueueExecutor.runThreadBody(EnhancedQueueExecutor.java:1622)
      at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1589)
      at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:11)
      at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:11)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.base/java.lang.Thread.run(Thread.java:1583)

      Caused by: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
      

      at org.infinispan.statetransfer.StateTransferLockImpl.transactionDataTimeoutException(StateTransferLockImpl.java:133)
      at org.infinispan.statetransfer.StateTransferLockImpl.lambda$transactionDataFuture$1(StateTransferLockImpl.java:126)
      at org.infinispan.util.concurrent.ConditionFuture.lambda$newConditionStage$0(ConditionFuture.java:60)
      at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
      at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
      at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

      
      

      How to Reproduce?

      1. Start 3 Keycloak nodes in Docker with KC_CACHE_STACK=jdbc-ping.
      2. Confirm cluster formation (jgroups_ping table has 3 rows, logs show "new cluster view with 3 members").
      3. Simulate network partition on one node demo-kc-01:

      • ip route add blackhole 192.168.0.11
      • ip route add blackhole 192.168.0.12
      • ip route add blackhole 192.168.0.20
        4. Wait for ~5 min.
        5. Restore network:
      • ip route del blackhole 192.168.0.11
      • ip route del blackhole 192.168.0.12
      • ip route del blackhole 192.168.0.20
        Observe logs: node sees DB, gets new view, but no merge/rejoin.

      Anything else?

      • FD_ALL3 timeout=40s, MERGE3 present in default config.
      • No firewall blocks after restoration; direct TCP between nodes + DB works.

              Unassigned Unassigned
              pvlha Pavel Vlha
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: