-
Bug
-
Resolution: Done
-
Undefined
-
None
-
None
Before reporting an issue
[x] I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.
Area
infinispan
Describe the bug
Environment
- 3 separate VMs (Docker containers)
- Shared PostgreSQL 16.4 on a separate VM
- Cache stack: default jdbc-ping (TCP + JDBC_PING2)
- No custom JGroups/Infinispan XML (using Quarkus/Keycloak defaults)
Description
A 3-node Keycloak cluster using the default jdbc-ping stack forms successfully at startup:
- 3 entries appear in the jgroups_ping table in PostgreSQL.
- One node becomes coordinator.
- Keycloak works normally, and caches (sessions, etc.) synchronize between nodes.
When one node temporarily loses network connectivity to the other nodes and to the shared PostgreSQL DB:
- The isolated node is correctly suspected and removed from the cluster view (via FD_ALL3).
- Network-related errors appear in logs (no physical address, dropping messages).
After network connectivity is restored (DB access works again):
- The node detects the DB and attempts to rejoin the cluster.
- It receives a new cluster view.
- However, it does not merge into the existing cluster and remains isolated (no full rejoin).
- Scheduled tasks (e.g., ClearExpiredAdminEvents, ClearExpiredUserSessions) fail with Infinispan timeouts, as the cluster view is inconsistent.
This leads to split-brain-like behavior, failed cache operations, and stuck nodes until full restart.
Docker compose file
services: keycloak:
image: keycloak/keycloak:26.4.2
restart: always
command:
- "start-dev"
- "--spi-events-listener-jboss-logging-success-level=info"
environment: - KC_PROXY=edge
- KC_PROXY_HEADERS=xforwarded
- KC_HOSTNAME_STRICT=false
- KC_DB=postgres
- KC_DB_URL=jdbc:postgresql://192.168.0.20:5432/keycloak?connectTimeout=5&socketTimeout=30¤tSchema=public
- KC_DB_USERNAME=keycloak
- KC_DB_PASSWORD=keycloak
- KC_BOOTSTRAP_ADMIN_USERNAME=admin
- KC_BOOTSTRAP_ADMIN_PASSWORD=password
- KC_HEALTH_ENABLED=true
- KC_METRICS_ENABLED=true
- KC_HTTP_ENABLED=true
- KC_LOG_LEVEL=INFO,org.keycloak.events:debug,org.infinispan:debug,org.jgroups:debug
- KC_FEATURES=scripts
- KC_METRICS_STATS_ENABLED=true
- KC_EVENT_METRICS_USER_ENABLED=true
- JAVA_OPTS_APPEND='-Xms1g' '-Xmx2g' '-Djgroups.external_addr=192.168.0.10'
- KC_CACHE=ispn
- KC_CACHE_STACK=jdbc-ping
volumes:
-./cache-ispn.xml:/opt/keycloak/conf/cache-ispn.xml
ports: - 8080:8080
- 7800:7800
- 57800:57800
- 9000:9000
networks: - keycloak
networks: keycloak:
driver: bridge
Infinispan cache config file
Use default file with parameter node-name="demo-kc-01"
Logs start Keycloak
2026-02-03 14:15:43,434 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) JGroups protocol stack: TCP(defer_client_bind_addr=false;time_service_interval=500;log_details=true;buffered_output_stream_size=65536;ergonomics=true;port_range=10;log_accept_error=true;conn_expire_time=0;recv_buf_size=0;bind_port=7800;loopback_copy=false;thread_naming_pattern=pl;non_blocking_sends=false;reaper_interval=0;send_buf_size=640000;tcp_nodelay=true;use_acks=false;stats=true;buffered_input_stream_size=65536;external_addr=192.168.0.10;suppress_time_different_version_warnings=60000;loopback_separate_thread=true;id=57;logical_addr_cache_expiration=360000;logical_addr_cache_max_size=2000;suppress_time_different_cluster_warnings=60000;max_length=0;peer_addr_read_timeout=1000;max_send_queue=128;sock_conn_timeout=300;linger=-1;client_bind_port=0;receive_on_all_interfaces=false;use_virtual_threads=true;external_port=0;bundler_type=per-destination;log_discard_msgs=true;bind_addr=172.17.0.2;logical_addr_cache_reaper_interval=60000;who_has_cache_timeout=2000;log_discard_msgs_version=true) :RED(min_threshold=0.5;max_threshold=1.0;stats=true;ergonomics=true;weight_factor=1.0;id=0;enabled=true) :org.keycloak.jgroups.protocol.KEYCLOAK_JDBC_PING2(async_discovery_use_separate_thread_per_request=false;update_store_on_view_change=true;ergonomics=true;insert_single_sql=INSERT INTO JGROUPS_PING values (?, ?, ?, ?, ?);max_rank_to_reply=0;info_writer_sleep_time=10000;remove_all_data_on_view_change=true;select_all_pingdata_sql=SELECT address, name, ip, coord FROM JGROUPS_PING WHERE cluster_name=?;send_cache_on_join=true;break_on_coord_rsp=true;stats=true;use_disk_cache=false;clear_sql=DELETE from JGROUPS_PING WHERE cluster_name=?;discovery_rsp_expiry_time=60000;id=1025;info_writer_max_writes_after_view=2;register_shutdown_hook=false;stagger_timeout=0;delete_single_sql=DELETE from JGROUPS_PING WHERE address=?;async_discovery=false;remove_old_coords_on_view_change=false;max_members_in_discovery_request=500;num_discovery_runs=1;initialize_sql=;location=/tmp/jgroups;write_data_on_find=true;return_entire_cache=false) :MERGE3(check_interval=48000;stats=true;min_interval=10000;ergonomics=true;id=38;max_participants_in_merge=100;max_interval=30000) :FD_SOCK2(offset=50000;connect_timeout=1000;linger=-1;client_bind_port=0;ergonomics=true;port_range=3;suspect_msg_interval=5000;max_port=65536;stats=true;external_port=0;external_addr=192.168.0.10;id=71;min_port=1024) :FD_ALL3(stats=true;ergonomics=true;interval=8000;id=70;timeout=40000) :VERIFY_SUSPECT2(num_msgs=1;use_mcast_rsps=false;stats=true;ergonomics=true;id=72;timeout=1000) :pbcast.NAKACK2(resend_last_seqno_max_times=1;use_mcast_xmit=false;ergonomics=true;xmit_table_msgs_per_row=1024;xmit_table_max_compaction_time=30000;become_server_queue_size=50;xmit_interval=200;resend_last_seqno=true;max_xmit_req_size=1024;discard_delivered_msgs=true;suppress_time_non_member_warnings=60000;xmit_table_num_rows=50;stats=true;xmit_from_random_member=false;log_discard_msgs=true;log_not_found_msgs=true;xmit_table_resize_factor=1.2;id=41;max_rebroadcast_timeout=2000;max_batch_size=0;use_mcast_xmit_req=false) :UNICAST3(ergonomics=true;xmit_table_msgs_per_row=1024;xmit_table_max_compaction_time=30000;ack_threshold=100;sync_min_interval=2000;max_retransmit_time=60000;xmit_interval=200;max_xmit_req_size=511600;conn_close_timeout=5000;conn_expiry_timeout=120000;xmit_table_num_rows=50;xmits_enabled=true;stats=true;xmit_table_resize_factor=1.2;log_not_found_msgs=true;id=48;max_batch_size=0;loopback=false) :pbcast.STABLE(stats=true;ergonomics=true;desired_avg_gossip=5000;max_bytes=1000000;id=13) :pbcast.GMS(max_join_attempts=10;print_local_addr=false;leave_timeout=1000;all_clients_retry_timeout=100;log_view_warnings=true;ergonomics=true;use_delta_views=true;use_flush_if_present=true;print_view_details=true;print_physical_addrs=true;join_timeout=2000;max_leave_attempts=1;view_ack_collection_timeout=2000;stats=true;num_prev_views=10;merge_timeout=5000;num_prev_mbrs=50;id=12;log_collect_msgs=false;membership_change_policy=org.jgroups.protocols.pbcast.GMS$DefaultMembershipPolicy@3d8c55ff) :UFC(min_threshold=0.4;stats=true;ergonomics=true;max_credits=4000000;min_credits=1600000;id=31;max_block_time=5000) :MFC(min_threshold=0.4;stats=true;ergonomics=true;max_credits=4000000;min_credits=1600000;id=30;max_block_time=5000) :FRAG4(frag_size=60000;stats=true;ergonomics=true;id=69) 2026-02-03 14:15:43,435 INFO [org.infinispan.CLUSTER] (main) ISPN000078: Starting JGroups channel `ISPN` with stack `jdbc-ping` 2026-02-03 14:15:43,440 INFO [org.jgroups.JChannel] (main) local_addr: 00000000-0000-0000-0000-00000000000a, name: demo-kc-01 2026-02-03 14:15:43,464 INFO [org.jgroups.protocols.FD_SOCK2] (main) server listening on *:57800 2026-02-03 14:15:43,470 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) address=demo-kc-01, cluster=ISPN, physical address=192.168.0.10:7800 2026-02-03 14:15:43,510 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) demo-kc-01: sending JOIN(demo-kc-01) to demo-kc-02 2026-02-03 14:15:43,710 DEBUG [org.infinispan.metrics.impl.MetricsRegistryImpl] (Notification Thread) Registered metric MeterId{name='jvm.gc.pause', tags=[tag(action=end of minor GC),tag(cause=Metadata GC Threshold),tag(gc=G1 Young Generation)]} 2026-02-03 14:15:43,904 DEBUG [org.infinispan.metrics.impl.MetricsRegistryImpl] (Notification Thread) Registered metric MeterId{name='jvm.gc.concurrent.phase.time', tags=[tag(action=end of concurrent GC pause),tag(cause=No GC),tag(gc=G1 Concurrent GC)]} 2026-02-03 14:15:44,036 DEBUG [org.jgroups.protocols.pbcast.NAKACK2] (main) [demo-kc-01 setDigest()] existing digest: [] new digest: demo-kc-02: [33 (33)], demo-kc-03: [0 (0)], demo-kc-01: [0 (0)] resulting digest: demo-kc-02: [33 (33)], demo-kc-03: [0 (0)], demo-kc-01: [0 (0)] 2026-02-03 14:15:44,040 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) demo-kc-01: installing view [demo-kc-02|4] (3) [demo-kc-02, demo-kc-03, demo-kc-01] ([demo-kc-02, demo-kc-03, demo-kc-01] joined) 2026-02-03 14:15:44,105 DEBUG [org.jgroups.protocols.FD_SOCK2] (main) demo-kc-01: trying to connect to demo-kc-02 2026-02-03 14:15:44,115 DEBUG [org.jgroups.protocols.FD_SOCK2] (main) demo-kc-01: connected successfully to demo-kc-02 (192.168.0.11:57800) in 9 ms 2026-02-03 14:15:44,131 INFO [org.infinispan.CLUSTER] (main) ISPN000094: Received new cluster view for channel ISPN: [demo-kc-02|4] (3) [demo-kc-02, demo-kc-03, demo-kc-01] 2026-02-03 14:15:44,132 DEBUG [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (main) Joined: [demo-kc-02, demo-kc-03, demo-kc-01], Left: [] 2026-02-03 14:15:44,136 INFO [org.keycloak.jgroups.certificates.CertificateReloadManager] (main) Reloading JGroups Certificate
Version
Keycloak 26.4.2 (Docker images)
Regression
[ ] The issue is a regression
Expected behavior
After network restoration, the node should cleanly rejoin the cluster (via JDBC_PING2 discovery + MERGE3), update the view, and resume cache synchronization without timeouts or isolation.
Actual behavior
Node remains out of sync. Infinispan throws repeated ISPN000476: Timed out waiting for responses on 'work' cache operations (scheduled tasks). Cluster is effectively broken until node restart.
Keycloak logs after network connectivity is restored
2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02) 2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02) 2026-02-03 15:09:44,264 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03) 2026-02-03 15:10:32,264 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: haven't received a heartbeat from demo-kc-02 in timeout period (40000 ms), adding it to suspect list 2026-02-03 15:10:32,264 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: haven't received a heartbeat from demo-kc-03 in timeout period (40000 ms), adding it to suspect list 2026-02-03 15:10:32,265 DEBUG [org.jgroups.protocols.FD_ALL3] (Timer runner-2,demo-kc-01) demo-kc-01: suspecting [demo-kc-02, demo-kc-03] 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_ALL3] () demo-kc-01: unsuspecting demo-kc-03 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03) 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_ALL3] () demo-kc-01: unsuspecting demo-kc-02 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02) 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-03) 2026-02-03 15:10:32,267 DEBUG [org.jgroups.protocols.FD_SOCK2] () demo-kc-01: broadcasting unsuspect(demo-kc-02) 2026-02-03 15:10:50,899 WARN [org.jgroups.protocols.TCP] (pd-bundler-15,demo-kc-01) JGRP000032: demo-kc-01: no physical address for demo-kc-02, dropping message 2026-02-03 15:10:51,308 WARN [org.keycloak.cookie.DefaultCookieProvider] (executor-thread-5) Non-secure context detected; cookies are not secured, and will not be available in cross-origin POST requests. Please review whether this direct HTTP usage is expected. 2026-02-03 15:10:51,676 WARN [org.jgroups.protocols.TCP] (pd-bundler-16,demo-kc-01) JGRP000032: demo-kc-01: no physical address for demo-kc-03, dropping message 2026-02-03 15:11:06,995 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (non-blocking-thread-demo-kc-01-p2-t4) ISPN000136: Error executing command PutKeyValueCommand on Cache 'authenticationSessions', writing keys [fdbfc215-d05a-5040-9547-01f3ea995d58]: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
at org.infinispan.statetransfer.StateTransferLockImpl.transactionDataTimeoutException(StateTransferLockImpl.java:133)
at org.infinispan.statetransfer.StateTransferLockImpl.lambda$transactionDataFuture$1(StateTransferLockImpl.java:126)
at org.infinispan.util.concurrent.ConditionFuture.lambda$newConditionStage$0(ConditionFuture.java:60)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)
2026-02-03 15:11:06,996 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (executor-thread-5) Uncaught server error: java.util.concurrent.CompletionException: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
at org.infinispan.commons.util.concurrent.CompletionStages.join(CompletionStages.java:91)
at org.keycloak.models.sessions.infinispan.transaction.DefaultInfinispanTransactionProvider.commitImpl(DefaultInfinispanTransactionProvider.java:73)
at org.keycloak.models.AbstractKeycloakTransaction.commit(AbstractKeycloakTransaction.java:46)
at org.keycloak.services.DefaultKeycloakTransactionManager.lambda$commitWithTracing$1(DefaultKeycloakTransactionManager.java:170)
at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:59)
at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:69)
at org.keycloak.services.DefaultKeycloakTransactionManager.commitWithTracing(DefaultKeycloakTransactionManager.java:169)
at org.keycloak.services.DefaultKeycloakTransactionManager.lambda$commit$0(DefaultKeycloakTransactionManager.java:146)
at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:59)
at org.keycloak.tracing.NoopTracingProvider.trace(NoopTracingProvider.java:69)
at org.keycloak.services.DefaultKeycloakTransactionManager.commit(DefaultKeycloakTransactionManager.java:121)
at org.keycloak.services.DefaultKeycloakSession.closeTransactionManager(DefaultKeycloakSession.java:392)
at org.keycloak.services.DefaultKeycloakSession.close(DefaultKeycloakSession.java:357)
at org.keycloak.models.KeycloakBeanProducer_ProducerMethod_getKeycloakSession_XoSEUTXOsE3bpqXlGMAykCiECUM_ClientProxy.close(Unknown Source)
at org.keycloak.quarkus.runtime.transaction.TransactionalSessionHandler.close(TransactionalSessionHandler.java:60)
at org.keycloak.quarkus.runtime.integration.jaxrs.CloseSessionFilter.closeSession(CloseSessionFilter.java:67)
at org.keycloak.quarkus.runtime.integration.jaxrs.CloseSessionFilter.filter(CloseSessionFilter.java:63)
at org.jboss.resteasy.reactive.server.handlers.ResourceResponseFilterHandler.handle(ResourceResponseFilterHandler.java:25)
at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:192)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:147)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$15.runWith(VertxCoreRecorder.java:645)
at org.jboss.threads.EnhancedQueueExecutor$Task.doRunWith(EnhancedQueueExecutor.java:2651)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2630)
at org.jboss.threads.EnhancedQueueExecutor.runThreadBody(EnhancedQueueExecutor.java:1622)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1589)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:11)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:11)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.infinispan.commons.TimeoutException: ISPN000638: Timeout waiting for topology 20 transaction data
at org.infinispan.statetransfer.StateTransferLockImpl.transactionDataTimeoutException(StateTransferLockImpl.java:133)
at org.infinispan.statetransfer.StateTransferLockImpl.lambda$transactionDataFuture$1(StateTransferLockImpl.java:126)
at org.infinispan.util.concurrent.ConditionFuture.lambda$newConditionStage$0(ConditionFuture.java:60)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)
How to Reproduce?
1. Start 3 Keycloak nodes in Docker with KC_CACHE_STACK=jdbc-ping.
2. Confirm cluster formation (jgroups_ping table has 3 rows, logs show "new cluster view with 3 members").
3. Simulate network partition on one node demo-kc-01:
- ip route add blackhole 192.168.0.11
- ip route add blackhole 192.168.0.12
- ip route add blackhole 192.168.0.20
4. Wait for ~5 min.
5. Restore network: - ip route del blackhole 192.168.0.11
- ip route del blackhole 192.168.0.12
- ip route del blackhole 192.168.0.20
Observe logs: node sees DB, gets new view, but no merge/rejoin.
Anything else?
- FD_ALL3 timeout=40s, MERGE3 present in default config.
- No firewall blocks after restoration; direct TCP between nodes + DB works.
- links to