Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-13538

Rebalance confirmation gets lost after node leave

    XMLWordPrintable

Details

    Description

      Cluster with 3 nodes: A, B, C, starting topology 10
      NodeD joins, rebalance starts with topology 11
      While NodeD is installing cluster listeners, B leaves

      C confirms the rebalance with topology 11, which included B
      A already has topology 12, without B, and rejects, C's confirmation
      B then receives topology 12 but doesn't confirm it because it's supposed to be confirmed already

      The issue causes random failures in ClusterListenerDistAddListenerTest.testMemberJoinsAndRetrievesClusterListenersButMainListenerNodeDiesBeforeInstalled:

      10:14:12,996 ERROR (testng-Test:[]) [TestSuiteProgress] Test failed: org.infinispan.notifications.cachelistener.cluster.ClusterListenerDistAddListenerTest.testMemberJoinsAndRetrievesClusterListenersButMainListenerNodeDiesBeforeInstalled
      org.infinispan.commons.CacheException: Initial state transfer timed out for cache cluster-listener on D
      	at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:248) ~[classes/:?]
      	at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1017) ~[classes/:?]
      	at org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:512) ~[classes/:?]
      	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:700) ~[classes/:?]
      	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:646) ~[classes/:?]
      	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:535) ~[classes/:?]
      	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:513) ~[classes/:?]
      	at org.infinispan.test.MultipleCacheManagersTest.cache(MultipleCacheManagersTest.java:530) ~[test-classes/:?]
      	at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.lambda$testMemberJoinsAndRetrievesClusterListenersButMainListenerNodeDiesBeforeInstalled$2(AbstractClusterListenerDistAddListenerTest.java:165) ~[test-classes/:?]
      
      10:10:12,997 TRACE (non-blocking-thread-C-p30645-t4:[]) [JGroupsTransport] C sending command to A: ConfirmRebalancePhaseCommand{cacheName='cluster-listener', origin=C, throwable=null, topologyId=11, viewId=3}
      10:10:12,997 DEBUG (non-blocking-thread-C-p30645-t4:[]) [LocalTopologyManagerImpl] Updating local topology for cache cluster-listener: CacheTopology{id=12, phase=READ_OLD_WRITE_ALL, rebalanceId=5, currentCH=DefaultConsistentHash{ns=256, owners = (2)[A: 132+34, C: 124+48]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[A: 83+91, C: 87+85, D: 86+80]}, unionCH=null, actualMembers=[A, C, D], persistentUUIDs=[11c24a2d-77d1-448d-9ca0-1367b87fa9e8, 8012efb8-5aac-4092-b85f-0d49d12fecdf, 3db53eb6-67c7-48bd-9156-7d094fb2c0f2]}
      10:10:12,997 TRACE (jgroups-7,A:[]) [JGroupsTransport] A received command from C: ConfirmRebalancePhaseCommand{cacheName='cluster-listener', origin=null, throwable=null, topologyId=11, viewId=3}
      10:10:12,997 TRACE (jgroups-7,A:[]) [GlobalInboundInvocationHandler] Attempting to execute non-CacheRpcCommand: ConfirmRebalancePhaseCommand{cacheName='cluster-listener', origin=C, throwable=null, topologyId=11, viewId=3} [sender=C]
      10:10:12,997 DEBUG (jgroups-7,A:[]) [ClusterCacheStatus] Ignoring rebalance confirmation from C for cache cluster-listener because the topology id is old (11, expected 12)
      10:10:12,997 DEBUG (non-blocking-thread-C-p30645-t4:[]) [LocalTopologyManagerImpl] Starting local rebalance for cache cluster-listener, topology = CacheTopology{id=12, phase=READ_OLD_WRITE_ALL, rebalanceId=5, currentCH=DefaultConsistentHash{ns=256, owners = (2)[A: 132+34, C: 124+48]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[A: 83+91, C: 87+85, D: 86+80]}, unionCH=null, actualMembers=[A, C, D], persistentUUIDs=[11c24a2d-77d1-448d-9ca0-1367b87fa9e8, 8012efb8-5aac-4092-b85f-0d49d12fecdf, 3db53eb6-67c7-48bd-9156-7d094fb2c0f2]}
      10:10:12,998 TRACE (non-blocking-thread-C-p30645-t4:[]) [StateConsumerImpl] Received new topology for cache cluster-listener, isRebalance = true, isMember = true, topology = CacheTopology{id=12, phase=READ_OLD_WRITE_ALL, rebalanceId=5, currentCH=DefaultConsistentHash{ns=256, owners = (2)[A: 132+34, C: 124+48]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[A: 83+91, C: 87+85, D: 86+80]}, unionCH=DefaultConsistentHash{ns=256, owners = (3)[A: 132+62, C: 124+79, D: 0+166]}, actualMembers=[A, C, D], persistentUUIDs=[11c24a2d-77d1-448d-9ca0-1367b87fa9e8, 8012efb8-5aac-4092-b85f-0d49d12fecdf, 3db53eb6-67c7-48bd-9156-7d094fb2c0f2]}
      10:10:12,998 TRACE (non-blocking-thread-C-p30645-t4:[]) [StateConsumerImpl] On cache cluster-listener we have: new segments: {0-76 97-117 138-180 187-190 193-198 204-255}; old segments: {0-76 97-110 138-167 202 204-255}
      10:10:12,998 TRACE (non-blocking-thread-C-p30645-t4:[]) [StateConsumerImpl] On cache cluster-listener we have: added segments: {111-117 168-180 187-190 193-198}; removed segments: {202}
      10:10:12,998 TRACE (jgroups-8,C:[]) [StateConsumerImpl] Inbound transfer finished: InboundTransferTask{segments={145 202}, unfinishedSegments={}, source=B, isCancelled=true, completionFuture=java.util.concurrent.CompletableFuture@1529aa6a[Completed normally], topologyId=10, timeout=240000, cacheName=cluster-listener}
      10:10:12,998 DEBUG (jgroups-8,C:[]) [StateConsumerImpl] Finished receiving of segments for cache cluster-listener for topology 10.
      10:10:12,998 TRACE (non-blocking-thread-C-p30645-t4:[]) [StateConsumerImpl] No end of state transfer notification, waitingForState already set to false by another thread
      10:10:12,998 TRACE (jgroups-8,C:[]) [StateConsumerImpl] No end of state transfer notification, waitingForState already set to false by another thread
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            dberinde@redhat.com Dan Berindei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: