Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-9187

Lost segments logged when node leaves during rebalance

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Done
    • 9.2.3.Final, 9.3.0.Beta1
    • 9.3.0.Final
    • Core
    • None

    Description

      When a node leaves during rebalance, we remove the leaver from the current CH and from the pending CH with ConsistentHashFactory.updateMembers(). However, since the 4-phase rebalance changes joiners are also removed from the pending CH.

      In the following log numOwners=2 and pending CH had one segment (239) owned by a joiner (D) and a leaver (B). The updated pending CH doesn't have either B or D, this means both owners are lost and random owners (C and A) are elected. A sees that it was allocated new segments from updateMembers and logs that the segment has been lost:

      08:53:01,528 INFO  (remote-thread-test-NodeA-p2-t6:[cluster-listener]) [CLUSTER] [Context=cluster-listener] ISPN100002: Starting rebalance with members [test-NodeA-15001, test-NodeB-55628, test-NodeC-62395, test-NodeD-29215], phase READ_OLD_WRITE_ALL, topology id 10
      08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener]) [CacheTopology] Current consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
        test-NodeB-55628 primary: {... 231 234 240 242}, backup: {... 232-233 239 241 255}
        test-NodeC-62395 primary: {... 232-233 235-236 239 241 243 246-247 251 253 255}, backup: {... 231 234 237-238 240 242 244-245 248-250 252 254}
      08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener]) [CacheTopology] Pending consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 245 248 252}, backup: {... 235-236 244 246-247 251}
        test-NodeB-55628 primary: {... 231 240 242}, backup: {... 230 239 241}
        test-NodeC-62395 primary: {... 232-233 235-236 241 243 246-247 251 253 255}, backup: {... 231 234 240 242 245 249-250 252 254}
        test-NodeD-29215 primary: {... 234 239 244 249-250 254}, backup: {... 232-233 237-238 243 248 253 255}
      
      08:53:01,606 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [ClusterCacheStatus] Removed node test-NodeB-55628 from cache cluster-listener: members = [test-NodeA-15001, test-NodeC-62395, test-NodeD-29215], joiners = [test-NodeD-29215]
      08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology] Current consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
        test-NodeC-62395 primary: {... 230-236 239-243 246-247 251 253 255}, backup: {... 237-238 244-245 248-250 252 254}
      08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology] Pending consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248 252}, backup: {... 235-236 239 246-247 251}
        test-NodeC-62395 primary: {... 227-236 239-243 246-247 249-251 253-255}, backup: {... 226 245 252}
      
      08:53:01,613 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateTransferManagerImpl] Installing new cache topology CacheTopology{id=11, phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 134+45, test-NodeC-62395: 122+50]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 129+45, test-NodeC-62395: 127+28]}, unionCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 134+62, test-NodeC-62395: 122+68]}, actualMembers=[test-NodeA-15001, test-NodeC-62395, test-NodeD-29215], persistentUUIDs=[0506cc27-9762-4703-ad56-6a3bf7953529, c365b93f-e46c-4f11-ab46-6cafa2b2d92b, d3f21b0d-07f2-4089-b160-f754e719de83]} on cache cluster-listener
      08:53:01,662 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] On cache cluster-listener we have: new segments: {1-18 21 30-53 56-63 67 69-73 75-78 81-85 88-101 107-115 117-118 125-134 136-139 142-156 158-165 168-170 172-174 177-179 181-188 193-211 214-226 228-229 235-239 243-254}; old segments: {1-15 30-53 56-63 69-73 75-77 81-85 88-101 107-115 117-118 125-131 136-139 142-145 150-156 158-165 168-170 172-174 177-179 183-188 193-211 214-218 220-226 228-229 235-238 243-254}
      08:53:01,663 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] On cache cluster-listener we have: added segments: {16-18 21 67 78 132-134 146-149 181-182 219 239}; removed segments: {}
      08:53:01,663 DEBUG (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] Not requesting segments {16-18 21 67 78 132-134 146-149 181-182 219 239} because the last owner left the cluster
      

      There isn't any visible inconsistency: A only owns segment 239 for writing, and the coordinator immediately starts a new rebalance, ignoring the pending CH that it sent out earlier. However, the new rebalance causes its own problems: ISPN-8240.

      Attachments

        Issue Links

          Activity

            People

              dberinde@redhat.com Dan Berindei
              dberinde@redhat.com Dan Berindei
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: