Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-15191

Cache startup failures on individual nodes can cause other caches to enter DEGRADED mode on restart

XMLWordPrintable

      If a cache encounters a fatal exception on the server, an exception is thrown, the Server stops all other caches and the Server terminates with a FATAL exception. However, when other caches are stopped we only make a call to EmbeddedCacheManager#stop which means that the Caches' state is never persisted. Consequently, if another cache manages to form a cluster before the exception is thrown and it has PartitionHandling.DENY_READ_WRITES configured, then on node restart it is never possible for the cluster to become AVAILABLE again as the UUID of the restarted node will differ from the original.

      The org.infinispan.LOCKS cache utilises PartitionHandling.DENY_READ_WRITES, therefore any code attempting to utilise a Lock will fail even if the server correctly startups on restart.

      This issue was encountered in the Operator testsuite because a single node was failing due to ISPN-15089 and k8s automatically restarts the server pod on failure. Once the cluster successfully forms, attempts to perform a Backup Restore fail with:

      ISPN000136: Error executing command GetKeyValueCommand on Cache 'org.infinispan.LOCKS', writing keys [] org.infinispan.partitionhandling.AvailabilityException: ISPN000306: Key 'ClusteredLockKey{name=BackupManagerImpl-restore}' is not available. Not all owners are in this partition
      

      As a cache needs to lose at least half it's members, or all owners of a segment, for a cluster to be affected by this issue it must meet one of the following:

      • Cluster only has 2 nodes
      • Cluster has > 2 nodes, but a cache has num_owners=1

              rh-ee-jbolina Jose Bolina
              remerson@redhat.com Ryan Emerson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: