Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-7637

GracefulShutdown upgrades should tolerate pods that have already been stopped

XMLWordPrintable

      Problem

      During a GracefulShutdown the Operator does the following:

      1. Ping the server to determine the version so that we can create the correct client
      2. Disable global rebalancing via the REST endpoint by calling the -0 pod
      3. For each pod in the cluster, call the shutdown endpoint

      If the Operator progress is interrupted* during step 3, subsequent attempts to perform the GracefulShutdown could fail as the pod's cache-container has already been shutdown.

      Furthermore, the pod list is not guaranteed to be in the same order each time which adds additional non-determinism.

      *Progress maybe interrupted due to the Operator pod being restarted/rescheduled, or an unexpected error from the server.

      Solution

      1. When attempting the GracefulShutdown we should continue to the next pod if it returns an error response indicating the cache-container has already been stopped. We should output an appropriate log indicating that the pod has already been stopped.
      2. We should make sure that all error logs associated with pod specific requests include the name of the pod to ease debugging in the future.
      3. We should ensure that the order of pod names returned by ctx.InfinispanPods() is deterministic, sorted from the lowest to highest ordinal.

              remerson@redhat.com Ryan Emerson
              rhn-support-afield Alan Field
              Pavel Drobek Pavel Drobek
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: