-
Bug
-
Resolution: Done
-
Major
-
None
-
None
Problem
During a GracefulShutdown the Operator does the following:
- Ping the server to determine the version so that we can create the correct client
- Disable global rebalancing via the REST endpoint by calling the -0 pod
- For each pod in the cluster, call the shutdown endpoint
If the Operator progress is interrupted* during step 3, subsequent attempts to perform the GracefulShutdown could fail as the pod's cache-container has already been shutdown.
Furthermore, the pod list is not guaranteed to be in the same order each time which adds additional non-determinism.
*Progress maybe interrupted due to the Operator pod being restarted/rescheduled, or an unexpected error from the server.
Solution
- When attempting the GracefulShutdown we should continue to the next pod if it returns an error response indicating the cache-container has already been stopped. We should output an appropriate log indicating that the pod has already been stopped.
- We should make sure that all error logs associated with pod specific requests include the name of the pod to ease debugging in the future.
- We should ensure that the order of pod names returned by ctx.InfinispanPods() is deterministic, sorted from the lowest to highest ordinal.
- links to