-
Task
-
Resolution: Done
-
Major
-
None
-
False
-
None
-
False
[1] https://access.redhat.com/solutions/6740601
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2040766
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2027959
Original issue discussion:
On Fri, Apr 22, 2022 at 4:49 AM Ali Bokhari <abokhari@redhat.com> wrote:
Hello team,
I finally executed the test case where we failed one of the nodes (powered off the node) from our 3-node OCP cluster where we are hosting the OpenStack control plane. The following results were not what I expected, so I thought I would check with you to see if you have tried this scenario or if you have any thoughts on the results that I got:
- I failed the node which has the OpenStackClient pod running on it. It took about 5 minutes for the client to be restarted on one of the other nodes. I think the actual failure was detected after about 3 minutes.
- That seems like a long time to detect the failure
Check in the nodes condition section [1] on the behavior when a node is down.
~~~
The default eviction timeout duration is five minutes. In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
~~~
[1] https://kubernetes.io/docs/concepts/architecture/nodes/#condition
- The controller VM that was running on the failed node was marked for termination. But even after waiting for 20 minutes the pod was not terminated. Since the pod was not terminated a new pod / vmi was not spawned
- It seemed like the pod was waiting for some event to happen before it can be terminated, but I could not figure out what it was waiting for
- Finally, I powered back the failed node and little bit after that the pod was terminated and a new instance was created on one of the 2 nodes that had not failed
~~~
The node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the API server and frees up their names.
~~~
- This final one is the most critical one. By chance the node that I failed, was hosting the controller VM with the active VIP configured on it. The VM went down, but the VIP did not move to one of the remaining 2 controller VMs. As a result overcloud API was not available
- I had to power back the OCP node, and then wait for the VM to be re-created. Once the VM was re-created the VIP came back alive on that same VM and overcloud API started to work
- I would have expected the VIP to be moved to one of the remaining 2 VMs
I have not done a test like this myself, was fencing enabled in the deployment? What was the fence-kubevirt reporting in that situation? If it could not fence the VM on the down worker node, pacemaker will not move the service as it does not know if it is safe to do so without confirmation that the VM got successfully fenced.