-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
4.13.z
-
Moderate
-
No
-
False
-
Description of problem:
On an Azure Red Hat OpenShift cluster that was in the middle of upgrading from 4.13.40 to 4.14.21, the upgrade became stuck on the network CO. Upon investigation, I noticed that: - A particular VM's openshift-ovn-kubernetes "ovnkube-node-xxxxx" Pod was unable to start due to the following error from the kube-scheduler shown in the Pod's Events: Warning FailedScheduling 102m default-scheduler 0/7 nodes are available: 1 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node.. - Sure enough, there was a Pod on the node stuck Terminating, with this warning in its Events: Warning FailedKillPod 89s (x140 over 106m) kubelet error killing pod: failed to "KillPodSandbox" for "REDACTED" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox REDACTED: error removing pod REDACTED from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): Multus: [REDACTED]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition" This was a customer Pod in a customer Namespace, which is why some data is redacted from the message as I've typed it here. - The same node's openshift-multus "multus-xxxxx" Pod was unable to start, and it appeared to be related to the fact that the openshift-ovn-kubernetes Pod was not ready. --- Given my very limited knowledge of the network CO, this seemed like it could have been the result of a race condition. I redeployed the affected node's underlying VM, and all network-related Pods started up, and the upgrade resumed progressing.
Version-Release number of selected component (if applicable):
Unsure
How reproducible:
Unsure
Steps to Reproduce:
1. Upgrade an ARO cluster from 4.13 to 4.14, and wait until the upgrade progresses to the network CO. 2. Once the upgrade gets stuck on the network CO (if it does), check whether any particular node's openshift-ovn-kubernetes "ovnkube-node-xxxxx" is unable to start and a Pod is stuck Terminating on the same node, with the same Event messages that I gave in the problem description.
Actual results:
The cluster upgrade was stuck on the network CO and unable to progress until I redeployed the affected node's underlying VM.
Expected results:
The cluster upgrade proceeds without getting stuck on the network CO.
Additional info:
N/A