Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: Networking / cluster-network-operator
Labels:
- SREsDevImpact-None
- SREsPerCoreImpact-Low

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

On an Azure Red Hat OpenShift cluster that was in the middle of upgrading from 4.13.40 to 4.14.21, the upgrade became stuck on the network CO.

Upon investigation, I noticed that:

- A particular VM's openshift-ovn-kubernetes "ovnkube-node-xxxxx" Pod was unable to start due to the following error from the kube-scheduler shown in the Pod's Events:

Warning  FailedScheduling  102m  default-scheduler  0/7 nodes are available: 1 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node..

- Sure enough, there was a Pod on the node stuck Terminating, with this warning in its Events:

Warning  FailedKillPod   89s (x140 over 106m)  kubelet            error killing pod: failed to "KillPodSandbox" for "REDACTED" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox REDACTED: error removing pod REDACTED from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): Multus: [REDACTED]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"

  This was a customer Pod in a customer Namespace, which is why some data is redacted from the message as I've typed it here.

- The same node's openshift-multus "multus-xxxxx" Pod was unable to start, and it appeared to be related to the fact that the openshift-ovn-kubernetes Pod was not ready.

---

Given my very limited knowledge of the network CO, this seemed like it could have been the result of a race condition. I redeployed the affected node's underlying VM, and all network-related Pods started up, and the upgrade resumed progressing.

Version-Release number of selected component (if applicable):

Unsure

How reproducible:

Unsure

Steps to Reproduce:

    1. Upgrade an ARO cluster from 4.13 to 4.14, and wait until the upgrade progresses to the network CO.
    2. Once the upgrade gets stuck on the network CO (if it does), check whether any particular node's openshift-ovn-kubernetes "ovnkube-node-xxxxx" is unable to start and a Pod is stuck Terminating on the same node, with the same Event messages that I gave in the problem description.

Actual results:

The cluster upgrade was stuck on the network CO and unable to progress until I redeployed the affected node's underlying VM.

Expected results:

The cluster upgrade proceeds without getting stuck on the network CO.

Additional info:

N/A

Assignee:: Ben Bennett

Reporter:: Kipp Morris

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/04/26 10:56 PM

Updated:: 2024/04/29 2:02 PM

Resolved:: 2024/04/29 2:02 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates