[OCPBUGS-14355] [4.11.z] [GCP] worker node with Sriov operator installed fails to come up online after reboot - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.11.z
Component/s: Networking / SR-IOV
Labels:
None

Regression:
No
Story Points:
3
Sprint:
NHE Sprint 237
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.11.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-13284~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-7039~~. The following is the description of the original issue:
—
Description of problem:

After installing the Sriov operator, reboot a worker node gracefully. The worker node can never come back online and stuck at the state "NotReady,SchedulingDisabled". The issue only happens on GCP.

oc get nodes
NAME                                       STATUS                        ROLES                  AGE     VERSION
ci-ln-yn53m6t-72292-5fnnc-master-0         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-master-1         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-master-2         Ready                         control-plane,master   3h19m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-a-nvqrj   NotReady,SchedulingDisabled   worker                 3h7m    v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-b-knwfn   Ready                         worker                 3h7m    v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-c-npfrb   Ready                         worker                 3h7m    v1.26.0+083e3f3

Version-Release number of selected component (if applicable):

How reproducible:

%100

Steps to Reproduce:

1. Use Cluster Bot to bring up a 3 masters + 3 workers OCP cluster on GCP with latest 4.13 ci build.
   - Send message "launch ci gcp,ovn" to Cluster Bot on Slack.
2. Install sriov operator via OLM and wait for it to complete
3. Gracefully reboot a worker node
   - oc adm cordon <node>
   - oc adm drain <node> --ignore-daemonsets --delete-emptydir-data --force
   - oc debug node/<node>
   - chroot /host
   - systemctl reboot

Actual results:

The worker node fails to become Ready.

Expected results:

The worker node is rebooted and becomes Ready.

Additional info:

We were running the ran profile CI in the cnf-features-deploy repo on GCP and it had broken for a while as the worker nodes couldn't come back online after reboot triggered by MCs. After investigation and more testings with cluster bot tool which will give you the same environment as prow/CI, it looks like related to sriov operator on GCP, without sriov, no issue to apply MCs/reboot. So we end up switching to AWS to unblock our CI.

More gathered info of the failed cluster is available here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1621581330145873920/artifacts/launch/

blocks

OCPBUGS-14359 [4.10.z] [GCP] worker node with Sriov operator installed fails to come up online after reboot

Closed

is blocked by

OCPBUGS-14353 [4.12.z] [GCP] worker node with Sriov operator installed fails to come up online after reboot

Closed

is cloned by

OCPBUGS-14359 [4.10.z] [GCP] worker node with Sriov operator installed fails to come up online after reboot

Closed

links to

openshift/sriov-network-operator#783: [release-4.11] OCPBUGS-14355: Skip the Redhat virtual nic from udev rule

Assignee:: William Zhao

Reporter:: OpenShift Prow Bot

QA Contact:: Zhanqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/05/31 2:34 PM

Updated:: 2023/07/06 2:22 AM

Resolved:: 2023/07/06 2:22 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates