Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Networking / SR-IOV
Labels:
- closed-loop-ecosys

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
None
Sprint:
CNF Network Sprint 232, CNF Network Sprint 233, CNF Network Sprint 235, CNF Network Sprint 236
sprint_count:
4

Customer Impact:

Customer Facing
Internal Whiteboard:
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

After installing the Sriov operator, reboot a worker node gracefully. The worker node can never come back online and stuck at the state "NotReady,SchedulingDisabled". The issue only happens on GCP.

oc get nodes
NAME                                       STATUS                        ROLES                  AGE     VERSION
ci-ln-yn53m6t-72292-5fnnc-master-0         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-master-1         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-master-2         Ready                         control-plane,master   3h19m   v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-a-nvqrj   NotReady,SchedulingDisabled   worker                 3h7m    v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-b-knwfn   Ready                         worker                 3h7m    v1.26.0+083e3f3
ci-ln-yn53m6t-72292-5fnnc-worker-c-npfrb   Ready                         worker                 3h7m    v1.26.0+083e3f3

Version-Release number of selected component (if applicable):

How reproducible:

%100

Steps to Reproduce:

1. Use Cluster Bot to bring up a 3 masters + 3 workers OCP cluster on GCP with latest 4.13 ci build.
   - Send message "launch ci gcp,ovn" to Cluster Bot on Slack.
2. Install sriov operator via OLM and wait for it to complete
3. Gracefully reboot a worker node
   - oc adm cordon <node>
   - oc adm drain <node> --ignore-daemonsets --delete-emptydir-data --force
   - oc debug node/<node>
   - chroot /host
   - systemctl reboot

Actual results:

The worker node fails to become Ready.

Expected results:

The worker node is rebooted and becomes Ready.

Additional info:

We were running the ran profile CI in the cnf-features-deploy repo on GCP and it had broken for a while as the worker nodes couldn't come back online after reboot triggered by MCs. After investigation and more testings with cluster bot tool which will give you the same environment as prow/CI, it looks like related to sriov operator on GCP, without sriov, no issue to apply MCs/reboot. So we end up switching to AWS to unblock our CI.

More gathered info of the failed cluster is available here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1621581330145873920/artifacts/launch/

blocks

OCPBUGS-13284 [GCP] worker node with Sriov operator installed fails to come up online after reboot

Closed

is cloned by

OCPBUGS-13284 [GCP] worker node with Sriov operator installed fails to come up online after reboot

Closed

links to

https://github.com/openshift/sriov-network-operator/pull/761

RHEA-2023:5005 rpm

Assignee:: Sebastian Scheinkman

Reporter:: Angie Wang

QA Contact:: Zhanqi Zhao

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/02/03 10:50 PM

Updated:: 2025/09/13 9:27 PM

Resolved:: 2023/10/31 10:41 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates