-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.13
-
None
-
CNF Network Sprint 232, CNF Network Sprint 233, CNF Network Sprint 235, CNF Network Sprint 236
-
4
-
False
-
-
Customer Facing
-
-
Description of problem:
After installing the Sriov operator, reboot a worker node gracefully. The worker node can never come back online and stuck at the state "NotReady,SchedulingDisabled". The issue only happens on GCP. oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-yn53m6t-72292-5fnnc-master-0 Ready control-plane,master 3h18m v1.26.0+083e3f3 ci-ln-yn53m6t-72292-5fnnc-master-1 Ready control-plane,master 3h18m v1.26.0+083e3f3 ci-ln-yn53m6t-72292-5fnnc-master-2 Ready control-plane,master 3h19m v1.26.0+083e3f3 ci-ln-yn53m6t-72292-5fnnc-worker-a-nvqrj NotReady,SchedulingDisabled worker 3h7m v1.26.0+083e3f3 ci-ln-yn53m6t-72292-5fnnc-worker-b-knwfn Ready worker 3h7m v1.26.0+083e3f3 ci-ln-yn53m6t-72292-5fnnc-worker-c-npfrb Ready worker 3h7m v1.26.0+083e3f3
Version-Release number of selected component (if applicable):
How reproducible:
%100
Steps to Reproduce:
1. Use Cluster Bot to bring up a 3 masters + 3 workers OCP cluster on GCP with latest 4.13 ci build. - Send message "launch ci gcp,ovn" to Cluster Bot on Slack. 2. Install sriov operator via OLM and wait for it to complete 3. Gracefully reboot a worker node - oc adm cordon <node> - oc adm drain <node> --ignore-daemonsets --delete-emptydir-data --force - oc debug node/<node> - chroot /host - systemctl reboot
Actual results:
The worker node fails to become Ready.
Expected results:
The worker node is rebooted and becomes Ready.
Additional info:
We were running the ran profile CI in the cnf-features-deploy repo on GCP and it had broken for a while as the worker nodes couldn't come back online after reboot triggered by MCs. After investigation and more testings with cluster bot tool which will give you the same environment as prow/CI, it looks like related to sriov operator on GCP, without sriov, no issue to apply MCs/reboot. So we end up switching to AWS to unblock our CI. More gathered info of the failed cluster is available here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1621581330145873920/artifacts/launch/
- blocks
-
OCPBUGS-13284 [GCP] worker node with Sriov operator installed fails to come up online after reboot
- Closed
- is cloned by
-
OCPBUGS-13284 [GCP] worker node with Sriov operator installed fails to come up online after reboot
- Closed
- links to
-
RHEA-2023:5005 rpm