Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7039

[GCP] worker node with Sriov operator installed fails to come up online after reboot

XMLWordPrintable

    • None
    • CNF Network Sprint 232, CNF Network Sprint 233, CNF Network Sprint 235, CNF Network Sprint 236
    • 4
    • False
    • Hide

      None

      Show
      None
    • Customer Facing

      Description of problem:

      After installing the Sriov operator, reboot a worker node gracefully. The worker node can never come back online and stuck at the state "NotReady,SchedulingDisabled". The issue only happens on GCP.
      
      oc get nodes
      NAME                                       STATUS                        ROLES                  AGE     VERSION
      ci-ln-yn53m6t-72292-5fnnc-master-0         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
      ci-ln-yn53m6t-72292-5fnnc-master-1         Ready                         control-plane,master   3h18m   v1.26.0+083e3f3
      ci-ln-yn53m6t-72292-5fnnc-master-2         Ready                         control-plane,master   3h19m   v1.26.0+083e3f3
      ci-ln-yn53m6t-72292-5fnnc-worker-a-nvqrj   NotReady,SchedulingDisabled   worker                 3h7m    v1.26.0+083e3f3
      ci-ln-yn53m6t-72292-5fnnc-worker-b-knwfn   Ready                         worker                 3h7m    v1.26.0+083e3f3
      ci-ln-yn53m6t-72292-5fnnc-worker-c-npfrb   Ready                         worker                 3h7m    v1.26.0+083e3f3

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      %100

      Steps to Reproduce:

      1. Use Cluster Bot to bring up a 3 masters + 3 workers OCP cluster on GCP with latest 4.13 ci build.
         - Send message "launch ci gcp,ovn" to Cluster Bot on Slack.
      2. Install sriov operator via OLM and wait for it to complete
      3. Gracefully reboot a worker node
         - oc adm cordon <node>
         - oc adm drain <node> --ignore-daemonsets --delete-emptydir-data --force
         - oc debug node/<node>
         - chroot /host
         - systemctl reboot

      Actual results:

      The worker node fails to become Ready.

      Expected results:

      The worker node is rebooted and becomes Ready.

      Additional info:

      We were running the ran profile CI in the cnf-features-deploy repo on GCP and it had broken for a while as the worker nodes couldn't come back online after reboot triggered by MCs. After investigation and more testings with cluster bot tool which will give you the same environment as prow/CI, it looks like related to sriov operator on GCP, without sriov, no issue to apply MCs/reboot. So we end up switching to AWS to unblock our CI.
      
      More gathered info of the failed cluster is available here:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1621581330145873920/artifacts/launch/

       

              sscheink@redhat.com Sebastian Scheinkman
              angwang@redhat.com Angie Wang
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: