Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-646

The worker-rt nodes flips between two rendered-worker-rt* currentConfigs and desiredConfigs after upgrade to OCP4.11.1

XMLWordPrintable

    • Important
    • None
    • CNF Compute Sprint 237
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After upgrade the OCP env from OCP4.10.26 to OCP4.11.1, the worker-rt nodes flips between two rendered-worker-rt* currentConfigs and desiredConfigs, causing the nodes in the pool to be in a reboot loop. This happens during and after the upgrade process; because of the issue, during the upgrade, some ClusterOperators can't finish the upgrade, so we deleted the PerformanceProfile to let the upgrade could finish; then after the upgrade, when everything is settled (clusteroperators and MCPs are all good), we reapplied the PerformanceProfile, the issue was reproduced.
      
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"'
      Fri Aug 26 14:32:31 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"'
      Fri Aug 26 14:32:36 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"'
      Fri Aug 26 14:42:51 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"'
      Fri Aug 26 14:42:57 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"'
      Fri Aug 26 14:51:48 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"'
      Fri Aug 26 14:51:51 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da
      [ocohen@ocohen ~]$ 
      [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"'
      Fri Aug 26 15:41:21 IDT 2022
      zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725
      zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725

      Version-Release number of selected component (if applicable):

      ocp4.11.1

      How reproducible:

       

      Steps to Reproduce:

      1. Update the env from OCP4.10.26 to OCP4.11.1
      2. During the process, check the OCP env status(the operators/nodes etc)
      3. Delete the PerformanceProfile, continue check the OCP env status
      4. After the upgrade finishes, reapply the PerformanceProfile

      Actual results:

      In step2, it's found the 2 real time nodes were rebooting in turn; and 3 clusteroperators were unable to complete the upgrade because of that, please refer to the attachment 'oc get clusteroperators.txt' for details.
      In step3, the upgrade could finish.
      In step4, the 2 real time nodes began to reboot again, i.e. flips between two rendered-worker-rt* currentConfigs and desiredConfigs
      
      

      Expected results:

       

      Additional info:

       

        1. MCD logs.txt
          9 kB
          Nini Gu
        2. oc get clusteroperators.txt
          4 kB
          Nini Gu
        3. oc logs tuned-75flp.txt
          16 kB
          Nini Gu
        4. PerformanceProfile-rt.yaml
          2 kB
          Nini Gu
        5. rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da.yaml
          155 kB
          Oren Cohen
        6. rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725.yaml
          155 kB
          Oren Cohen

              msivak@redhat.com Martin Sivak
              ngu@redhat.com Nini Gu
              Shereen Haj Shereen Haj
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: