-
Bug
-
Resolution: Done
-
Major
-
4.11
-
None
-
Important
-
None
-
CNF Compute Sprint 237
-
1
-
Rejected
-
False
-
Description of problem:
After upgrade the OCP env from OCP4.10.26 to OCP4.11.1, the worker-rt nodes flips between two rendered-worker-rt* currentConfigs and desiredConfigs, causing the nodes in the pool to be in a reboot loop. This happens during and after the upgrade process; because of the issue, during the upgrade, some ClusterOperators can't finish the upgrade, so we deleted the PerformanceProfile to let the upgrade could finish; then after the upgrade, when everything is settled (clusteroperators and MCPs are all good), we reapplied the PerformanceProfile, the issue was reproduced. [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"' Fri Aug 26 14:32:31 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725 zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"' Fri Aug 26 14:32:36 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"' Fri Aug 26 14:42:51 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"' Fri Aug 26 14:42:57 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"' Fri Aug 26 14:51:48 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/currentConfig")"' Fri Aug 26 14:51:51 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725 zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-1a5dc54b55d1f005ec37240578ec90da [ocohen@ocohen ~]$ [ocohen@ocohen ~]$ date && oc get nodes -l node-role.kubernetes.io/worker-rt= -o json | jq -r '.items[] | "\(.metadata.name) \(.metadata.annotations."machineconfiguration.openshift.io/desiredConfig")"' Fri Aug 26 15:41:21 IDT 2022 zeus08.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725 zeus10.lab.eng.tlv2.redhat.com rendered-worker-rt-485e0aca2182afaaac3a28c45c29b725
Version-Release number of selected component (if applicable):
ocp4.11.1
How reproducible:
Steps to Reproduce:
1. Update the env from OCP4.10.26 to OCP4.11.1 2. During the process, check the OCP env status(the operators/nodes etc) 3. Delete the PerformanceProfile, continue check the OCP env status 4. After the upgrade finishes, reapply the PerformanceProfile
Actual results:
In step2, it's found the 2 real time nodes were rebooting in turn; and 3 clusteroperators were unable to complete the upgrade because of that, please refer to the attachment 'oc get clusteroperators.txt' for details. In step3, the upgrade could finish. In step4, the 2 real time nodes began to reboot again, i.e. flips between two rendered-worker-rt* currentConfigs and desiredConfigs
Expected results:
Additional info: