-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Attempted upgrade of 3480 SNOs that were deployed from 4.13.11 to 4.14.0-rc.0 and 2 SNOs ended up with a degraded etcd cluster operator and partial upgrade. Example: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.11 True True 15h Unable to apply 4.14.0-rc.0: wait has exceeded 40 minutes for these operators: etcd # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.11 True False False 15h baremetal 4.13.11 True False False 36h cloud-controller-manager 4.13.11 True False False 36h cloud-credential 4.13.11 True False False 36h cluster-autoscaler 4.13.11 True False False 36h config-operator 4.14.0-rc.0 True False False 36h console 4.13.11 True False False 36h control-plane-machine-set 4.13.11 True False False 36h csi-snapshot-controller 4.13.11 True False False 36h dns 4.13.11 True False False 36h etcd 4.13.11 True True True 36h MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 4 on node: "vm02837" didn't show up, waited: 3m30s image-registry 4.13.11 True False False 36h ingress 4.13.11 True False False 36h insights 4.13.11 True False False 36h kube-apiserver 4.14.0-rc.0 True False False 36h kube-controller-manager 4.13.11 True False False 36h kube-scheduler 4.13.11 True False False 36h kube-storage-version-migrator 4.13.11 True False False 36h machine-api 4.13.11 True False False 36h machine-approver 4.13.11 True False False 36h machine-config 4.13.11 True False False 36h marketplace 4.13.11 True False False 36h monitoring 4.13.11 True False False 36h network 4.13.11 True False False 36h node-tuning 4.13.11 True False False 36h openshift-apiserver 4.13.11 True False False 13h openshift-controller-manager 4.13.11 True False False 13h openshift-samples 4.13.11 True False False 36h operator-lifecycle-manager 4.13.11 True False False 36h operator-lifecycle-manager-catalog 4.13.11 True False False 36h operator-lifecycle-manager-packageserver 4.13.11 True False False 36h service-ca 4.13.11 True False False 36h storage 4.13.11 True False False 36h
Version-Release number of selected component (if applicable):
SNO OCP (managed clusters being upgraded) 4.13.11 upgraded to 4.14.0-rc.0 Hub OCP 4.13.12 ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
Rare (2 out of 3480), represents 2 out of the 41 failed upgrades (~4.8% of failures)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Restarting crio resolves the issue. Maybe related to this bug - https://issues.redhat.com/browse/OCPBUGS-2474