-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.15.0
-
Important
-
No
-
SDN Sprint 249, SDN Sprint 250, SDN Sprint 251, SDN Sprint 252, SDN Sprint 253
-
5
-
Rejected
-
False
-
-
affects hypershift on IBMCloud with kubevirt
-
Description of problem:
Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup. Platform: IBM Cloud Bare Metal cluster. Steps done: Step 1. $ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge clusterversion.config.openshift.io/version patched Step 2: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 The cluster was not upgraded successfully. $ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837 True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2024-01-18-050837 True False True 111s APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... console 4.15.0-0.nightly-2024-01-18-050837 False False False 111s RouteHealthAvailable: console route is not admitted dns 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6." etcd 4.15.0-0.nightly-2024-01-18-050837 True False True 12d EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379" Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379" Healthy:true Took:9.090133ms Error:<nil>}]... image-registry 4.15.0-0.nightly-2024-01-18-050837 True True False 12d Progressing: All registry resources are removed... machine-config 4.14.7 True True True 7d22h Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]] monitoring 4.15.0-0.nightly-2024-01-18-050837 False True True 7m54s UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded network 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)... node-tuning 4.15.0-0.nightly-2024-01-18-050837 True True False 98m Working towards "4.15.0-0.nightly-2024-01-18-050837" $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True True 3 0 0 1 12d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 12d $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.7 True True 120m Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors $ oc get nodes NAME STATUS ROLES AGE VERSION baremetal2-01.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-02.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b baremetal2-03.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-04.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b baremetal2-05.qe.rh-ocs.com Ready worker 12d v1.28.5+c84a6b8 baremetal2-06.qe.rh-ocs.com Ready,SchedulingDisabled control-plane,master,worker 12d v1.27.8+4fab27b ---------------------------------------------------- During the efforts to bring the cluster back to a good state, these steps were done: The node baremetal2-06.qe.rh-ocs.com was uncordoned. Tried to upgrade to using the command $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: ClusterOperatorsDegraded Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful. Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works. Some clusteroperators stayed on the previous version. Some moved to Degraded state. $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True False 3 1 1 0 13d worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 13d $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 11d rook-ceph-mon-pdb N/A 1 1 11d rook-ceph-osd N/A 1 1 3h17m $ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.23672 root default -5 1.74557 host baremetal2-01-qe-rh-ocs-com 1 ssd 0.87279 osd.1 up 1.00000 1.00000 4 ssd 0.87279 osd.4 up 1.00000 1.00000 -7 1.74557 host baremetal2-02-qe-rh-ocs-com 3 ssd 0.87279 osd.3 up 1.00000 1.00000 5 ssd 0.87279 osd.5 up 1.00000 1.00000 -3 1.74557 host baremetal2-05-qe-rh-ocs-com 0 ssd 0.87279 osd.0 up 1.00000 1.00000 2 ssd 0.87279 osd.2 up 1.00000 1.00000 OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/
Version-Release number of selected component (if applicable):
Initial version: OCP 4.14.7 ODF 4.14.4-5.fusion-hci OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380 Local Storage: local-storage-operator.v4.14.0-202312132033 OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci
How reproducible:
Reporting the first occurance of the isue.
Steps to Reproduce:
1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
Actual results:
OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.
Expected results:
OCP upgrade to nightly build from 4.14.7 should be success.
Additional info:
There are 3 hosted clients present
- blocks
-
OCPBUGS-29316 OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods
- Closed
- is cloned by
-
OCPBUGS-29316 OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update