-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.15.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
No
-
None
-
Rejected
-
SDN Sprint 249, SDN Sprint 250, SDN Sprint 253
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
This is a clone of issue OCPBUGS-27853. The following is the description of the original issue:
—
Description of problem:
Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup.
Platform: IBM Cloud Bare Metal cluster.
Steps done:
Step 1.
$ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge
clusterversion.config.openshift.io/version patched
Step 2:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837
The cluster was not upgraded successfully.
$ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837 True False False"
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.15.0-0.nightly-2024-01-18-050837 True False True 111s APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
console 4.15.0-0.nightly-2024-01-18-050837 False False False 111s RouteHealthAvailable: console route is not admitted
dns 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6."
etcd 4.15.0-0.nightly-2024-01-18-050837 True False True 12d EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379" Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379" Healthy:true Took:9.090133ms Error:<nil>}]...
image-registry 4.15.0-0.nightly-2024-01-18-050837 True True False 12d Progressing: All registry resources are removed...
machine-config 4.14.7 True True True 7d22h Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]]
monitoring 4.15.0-0.nightly-2024-01-18-050837 False True True 7m54s UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded
network 4.15.0-0.nightly-2024-01-18-050837 True True False 12d DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)...
node-tuning 4.15.0-0.nightly-2024-01-18-050837 True True False 98m Working towards "4.15.0-0.nightly-2024-01-18-050837"
$ oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True True 3 0 0 1 12d
worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 12d
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.7 True True 120m Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors
$ oc get nodes
NAME STATUS ROLES AGE VERSION
baremetal2-01.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b
baremetal2-02.qe.rh-ocs.com Ready worker 12d v1.27.8+4fab27b
baremetal2-03.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b
baremetal2-04.qe.rh-ocs.com Ready control-plane,master,worker 12d v1.27.8+4fab27b
baremetal2-05.qe.rh-ocs.com Ready worker 12d v1.28.5+c84a6b8
baremetal2-06.qe.rh-ocs.com Ready,SchedulingDisabled control-plane,master,worker 12d v1.27.8+4fab27b
----------------------------------------------------
During the efforts to bring the cluster back to a good state, these steps were done:
The node baremetal2-06.qe.rh-ocs.com was uncordoned.
Tried to upgrade to using the command
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: ClusterOperatorsDegraded
Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500
Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful.
Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works.
Some clusteroperators stayed on the previous version. Some moved to Degraded state.
$ oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-9b7e02d956d965d0906def1426cb03b5 False True False 3 1 1 0 13d
worker rendered-worker-4f54b43e9f934f0659761929f55201a1 False True True 3 1 1 1 13d
$ oc get pdb -n openshift-storage
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 11d
rook-ceph-mon-pdb N/A 1 1 11d
rook-ceph-osd N/A 1 1 3h17m
$ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.23672 root default
-5 1.74557 host baremetal2-01-qe-rh-ocs-com
1 ssd 0.87279 osd.1 up 1.00000 1.00000
4 ssd 0.87279 osd.4 up 1.00000 1.00000
-7 1.74557 host baremetal2-02-qe-rh-ocs-com
3 ssd 0.87279 osd.3 up 1.00000 1.00000
5 ssd 0.87279 osd.5 up 1.00000 1.00000
-3 1.74557 host baremetal2-05-qe-rh-ocs-com
0 ssd 0.87279 osd.0 up 1.00000 1.00000
2 ssd 0.87279 osd.2 up 1.00000 1.00000
OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/
Version-Release number of selected component (if applicable):
Initial version: OCP 4.14.7 ODF 4.14.4-5.fusion-hci OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380 Local Storage: local-storage-operator.v4.14.0-202312132033 OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci
How reproducible:
Reporting the first occurance of the isue.
Steps to Reproduce:
1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
Actual results:
OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.
Expected results:
OCP upgrade to nightly build from 4.14.7 should be success.
Additional info:
There are 3 hosted clients present
- clones
-
OCPBUGS-27853 OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods
-
- Closed
-
- is blocked by
-
OCPBUGS-27853 OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods
-
- Closed
-
- links to
-
RHBA-2024:2865
OpenShift Container Platform 4.15.z bug fix update