-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.13.z
-
Critical
-
No
-
Rejected
-
False
-
Description of problem:
profile: 02_UPI_on_Baremetal-packet_OVN-dual-stack_Disk-encryption_Disk-mirroring_Etcd-encryption
wanted to upgrade from 4.12.41-x86_64 - > 4.13.19-x86_64,4.14.0-x86_64
4.12.41 upgrade to 4.13.19 is failed
$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.19 True False True 172m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.13.19 True False False 4h55m cloud-controller-manager 4.13.19 True False False 4h58m cloud-credential 4.13.19 True False False 4h59m cluster-autoscaler 4.13.19 True False False 4h54m config-operator 4.13.19 True False False 4h55m console 4.13.19 True False False 4h43m control-plane-machine-set 4.13.19 True False False 4h55m csi-snapshot-controller 4.13.19 True False False 4h55m dns 4.13.19 True True False 4h54m DNS "default" reports Progressing=True: "Have 4 available node-resolver pods, want 6." etcd 4.13.19 True False True 4h53m ClusterMemberControllerDegraded: unhealthy members found during reconciling members... image-registry 4.13.19 True True False 175m Progressing: The registry is ready... ingress 4.13.19 True False False 4h44m insights 4.13.19 True False False 4h49m kube-apiserver 4.13.19 True False True 4h50m NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-controller-manager 4.13.19 True False True 4h52m NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.19 True False True 4h52m NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.19 True False False 89m machine-api 4.13.19 True False False 4h55m machine-approver 4.13.19 True False False 4h54m machine-config 4.12.41 False True True 69m Cluster not available for [{operator 4.12.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)] marketplace 4.13.19 True False False 4h54m monitoring 4.13.19 False True True 66m reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: Too many daemonset pods are unavailable (2 > 1 max unavailable). network 4.13.19 True True False 4h54m DaemonSet "/openshift-multus/multus" is not available (awaiting 2 nodes)... node-tuning 4.13.19 True True False 121m Working towards "4.13.19" openshift-apiserver 4.13.19 True False True 4h48m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.13.19 True False False 4h51m openshift-samples 4.13.19 True False False 121m operator-lifecycle-manager 4.13.19 True False False 4h54m operator-lifecycle-manager-catalog 4.13.19 True False False 4h54m operator-lifecycle-manager-packageserver 4.13.19 True False False 4h48m service-ca 4.13.19 True False False 4h55m storage 4.13.19 True False False 4h55m
and mcp is stuck at updating
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4abc568f4584b75998ea779c818f03c4 False True False 3 0 0 0 4h56m worker rendered-worker-b9f20eedfe223ba9bb8cd926ea93db77 False True False 3 1 1 0 4h56m $ oc get node NAME STATUS ROLES AGE VERSION master-00.juzhao-44510.qe.devcluster.openshift.com NotReady,SchedulingDisabled control-plane,master 4h59m v1.25.14+31e0558 master-01.juzhao-44510.qe.devcluster.openshift.com Ready control-plane,master 4h58m v1.25.14+31e0558 master-02.juzhao-44510.qe.devcluster.openshift.com Ready control-plane,master 4h58m v1.25.14+31e0558 worker-00.juzhao-44510.qe.devcluster.openshift.com Ready worker 4h44m v1.25.14+31e0558 worker-01.juzhao-44510.qe.devcluster.openshift.com Ready worker 4h46m v1.26.9+636f2be worker-02.juzhao-44510.qe.devcluster.openshift.com NotReady,SchedulingDisabled worker 4h45m v1.25.14+31e0558
clusterversion info
$ oc get clusterversion version -oyaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2023-10-27T07:39:05Z" generation: 3 name: version resourceVersion: "191974" uid: 6b3d8f94-8948-4756-a0f9-0b9fb39bb91d spec: channel: stable-4.12 clusterID: 0b4919c9-3b65-4c77-9e0c-cea5933494ca desiredUpdate: force: true image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a version: "" status: availableUpdates: null capabilities: enabledCapabilities: - CSISnapshot - Console - Insights - NodeTuning - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - CSISnapshot - Console - Insights - NodeTuning - Storage - baremetal - marketplace - openshift-samples conditions: - lastTransitionTime: "2023-10-27T07:39:07Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.13.19 not found in the "stable-4.12" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2023-10-27T10:10:34Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2023-10-27T07:39:07Z" message: Payload loaded version="4.13.19" image="quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a" architecture="amd64" reason: PayloadLoaded status: "True" type: ReleaseAccepted - lastTransitionTime: "2023-10-27T08:03:40Z" message: Done applying 4.12.41 status: "True" type: Available - lastTransitionTime: "2023-10-27T11:15:58Z" message: Cluster operators etcd, kube-apiserver are degraded reason: ClusterOperatorsDegraded status: "True" type: Failing - lastTransitionTime: "2023-10-27T10:10:20Z" message: 'Unable to apply 4.13.19: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver' reason: ClusterOperatorsDegraded status: "True" type: Progressing - lastTransitionTime: "2023-10-27T11:14:04Z" message: 'Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details' reason: PoolUpdating status: "False" type: Upgradeable desired: image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a url: https://access.redhat.com/errata/RHSA-2023:6130 version: 4.13.19 history: - acceptedRisks: |- Forced through blocking failures: Multiple precondition checks failed: * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.41 to 4.13.19 is unknown. completionTime: null image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a startedTime: "2023-10-27T10:10:20Z" state: Partial verified: true version: 4.13.19 - completionTime: "2023-10-27T08:03:40Z" image: quay.io/openshift-release-dev/ocp-release@sha256:59c93fdfff4ecca2ca6d6bb0ec722bca2bb08152252ae10ce486a9fc80c82dcf startedTime: "2023-10-27T07:39:07Z" state: Completed verified: false version: 4.12.41 observedGeneration: 3 versionHash: K84dirQ2oDM=
also found the same issue in prow job, https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.19 , click the "Failed" link of 4.12.41 to 4.13.19 upgrade, check its logs
Oct 26 14:31:54.647: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) Oct 26 14:32:04.645: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) Oct 26 14:32:14.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) Oct 26 14:32:24.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) Oct 26 14:32:24.832: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
Version-Release number of selected component (if applicable):
4.12.41 to 4.13.19 upgrade
How reproducible:
not always
Steps to Reproduce:
1. 4.12.41 to 4.13.19 upgrade
Actual results:
failed upgrade
Expected results:
should be successful
Additional info:
there is successful upgrade from 4.11.52-x86_64 - > 4.12.41-x86_64,4.13.19-x86_64,4.14.0-x86_64 https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/44490/ seems the issue is not always happen