-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.14.0
-
No
-
False
-
-
Description of problem:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 513b8753-04c6-4a3a-988a-2d92b95e48f9 ClusterVersion: Updating to "4.14.0-rc.1" from "4.14.0-rc.0" for 5 hours: Unable to apply 4.14.0-rc.1: wait has exceeded 40 minutes for these operators: authentication, openshift-apiserver ClusterOperators: clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver () OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication () clusteroperator/machine-config is degraded because Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]] clusteroperator/network is progressing: Deployment "/openshift-ovn-kubernetes/ovnkube-control-plane" is not available (awaiting 1 nodes) clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () error: gather never finished for pod must-gather-wvxzn: pods "must-gather-wvxzn" not found
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2023-09-14T19:20:36Z" generation: 3 name: version resourceVersion: "4005495" uid: 7672d053-f9ed-43e3-a324-9a9bb85db483 spec: channel: stable-4.14 clusterID: 513b8753-04c6-4a3a-988a-2d92b95e48f9 desiredUpdate: architecture: "" force: true image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64 version: "" status: availableUpdates: null capabilities: enabledCapabilities: - Build - CSISnapshot - Console - DeploymentConfig - ImageRegistry - Insights - MachineAPI - NodeTuning - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - Build - CSISnapshot - Console - DeploymentConfig - ImageRegistry - Insights - MachineAPI - NodeTuning - Storage - baremetal - marketplace - openshift-samples conditions: - lastTransitionTime: "2023-09-14T19:20:40Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.14.0-rc.1 not found in the "stable-4.14" channel' reason: VersionNotFound status: "False" type: RetrievedUpdates - lastTransitionTime: "2023-09-14T19:20:40Z" message: Capabilities match configured spec reason: AsExpected status: "False" type: ImplicitlyEnabledCapabilities - lastTransitionTime: "2023-09-14T19:20:40Z" message: Payload loaded version="4.14.0-rc.1" image="registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64" architecture="amd64" reason: PayloadLoaded status: "True" type: ReleaseAccepted - lastTransitionTime: "2023-09-14T20:08:35Z" message: Done applying 4.14.0-rc.0 status: "True" type: Available - lastTransitionTime: "2023-09-19T14:55:40Z" message: Cluster operators authentication, openshift-apiserver are degraded reason: ClusterOperatorsDegraded status: "True" type: Failing - lastTransitionTime: "2023-09-19T13:14:45Z" message: 'Unable to apply 4.14.0-rc.1: wait has exceeded 40 minutes for these operators: authentication, openshift-apiserver' reason: ClusterOperatorsDegraded status: "True" type: Progressing - lastTransitionTime: "2023-09-19T13:57:22Z" message: 'Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading' reason: DegradedPool status: "False" type: Upgradeable desired: image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64 url: https://access.redhat.com/errata/RHSA-2023:5006 version: 4.14.0-rc.1 history: - acceptedRisks: |- Target release version="" image="registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.14.0-rc.0 to 4.14.0-rc.1 is unknown. completionTime: null image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64 startedTime: "2023-09-19T13:14:45Z" state: Partial verified: false version: 4.14.0-rc.1 - completionTime: "2023-09-14T20:08:35Z" image: quay.io/openshift-release-dev/ocp-release@sha256:1d2cc38cbd94c532dc822ff793f46b23a93b76b400f7d92b13c1e1da042c88fe startedTime: "2023-09-14T19:20:40Z" state: Completed verified: false version: 4.14.0-rc.0 observedGeneration: 3 versionHash: MQnicHcnnoQ= kind: List metadata: resourceVersion: "" oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master,worker 4d22h v1.27.4+2c287eb master-1 Ready,SchedulingDisabled control-plane,master,worker 4d23h v1.27.4+2c83a9f master-2 Ready control-plane,master,worker 4d23h v1.27.4+2c287eb oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-rc.1 True False True 4d22h APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.14.0-rc.1 True False False 4d22h cloud-controller-manager 4.14.0-rc.1 True False False 4d23h cloud-credential 4.14.0-rc.1 True False False 4d23h cluster-autoscaler 4.14.0-rc.1 True False False 4d22h config-operator 4.14.0-rc.1 True False False 4d22h console 4.14.0-rc.1 True False False 4d22h control-plane-machine-set 4.14.0-rc.1 True False False 4d22h csi-snapshot-controller 4.14.0-rc.1 True False False 4d22h dns 4.14.0-rc.1 True False False 4d22h etcd 4.14.0-rc.1 True False False 4d22h image-registry 4.14.0-rc.1 True False False 4h23m ingress 4.14.0-rc.1 True False False 4d22h insights 4.14.0-rc.1 True False False 4d22h kube-apiserver 4.14.0-rc.1 True False False 4d22h kube-controller-manager 4.14.0-rc.1 True False False 4d22h kube-scheduler 4.14.0-rc.1 True False False 4d22h kube-storage-version-migrator 4.14.0-rc.1 True False False 4h35m machine-api 4.14.0-rc.1 True False False 4d22h machine-approver 4.14.0-rc.1 True False False 4d22h machine-config 4.14.0-rc.0 True True True 4d22h Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]] marketplace 4.14.0-rc.1 True False False 4d22h monitoring 4.14.0-rc.1 True False False 4d22h network 4.14.0-rc.1 True True False 4d22h Deployment "/openshift-ovn-kubernetes/ovnkube-control-plane" is not available (awaiting 1 nodes) node-tuning 4.14.0-rc.1 True False False 4d22h openshift-apiserver 4.14.0-rc.1 True False True 4d22h APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.14.0-rc.1 True False False 4d22h openshift-samples 4.14.0-rc.1 True False False 5h4m operator-lifecycle-manager 4.14.0-rc.1 True False False 4d22h operator-lifecycle-manager-catalog 4.14.0-rc.1 True False False 4d22h operator-lifecycle-manager-packageserver 4.14.0-rc.1 True False False 4d22h service-ca 4.14.0-rc.1 True False False 4d22h storage 4.14.0-rc.1 True False False 4d22h oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-85c5fb6d7c-25mqf 2/2 Running 0 5h10m apiserver-85c5fb6d7c-bzgft 0/2 Pending 0 4h53m apiserver-85c5fb6d7c-ms9nk 2/2 Running 0 5h2m [kni@registry.kni-qe-31 post-config]$ oc get events -n openshift-apiserver LAST SEEN TYPE REASON OBJECT MESSAGE 12m Warning FailedScheduling pod/apiserver-85c5fb6d7c-bzgft 0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 node(s) didn't match pod anti-affinity rules.. oc get events -A | grep machine-config openshift-machine-config-operator 22m Warning OperatorDegraded: RequiredPoolsFailed /machine-config Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]] oc logs machine-config-daemon-drb4p -c machine-config-daemon -n openshift-machine-config-operator oc adm drain master-1 --grace-period=20 --ignore-daemonsets --force=true --delete-emptydir-data --timeout=60s node/master-1 already cordoned Warning: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-lvg5p, openshift-dns/dns-default-l8kzl, openshift-dns/node-resolver-vnk7g, openshift-image-registry/node-ca-kwtw4, openshift-ingress-canary/ingress-canary-jcnff, openshift-local-storage/diskmaker-manager-7scmt, openshift-logging/collector-fjbwq, openshift-machine-api/ironic-proxy-6hdx6, openshift-machine-config-operator/machine-config-daemon-drb4p, openshift-machine-config-operator/machine-config-server-zzsfq, openshift-monitoring/node-exporter-dt26h, openshift-multus/multus-7fblb, openshift-multus/multus-additional-cni-plugins-wssh8, openshift-multus/network-metrics-daemon-ptf2k, openshift-network-diagnostics/network-check-target-gdpch, openshift-ovn-kubernetes/ovnkube-node-pw7w6, openshift-sriov-network-operator/network-resources-injector-lv5kj, openshift-sriov-network-operator/operator-webhook-ntnn6, openshift-sriov-network-operator/sriov-device-plugin-896fr, openshift-sriov-network-operator/sriov-network-config-daemon-ctmc4, openshift-storage/csi-cephfsplugin-2xgf4, openshift-storage/csi-rbdplugin-ct9lp evicting pod nqldh/mypod-nqldh evicting pod c21gn/mypod-c21gn evicting pod hxm70/mypod-hxm70 There are pending pods in node "master-1" when an error occurred: [error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s] pod/mypod-c21gn pod/mypod-hxm70 pod/mypod-nqldh error: unable to drain node "master-1" due to error:[error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s], continuing command... There are pending nodes to be drained: master-1 error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s Forced drained but had to force delete these pods Then oc adm drain master-1 --grace-period=20 --ignore-daemonsets --force=true --delete-emptydir-data --timeout=60s node/master-1 already cordoned Warning: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-lvg5p, openshift-dns/dns-default-l8kzl, openshift-dns/node-resolver-vnk7g, openshift-image-registry/node-ca-kwtw4, openshift-ingress-canary/ingress-canary-jcnff, openshift-local-storage/diskmaker-manager-7scmt, openshift-logging/collector-fjbwq, openshift-machine-api/ironic-proxy-6hdx6, openshift-machine-config-operator/machine-config-daemon-drb4p, openshift-machine-config-operator/machine-config-server-zzsfq, openshift-monitoring/node-exporter-dt26h, openshift-multus/multus-7fblb, openshift-multus/multus-additional-cni-plugins-wssh8, openshift-multus/network-metrics-daemon-ptf2k, openshift-network-diagnostics/network-check-target-gdpch, openshift-ovn-kubernetes/ovnkube-node-pw7w6, openshift-sriov-network-operator/network-resources-injector-lv5kj, openshift-sriov-network-operator/operator-webhook-ntnn6, openshift-sriov-network-operator/sriov-device-plugin-896fr, openshift-sriov-network-operator/sriov-network-config-daemon-ctmc4, openshift-storage/csi-cephfsplugin-2xgf4, openshift-storage/csi-rbdplugin-ct9lp node/master-1 drained oc adm uncordon master-1 node/master-1 uncordoned [kni@registry.kni-qe-31 post-config]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master,worker 4d23h v1.27.4+2c287eb master-1 NotReady control-plane,master,worker 5d v1.27.4+2c83a9f master-2 Ready control-plane,master,worker 5d v1.27.4+2c287eb oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.1 True False 27s Cluster version is 4.14.0-rc.1 It finally upgrade but why does the upgrade not handle stuck pods?