Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14.0
Component/s: Cluster Version Operator
Labels:

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 513b8753-04c6-4a3a-988a-2d92b95e48f9
ClusterVersion: Updating to "4.14.0-rc.1" from "4.14.0-rc.0" for 5 hours: Unable to apply 4.14.0-rc.1: wait has exceeded 40 minutes for these operators: authentication, openshift-apiserver
ClusterOperators:
        clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()
        clusteroperator/machine-config is degraded because Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]
        clusteroperator/network is progressing: Deployment "/openshift-ovn-kubernetes/ovnkube-control-plane" is not available (awaiting 1 nodes)
        clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
error: gather never finished for pod must-gather-wvxzn: pods "must-gather-wvxzn" not found

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2023-09-14T19:20:36Z"
    generation: 3
    name: version
    resourceVersion: "4005495"
    uid: 7672d053-f9ed-43e3-a324-9a9bb85db483
  spec:
    channel: stable-4.14
    clusterID: 513b8753-04c6-4a3a-988a-2d92b95e48f9
    desiredUpdate:
      architecture: ""
      force: true
      image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64
      version: ""
  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - Build
      - CSISnapshot
      - Console
      - DeploymentConfig
      - ImageRegistry
      - Insights
      - MachineAPI
      - NodeTuning
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - Build
      - CSISnapshot
      - Console
      - DeploymentConfig
      - ImageRegistry
      - Insights
      - MachineAPI
      - NodeTuning
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
    conditions:
    - lastTransitionTime: "2023-09-14T19:20:40Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster
        version 4.14.0-rc.1 not found in the "stable-4.14" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2023-09-14T19:20:40Z"
      message: Capabilities match configured spec
      reason: AsExpected
      status: "False"
      type: ImplicitlyEnabledCapabilities
    - lastTransitionTime: "2023-09-14T19:20:40Z"
      message: Payload loaded version="4.14.0-rc.1" image="registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64"
        architecture="amd64"
      reason: PayloadLoaded
      status: "True"
      type: ReleaseAccepted
    - lastTransitionTime: "2023-09-14T20:08:35Z"
      message: Done applying 4.14.0-rc.0
      status: "True"
      type: Available
    - lastTransitionTime: "2023-09-19T14:55:40Z"
      message: Cluster operators authentication, openshift-apiserver are degraded
      reason: ClusterOperatorsDegraded
      status: "True"
      type: Failing
    - lastTransitionTime: "2023-09-19T13:14:45Z"
      message: 'Unable to apply 4.14.0-rc.1: wait has exceeded 40 minutes for these
        operators: authentication, openshift-apiserver'
      reason: ClusterOperatorsDegraded
      status: "True"
      type: Progressing
    - lastTransitionTime: "2023-09-19T13:57:22Z"
      message: 'Cluster operator machine-config should not be upgraded between minor
        versions: One or more machine config pools are degraded, please see `oc get
        mcp` for further details and resolve before upgrading'
      reason: DegradedPool
      status: "False"
      type: Upgradeable
    desired:
      image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64
      url: https://access.redhat.com/errata/RHSA-2023:5006
      version: 4.14.0-rc.1
    history:
    - acceptedRisks: |-
        Target release version="" image="registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.14.0-rc.0 to 4.14.0-rc.1 is unknown.
      completionTime: null
      image: registry.kni-qe-31.lab.eng.rdu2.redhat.com:5000/openshift-release-dev:4.14.0-rc.1-x86_64
      startedTime: "2023-09-19T13:14:45Z"
      state: Partial
      verified: false
      version: 4.14.0-rc.1
    - completionTime: "2023-09-14T20:08:35Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:1d2cc38cbd94c532dc822ff793f46b23a93b76b400f7d92b13c1e1da042c88fe
      startedTime: "2023-09-14T19:20:40Z"
      state: Completed
      verified: false
      version: 4.14.0-rc.0
    observedGeneration: 3
    versionHash: MQnicHcnnoQ=
kind: List
metadata:
  resourceVersion: ""

oc get nodes NAME       STATUS                     ROLES                         AGE     VERSION master-0   Ready                      control-plane,master,worker   4d22h   v1.27.4+2c287eb master-1   Ready,SchedulingDisabled   control-plane,master,worker   4d23h   v1.27.4+2c83a9f master-2   Ready                      control-plane,master,worker   4d23h   v1.27.4+2c287eb


 oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-rc.1   True        False         True       4d22h   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.14.0-rc.1   True        False         False      4d22h
cloud-controller-manager                   4.14.0-rc.1   True        False         False      4d23h
cloud-credential                           4.14.0-rc.1   True        False         False      4d23h
cluster-autoscaler                         4.14.0-rc.1   True        False         False      4d22h
config-operator                            4.14.0-rc.1   True        False         False      4d22h
console                                    4.14.0-rc.1   True        False         False      4d22h
control-plane-machine-set                  4.14.0-rc.1   True        False         False      4d22h
csi-snapshot-controller                    4.14.0-rc.1   True        False         False      4d22h
dns                                        4.14.0-rc.1   True        False         False      4d22h
etcd                                       4.14.0-rc.1   True        False         False      4d22h
image-registry                             4.14.0-rc.1   True        False         False      4h23m
ingress                                    4.14.0-rc.1   True        False         False      4d22h
insights                                   4.14.0-rc.1   True        False         False      4d22h
kube-apiserver                             4.14.0-rc.1   True        False         False      4d22h
kube-controller-manager                    4.14.0-rc.1   True        False         False      4d22h
kube-scheduler                             4.14.0-rc.1   True        False         False      4d22h
kube-storage-version-migrator              4.14.0-rc.1   True        False         False      4h35m
machine-api                                4.14.0-rc.1   True        False         False      4d22h
machine-approver                           4.14.0-rc.1   True        False         False      4d22h
machine-config                             4.14.0-rc.0   True        True          True       4d22h   Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]
marketplace                                4.14.0-rc.1   True        False         False      4d22h
monitoring                                 4.14.0-rc.1   True        False         False      4d22h
network                                    4.14.0-rc.1   True        True          False      4d22h   Deployment "/openshift-ovn-kubernetes/ovnkube-control-plane" is not available (awaiting 1 nodes)
node-tuning                                4.14.0-rc.1   True        False         False      4d22h
openshift-apiserver                        4.14.0-rc.1   True        False         True       4d22h   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.14.0-rc.1   True        False         False      4d22h
openshift-samples                          4.14.0-rc.1   True        False         False      5h4m
operator-lifecycle-manager                 4.14.0-rc.1   True        False         False      4d22h
operator-lifecycle-manager-catalog         4.14.0-rc.1   True        False         False      4d22h
operator-lifecycle-manager-packageserver   4.14.0-rc.1   True        False         False      4d22h
service-ca                                 4.14.0-rc.1   True        False         False      4d22h
storage                                    4.14.0-rc.1   True        False         False      4d22h




oc get pods -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-85c5fb6d7c-25mqf   2/2     Running   0          5h10m
apiserver-85c5fb6d7c-bzgft   0/2     Pending   0          4h53m
apiserver-85c5fb6d7c-ms9nk   2/2     Running   0          5h2m
[kni@registry.kni-qe-31 post-config]$ oc get events -n openshift-apiserver
LAST SEEN   TYPE      REASON             OBJECT                           MESSAGE
12m         Warning   FailedScheduling   pod/apiserver-85c5fb6d7c-bzgft   0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 node(s) didn't match pod anti-affinity rules..

oc get events -A | grep machine-config
openshift-machine-config-operator            22m         Warning   OperatorDegraded: RequiredPoolsFailed   /machine-config                                                    Unable to apply 4.14.0-rc.1: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]]



oc logs machine-config-daemon-drb4p -c machine-config-daemon -n openshift-machine-config-operator

oc adm drain master-1 --grace-period=20 --ignore-daemonsets --force=true --delete-emptydir-data --timeout=60s
node/master-1 already cordoned
Warning: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-lvg5p, openshift-dns/dns-default-l8kzl, openshift-dns/node-resolver-vnk7g, openshift-image-registry/node-ca-kwtw4, openshift-ingress-canary/ingress-canary-jcnff, openshift-local-storage/diskmaker-manager-7scmt, openshift-logging/collector-fjbwq, openshift-machine-api/ironic-proxy-6hdx6, openshift-machine-config-operator/machine-config-daemon-drb4p, openshift-machine-config-operator/machine-config-server-zzsfq, openshift-monitoring/node-exporter-dt26h, openshift-multus/multus-7fblb, openshift-multus/multus-additional-cni-plugins-wssh8, openshift-multus/network-metrics-daemon-ptf2k, openshift-network-diagnostics/network-check-target-gdpch, openshift-ovn-kubernetes/ovnkube-node-pw7w6, openshift-sriov-network-operator/network-resources-injector-lv5kj, openshift-sriov-network-operator/operator-webhook-ntnn6, openshift-sriov-network-operator/sriov-device-plugin-896fr, openshift-sriov-network-operator/sriov-network-config-daemon-ctmc4, openshift-storage/csi-cephfsplugin-2xgf4, openshift-storage/csi-rbdplugin-ct9lp
evicting pod nqldh/mypod-nqldh
evicting pod c21gn/mypod-c21gn
evicting pod hxm70/mypod-hxm70
There are pending pods in node "master-1" when an error occurred: [error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s]
pod/mypod-c21gn
pod/mypod-hxm70
pod/mypod-nqldh
error: unable to drain node "master-1" due to error:[error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s, error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s], continuing command...
There are pending nodes to be drained:
 master-1
error when waiting for pod "mypod-c21gn" terminating: global timeout reached: 1m0s
error when waiting for pod "mypod-hxm70" terminating: global timeout reached: 1m0s
error when waiting for pod "mypod-nqldh" terminating: global timeout reached: 1m0s

Forced drained but had to force delete these pods
Then
oc adm drain master-1 --grace-period=20 --ignore-daemonsets --force=true --delete-emptydir-data --timeout=60s
node/master-1 already cordoned
Warning: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-lvg5p, openshift-dns/dns-default-l8kzl, openshift-dns/node-resolver-vnk7g, openshift-image-registry/node-ca-kwtw4, openshift-ingress-canary/ingress-canary-jcnff, openshift-local-storage/diskmaker-manager-7scmt, openshift-logging/collector-fjbwq, openshift-machine-api/ironic-proxy-6hdx6, openshift-machine-config-operator/machine-config-daemon-drb4p, openshift-machine-config-operator/machine-config-server-zzsfq, openshift-monitoring/node-exporter-dt26h, openshift-multus/multus-7fblb, openshift-multus/multus-additional-cni-plugins-wssh8, openshift-multus/network-metrics-daemon-ptf2k, openshift-network-diagnostics/network-check-target-gdpch, openshift-ovn-kubernetes/ovnkube-node-pw7w6, openshift-sriov-network-operator/network-resources-injector-lv5kj, openshift-sriov-network-operator/operator-webhook-ntnn6, openshift-sriov-network-operator/sriov-device-plugin-896fr, openshift-sriov-network-operator/sriov-network-config-daemon-ctmc4, openshift-storage/csi-cephfsplugin-2xgf4, openshift-storage/csi-rbdplugin-ct9lp
node/master-1 drained

oc adm uncordon master-1
node/master-1 uncordoned
[kni@registry.kni-qe-31 post-config]$ oc get nodes
NAME       STATUS     ROLES                         AGE     VERSION
master-0   Ready      control-plane,master,worker   4d23h   v1.27.4+2c287eb
master-1   NotReady   control-plane,master,worker   5d      v1.27.4+2c83a9f
master-2   Ready      control-plane,master,worker   5d      v1.27.4+2c287eb

oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-rc.1   True        False         27s     Cluster version is 4.14.0-rc.1


It finally upgrade but why does the upgrade not handle stuck pods?

Assignee:: Petr Muller

Reporter:: Mike Lammon (Inactive)

QA Contact:: Jia Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/19 7:12 PM

Updated:: 2023/10/05 5:16 PM

Resolved:: 2023/10/05 12:00 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates