Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36667

Upgrade from 4.15 to 4.16 hit etcd is degraded issue

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.16
    • Etcd
    • None
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-fullyprivate-proxy-arm-f28

      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-fullyprivate-proxy-arm-f28/1807425918356951040

      Running command: oc adm upgrade --to-image=registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb --allow-explicit-upgrade --force=true
      warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
      warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
      Requested update to release image registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb
      Upgrading cluster to registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb gets started...
      Starting the upgrade checking on 2024-06-30 18:24:46
      Running command: oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.19   True        True          4m44s   Working towards 4.16.0-0.nightly-multi-2024-06-27-053432: 110 of 894 done (12% complete), waiting on etcd, kube-apiserver
          

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-multi-2024-06-27-053432
          

      How reproducible:

      
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      # oc adm upgrade status
      
      Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
      = Control Plane =
      Assessment:      Progressing
      Target Version:  4.16.0-0.nightly-multi-2024-06-27-053432 (from 4.15.19)
      Completion:      97%
      Duration:        2h10m21.917016007s
      Operator Status: 25 Healthy, 1 Unavailable, 7 Available but degraded
      
      Control Plane Nodes
      NAME                                  ASSESSMENT    PHASE      VERSION                                    EST    MESSAGE
      ci-op-y87p0c68-80996-qqdwx-master-2   Progressing   Updating   4.15.19                                    +20m   
      ci-op-y87p0c68-80996-qqdwx-master-0   Completed     Updated    4.16.0-0.nightly-multi-2024-06-27-053432   -      
      ci-op-y87p0c68-80996-qqdwx-master-1   Completed     Updated    4.16.0-0.nightly-multi-2024-06-27-053432   -      
      
      = Worker Upgrade =
      
      = Worker Pool =
      Worker Pool:     worker
      Assessment:      Completed
      Completion:      100%
      Worker Status:   4 Total, 4 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded
      
      Worker Pool Nodes
      NAME                                                      ASSESSMENT   PHASE     VERSION                                    EST   MESSAGE
      ci-op-y87p0c68-80996-qqdwx-41804-k7z6m                    Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
      ci-op-y87p0c68-80996-qqdwx-worker-southcentralus1-w4dc4   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
      ci-op-y87p0c68-80996-qqdwx-worker-southcentralus2-8fg97   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
      ci-op-y87p0c68-80996-qqdwx-worker-southcentralus3-2kskb   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
      
      = Worker Pool =
      Worker Pool:     worker-pao
      Assessment:      Completed
      Completion:      100%
      Worker Status:   1 Total, 1 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded
      
      Worker Pool Node
      NAME                                                      ASSESSMENT   PHASE     VERSION                                    EST   MESSAGE
      ci-op-y87p0c68-80996-qqdwx-worker-southcentralus1-w4dc4   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
      
      = Update Health =
      Message: Cluster Operator authentication is degraded (APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod)
        Since:       43m58s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: authentication
        Description: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
                     , OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()
      
      Message: Cluster Operator machine-config is degraded (MachineConfigDaemonFailed)
        Since:       44m28s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: machine-config
        Description: Unable to apply 4.16.0-0.nightly-multi-2024-06-27-053432: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]
      
      Message: Cluster Operator openshift-apiserver is degraded (APIServerDeployment_UnavailablePod)
        Since:       44m32s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: openshift-apiserver
        Description: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      
      Message: Cluster Operator etcd is degraded (EtcdCertSignerController_Error::EtcdEndpoints_ErrorUpdatingEtcdEndpoints::EtcdMembers_UnhealthyMembers::NodeController_MasterNodesReady)
        Since:       1h8m47s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: etcd
        Description: EtcdCertSignerControllerDegraded: EtcdCertSignerController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:1612627219685692483 name:"ci-op-y87p0c68-80996-qqdwx-master-2" peerURLs:"https://10.0.0.7:2380" clientURLs:"https://10.0.0.7:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.0.7:2379]: context deadline exceeded} {Member:ID:7859641474667542729 name:"ci-op-y87p0c68-80996-qqdwx-master-0" peerURLs:"https://10.0.0.8:2380" clientURLs:"https://10.0.0.8:2379"  Healthy:true Took:1.638123ms Error:<nil>} {Member:ID:16394202218457999149 name:"ci-op-y87p0c68-80996-qqdwx-master-1" peerURLs:"https://10.0.0.6:2380" clientURLs:"https://10.0.0.6:2379"  Healthy:true Took:2.867525ms Error:<nil>}]
                     , EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:1612627219685692483 name:"ci-op-y87p0c68-80996-qqdwx-master-2" peerURLs:"https://10.0.0.7:2380" clientURLs:"https://10.0.0.7:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.0.7:2379]: context deadline exceeded} {Member:ID:7859641474667542729 name:"ci-op-y87p0c68-80996-qqdwx-master-0" peerURLs:"https://10.0.0.8:2380" clientURLs:"https://10.0.0.8:2379"  Healthy:true Took:3.586207ms Error:<nil>} {Member:ID:16394202218457999149 name:"ci-op-y87p0c68-80996-qqdwx-master-1" peerURLs:"https://10.0.0.6:2380" clientURLs:"https://10.0.0.6:2379"  Healthy:true Took:1.774003ms Error:<nil>}]
                     , EtcdMembersDegraded: 2 of 3 members are available, ci-op-y87p0c68-80996-qqdwx-master-2 is unhealthy
                     , NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      
      Message: Cluster Operator kube-apiserver is degraded (NodeController_MasterNodesReady)
        Since:       1h8m50s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: kube-apiserver
        Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      
      Message: Cluster Operator kube-controller-manager is degraded (NodeController_MasterNodesReady)
        Since:       1h8m50s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: kube-controller-manager
        Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      
      Message: Cluster Operator kube-scheduler is degraded (NodeController_MasterNodesReady)
        Since:       1h8m50s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: kube-scheduler
        Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      
      Message: Cluster Operator control-plane-machine-set is unavailable (UnavailableReplicas)
        Since:       19m45s
        Level:       Warning
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDown.md
        Resources:
          clusteroperators.config.openshift.io: control-plane-machine-set
        Description: Missing 1 available replica(s)
      
      Message: Cluster Version version is failing to proceed with the update (ClusterOperatorsDegraded)
        Since:       28m41s
        Level:       Warning
        Impact:      Update Stalled
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusterversions.config.openshift.io: version
        Description: Cluster operators etcd, kube-apiserver are degraded
          

      Expected results:

      Upgrade should pass
          

      Additional info:

      
          

            dwest@redhat.com Dean West
            rhn-support-jianl Jian Li
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: