[OCPBUGS-36667] Upgrade from 4.15 to 4.16 hit etcd is degraded issue - Red Hat Issue Tracker

Type: Bug
Resolution: Cannot Reproduce
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Etcd
Labels:
None

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-fullyprivate-proxy-arm-f28

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-4.16-upgrade-from-stable-4.15-azure-ipi-fullyprivate-proxy-arm-f28/1807425918356951040

Running command: oc adm upgrade --to-image=registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb --allow-explicit-upgrade --force=true
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb
Upgrading cluster to registry.build02.ci.openshift.org/ci-op-y87p0c68/release@sha256:0453dcf90e6f1e6ba6b8eb197d520c95f47cb2e3906dc4d98902f10412f90ceb gets started...
Starting the upgrade checking on 2024-06-30 18:24:46
Running command: oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.19   True        True          4m44s   Working towards 4.16.0-0.nightly-multi-2024-06-27-053432: 110 of 894 done (12% complete), waiting on etcd, kube-apiserver

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-06-27-053432

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

# oc adm upgrade status

Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.16.0-0.nightly-multi-2024-06-27-053432 (from 4.15.19)
Completion:      97%
Duration:        2h10m21.917016007s
Operator Status: 25 Healthy, 1 Unavailable, 7 Available but degraded

Control Plane Nodes
NAME                                  ASSESSMENT    PHASE      VERSION                                    EST    MESSAGE
ci-op-y87p0c68-80996-qqdwx-master-2   Progressing   Updating   4.15.19                                    +20m   
ci-op-y87p0c68-80996-qqdwx-master-0   Completed     Updated    4.16.0-0.nightly-multi-2024-06-27-053432   -      
ci-op-y87p0c68-80996-qqdwx-master-1   Completed     Updated    4.16.0-0.nightly-multi-2024-06-27-053432   -      

= Worker Upgrade =

= Worker Pool =
Worker Pool:     worker
Assessment:      Completed
Completion:      100%
Worker Status:   4 Total, 4 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Nodes
NAME                                                      ASSESSMENT   PHASE     VERSION                                    EST   MESSAGE
ci-op-y87p0c68-80996-qqdwx-41804-k7z6m                    Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
ci-op-y87p0c68-80996-qqdwx-worker-southcentralus1-w4dc4   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
ci-op-y87p0c68-80996-qqdwx-worker-southcentralus2-8fg97   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     
ci-op-y87p0c68-80996-qqdwx-worker-southcentralus3-2kskb   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     

= Worker Pool =
Worker Pool:     worker-pao
Assessment:      Completed
Completion:      100%
Worker Status:   1 Total, 1 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Node
NAME                                                      ASSESSMENT   PHASE     VERSION                                    EST   MESSAGE
ci-op-y87p0c68-80996-qqdwx-worker-southcentralus1-w4dc4   Completed    Updated   4.16.0-0.nightly-multi-2024-06-27-053432   -     

= Update Health =
Message: Cluster Operator authentication is degraded (APIServerDeployment_UnavailablePod::OAuthServerDeployment_UnavailablePod)
  Since:       43m58s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: authentication
  Description: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
               , OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()

Message: Cluster Operator machine-config is degraded (MachineConfigDaemonFailed)
  Since:       44m28s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: machine-config
  Description: Unable to apply 4.16.0-0.nightly-multi-2024-06-27-053432: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]

Message: Cluster Operator openshift-apiserver is degraded (APIServerDeployment_UnavailablePod)
  Since:       44m32s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: openshift-apiserver
  Description: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

Message: Cluster Operator etcd is degraded (EtcdCertSignerController_Error::EtcdEndpoints_ErrorUpdatingEtcdEndpoints::EtcdMembers_UnhealthyMembers::NodeController_MasterNodesReady)
  Since:       1h8m47s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: etcd
  Description: EtcdCertSignerControllerDegraded: EtcdCertSignerController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:1612627219685692483 name:"ci-op-y87p0c68-80996-qqdwx-master-2" peerURLs:"https://10.0.0.7:2380" clientURLs:"https://10.0.0.7:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.0.7:2379]: context deadline exceeded} {Member:ID:7859641474667542729 name:"ci-op-y87p0c68-80996-qqdwx-master-0" peerURLs:"https://10.0.0.8:2380" clientURLs:"https://10.0.0.8:2379"  Healthy:true Took:1.638123ms Error:<nil>} {Member:ID:16394202218457999149 name:"ci-op-y87p0c68-80996-qqdwx-master-1" peerURLs:"https://10.0.0.6:2380" clientURLs:"https://10.0.0.6:2379"  Healthy:true Took:2.867525ms Error:<nil>}]
               , EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:1612627219685692483 name:"ci-op-y87p0c68-80996-qqdwx-master-2" peerURLs:"https://10.0.0.7:2380" clientURLs:"https://10.0.0.7:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.0.7:2379]: context deadline exceeded} {Member:ID:7859641474667542729 name:"ci-op-y87p0c68-80996-qqdwx-master-0" peerURLs:"https://10.0.0.8:2380" clientURLs:"https://10.0.0.8:2379"  Healthy:true Took:3.586207ms Error:<nil>} {Member:ID:16394202218457999149 name:"ci-op-y87p0c68-80996-qqdwx-master-1" peerURLs:"https://10.0.0.6:2380" clientURLs:"https://10.0.0.6:2379"  Healthy:true Took:1.774003ms Error:<nil>}]
               , EtcdMembersDegraded: 2 of 3 members are available, ci-op-y87p0c68-80996-qqdwx-master-2 is unhealthy
               , NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)

Message: Cluster Operator kube-apiserver is degraded (NodeController_MasterNodesReady)
  Since:       1h8m50s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: kube-apiserver
  Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)

Message: Cluster Operator kube-controller-manager is degraded (NodeController_MasterNodesReady)
  Since:       1h8m50s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: kube-controller-manager
  Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)

Message: Cluster Operator kube-scheduler is degraded (NodeController_MasterNodesReady)
  Since:       1h8m50s
  Level:       Error
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusteroperators.config.openshift.io: kube-scheduler
  Description: NodeControllerDegraded: The master nodes not ready: node "ci-op-y87p0c68-80996-qqdwx-master-2" not ready since 2024-06-30 19:24:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)

Message: Cluster Operator control-plane-machine-set is unavailable (UnavailableReplicas)
  Since:       19m45s
  Level:       Warning
  Impact:      API Availability
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDown.md
  Resources:
    clusteroperators.config.openshift.io: control-plane-machine-set
  Description: Missing 1 available replica(s)

Message: Cluster Version version is failing to proceed with the update (ClusterOperatorsDegraded)
  Since:       28m41s
  Level:       Warning
  Impact:      Update Stalled
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusterversions.config.openshift.io: version
  Description: Cluster operators etcd, kube-apiserver are degraded

Expected results:

Upgrade should pass

Additional info:

Assignee:: Dean West

Reporter:: Jian Li

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/07/08 7:25 AM

Updated:: 2024/10/01 1:55 PM

Resolved:: 2024/10/01 1:55 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide