-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.14.z
-
No
-
False
-
Description of problem:
While deploying a brand new hub cluster for ACM ZTP scale testing I have now observed several times where etcd is degraded upon initial deployment of a cluster. The node which served as the bootstrap node seems to always host the problematic etcd pod. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 6m25s Error while reconciling 4.14.4: the cluster operator etcd is degraded # oc get no NAME STATUS ROLES AGE VERSION e27-h02-000-r650 Ready control-plane,master,worker 41m v1.27.6+d548052 e27-h03-000-r650 Ready control-plane,master,worker 44m v1.27.6+d548052 e27-h05-000-r650 Ready control-plane,master,worker 12m v1.27.6+d548052 # oc get po -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-e27-h02-000-r650 4/4 Running 0 29m etcd-e27-h03-000-r650 4/4 Running 0 24m etcd-e27-h05-000-r650 0/4 Init:CrashLoopBackOff 4 (66s ago) 2m39s etcd-guard-e27-h02-000-r650 1/1 Running 0 32m etcd-guard-e27-h03-000-r650 1/1 Running 0 31m etcd-guard-e27-h05-000-r650 0/1 Running 0 2m23s installer-2-e27-h02-000-r650 0/1 Completed 0 33m installer-3-e27-h03-000-r650 0/1 Completed 0 32m installer-4-e27-h02-000-r650 0/1 Completed 0 31m installer-4-e27-h03-000-r650 0/1 Completed 0 27m installer-4-e27-h05-000-r650 0/1 Completed 0 3m15s
Version-Release number of selected component (if applicable):
OCP 4.14.4
How reproducible:
These seems to occur almost every new rebuild of a cluster for this environment.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Restarting the node that is missing or showing a degraded etcd pod resolves the issue eventually. # oc describe co etcd Name: etcd Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2023-11-30T14:23:11Z Generation: 1 Owner References: API Version: config.openshift.io/v1 Controller: true Kind: ClusterVersion Name: version UID: 9c941210-059f-456a-ad36-55bdf6ce778b Resource Version: 44648 UID: abd34179-4d49-41bb-b2b1-c2f2dbdf2ef4 Spec: Status: Conditions: Last Transition Time: 2023-11-30T14:43:36Z Message: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required Reason: ControllerStarted Status: Unknown Type: RecentBackup Last Transition Time: 2023-11-30T14:56:28Z Message: EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:1524130501508512834 name:"e27-h02-000-r650" peerURLs:"https://[fc00:1004::5]:2380" clientURLs:"https://[fc00:1004::5]:2379" Healthy:true Took:502.844µs Error:<nil>} {Member:ID:15648185913586081636 name:"e27-h03-000-r650" peerURLs:"https://[fc00:1004::6]:2380" clientURLs:"https://[fc00:1004::6]:2379" Healthy:true Took:597.25µs Error:<nil>}] Reason: EtcdEndpoints_ErrorUpdatingEtcdEndpoints Status: True Type: Degraded Last Transition Time: 2023-11-30T15:14:50Z Message: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 4 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2023-11-30T14:45:32Z Message: StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 4 EtcdMembersAvailable: 2 members are available Reason: AsExpected Status: True Type: Available Last Transition Time: 2023-11-30T14:44:06Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: etcds Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-etcd-operator Resource: namespaces Group: Name: openshift-etcd Resource: namespaces Versions: Name: raw-internal Version: 4.14.4 Name: etcd Version: 4.14.4 Name: operator Version: 4.14.4 Events: <none>