Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Etcd
Labels:
- perfscale-telco-5g
- telco-5g

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description of problem:

While deploying a brand new hub cluster for ACM ZTP scale testing I have now observed several times where etcd is degraded upon initial deployment of a cluster.  The node which served as the bootstrap node seems to always host the problematic etcd pod.

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         6m25s   Error while reconciling 4.14.4: the cluster operator etcd is degraded

# oc get no
NAME               STATUS   ROLES                         AGE   VERSION
e27-h02-000-r650   Ready    control-plane,master,worker   41m   v1.27.6+d548052
e27-h03-000-r650   Ready    control-plane,master,worker   44m   v1.27.6+d548052
e27-h05-000-r650   Ready    control-plane,master,worker   12m   v1.27.6+d548052  

# oc get po -n openshift-etcd
NAME                           READY   STATUS                  RESTARTS      AGE
etcd-e27-h02-000-r650          4/4     Running                 0             29m
etcd-e27-h03-000-r650          4/4     Running                 0             24m
etcd-e27-h05-000-r650          0/4     Init:CrashLoopBackOff   4 (66s ago)   2m39s
etcd-guard-e27-h02-000-r650    1/1     Running                 0             32m
etcd-guard-e27-h03-000-r650    1/1     Running                 0             31m
etcd-guard-e27-h05-000-r650    0/1     Running                 0             2m23s
installer-2-e27-h02-000-r650   0/1     Completed               0             33m
installer-3-e27-h03-000-r650   0/1     Completed               0             32m
installer-4-e27-h02-000-r650   0/1     Completed               0             31m
installer-4-e27-h03-000-r650   0/1     Completed               0             27m
installer-4-e27-h05-000-r650   0/1     Completed               0             3m15s

Version-Release number of selected component (if applicable):

OCP 4.14.4

How reproducible:

These seems to occur almost every new rebuild of a cluster for this environment.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Restarting the node that is missing or showing a degraded etcd pod resolves the issue eventually.

# oc describe co etcd 
Name:         etcd
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2023-11-30T14:23:11Z
  Generation:          1
  Owner References:
    API Version:     config.openshift.io/v1
    Controller:      true
    Kind:            ClusterVersion
    Name:            version
    UID:             9c941210-059f-456a-ad36-55bdf6ce778b
  Resource Version:  44648
  UID:               abd34179-4d49-41bb-b2b1-c2f2dbdf2ef4
Spec:
Status:
  Conditions:
    Last Transition Time:  2023-11-30T14:43:36Z
    Message:               The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
    Reason:                ControllerStarted
    Status:                Unknown
    Type:                  RecentBackup
    Last Transition Time:  2023-11-30T14:56:28Z
    Message:               EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:1524130501508512834 name:"e27-h02-000-r650" peerURLs:"https://[fc00:1004::5]:2380" clientURLs:"https://[fc00:1004::5]:2379"  Healthy:true Took:502.844µs Error:<nil>} {Member:ID:15648185913586081636 name:"e27-h03-000-r650" peerURLs:"https://[fc00:1004::6]:2380" clientURLs:"https://[fc00:1004::6]:2379"  Healthy:true Took:597.25µs Error:<nil>}]
    Reason:                EtcdEndpoints_ErrorUpdatingEtcdEndpoints
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2023-11-30T15:14:50Z
    Message:               NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 4
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2023-11-30T14:45:32Z
    Message:               StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 4
EtcdMembersAvailable: 2 members are available
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2023-11-30T14:44:06Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  etcds
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-etcd
    Resource:  namespaces
  Versions:
    Name:     raw-internal
    Version:  4.14.4
    Name:     etcd
    Version:  4.14.4
    Name:     operator
    Version:  4.14.4
Events:       <none>

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

etcd-degraded-deployment.tar.gz
31.62 MB
2023/11/30 3:52 PM

Details

Description

Attachments

Attachments

Activity

People

Dates