Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24270

etcd is degraded on bootstrap node when new cluster is deployed via the assisted-installer

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While deploying a brand new hub cluster for ACM ZTP scale testing I have now observed several times where etcd is degraded upon initial deployment of a cluster.  The node which served as the bootstrap node seems to always host the problematic etcd pod.
      
      # oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         6m25s   Error while reconciling 4.14.4: the cluster operator etcd is degraded
      
      # oc get no
      NAME               STATUS   ROLES                         AGE   VERSION
      e27-h02-000-r650   Ready    control-plane,master,worker   41m   v1.27.6+d548052
      e27-h03-000-r650   Ready    control-plane,master,worker   44m   v1.27.6+d548052
      e27-h05-000-r650   Ready    control-plane,master,worker   12m   v1.27.6+d548052  
      
      # oc get po -n openshift-etcd
      NAME                           READY   STATUS                  RESTARTS      AGE
      etcd-e27-h02-000-r650          4/4     Running                 0             29m
      etcd-e27-h03-000-r650          4/4     Running                 0             24m
      etcd-e27-h05-000-r650          0/4     Init:CrashLoopBackOff   4 (66s ago)   2m39s
      etcd-guard-e27-h02-000-r650    1/1     Running                 0             32m
      etcd-guard-e27-h03-000-r650    1/1     Running                 0             31m
      etcd-guard-e27-h05-000-r650    0/1     Running                 0             2m23s
      installer-2-e27-h02-000-r650   0/1     Completed               0             33m
      installer-3-e27-h03-000-r650   0/1     Completed               0             32m
      installer-4-e27-h02-000-r650   0/1     Completed               0             31m
      installer-4-e27-h03-000-r650   0/1     Completed               0             27m
      installer-4-e27-h05-000-r650   0/1     Completed               0             3m15s

      Version-Release number of selected component (if applicable):

      OCP 4.14.4

      How reproducible:

      These seems to occur almost every new rebuild of a cluster for this environment.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Restarting the node that is missing or showing a degraded etcd pod resolves the issue eventually.
      
      # oc describe co etcd 
      Name:         etcd
      Namespace:    
      Labels:       <none>
      Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
      API Version:  config.openshift.io/v1
      Kind:         ClusterOperator
      Metadata:
        Creation Timestamp:  2023-11-30T14:23:11Z
        Generation:          1
        Owner References:
          API Version:     config.openshift.io/v1
          Controller:      true
          Kind:            ClusterVersion
          Name:            version
          UID:             9c941210-059f-456a-ad36-55bdf6ce778b
        Resource Version:  44648
        UID:               abd34179-4d49-41bb-b2b1-c2f2dbdf2ef4
      Spec:
      Status:
        Conditions:
          Last Transition Time:  2023-11-30T14:43:36Z
          Message:               The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
          Reason:                ControllerStarted
          Status:                Unknown
          Type:                  RecentBackup
          Last Transition Time:  2023-11-30T14:56:28Z
          Message:               EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:1524130501508512834 name:"e27-h02-000-r650" peerURLs:"https://[fc00:1004::5]:2380" clientURLs:"https://[fc00:1004::5]:2379"  Healthy:true Took:502.844µs Error:<nil>} {Member:ID:15648185913586081636 name:"e27-h03-000-r650" peerURLs:"https://[fc00:1004::6]:2380" clientURLs:"https://[fc00:1004::6]:2379"  Healthy:true Took:597.25µs Error:<nil>}]
          Reason:                EtcdEndpoints_ErrorUpdatingEtcdEndpoints
          Status:                True
          Type:                  Degraded
          Last Transition Time:  2023-11-30T15:14:50Z
          Message:               NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 4
          Reason:                NodeInstaller
          Status:                True
          Type:                  Progressing
          Last Transition Time:  2023-11-30T14:45:32Z
          Message:               StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 4
      EtcdMembersAvailable: 2 members are available
          Reason:                AsExpected
          Status:                True
          Type:                  Available
          Last Transition Time:  2023-11-30T14:44:06Z
          Message:               All is well
          Reason:                AsExpected
          Status:                True
          Type:                  Upgradeable
        Extension:               <nil>
        Related Objects:
          Group:     operator.openshift.io
          Name:      cluster
          Resource:  etcds
          Group:     
          Name:      openshift-config
          Resource:  namespaces
          Group:     
          Name:      openshift-config-managed
          Resource:  namespaces
          Group:     
          Name:      openshift-etcd-operator
          Resource:  namespaces
          Group:     
          Name:      openshift-etcd
          Resource:  namespaces
        Versions:
          Name:     raw-internal
          Version:  4.14.4
          Name:     etcd
          Version:  4.14.4
          Name:     operator
          Version:  4.14.4
      Events:       <none>
      
      

            dwest@redhat.com Dean West
            akrzos@redhat.com Alex Krzos
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: