Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-6878

SNO failed to deploy because etcd is in degraded state

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      While deploying 3510 SNOs via ACM and ZTP, 5 out of 19 install failures were because etcd operator reported to be in degraded state.

      Version-Release number of selected component (if applicable):

      Hub and SNO OCP 4.12.1
      ACM 2.7.0-DOWNSTREAM-2023-01-26-20-15-10

      How reproducible:

      5 out of 19 install failures thus represents more than 25% of the install failures, however only represents 5 out of 3510 SNOs attempted to be installed (< .15% of all clusters installed)

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      # cat install_failed_etcd | xargs -I % sh -c "echo -n '% '; oc --kubeconfig=/root/hv-vm/sno/manifests/%/kubeconfig get clusterversion --no-headers"
      sno00389 version         False   False   4d    Error while reconciling 4.12.1: the cluster operator etcd is degraded
      sno00540 version         False   False   3d23h   Error while reconciling 4.12.1: the cluster operator etcd is degraded
      sno01227 version         False   False   3d21h   Error while reconciling 4.12.1: the cluster operator etcd is degraded
      sno01544 version         False   False   3d20h   Error while reconciling 4.12.1: the cluster operator etcd is degraded
      sno01958 version         False   False   3d21h   Error while reconciling 4.12.1: the cluster operator etcd is degraded
      sno03301 version         False   False   3d18h   Error while reconciling 4.12.1: the cluster operator etcd is degraded
      
      
      # cat install_failed_etcd | xargs -I % sh -c "echo -n '% '; oc --kubeconfig=/root/hv-vm/sno/manifests/%/kubeconfig get co etcd --no-headers"
      sno00389 etcd   4.12.1   True   True   True   4d    MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 4 on node: "sno00389" didn't show up, waited: 3m30s
      sno00540 etcd   4.12.1   True   True   True   3d23h   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 3 on node: "sno00540" didn't show up, waited: 3m30s
      sno01227 etcd   4.12.1   True   True   True   3d22h   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 4 on node: "sno01227" didn't show up, waited: 3m30s
      sno01544 etcd   4.12.1   True   True   True   3d21h   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 3 on node: "sno01544" didn't show up, waited: 3m30s
      sno01958 etcd   4.12.1   True   True   True   3d21h   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 3 on node: "sno01958" didn't show up, waited: 3m30s
      sno03301 etcd   4.12.1   True   True   True   3d18h   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 3 on node: "sno03301" didn't show up, waited: 3m30s
      
      

      And a describe on the operator from one of the affected SNOs:

      # oc --kubeconfig /root/hv-vm/sno/manifests/sno00389/kubeconfig describe co etcd
      Name:         etcd
      Namespace:
      Labels:       <none>
      Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
      API Version:  config.openshift.io/v1
      Kind:         ClusterOperator
      Metadata:
        Creation Timestamp:  2023-01-27T20:55:29Z
        Generation:          1
        Managed Fields:
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                .:
                f:exclude.release.openshift.io/internal-openshift-hosted:
                f:include.release.openshift.io/self-managed-high-availability:
                f:include.release.openshift.io/single-node-developer:
              f:ownerReferences:
                .:
                k:{"uid":"d0009f5d-f6f8-45f1-9f5d-c33493a90ac6"}:
            f:spec:
          Manager:      cluster-version-operator
          Operation:    Update
          Time:         2023-01-27T20:55:29Z
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:status:
              .:
              f:extension:
              f:relatedObjects:
          Manager:      cluster-version-operator
          Operation:    Update
          Subresource:  status
          Time:         2023-01-27T20:55:30Z
          API Version:  config.openshift.io/v1
          Fields Type:  FieldsV1
          fieldsV1:
            f:status:
              f:conditions:
              f:versions:
          Manager:      cluster-etcd-operator
          Operation:    Update
          Subresource:  status
          Time:         2023-01-28T16:09:58Z
        Owner References:
          API Version:     config.openshift.io/v1
          Kind:            ClusterVersion
          Name:            version
          UID:             d0009f5d-f6f8-45f1-9f5d-c33493a90ac6
        Resource Version:  239508
        UID:               e1b17331-afc0-4a50-9cd5-7b7fa6dc3b7b
      Spec:
      Status:
        Conditions:
          Last Transition Time:  2023-01-27T21:33:41Z
          Message:               MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 4 on node: "sno00389" didn't show up, waited: 3m30s
          Reason:                MissingStaticPodController_SyncError
          Status:                True
          Type:                  Degraded
          Last Transition Time:  2023-01-27T21:27:21Z
          Message:               NodeInstallerProgressing: 1 nodes are at revision 3; 0 nodes have achieved new revision 4
          Reason:                NodeInstaller
          Status:                True
          Type:                  Progressing
          Last Transition Time:  2023-01-27T21:27:12Z
          Message:               StaticPodsAvailable: 1 nodes are active; 1 nodes are at revision 3; 0 nodes have achieved new revision 4
      EtcdMembersAvailable: 1 members are available
          Reason:                AsExpected
          Status:                True
          Type:                  Available
          Last Transition Time:  2023-01-27T21:24:10Z
          Message:               All is well
          Reason:                AsExpected
          Status:                True
          Type:                  Upgradeable
          Last Transition Time:  2023-01-27T21:23:20Z
          Message:               The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
          Reason:                ControllerStarted
          Status:                Unknown
          Type:                  RecentBackup
        Extension:               <nil>
        Related Objects:
          Group:     operator.openshift.io
          Name:      cluster
          Resource:  etcds
          Group:
          Name:      openshift-config
          Resource:  namespaces
          Group:
          Name:      openshift-config-managed
          Resource:  namespaces
          Group:
          Name:      openshift-etcd-operator
          Resource:  namespaces
          Group:
          Name:      openshift-etcd
          Resource:  namespaces
        Versions:
          Name:     raw-internal
          Version:  4.12.1
          Name:     operator
          Version:  4.12.1
          Name:     etcd
          Version:  4.12.1
      Events:       <none>
      

      Attachments

        Issue Links

          Activity

            People

              rphillip@redhat.com Ryan Phillips
              akrzos@redhat.com Alex Krzos
              ge liu ge liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: