Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5988

Degraded etcd on assisted-installer installation- bootstrap etcd is not removed properly

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.13, 4.12
    • Etcd
    • None
    • 3
    • ETCD Sprint 230, ETCD Sprint 231
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

      Description of problem:

      Etcd operator is in degraded state as one of the masters can't connect.
      Master that fails to connect was previously bootstrap and pivoted as part of assisted-installer installation to master.
      
      Etcd log:
      2023-01-17T23:09:26.523562312Z 28dcf1b0a44481b0, started, test-infra-cluster-04bf4418-master-1, https://192.168.127.11:2380, https://192.168.127.11:2379, false
      2023-01-17T23:09:26.523562312Z 30600b5b86e23c8e, started, etcd-bootstrap, https://192.168.127.12:2380, https://192.168.127.12:2379, false
      2023-01-17T23:09:26.523562312Z 73f00626fee34a87, started, test-infra-cluster-04bf4418-master-0, https://192.168.127.10:2380, https://192.168.127.10:2379, false
      2023-01-17T23:09:26.541214220Z #### attempt 0
      2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-1", peerURLs=[https://192.168.127.11:2380}, clientURLs=[https://192.168.127.11:2379]
      2023-01-17T23:09:26.547811132Z       member={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379]
      2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-0", peerURLs=[https://192.168.127.10:2380}, clientURLs=[https://192.168.127.10:2379]
      2023-01-17T23:09:26.547811132Z       target={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379]
      2023-01-17T23:09:26.547846508Z member "https://192.168.127.12:2380" dataDir has been destroyed and must be removed from the cluster
      
      There are couple of problems that we see:
      1. For unknown reason etcd operator BootstrapTeardownController fails to start as it fails to see "openshift-etcd" namespace though by the logs it is there.
      2023-01-17T21:39:43.323928903Z E0117 21:39:43.323917       1 base_controller.go:272] BootstrapTeardownController reconciliation failed: failed to get bootstrap scaling strategy: failed to get openshift-etcd names
      
      2. DelayStrategy code was change by https://github.com/openshift/cluster-etcd-operator/pull/964/files and currently it requires 3 healthy members in order to remove. It can create issues as etcd and cluster-bootstrap(bootkube) are not synchronized and nothing is actually blocking bootstrap on stop etcd and block remove of bootstrap etcd.(at least how i understand the flow)
      
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      It is race as far as i understand but reproduced pretty much in our CI by installing 4.12 nightlies

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      Etcd is degrade cause third joined master etcd can't start

      Expected results:

      Etcd is healthy

      Additional info:

       

        1. etcd_operator.log
          790 kB
        2. must-gather.tar
          19.83 MB

            tjungblu@redhat.com Thomas Jungblut
            itsoiref@redhat.com Igal Tsoiref
            Sandeep Kundu Sandeep Kundu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: