Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-6935

[4.12] Degraded etcd on assisted-installer installation- bootstrap etcd is not removed properly


    • 1
    • ETCD Sprint 231, ETCD Sprint 232
    • 2
    • Rejected
    • False
    • Hide



      This is a clone of issue OCPBUGS-5988. The following is the description of the original issue:

      Description of problem:

      Etcd operator is in degraded state as one of the masters can't connect.
      Master that fails to connect was previously bootstrap and pivoted as part of assisted-installer installation to master.
      Etcd log:
      2023-01-17T23:09:26.523562312Z 28dcf1b0a44481b0, started, test-infra-cluster-04bf4418-master-1,,, false
      2023-01-17T23:09:26.523562312Z 30600b5b86e23c8e, started, etcd-bootstrap,,, false
      2023-01-17T23:09:26.523562312Z 73f00626fee34a87, started, test-infra-cluster-04bf4418-master-0,,, false
      2023-01-17T23:09:26.541214220Z #### attempt 0
      2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-1", peerURLs=[}, clientURLs=[]
      2023-01-17T23:09:26.547811132Z       member={name="etcd-bootstrap", peerURLs=[}, clientURLs=[]
      2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-0", peerURLs=[}, clientURLs=[]
      2023-01-17T23:09:26.547811132Z       target={name="etcd-bootstrap", peerURLs=[}, clientURLs=[]
      2023-01-17T23:09:26.547846508Z member "" dataDir has been destroyed and must be removed from the cluster
      There are couple of problems that we see:
      1. For unknown reason etcd operator BootstrapTeardownController fails to start as it fails to see "openshift-etcd" namespace though by the logs it is there.
      2023-01-17T21:39:43.323928903Z E0117 21:39:43.323917       1 base_controller.go:272] BootstrapTeardownController reconciliation failed: failed to get bootstrap scaling strategy: failed to get openshift-etcd names
      2. DelayStrategy code was change by https://github.com/openshift/cluster-etcd-operator/pull/964/files and currently it requires 3 healthy members in order to remove. It can create issues as etcd and cluster-bootstrap(bootkube) are not synchronized and nothing is actually blocking bootstrap on stop etcd and block remove of bootstrap etcd.(at least how i understand the flow)

      Version-Release number of selected component (if applicable):


      How reproducible:

      It is race as far as i understand but reproduced pretty much in our CI by installing 4.12 nightlies

      Steps to Reproduce:


      Actual results:

      Etcd is degrade cause third joined master etcd can't start

      Expected results:

      Etcd is healthy

      Additional info:


            tjungblu@redhat.com Thomas Jungblut
            openshift-crt-jira-prow OpenShift Prow Bot
            Sandeep Kundu Sandeep Kundu
            0 Vote for this issue
            9 Start watching this issue