Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62856

TNF - discovery fails for race condition in scenario

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20
    • Two Node Fencing
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      What is happening is that master-1: podman-etcd restarts and decides to start etcd as learner (good), it sees it is in the member list (CIB learner_node=master-1) and starts (bad), master-0: etcd is shutting down, hence master-1: etcd discovery fails

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          75%

      Steps to Reproduce:

      1. Deploy a TNF cluster
      2. Reboot master-0 ; wait for it to reboot, but not completely start      etcd
      3. Poll podman etcd to determine what state it is in 
      5. Reboot master-1 while etcd on master-0 is still coming up. 
      4. Poll podman etcd on master-1 to view the failure
          

      Actual results:

      [core@master-0 ~]$ sudo podman exec etcd etcdctl member list -w table
      +------------------+-----------+----------+-----------------------------+-----------------------------+------------+
      |        ID        |  STATUS   |   NAME   |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
      +------------------+-----------+----------+-----------------------------+-----------------------------+------------+
      | 1437c5d88d1a66e3 | unstarted |          | https://192.168.111.21:2380 |                             |       true |
      | e42e3c3a55c27ed6 |   started | master-0 | https://192.168.111.20:2380 | https://192.168.111.20:2379 |      false |
      +------------------+-----------+----------+-----------------------------+-----------------------------+------------+    

      Expected results:

      [core@master-1 ~]$ sudo podman exec etcd etcdctl member list -w table
      +------------------+---------+----------+-----------------------------+-----------------------------+------------+
      |        ID        | STATUS  |   NAME   |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
      +------------------+---------+----------+-----------------------------+-----------------------------+------------+
      | 3d9d9acb04427a83 | started | master-0 | https://192.168.111.20:2380 | https://192.168.111.20:2379 |      false |
      | caedfca0719c8594 | started | master-1 | https://192.168.111.21:2380 | https://192.168.111.21:2379 |      false |
      +------------------+---------+----------+-----------------------------+-----------------------------+------------+
      

      Additional info:

      https://redhat-internal.slack.com/archives/C07ABRBBDK3/p1759864654684679

              rh-ee-clobrano Carlo Lobrano
              rh-ee-dhensel Douglas Hensel
              None
              None
              Douglas Hensel Douglas Hensel
              Srikanth R Srikanth R
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: