Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77946

TNF podman-etcd should ignore learners when considering which node has higher revision

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.20.z, 4.21.z, 4.22
    • Two Node Fencing
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      It is possible for podman-etcd to try to recover in a state where the etcd revision on a learner is higher than the revision on a voter. This seems to be expecially prevalent in environments with slow disks where the voter gets starved for bandwidth, times out, and then is fenced.
      

      Version-Release number of selected component (if applicable):

      Current podman-etcd in 9.6 + 9.8
      

      How reproducible:

      Easiest way to reproduce is to spoof the revision numbers and then fencing the voter.
      

      Steps to Reproduce:

          1. Start one of your nodes as a learner
          2. Spoof the revision so it's higher than the voting member node
          3. Fence the voting member node
      

      Actual results:

      Recovery deadlock - no node can recover since the voter waits for the higher revision node to start, and the learner crashes because it has no voting members.
      

      Expected results:

      Learners are not real members. Their revision numbers don't matter. We need to check the revision numbers on only voting members when we check which node to start from.
      

      Additional info:

      We observed a similar case to this where the revision numbers were equal, so we were able to start the cluster, but one of the nodes (the learner) crashed immediately and pacemaker was left in a state where it thought that node was healthy despite etcd not running there. I would have expected this to be caught by the monitor code, but I didn't watch it long enough to observe if this happened.
      

              rh-ee-clobrano Carlo Lobrano
              jpoulin Jeremy Poulin
              Douglas Hensel Douglas Hensel
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: