Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59238

[TNF] Quick podman-etcd restart result in failure to start

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19.0, 4.20
    • Two Node Fencing
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 0
    • None
    • No
    • None
    • None
    • Rejected
    • OCPEDGE Sprint 278, OCPEDGE Sprint 279, OCPEDGE Sprint 280
    • 3
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: Rapid restart of a podman etcd container in a two member etcd cluster
      *Consequence*: etcd cluster doesn't recover from the node loss and ends up with only one member
      *Fix*: Fixed the resource count according to Pacemaker guidelines
      *Result*: Pacemaker can progress with handling the loss of a member
      Show
      *Cause*: Rapid restart of a podman etcd container in a two member etcd cluster *Consequence*: etcd cluster doesn't recover from the node loss and ends up with only one member *Fix*: Fixed the resource count according to Pacemaker guidelines *Result*: Pacemaker can progress with handling the loss of a member
    • None
    • None
    • None
    • None

      Description of problem:

      A rapid restart of podman-etcd fails, probably due to a misalignment of the clone notification environment variables[1] counting the number of active and inactive agents. This also has the effect of stalling cluster recovery.

      Jul 11 09:26:59 master-0 pacemaker-controld[1885]:  notice: Result of stop operation for etcd on master-0: ok
      Jul 11 09:26:59 master-0 pacemaker-controld[1885]:  notice: Requesting local execution of start operation for etcd on master-0
      Jul 11 09:27:00 master-0 podman-etcd(etcd)[9729]: NOTICE: podman-etcd start
      Jul 11 09:27:00 master-0 podman-etcd(etcd)[9762]: INFO: ensure etcd pod is not running (retries: 60, interval: 10)
      Jul 11 09:27:00 master-0 podman-etcd(etcd)[9896]: ERROR: Unexpected active resource count: 2
      Jul 11 09:27:00 master-0 pacemaker-controld[1885]:  notice: Result of start operation for etcd on master-0: error
      

      [1]: https://clusterlabs.org/projects/pacemaker/doc/2.1/Pacemaker_Administration/html/agents.html#clone-notifications

       

      Steps to reproduce

      In a stable Two Nodes with Fencing cluster (any version, problem is in the underlyn RHCOS packages), which has an etcd cluster with 2 members, suddenly killing one of them will trigger this bug.

      "sudo podman kill etcd" in one of the nodes is enough, but any way of stopping it ungracefully will suffice. The logs in the description above will appear in pacemaker logs (journalctl -u pacemaker).

      On checking with "sudo pcs status", resource will look like this

        * Clone Set: etcd-clone [etcd]:
          * Started: [ javier-master-0-1 ]
          * Stopped: [ javier-master-0-0 ] 

      Verification information

      Bugfix is merged in https://github.com/ClusterLabs/resource-agents/pull/2082.
      NOTE: Even with this bugfix, the etcd cluster might still fail to recover. There is a second, previously hidden bug fixed in https://github.com/ClusterLabs/resource-agents/pull/2089. If the system is verified with the first fix but not the second, the problem described here should not occur, but the cluster might still be unable to start properly.
      With both fixes applied, the cluster should recover properly. Can be verified running "sudo pcs status" and checking that the etcd clone looks like this:

       

      * Clone Set: etcd-clone [etcd]:
       * Started: [ javier-master-0-0 javier-master-0-1 ]

              rh-ee-pfontani Pablo Fontanilla
              rh-ee-clobrano Carlo Lobrano
              None
              None
              Douglas Hensel Douglas Hensel
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: