Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-14826

A stop action for a stonith device timed out leading to a cluster node being fenced

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • rhel-9.6
    • rhel-8.6.0, rhel-9.4
    • pacemaker
    • None
    • None
    • Medium
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 13
    • 23
    • 5
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • Yes
    • Red Hat Enterprise Linux
    • None
    • None
    • None
    • Bug Fix
    • Hide
      Cause (the user action or circumstances that trigger the bug):
      Consequence (what the user experience is when the bug occurs):
      Fix (what has changed to fix the bug; do not include overly technical details):
      Result (what happens now that the patch is applied):
      Show
      Cause (the user action or circumstances that trigger the bug): Consequence (what the user experience is when the bug occurs): Fix (what has changed to fix the bug; do not include overly technical details): Result (what happens now that the patch is applied):
    • Proposed
    • None

      What were you trying to do that didn't work?

      A cluster node was fenced because the "stop" action of a stonith resource timed out.  A "stop" action should not be timing out and leading to a cluster node being fenced.

      The monitor action of the stonith device timed out:

      Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: error: Node clprod1.unix.cwtcloud.com did not send monitor result (via controller) within 80000ms (action timeout plus cluster-delay)
        Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: error: [Action    1]: In-flight resource op clvmfence_monitor_60000      on clprod1.unix.cwtcloud.com (priority: 0, waiting: (null))
        Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: notice: Transition 24 aborted: Action lost
        Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: warning: rsc_op 1: clvmfence_monitor_60000 on clprod1.unix.cwtcloud.com timed out

      The stop action of the stonith device timed out:

        Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: error: Node clprod1.unix.cwtcloud.com did not send stop result (via controller) within 80000ms (action timeout plus cluster-delay)
        Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: error: [Action    2]: In-flight resource op clvmfence_stop_0             on clprod1.unix.cwtcloud.com (priority: 0, waiting: (null))
        Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: notice: Transition 26 aborted: Action lost
        Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: warning: rsc_op 2: clvmfence_stop_0 on clprod1.unix.cwtcloud.com timed out

      The stop action failure caused the cluster node to be fenced.

      Please provide the package NVR for which bug is seen:

      pacemaker-2.1.2-4.el8_6.2

      How reproducible:

      Unknown

      Steps to reproduce

      1. Unknown

      Expected results

      The "stop" action of a stonith device should be a quick process and not lead to the action timing out which leads to the cluster node being fenced.

      Actual results

      A cluster node was fenced because the "stop" action of a stonith resource timed out.  A "stop" action should not be timing out and leading to a cluster node being fenced.

            rhn-support-nwahl Reid Wahl
            rhn-support-sbradley Shane Bradley
            Kenneth Gaillot Kenneth Gaillot
            Cluster QE Cluster QE
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: