What were you trying to do that didn't work?
A cluster node was fenced because the "stop" action of a stonith resource timed out. A "stop" action should not be timing out and leading to a cluster node being fenced.
The monitor action of the stonith device timed out:
Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: error: Node clprod1.unix.cwtcloud.com did not send monitor result (via controller) within 80000ms (action timeout plus cluster-delay)
Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: error: [Action 1]: In-flight resource op clvmfence_monitor_60000 on clprod1.unix.cwtcloud.com (priority: 0, waiting: (null))
Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: notice: Transition 24 aborted: Action lost
Sep 28 22:52:29 clprod2 pacemaker-controld[1628]: warning: rsc_op 1: clvmfence_monitor_60000 on clprod1.unix.cwtcloud.com timed out
The stop action of the stonith device timed out:
Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: error: Node clprod1.unix.cwtcloud.com did not send stop result (via controller) within 80000ms (action timeout plus cluster-delay)
Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: error: [Action 2]: In-flight resource op clvmfence_stop_0 on clprod1.unix.cwtcloud.com (priority: 0, waiting: (null))
Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: notice: Transition 26 aborted: Action lost
Sep 28 22:53:51 clprod2 pacemaker-controld[1628]: warning: rsc_op 2: clvmfence_stop_0 on clprod1.unix.cwtcloud.com timed out
The stop action failure caused the cluster node to be fenced.
Please provide the package NVR for which bug is seen:
pacemaker-2.1.2-4.el8_6.2
How reproducible:
Unknown
Steps to reproduce
- Unknown
Expected results
The "stop" action of a stonith device should be a quick process and not lead to the action timing out which leads to the cluster node being fenced.
Actual results
A cluster node was fenced because the "stop" action of a stonith resource timed out. A "stop" action should not be timing out and leading to a cluster node being fenced.
- relates to
-
RHEL-29861 The "pcmk_monitor_timeout" default value in multiple documentation is listed as 60s, but should be 20s
- In Progress
- links to