-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-9.4
-
None
-
No
-
None
-
rhel-sst-high-availability
-
ssg_filesystems_storage_and_HA
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
What were you trying to do that didn't work? In a host reboot scenario that involves fencing, the fencing action can be blocked waiting for an in-progress resource START action to complete. This poses an issue when the resource START action has an indirect dependency on the fencing action to be run, so introduce a circular dependency.
Here is the sequence of events:
- Cluster has 4 nodes ( p10rhel092, p10rhel093, p10rhel094, p10rhel095) and a quorum device
- 2 nodes were rebooted ( p10rhel092 then p10rhel094 )
- fencing was triggered for p10rhel092
- A db2 member resource that was running on p10rhel092 was restarted on p10rhel093)
- The START action for the db2 member resource could not complete in a timely fashion, because it has a dependency on fencing for p10rhel094.
The main question is: Can we configure Pacemaker so that node fencing can be triggered when needed without waiting for an in-progress resource action to complete.
Time line:
- 2024-10-02-19.22.52: 2024-10-02-19.22.52: p10rhel092.rtp.raleigh.ibm.com is being REBOOT
- 2024-10-02-19.22.54: 2024-10-02-19.22.54: p10rhel094.rtp.raleigh.ibm.com is being REBOOT
– p10rhel092 (member 0) was fenced by p10rhel093 at 19:22:55 and completed at 19:23:06 (took 11 seconds)
Oct 02 19:22:55.526 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=b455b228 state=querying base_timeout=60
Oct 02 19:23:06.156 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'off' [187490] targeting p10rhel092 using db2_fence returned 0 | call 7 from pacemaker-controld.3093
>> There was a delay of more than 2 minutes, between 19:23:06 and 19:25:17, before p10rhel094 was fenced <<<
– It seems like this delay was caused by db2_member_jstamko2_0 start hanging/blocking for 2 minutes.
Oct 02 19:23:08.456 p10rhel093 pacemaker-controld [3093] (do_lrm_rsc_op) notice: Requesting local execution of start operation for db2_member_jstamko2_0 on p10rhel093 | transition_key=35:61:0:263f31a1-6ab7-444a-aa40-802a1a11824a op_key=db2_member_jstamko2_0_start_0
Oct 02 19:25:17.856 p10rhel093 pacemaker-execd [3090] (log_finished) info: db2_member_jstamko2_0 start (call 131, PID 189476) exited with status 0 (execution time 2m9.400s)
– p10rhel094 (CF 129) was fenced by p10rhel093 at 19:25:17 and completed at 19:25:19 (took 2 seconds)
Oct 02 19:25:17.966 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel094 | id=ae98075e state=querying base_timeout=60
Oct 02 19:25:19.496 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'off' [205224] targeting p10rhel094 using db2_fence returned 0 | call 8 from pacemaker-controld.3093
– p10rhel094 was unfenced at 19:26:21 and completed at 19:26:21 (less than 1s)
Oct 02 19:26:21.896 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel094 | id=fc118962 state=querying base_timeout=120
Oct 02 19:26:21.896 p10rhel093 pacemaker-fenced [3089] (request_peer_fencing) notice: Requesting that p10rhel093 perform 'on' action targeting p10rhel094 | for client stonith_admin.211560 (144s, 0s)
– p10rhel092 was unfenced at 19:27:01 and completed at 19:27:03 (took 2s)
Oct 02 19:27:01.366 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=51625729 state=querying base_timeout=120
Oct 02 19:27:03.006 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'on' [214348] targeting p10rhel092 using db2_fence returned 0 | call 2 from stonith_admin.214347
------------
What is the impact of this issue to you? The cluster was unable to automatically recovery after multiple host failures.
Please provide the package NVR for which the bug is seen:
Pacemaker 2.1.7-4.db2pcmk.el9
How reproducible is this bug?: Hit only once so far
Steps to reproduce
- Set up a cluster with 4 nodes and a q-device
- Configure with Db2 resource model and Db2 fencing agent
- Reboot 2 nodes
Expected results: It's expected that fencing actions are triggered for both the down nodes in a timely fashion.
Actual results: Fencing action for the second node was blocked waiting for a resource START action to complete first.
The key question here is: Is this working as design that a fencing action can be blocked waiting for any in-progress resource action ? If yes, is there a way to configure the cluster so that fencing action is independent of resource action and can be run concurrently.
- links to