-
Bug
-
Resolution: Done
-
Major
-
None
-
rhel-9.4
-
None
-
No
-
Critical
-
rhel-sst-high-availability
-
ssg_filesystems_storage_and_HA
-
None
-
False
-
-
None
-
None
-
None
-
None
-
ppc64le
-
None
What were you trying to do that didn't work?
STONITH fencing agent failed to unfence (fence "on") host when the host rejoins the cluster.
It appears that the fence "on" action did not invoke the fence agent script to perform the fence "on" (or unfence) action:
Cluster consists of 4 hosts: p10rhel092, p10rhel093, p10rhel094, p10rhel095 and a quorum device
Initially p10rhel094 was the DC
Then after p10rhel094, then p10rhel093 became the new DC
- At Sep 23 23:37:11 - both hosts p10rhel092 and p10rhel094 were rebooted
- At 23:37:14 - p10rhel092 was fenced by p10rhel095.
- From p10rhel095 host, fence agent script invoked to perform the fence off actionFrom old DC (p10rhel094) host:
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (handle_fence_request) notice: Client pacemaker-controld.3080550 wants to fence p10rhel092 using any device
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=b99c9447 state=querying base_timeout=120
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel092: none
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (process_remote_stonith_query) info: Query result 1 of 4 from p10rhel094 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (process_remote_stonith_query) info: Query result 2 of 4 from p10rhel092 for p10rhel092/off (0 devices) b99c9447-6050-4e2a-be74-571e4369bce7
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (process_remote_stonith_query) info: Query result 3 of 4 from p10rhel093 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (process_remote_stonith_query) info: Query result 4 of 4 from p10rhel095 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (request_peer_fencing) info: Total timeout set to 144 for peer's fencing targeting p10rhel092 for pacemaker-controld.3080550|id=b99c9447
Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced [3080546] (request_peer_fencing) notice: Requesting that p10rhel095 perform 'off' action targeting p10rhel092 | for client pacemaker-controld.3080550 (144s, 0s)Sep 23 23:37:48.125 p10rhel094 pacemaker-fenced [3080546] (finalize_op) notice: Operation 'off' targeting p10rhel092 by p10rhel095 for pacemaker-controld.3080550@p10rhel094: OK (complete) | id=b99c9447
- From DC (p10rhel093) host:September-23 23:37:33 db2fence_ps(670288): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: off
September-23 23:37:33 db2fence_ps(670288): Function: fence_node Line: 506 WARNING:Entry. Parameters: p10rhel092, jstamko2
September-23 23:37:33 db2fence_ps(670288): Function: fence_node Line: 510 WARNING:Remote cleanup: db2remotecleanup p10rhel092 jstamko2
September-23 23:37:48 db2fence_ps(670288): Function: fence_node Line: 514 WARNING:Exit. Return code: 0
September-23 23:37:48 db2fence_ps(670288): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel092. Return Code: 0
Sep 23 23:37:48.120 p10rhel093 pacemaker-fenced [2108653] (finalize_op) notice: Operation 'off' targeting p10rhel092 by p10rhel095 for pacemaker-controld.3080550@p10rhel094: OK (complete) | id=b99c9447
- At 23:37:50: Pacemaker was shutdown on p10rhel094
Sep 23 23:37:50.725 p10rhel094 pacemakerd [3080544] (pcmk_shutdown_worker) notice: Shutting down Pacemaker
Sep 23 23:37:50.725 p10rhel094 pacemakerd [3080544] (pcmk_shutdown_worker) notice: Still waiting for pacemaker-controld to terminate | pid=3080550
Sep 23 23:37:50.725 p10rhel094 pacemaker-fenced [3080546] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler)
Sep 23 23:37:50.725 p10rhel094 pacemaker-execd [3080547] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler) - NOW p10rhel093 is the new DC*
- At 23:38:04 - p10rhel094 was fenced by p10rhel095, completed at 23:38:19
- Fence agent script was invoked on p10rhel095 to fence off host p10rhel094Sep 23 23:38:04.580 p10rhel093 pacemaker-fenced [2108653] (handle_fence_request) notice: Client pacemaker-controld.2108657 wants to fence p10rhel094 using any device
Sep 23 23:38:19.740 p10rhel093 pacemaker-fenced [2108653] (finalize_op) notice: Operation 'off' targeting p10rhel094 by p10rhel095 for pacemaker-controld.2108657@p10rhel093: OK (complete) | id=2486d142
- DC (p10rhel093) detected that host p10rhel092 rejoinedSeptember-23 23:38:04 db2fence_ps(675837): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: off
September-23 23:38:04 db2fence_ps(675837): Function: fence_node Line: 506 WARNING:Entry. Parameters: p10rhel094, jstamko2
September-23 23:38:19 db2fence_ps(675837): Function: fence_node Line: 514 WARNING:Exit. Return code: 0
September-23 23:38:19 db2fence_ps(675837): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel094. Return Code: 0
- DC requests host p10rhel095 to fence p10rhel092Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Created entry 81f8faad-5e12-4a39-8b2c-06d5938b0be4/0x16f90d530 for node p10rhel092/2 (3 total)
Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Node 2 is now known as p10rhel092
Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Node 2 has uuid 2
- On host p10rhel095, saw the fence on request, BUT IT NEVER INVOKED the fence agent script to perform the fence on action. Instead it just returned success (0)Sep 23 23:40:36.730 p10rhel093 pacemaker-fenced [2108653] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel092: none
Sep 23 23:40:36.810 p10rhel093 pacemaker-fenced [2108653] (finalize_op) notice: Operation 'on' targeting p10rhel092 by p10rhel095 for stonith_admin.688755@p10rhel095: OK (complete) | id=603a5d33
- At 23:40:54 the DC detected at host p10rhel094 rejoined the clustrSep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (handle_fence_request) notice: Client stonith_admin.688755 wants to fence p10rhel092 using any device
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=603a5d33 state=querying base_timeout=120
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel092: none
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 1 of 3 from p10rhel095 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (request_peer_fencing) info: Total timeout set to 144 for peer's fencing targeting p10rhel092 for stonith_admin.688755|id=603a5d33
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (request_peer_fencing) notice: Requesting that p10rhel095 perform 'on' action targeting p10rhel092 | for client stonith_admin.688755 (144s, 0s)
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 2 of 3 from p10rhel092 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 3 of 3 from p10rhel093 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel092: none
Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced [3410737] (stonith_fence_get_devices_cb) info: Found 1 matching device for target 'p10rhel092'
Sep 23 23:40:36.804 p10rhel095 pacemaker-fenced [3410737] (log_async_result) notice: Operation 'on' [688756] targeting p10rhel092 using db2_fence returned 0 | call 2 from stonith_admin.688755
- Then at 23:41:02 fence "on" for p10rhel094 was successfulSep 23 23:40:54.980 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Created entry c9abff4a-efcb-414e-b907-9c7b9ce497cc/0x16f9e6bf0 for node p10rhel094/1 (4 total)
Sep 23 23:40:54.980 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Node 1 is now known as p10rhel094
Sep 23 23:40:54.980 p10rhel093 pacemaker-fenced [2108653] (pcmk__get_peer) info: Node 1 has uuid 1
- On host p10rhel095, saw the fence on request for host p10rhel094 and was able to invoke he fence agent scriptSep 23 23:41:04.780 p10rhel093 pacemaker-fenced [2108653] (finalize_op) notice: Operation 'on' targeting p10rhel094 by p10rhel095 for stonith_admin.690473@p10rhel095: OK (complete) | id=4869624d
Sep 23 23:41:04.780 p10rhel093 pacemaker-controld [2108657] (handle_fence_notification) notice: p10rhel094 was unfenced by p10rhel095 at the request of stonith_admin.690473@p10rhel095
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (handle_fence_request) notice: Client stonith_admin.690473 wants to fence p10rhel094 using any device
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel094 | id=4869624d state=querying base_timeout=120
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel094: none
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 1 of 4 from p10rhel095 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (request_peer_fencing) info: Total timeout set to 144 for peer's fencing targeting p10rhel094 for stonith_admin.690473|id=4869624d
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (request_peer_fencing) notice: Requesting that p10rhel095 perform 'on' action targeting p10rhel094 | for client stonith_admin.690473 (144s, 0s)
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 2 of 4 from p10rhel094 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 3 of 4 from p10rhel092 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (process_remote_stonith_query) info: Query result 4 of 4 from p10rhel093 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (can_fence_host_with_device) info: db2_fence is eligible to fence p10rhel094: none
Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced [3410737] (stonith_fence_get_devices_cb) info: Found 1 matching device for target 'p10rhel094'September-23 23:41:02 db2fence_ps(690474): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel094. Return Code: 0
September-23 23:41:02 db2fence_ps(690474): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: on
September-23 23:41:02 db2fence_ps(690474): Function: unfence_node Line: 475 WARNING:Entry. Parameters: p10rhel094, jstamko2
September-23 23:41:04 db2fence_ps(690474): Function: unfence_node Line: 486 WARNING:Exit. Return code: 0
What is the impact of this issue to you? The cluster was unable to recover from the failure.
Please provide the package NVR for which the bug is seen:
Pacemaker 2.1.7-4.db2pcmk.el9
How reproducible is this bug?:
Hit on first iteration on a PPCLE cluster.
Steps to reproduce
- Set up a cluster consists of 4 hosts and a quorum device
- Configure with Db2 pureScale resource model with Db2 fence agent
- Reboot 2 hosts, the first host is a non-DC and the second host is a DC