Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-60122

Fencing "on" not invoke fencing agent script to perform the action

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • rhel-9.4
    • pacemaker
    • None
    • No
    • Critical
    • rhel-sst-high-availability
    • ssg_filesystems_storage_and_HA
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • ppc64le
    • None

      What were you trying to do that didn't work?

      STONITH fencing agent failed to unfence (fence "on") host when the host rejoins the cluster.
      It appears that the fence "on" action did not invoke the fence agent script to perform the fence "on" (or unfence) action:

      Cluster consists of 4 hosts: p10rhel092, p10rhel093, p10rhel094, p10rhel095 and a quorum device
      Initially p10rhel094 was the DC
      Then after p10rhel094, then p10rhel093 became the new DC

      • At Sep 23 23:37:11 - both hosts p10rhel092 and p10rhel094 were rebooted
      • At 23:37:14 - p10rhel092 was fenced by p10rhel095.

       

      From old DC (p10rhel094) host:
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (handle_fence_request)     notice: Client pacemaker-controld.3080550 wants to fence p10rhel092 using any device
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (initiate_remote_stonith_op)       notice: Requesting peer fencing targeting p10rhel092 | id=b99c9447 state=querying base_timeout=120
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel092: none
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (process_remote_stonith_query)     info: Query result 1 of 4 from p10rhel094 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (process_remote_stonith_query)     info: Query result 2 of 4 from p10rhel092 for p10rhel092/off (0 devices) b99c9447-6050-4e2a-be74-571e4369bce7
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (process_remote_stonith_query)     info: Query result 3 of 4 from p10rhel093 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (process_remote_stonith_query)     info: Query result 4 of 4 from p10rhel095 for p10rhel092/off (1 device) b99c9447-6050-4e2a-be74-571e4369bce7
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (request_peer_fencing)     info: Total timeout set to 144 for peer's fencing targeting p10rhel092 for pacemaker-controld.3080550|id=b99c9447
      Sep 23 23:37:14.285 p10rhel094 pacemaker-fenced    [3080546] (request_peer_fencing)     notice: Requesting that p10rhel095 perform 'off' action targeting p10rhel092 | for client pacemaker-controld.3080550 (144s, 0s)

      Sep 23 23:37:48.125 p10rhel094 pacemaker-fenced    [3080546] (finalize_op)      notice: Operation 'off' targeting p10rhel092 by p10rhel095 for pacemaker-controld.3080550@p10rhel094: OK (complete) | id=b99c9447

      - From p10rhel095 host, fence agent script invoked to perform the fence off action

       

       

      September-23 23:37:33 db2fence_ps(670288): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: off
      September-23 23:37:33 db2fence_ps(670288): Function: fence_node Line: 506 WARNING:Entry. Parameters: p10rhel092, jstamko2
      September-23 23:37:33 db2fence_ps(670288): Function: fence_node Line: 510 WARNING:Remote cleanup: db2remotecleanup p10rhel092 jstamko2
      September-23 23:37:48 db2fence_ps(670288): Function: fence_node Line: 514 WARNING:Exit. Return code: 0
      September-23 23:37:48 db2fence_ps(670288): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel092. Return Code: 0

      - From DC (p10rhel093) host:

      Sep 23 23:37:48.120 p10rhel093 pacemaker-fenced    [2108653] (finalize_op)      notice: Operation 'off' targeting p10rhel092 by p10rhel095 for pacemaker-controld.3080550@p10rhel094: OK (complete) | id=b99c9447

      • At 23:37:50: Pacemaker was shutdown on p10rhel094

        Sep 23 23:37:50.725 p10rhel094 pacemakerd          [3080544] (pcmk_shutdown_worker)     notice: Shutting down Pacemaker
        Sep 23 23:37:50.725 p10rhel094 pacemakerd          [3080544] (pcmk_shutdown_worker)     notice: Still waiting for pacemaker-controld to terminate | pid=3080550
        Sep 23 23:37:50.725 p10rhel094 pacemaker-fenced    [3080546] (crm_signal_dispatch)      notice: Caught 'Terminated' signal | 15 (invoking handler)
        Sep 23 23:37:50.725 p10rhel094 pacemaker-execd     [3080547] (crm_signal_dispatch)      notice: Caught 'Terminated' signal | 15 (invoking handler)

      • NOW p10rhel093 is the new DC*

       

      • At 23:38:04 - p10rhel094 was fenced by p10rhel095, completed at 23:38:19

      Sep 23 23:38:04.580 p10rhel093 pacemaker-fenced    [2108653] (handle_fence_request)     notice: Client pacemaker-controld.2108657 wants to fence p10rhel094 using any device
      Sep 23 23:38:19.740 p10rhel093 pacemaker-fenced    [2108653] (finalize_op)      notice: Operation 'off' targeting p10rhel094 by p10rhel095 for pacemaker-controld.2108657@p10rhel093: OK (complete) | id=2486d142

      - Fence agent script was invoked on p10rhel095 to fence off host p10rhel094

      September-23 23:38:04 db2fence_ps(675837): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: off
      September-23 23:38:04 db2fence_ps(675837): Function: fence_node Line: 506 WARNING:Entry. Parameters: p10rhel094, jstamko2
      September-23 23:38:19 db2fence_ps(675837): Function: fence_node Line: 514 WARNING:Exit. Return code: 0
      September-23 23:38:19 db2fence_ps(675837): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel094. Return Code: 0

      - DC (p10rhel093) detected that host p10rhel092 rejoined

      Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Created entry 81f8faad-5e12-4a39-8b2c-06d5938b0be4/0x16f90d530 for node p10rhel092/2 (3 total)
      Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Node 2 is now known as p10rhel092
      Sep 23 23:40:24.680 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Node 2 has uuid 2

      - DC requests host p10rhel095 to fence p10rhel092

      Sep 23 23:40:36.730 p10rhel093 pacemaker-fenced    [2108653] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel092: none
      Sep 23 23:40:36.810 p10rhel093 pacemaker-fenced    [2108653] (finalize_op)      notice: Operation 'on' targeting p10rhel092 by p10rhel095 for stonith_admin.688755@p10rhel095: OK (complete) | id=603a5d33

      - On host p10rhel095, saw the fence on request, BUT IT NEVER INVOKED the fence agent script to perform the fence on action.  Instead it just returned success (0)

      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (handle_fence_request)     notice: Client stonith_admin.688755 wants to fence p10rhel092 using any device
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (initiate_remote_stonith_op)       notice: Requesting peer fencing targeting p10rhel092 | id=603a5d33 state=querying base_timeout=120
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel092: none
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 1 of 3 from p10rhel095 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (request_peer_fencing)     info: Total timeout set to 144 for peer's fencing targeting p10rhel092 for stonith_admin.688755|id=603a5d33
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (request_peer_fencing)     notice: Requesting that p10rhel095 perform 'on' action targeting p10rhel092 | for client stonith_admin.688755 (144s, 0s)
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 2 of 3 from p10rhel092 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 3 of 3 from p10rhel093 for p10rhel092/on (1 device) 603a5d33-2fa6-472b-94b6-9cc6ddca0b8c
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel092: none
      Sep 23 23:40:36.724 p10rhel095 pacemaker-fenced    [3410737] (stonith_fence_get_devices_cb)     info: Found 1 matching device for target 'p10rhel092'
      Sep 23 23:40:36.804 p10rhel095 pacemaker-fenced    [3410737] (log_async_result)         notice: Operation 'on' [688756] targeting p10rhel092 using db2_fence returned 0 | call 2 from stonith_admin.688755

      - At 23:40:54 the DC detected at host p10rhel094 rejoined the clustr

      Sep 23 23:40:54.980 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Created entry c9abff4a-efcb-414e-b907-9c7b9ce497cc/0x16f9e6bf0 for node p10rhel094/1 (4 total)
      Sep 23 23:40:54.980 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Node 1 is now known as p10rhel094
      Sep 23 23:40:54.980 p10rhel093 pacemaker-fenced    [2108653] (pcmk__get_peer)   info: Node 1 has uuid 1

      - Then at 23:41:02 fence "on" for p10rhel094 was successful

      Sep 23 23:41:04.780 p10rhel093 pacemaker-fenced    [2108653] (finalize_op)      notice: Operation 'on' targeting p10rhel094 by p10rhel095 for stonith_admin.690473@p10rhel095: OK (complete) | id=4869624d
      Sep 23 23:41:04.780 p10rhel093 pacemaker-controld  [2108657] (handle_fence_notification)        notice: p10rhel094 was unfenced by p10rhel095 at the request of stonith_admin.690473@p10rhel095

      - On host p10rhel095, saw the fence on request for host p10rhel094 and was able to invoke he fence agent script

      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (handle_fence_request)     notice: Client stonith_admin.690473 wants to fence p10rhel094 using any device
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (initiate_remote_stonith_op)       notice: Requesting peer fencing targeting p10rhel094 | id=4869624d state=querying base_timeout=120
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel094: none
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 1 of 4 from p10rhel095 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (request_peer_fencing)     info: Total timeout set to 144 for peer's fencing targeting p10rhel094 for stonith_admin.690473|id=4869624d
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (request_peer_fencing)     notice: Requesting that p10rhel095 perform 'on' action targeting p10rhel094 | for client stonith_admin.690473 (144s, 0s)
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 2 of 4 from p10rhel094 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 3 of 4 from p10rhel092 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (process_remote_stonith_query)     info: Query result 4 of 4 from p10rhel093 for p10rhel094/on (1 device) 4869624d-3372-4561-a0ab-64cff37bf843
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (can_fence_host_with_device)       info: db2_fence is eligible to fence p10rhel094: none
      Sep 23 23:41:02.094 p10rhel095 pacemaker-fenced    [3410737] (stonith_fence_get_devices_cb)     info: Found 1 matching device for target 'p10rhel094'

      September-23 23:41:02 db2fence_ps(690474): Function: is_node_fenced_off Line: 293 INFO:Expelled node: p10rhel094. Return Code: 0
      September-23 23:41:02 db2fence_ps(690474): Function: set_fence_status Line: 200 WARNING:Entry. Parameters: on
      September-23 23:41:02 db2fence_ps(690474): Function: unfence_node Line: 475 WARNING:Entry. Parameters: p10rhel094, jstamko2
      September-23 23:41:04 db2fence_ps(690474): Function: unfence_node Line: 486 WARNING:Exit. Return code: 0

      What is the impact of this issue to you?  The cluster was unable to recover from the failure.

      Please provide the package NVR for which the bug is seen:

      Pacemaker 2.1.7-4.db2pcmk.el9

      How reproducible is this bug?:

      Hit on first iteration on a PPCLE cluster.

      Steps to reproduce

      1. Set up a cluster consists of 4 hosts and a quorum device
      2. Configure with Db2 pureScale resource model with Db2 fence agent
      3. Reboot 2 hosts, the first host is a non-DC and the second host is a DC

      Expected results: Expect that for the rebooting host that fence "off" and fence "on" action to invoke the corresponding fence agent script function to perform the fencing operations.

      Actual results:  Fence "off" was performed successfully on both rebooting hosts, but fence "on" failed to invoked the fence agent script function on one of the 2 hosts when the hosts rejoined.

              kgaillot@redhat.com Kenneth Gaillot
              lpham@ca.ibm.com Lan Pham
              IBM zSeries Confidential Group - deprecated
              Kenneth Gaillot Kenneth Gaillot
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: