Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-9.4
Component/s: pacemaker
Labels:
None

Regression:
No
Severity:
None

Pool Team:

rhel-sst-high-availability
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work? In a host reboot scenario that involves fencing, the fencing action can be blocked waiting for an in-progress resource START action to complete. This poses an issue when the resource START action has an indirect dependency on the fencing action to be run, so introduce a circular dependency.

Here is the sequence of events:

Cluster has 4 nodes ( p10rhel092, p10rhel093, p10rhel094, p10rhel095) and a quorum device
2 nodes were rebooted ( p10rhel092 then p10rhel094 )
fencing was triggered for p10rhel092
A db2 member resource that was running on p10rhel092 was restarted on p10rhel093)
The START action for the db2 member resource could not complete in a timely fashion, because it has a dependency on fencing for p10rhel094.

The main question is: Can we configure Pacemaker so that node fencing can be triggered when needed without waiting for an in-progress resource action to complete.

Time line:

2024-10-02-19.22.52: 2024-10-02-19.22.52: p10rhel092.rtp.raleigh.ibm.com is being REBOOT
2024-10-02-19.22.54: 2024-10-02-19.22.54: p10rhel094.rtp.raleigh.ibm.com is being REBOOT

– p10rhel092 (member 0) was fenced by p10rhel093 at 19:22:55 and completed at 19:23:06 (took 11 seconds)

Oct 02 19:22:55.526 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=b455b228 state=querying base_timeout=60

Oct 02 19:23:06.156 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'off' [187490] targeting p10rhel092 using db2_fence returned 0 | call 7 from pacemaker-controld.3093

>> There was a delay of more than 2 minutes, between 19:23:06 and 19:25:17, before p10rhel094 was fenced <<<

– It seems like this delay was caused by db2_member_jstamko2_0 start hanging/blocking for 2 minutes.

Oct 02 19:23:08.456 p10rhel093 pacemaker-controld [3093] (do_lrm_rsc_op) notice: Requesting local execution of start operation for db2_member_jstamko2_0 on p10rhel093 | transition_key=35:61:0:263f31a1-6ab7-444a-aa40-802a1a11824a op_key=db2_member_jstamko2_0_start_0

Oct 02 19:25:17.856 p10rhel093 pacemaker-execd [3090] (log_finished) info: db2_member_jstamko2_0 start (call 131, PID 189476) exited with status 0 (execution time 2m9.400s)

– p10rhel094 (CF 129) was fenced by p10rhel093 at 19:25:17 and completed at 19:25:19 (took 2 seconds)

Oct 02 19:25:17.966 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel094 | id=ae98075e state=querying base_timeout=60

Oct 02 19:25:19.496 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'off' [205224] targeting p10rhel094 using db2_fence returned 0 | call 8 from pacemaker-controld.3093

– p10rhel094 was unfenced at 19:26:21 and completed at 19:26:21 (less than 1s)

Oct 02 19:26:21.896 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel094 | id=fc118962 state=querying base_timeout=120

Oct 02 19:26:21.896 p10rhel093 pacemaker-fenced [3089] (request_peer_fencing) notice: Requesting that p10rhel093 perform 'on' action targeting p10rhel094 | for client stonith_admin.211560 (144s, 0s)

– p10rhel092 was unfenced at 19:27:01 and completed at 19:27:03 (took 2s)

Oct 02 19:27:01.366 p10rhel093 pacemaker-fenced [3089] (initiate_remote_stonith_op) notice: Requesting peer fencing targeting p10rhel092 | id=51625729 state=querying base_timeout=120

Oct 02 19:27:03.006 p10rhel093 pacemaker-fenced [3089] (log_async_result) notice: Operation 'on' [214348] targeting p10rhel092 using db2_fence returned 0 | call 2 from stonith_admin.214347

------------

What is the impact of this issue to you? The cluster was unable to automatically recovery after multiple host failures.

Please provide the package NVR for which the bug is seen:

Pacemaker 2.1.7-4.db2pcmk.el9

How reproducible is this bug?: Hit only once so far

Steps to reproduce

Set up a cluster with 4 nodes and a q-device
Configure with Db2 resource model and Db2 fencing agent
Reboot 2 nodes

Expected results: It's expected that fencing actions are triggered for both the down nodes in a timely fashion.

Actual results: Fencing action for the second node was blocked waiting for a resource START action to complete first.

The key question here is: Is this working as design that a fencing action can be blocked waiting for any in-progress resource action ? If yes, is there a way to configure the cluster so that fencing action is independent of resource action and can be run concurrently.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fencing-resource-action-serialized.tar.bz2
6.54 MB
2024/10/03 7:59 PM

links to

ClusterLabs T900

Assignee:: Kenneth Gaillot

Reporter:: Lan Pham

Contributing Groups:: IBM Confidential Group

Developer:: Kenneth Gaillot

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/10/03 7:58 PM

Updated:: 2024/10/16 10:00 PM

Resolved:: 2024/10/16 8:38 PM

Details

Description

What is the impact of this issue to you? The cluster was unable to automatically recovery after multiple host failures.

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?: Hit only once so far

Steps to reproduce

Expected results: It's expected that fencing actions are triggered for both the down nodes in a timely fashion.

Actual results: Fencing action for the second node was blocked waiting for a resource START action to complete first.

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates