Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-9.2.0
Component/s: pacemaker
Labels:
None

Regression:
No
Severity:
Low

Pool Team:

rhel-sst-high-availability

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

Periodically running crm_resource --refresh --resource <resource_name> and noticed after a certain refresh that the resource failed over to another host in the cluster. Looking at the Pacemaker logs I noticed the following.
Aug 29 13:42:35.400 R9GHADR-srv-2 pacemaker-execd [2447551] (cancel_recurring_action) info: Cancelling ocf operation db2_gerry_gerry_TESTDB_monitor_9000
Aug 29 13:42:35.400 R9GHADR-srv-2 pacemaker-execd [2447551] (services_action_cancel) info: Terminating in-flight op db2_gerry_gerry_TESTDB_monitor_9000[713660] early because it was cancelled
Aug 29 13:42:35.401 R9GHADR-srv-2 pacemaker-execd [2447551] (async_action_complete) info: db2_gerry_gerry_TESTDB_monitor_9000[713660] terminated with signal 9 (Killed)
Aug 29 13:42:35.401 R9GHADR-srv-2 pacemaker-execd [2447551] (cancel_recurring_action) info: Cancelling ocf operation db2_gerry_gerry_TESTDB_monitor_9000
<...>
Aug 29 13:42:35.403 R9GHADR-srv-2 pacemaker-attrd [2447552] (update_attr_on_host) notice: Setting last-failure-db2_gerry_gerry_TESTDB#monitor_9000[R9GHADR-srv-2] in instance_attributes: (unset) -> 1724964155 | from R9GHADR-srv-2 with no write delay

Please provide the package NVR for which bug is seen:

How reproducible: Very easy.

Steps to reproduce

Do something to make the monitor take a long time, i.e. add sleep 10.
In my case the migration-threshold for the resource is set to 1, which makes the failure very obvious because it results in takeover.
Continuously issue crm_resource --refresh --resource <resource_name>until failure is observed.

Expected results

Pacemaker should rerun the cancelled monitor without failing over the resource to another host.

Actual results

Pacemaker will kill the running monitor, and count it as a monitor failure which is problematic if the migration-threshold is set to 1.

links to

ClusterLabs T872

Assignee:: Christopher Lumens

Reporter:: Gerry Sommerville

Developer:: Kenneth Gaillot

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/08/30 12:21 AM

Updated:: 2025/01/08 9:10 PM

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible: Very easy.

Steps to reproduce

Expected results

Actual results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates