Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-9.6
Component/s: pacemaker
Labels:
None

Regression:
None
Severity:
None

AssignedTeam:
rhel-ha-pacemaker

Story Points:
5
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

ProdDocsReview-CCS:
Unspecified
ProdDocsReview-Dev:
Unspecified
ProdDocsReview-QE:
Unspecified

Architecture:

x86_64

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

Environment

Pacemaker 2.1.9 (built from source)

2-node cluster: pcmk7-p (was primary) and pcmk7-s (standby)

DB2 HADR resource: db2_svtdbm_svtdbm_HADRDB1 with recurring monitor ..._monitor_9000

Description
We hit a situation where pacemaker-controld scheduled a recurring monitor for a promoted DB2/HADR resource, but pacemaker-execd immediately treated that monitor as a duplicate and cancelled/merged it. Afterwards, pacemaker-controld never received a result for the monitor and, after the action timeout + cluster-delay, reported the monitor as timed out and aborted the transition.

This looks like a race between controld's expectation ("I started a recurring op, I will get a result") and execd's logic to cancel/merge a recurring op that it considers a duplicate.

Relevant log lines

From pcmk7-p:

Oct 27 09:25:30.687 pacemaker-controld ... notice: Requesting local execution of monitor operation for db2_svtdbm_svtdbm_HADRDB1 on pcmk7-p | ... op_key=db2_svtdbm_svtdbm_HADRDB1_monitor_9000
Oct 27 09:25:30.687 pacemaker-execd ... info: Cancelling ocf operation db2_svtdbm_svtdbm_HADRDB1_monitor_9000
Oct 27 09:25:30.687 pacemaker-execd ... warning: Duplicate recurring op entry detected (db2_svtdbm_svtdbm_HADRDB1_monitor_9000), merging with previous op entry

About 2 minutes later, controld times out:

Oct 27 09:27:30.687 pacemaker-controld ... error: Node pcmk7-p did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)
Oct 27 09:27:30.687 pacemaker-controld ... error: [Action    2]: In-flight resource op db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p
Oct 27 09:27:30.687 pacemaker-controld ... warning: rsc_op 2: db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p timed out

After this, the scheduler starts doing recovery / moving the promoted instance.

Actual result

Execd says "I cancelled / merged that recurring monitor."

Controld never gets a result.

Controld times out the monitor and aborts the transition.

This cascades into resource demote/stop/move, which in our case interacted badly with DB2 HADR takeover.

Expected result
One of the following should happen:

If execd decides to cancel/merge the recurring monitor, controld should be told explicitly so it does not wait for the result; or

Execd should not merge into an op that has just been cancelled / is being torn down; the new monitor request should become an actual runnable recurring op; or

Scheduler/controld should re-issue the monitor after the cancel so that controller and executor stay in sync.

Suspected code areas

daemons/execd/execd_commands.c - merge_recurring_duplicate(...)

lib/services/services.c - services_action_cancel() / cancel_recurring_action()

controller side LRM handling in daemons/controld to make sure a cancelled/merged recurring op does not stay in the "expected" list.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

primary1 pacemaker.log
17.36 MB
2025/11/06 7:23 PM
primary journal.out
141.22 MB
2025/11/06 7:24 PM
primary messages
19.10 MB
2025/11/06 7:23 PM
secondary journal.out
83.01 MB
2025/11/06 7:23 PM
secondary messages
13.34 MB
2025/11/06 7:23 PM
secondary pacemaker.log
8.56 MB
2025/11/06 7:23 PM

Assignee:: Christopher Lumens

Reporter:: Mehrdad Mehraban (Inactive)

Developer:: Christopher Lumens

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/11/06 7:22 PM

Updated:: 2025/11/21 8:49 PM

Stale Date:: 2026/11/06

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates