-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhel-9.6
-
None
-
None
-
None
-
rhel-ha-pacemaker
-
5
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
x86_64
-
None
Environment
- Pacemaker 2.1.9 (built from source)
- 2-node cluster: pcmk7-p (was primary) and pcmk7-s (standby)
- DB2 HADR resource: db2_svtdbm_svtdbm_HADRDB1 with recurring monitor ..._monitor_9000
Description
We hit a situation where pacemaker-controld scheduled a recurring monitor for a promoted DB2/HADR resource, but pacemaker-execd immediately treated that monitor as a duplicate and cancelled/merged it. Afterwards, pacemaker-controld never received a result for the monitor and, after the action timeout + cluster-delay, reported the monitor as timed out and aborted the transition.
This looks like a race between controld's expectation ("I started a recurring op, I will get a result") and execd's logic to cancel/merge a recurring op that it considers a duplicate.
Relevant log lines
From pcmk7-p:
Oct 27 09:25:30.687 pacemaker-controld ... notice: Requesting local execution of monitor operation for db2_svtdbm_svtdbm_HADRDB1 on pcmk7-p | ... op_key=db2_svtdbm_svtdbm_HADRDB1_monitor_9000
Oct 27 09:25:30.687 pacemaker-execd ... info: Cancelling ocf operation db2_svtdbm_svtdbm_HADRDB1_monitor_9000
Oct 27 09:25:30.687 pacemaker-execd ... warning: Duplicate recurring op entry detected (db2_svtdbm_svtdbm_HADRDB1_monitor_9000), merging with previous op entry
About 2 minutes later, controld times out:
Oct 27 09:27:30.687 pacemaker-controld ... error: Node pcmk7-p did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay) Oct 27 09:27:30.687 pacemaker-controld ... error: [Action 2]: In-flight resource op db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p Oct 27 09:27:30.687 pacemaker-controld ... warning: rsc_op 2: db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p timed out
After this, the scheduler starts doing recovery / moving the promoted instance.
Actual result
- Execd says "I cancelled / merged that recurring monitor."
- Controld never gets a result.
- Controld times out the monitor and aborts the transition.
- This cascades into resource demote/stop/move, which in our case interacted badly with DB2 HADR takeover.
Expected result
One of the following should happen:
- If execd decides to cancel/merge the recurring monitor, controld should be told explicitly so it does not wait for the result; or
- Execd should not merge into an op that has just been cancelled / is being torn down; the new monitor request should become an actual runnable recurring op; or
- Scheduler/controld should re-issue the monitor after the cancel so that controller and executor stay in sync.
Suspected code areas
- daemons/execd/execd_commands.c - merge_recurring_duplicate(...)
- lib/services/services.c - services_action_cancel() / cancel_recurring_action()
- controller side LRM handling in daemons/controld to make sure a cancelled/merged recurring op does not stay in the "expected" list.