Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-126940

recurring monitor is immediately cancelled/merged by execd, but controld still expects a result and times out → transition abort

Linking RHIVOS CVEs to...Migration: Automation ...RHELPRIO AssignedTeam ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-9.6
    • pacemaker
    • None
    • None
    • None
    • rhel-ha-pacemaker
    • 5
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • x86_64
    • None

      Environment

      • Pacemaker 2.1.9 (built from source)
      • 2-node cluster: pcmk7-p (was primary) and pcmk7-s (standby)
      • DB2 HADR resource: db2_svtdbm_svtdbm_HADRDB1 with recurring monitor ..._monitor_9000

      Description
      We hit a situation where pacemaker-controld scheduled a recurring monitor for a promoted DB2/HADR resource, but pacemaker-execd immediately treated that monitor as a duplicate and cancelled/merged it. Afterwards, pacemaker-controld never received a result for the monitor and, after the action timeout + cluster-delay, reported the monitor as timed out and aborted the transition.

      This looks like a race between controld's expectation ("I started a recurring op, I will get a result") and execd's logic to cancel/merge a recurring op that it considers a duplicate.

      Relevant log lines

      From pcmk7-p:

       

      Oct 27 09:25:30.687 pacemaker-controld ... notice: Requesting local execution of monitor operation for db2_svtdbm_svtdbm_HADRDB1 on pcmk7-p | ... op_key=db2_svtdbm_svtdbm_HADRDB1_monitor_9000
      Oct 27 09:25:30.687 pacemaker-execd ... info: Cancelling ocf operation db2_svtdbm_svtdbm_HADRDB1_monitor_9000
      Oct 27 09:25:30.687 pacemaker-execd ... warning: Duplicate recurring op entry detected (db2_svtdbm_svtdbm_HADRDB1_monitor_9000), merging with previous op entry
      
      

       

      About 2 minutes later, controld times out:

      Oct 27 09:27:30.687 pacemaker-controld ... error: Node pcmk7-p did not send monitor result (via controller) within 120000ms (action timeout plus cluster-delay)
      Oct 27 09:27:30.687 pacemaker-controld ... error: [Action    2]: In-flight resource op db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p
      Oct 27 09:27:30.687 pacemaker-controld ... warning: rsc_op 2: db2_svtdbm_svtdbm_HADRDB1_monitor_9000 on pcmk7-p timed out 

      After this, the scheduler starts doing recovery / moving the promoted instance.

      Actual result

      • Execd says "I cancelled / merged that recurring monitor."
      • Controld never gets a result.
      • Controld times out the monitor and aborts the transition.
      • This cascades into resource demote/stop/move, which in our case interacted badly with DB2 HADR takeover.

      Expected result
      One of the following should happen:

      1. If execd decides to cancel/merge the recurring monitor, controld should be told explicitly so it does not wait for the result; or
      1. Execd should not merge into an op that has just been cancelled / is being torn down; the new monitor request should become an actual runnable recurring op; or
      1. Scheduler/controld should re-issue the monitor after the cancel so that controller and executor stay in sync.

      Suspected code areas

      • daemons/execd/execd_commands.c - merge_recurring_duplicate(...)
      • lib/services/services.c - services_action_cancel() / cancel_recurring_action()
      • controller side LRM handling in daemons/controld to make sure a cancelled/merged recurring op does not stay in the "expected" list.

        1. primary1 pacemaker.log
          17.36 MB
        2. primary journal.out
          141.22 MB
        3. primary messages
          19.10 MB
        4. secondary journal.out
          83.01 MB
        5. secondary messages
          13.34 MB
        6. secondary pacemaker.log
          8.56 MB

              rhn-support-clumens Christopher Lumens
              mehrdadid Mehrdad Mehraban
              Christopher Lumens Christopher Lumens
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: