Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-56721

Running crm_resource --refresh during monitor execution results in monitor failure.

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-9.2.0
    • pacemaker
    • None
    • No
    • Low
    • rhel-sst-high-availability
    • ssg_filesystems_storage_and_HA
    • 5
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      What were you trying to do that didn't work?

      Periodically running crm_resource --refresh --resource <resource_name> and noticed after a certain refresh that the resource failed over to another host in the cluster. Looking at the Pacemaker logs I noticed the following.
      Aug 29 13:42:35.400 R9GHADR-srv-2 pacemaker-execd [2447551] (cancel_recurring_action) info: Cancelling ocf operation db2_gerry_gerry_TESTDB_monitor_9000
      Aug 29 13:42:35.400 R9GHADR-srv-2 pacemaker-execd [2447551] (services_action_cancel) info: Terminating in-flight op db2_gerry_gerry_TESTDB_monitor_9000[713660] early because it was cancelled
      Aug 29 13:42:35.401 R9GHADR-srv-2 pacemaker-execd [2447551] (async_action_complete) info: db2_gerry_gerry_TESTDB_monitor_9000[713660] terminated with signal 9 (Killed)
      Aug 29 13:42:35.401 R9GHADR-srv-2 pacemaker-execd [2447551] (cancel_recurring_action) info: Cancelling ocf operation db2_gerry_gerry_TESTDB_monitor_9000
      <...>
      Aug 29 13:42:35.403 R9GHADR-srv-2 pacemaker-attrd [2447552] (update_attr_on_host) notice: Setting last-failure-db2_gerry_gerry_TESTDB#monitor_9000[R9GHADR-srv-2] in instance_attributes: (unset) -> 1724964155 | from R9GHADR-srv-2 with no write delay
       

      Please provide the package NVR for which bug is seen:

      How reproducible: Very easy.

      Steps to reproduce

      1. Do something to make the monitor take a long time, i.e. add sleep 10.
      2. In my case the migration-threshold for the resource is set to 1, which makes the failure very obvious because it results in takeover.
      3. Continuously issue crm_resource --refresh --resource <resource_name>until failure is observed.

      Expected results

      Pacemaker should rerun the cancelled monitor without failing over the resource to another host.

      Actual results

      Pacemaker will kill the running monitor, and count it as a monitor failure which is problematic if the migration-threshold is set to 1.

              kgaillot@redhat.com Kenneth Gaillot
              gerrysommerville Gerry Sommerville
              Kenneth Gaillot Kenneth Gaillot
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: