Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-59527

Pacemaker ignores monitor failure

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • rhel-9.4
    • pacemaker
    • None
    • No
    • Low
    • rhel-sst-high-availability
    • ssg_filesystems_storage_and_HA
    • 1
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      monitor failure should've restarted a resource but it was ignored by Pacemaker until resource refresh was manually executed

       

      What is the impact of this issue to you?

      Resource automation isn't happening correctly

       

      Please provide the package NVR for which the bug is seen:

      How reproducible is this bug?:

      Yes

      Steps to reproduce

      1. Have two monitoring scripts and give sleep to one of the resource action
      2. Intentionally fail two resources, first fail resource that has 300 seconds sleep then kill second resources
      3. Observe that first resource is still reporting monitoring failure but Pacemaker is ignoring it until I run crm_resource refresh resource manually

      ```

       16389 Sep 16 15:00:11  db2hadr(db2_regress1_regress1_HARA)[110540]:    INFO: demote: 1063: regress1: 0: HARA: db2hadr_demote() sleep exit.
       16390 Sep 16 15:00:11  db2hadr(db2_regress1_regress1_HARA)[110540]:    ERROR: demote: 494: No db2sysc process detected in ps output
       16391 Sep 16 15:00:11  db2hadr(db2_regress1_regress1_HARA)[110540]:    ERROR: demote: 737: regress1: 0: HARA: Instance is not up. db2hadr_inst       ance_monitor() failed with rc=7, db2hadr_monitor() exit with rc=7.
      ```

      Above line is the beginning of failure, we see that monitor reports failure every 10 seconds yet Pacemaker ignores because initial failure was reported 300 seconds ago and was ignored because Pacemaker was busy running db2hadr_promote() action

      Expected results

      Regardless of how long failure was reported, if monitoring script is reporting failure it should automate it

      Actual results

      Even though monitoring script is reporting failure Pacemaker still does not automate the resource

       

      pcmk-Wed-18-Sep-2024.tar.bz2
      pcmk-Wed-18-Sep-2024-srv-2.tar.bz2

        1. pcmk-Thu-17-Oct-2024.tar.bz2
          246 kB
          Dongho Han
        2. pcmk-Wed-18-Sep-2024.tar.bz2
          7.25 MB
          Dongho Han
        3. pcmk-Wed-18-Sep-2024-srv-2.tar.bz2
          7.26 MB
          Dongho Han

              kgaillot@redhat.com Kenneth Gaillot
              donghohan@ibm.com Dongho Han
              Chris Feist, Gerry Sommerville, Lan Pham
              Kenneth Gaillot Kenneth Gaillot
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: