Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-56321

Unexpected resource move to a different node

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • rhel-9.2.0
    • pacemaker
    • None
    • No
    • None
    • rhel-sst-high-availability
    • ssg_filesystems_storage_and_HA
    • 1
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?  While running network down testing on a host, a resource on a different host was unexpectedly reassigned and started on a new host.

      Here is the sequence of event:

      • Cluster consists of 4 hosts (ps-1, ps-2, ps-3, ps-4) and a QDevice
      • At 2024-08-26-21.21.49, ip link down the ethernet interface on host ps-1.  ps-1 node was fenced, member resource failed over to run on ps-2 host.  Everything worked as expected
      • At 2024-08-26-21.24.20, ip link up the ethernet interface on host ps-1.  ps-1 node rejoined the cluster and resources restarted as expected.
      • At 2024-08-26-21.25.05, the scheduler unassigned resource 'db2_cfprimary_db2inst1'  and then start it on a different host ps-4.  This resource was running on node ps-3 at this time.  This was not expected

       

      Aug 26 21:25:05.710 ps-3 pacemaker-schedulerd[10161] (pcmk__unassign_resource)  info: Unassigning db2_cfprimary_db2inst1

      Aug 26 21:25:05.716 ps-3 pacemaker-schedulerd[10161] (log_list_item)    notice: Actions: Start      db2_cfprimary_db2inst1                 (         ps-4 )

       

      • As a result, the db2_cfrimary_db2inst1 was stopped on ps-3 and then restarted on ps-4.  This has a side effect of causing an error on another resource db2_cf_db2inst1_128.

       

       

      • At 2024-06-26-21.25.17 resource db2_cf_db2inst1_128 monitor failed.  This was expected because of the primary failover occurred earlier

       

      Aug 26 21:25:17.562 ps-3 pacemaker-controld  [10163] (log_executor_event)       notice: Result of monitor operation for db2_cf_db2inst1_128 on ps-3: not running | graph action unconfirmed; call=143 key=db2_cf_db2inst1_128_monitor_10000 rc=7

       

      • But recovery action for db2_cf_db2inst1_128 resource was delayed by 108 seconds.  This was not expected.

       

      Aug 26 21:27:05.047 ps-3 pacemaker-schedulerd[10161] (log_list_item)    notice: Actions: Recover    db2_cf_db2inst1_128                    (         ps-3 )

       

      Please provide the package NVR for which bug is seen: 

      Pacemaker 2.1.7-4.db2pcmk.el9.2

      How reproducible: Intermittent

      Steps to reproduce

      1. Set up a cluster with 4 nodes and QDevice
      2. Set up the Db2 pureScale resource model 
      3. Take down the ethernet interface on a member host

      Expected results: Expect that all resources recovered and restarted successfully in a timely fashion.  

      Actual results: In this case, one resource db2_cfprimary_db2inst1 moved to a different node unexpectedly causing another resource to fail.  A second issue was that it took 108 seconds for recovery action to be triggered after monitor failure for db2_cf_db2inst1_128 resource.

              kgaillot@redhat.com Kenneth Gaillot
              lpham@ca.ibm.com Lan Pham
              IBM Confidential Group
              Kenneth Gaillot Kenneth Gaillot
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: