Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-78393

Cluster gets stuck, when deleting failed and disabled resource with a constraint.

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-10.0
    • pacemaker
    • None
    • No
    • None
    • rhel-ha
    • 8
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      Please provide the package NVR for which the bug is seen:

      pacemaker-3.0.0-5.el10.x86_64
      pcs-0.12.0-2.el10.x86_64

      How reproducible is this bug?:

      always, easily

      Steps to reproduce

      • Create dummy resource with op_sleep longer than op monitor interval, so it fails:
      [root@virt-246 ~]# pcs resource create dummy1 ocf:pacemaker:Dummy op_sleep=15 op monitor interval=10 timeout=15
      
      • Create a location constraint on that resource:
      [root@virt-246 ~]# pcs constraint location dummy1 prefers virt-245=INFINITY
      
      • Wait until resource fails and then disable it:
      [root@virt-246 ~]# pcs resource disable dummy1
      
      [root@virt-246 ~]# pcs status --full
      Cluster name: STSRHTS2031
      Cluster Summary:
        * Stack: corosync (Pacemaker is running)
        * Current DC: virt-245 (1) (version 3.0.0-5.el10-5b53b7e) - partition with quorum
        * Last updated: Fri Feb  7 15:54:28 2025 on virt-246
        * Last change:  Fri Feb  7 15:54:09 2025 by root via root on virt-246
        * 2 nodes configured
        * 3 resource instances configured (1 DISABLED)
      
      Node List:
        * Node virt-245 (1): online, feature set 3.20.0
        * Node virt-246 (2): online, feature set 3.20.0
      
      Full List of Resources:
        * fence-virt-245	(stonith:fence_xvm):	 Started virt-245
        * fence-virt-246	(stonith:fence_xvm):	 Started virt-246
        * dummy1	(ocf:pacemaker:Dummy):	 FAILED virt-245 (disabled)
      
      Migration Summary:
        * Node: virt-245 (1):
          * dummy1: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  7 15:54:14 2025'
      
      Failed Resource Actions:
        * dummy1_monitor_10000 on virt-245 'Error occurred' (1): call=21, status='Timed out', exitreason='Resource agent did not complete within 15s', last-rc-change='Fri Feb  7 15:54:14 2025', queued=0ms, exec=14847ms
      
      Tickets:
      
      PCSD Status:
        virt-245: Online
        virt-246: Online
      
      Daemon Status:
        corosync: active/enabled
        pacemaker: active/enabled
        pcsd: active/enabled
      
      • Delete the resource:
      [root@virt-246 ~]# pcs resource delete dummy1
      Removing dependant element:
        Location constraint: 'location-dummy1-virt-245-INFINITY'
      Stopping resource 'dummy1' before deleting
      Waiting for the cluster to apply configuration changes...
      
      [root@virt-245 ~]# crm_resource --wait -T 1
      Pending actions:
      crm_resource: Error performing operation: Timeout occurred
      

      Expected results

      Resource is deleted.

      Actual results

      Resource is not deleted, cluster is stuck while deleting the resource.

      Additional info:

      If I run `pcs resource refresh` on that disabled resource before deleting it, it is deleted after a few seconds and cluster does not get stuck.

        1. pacemaker.log
          114 kB
        2. crm_resource.log
          190 kB
        3. cib.xml
          12 kB

              rhn-support-clumens Christopher Lumens
              rhn-support-msmazova Marketa Smazova
              Christopher Lumens Christopher Lumens
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: