Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-13216

Revert broken attempt to fix "cluster got stuck while stopping" [rhel-9]

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • rhel-9.4
    • rhel-9.2.0, rhel-9.3.0
    • pacemaker
    • pacemaker-2.1.7-4.el9
    • Yes
    • Important
    • ZStream, Regression
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 22
    • 26
    • 8
    • QE ack, Dev ack
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • Approved Blocker
    • All
    • All
    • 2.1.7
    • None

      What were you trying to do that didn't work?

      I tried to remove the stonith devices and stop the cluster, so I could setup sbd.

      Please provide the package NVR for which bug is seen:

      since pacemaker-2.1.6-7.el9.x86_64

      How reproducible:

      Sometimes, 50% chance

      Steps to reproduce

      1.  setup two node cluster
      2.  check out which node is a DC
      3.  on a DC node: remove the stonith devices and stop the cluster (
        pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all

        )

      Expected results

      Stonith devices are deleted, cluster stops.

      Actual results

      Cluster is stuck while stopping:

      [root@virt-253 ~]# pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all
      Attempting to stop: fence-virt-252... Stopped
      Attempting to stop: fence-virt-253... Stopped
      virt-252: Stopping Cluster (pacemaker)...
      
      [root@virt-253 ~]# pcs status --full
      Cluster name: STSRHTS14392
      
      WARNINGS:
      No stonith devices and stonith-enabled is not false
      
      Cluster Summary:
        * Stack: corosync (Pacemaker daemons are shutting down)
        * Current DC: virt-253 (2) (version 2.1.6-9.el9-6fdc9deea29) - MIXED-VERSION partition with quorum
        * Last updated: Fri Oct 13 13:16:22 2023 on virt-253
        * Last change:  Fri Oct 13 13:15:18 2023 by root via cibadmin on virt-252
        * 2 nodes configured
        * 0 resource instances configured
      
      Node List:
        * Node virt-252 (1): pending, feature set <3.15.1
        * Node virt-253 (2): online, feature set 3.17.4
      
      Full List of Resources:
        * No resources
      
      Migration Summary:
      
      Tickets:
      
      PCSD Status:
        virt-252: Online
        virt-253: Online
      
      Daemon Status:
        corosync: active/enabled
        pacemaker: inactive/enabled
        pcsd: active/enabled
      
      

      After 15 minutes when cluster is stuck (`cluster-recheck-interval` I assume) cluster finally stops.

      I created a crm_report from the incident and attached it. The cluster got stuck on the stop action around Oct 13 13:15

      cluster-froze-when-stop.tar.bz2

            rhn-support-msmazova Marketa Smazova
            rhn-support-msmazova Marketa Smazova
            Kenneth Gaillot Kenneth Gaillot
            Marketa Smazova Marketa Smazova
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: