Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-86147

Avoid "shutdown" node attribute persisting after shutdown [rhel-9]

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • rhel-9.6
    • pacemaker
    • None
    • None
    • Moderate
    • rhel-ha
    • 8
    • Dev ack
    • False
    • False
    • Hide

      None

      Show
      None
    • Yes
    • Red Hat Enterprise Linux
    • None
    • Approved Blocker
    • None
    • None
    • Bug Fix
    • Hide
      .Nodes no longer unexpectedly leave the cluster after rejoining

      Before this update, when a node left a cluster, the cleanup of its transient attributes was handled by two separate components. As a consequence, a node's shutdown attribute might not have been cleared before the node attempted to rejoin the cluster, causing the node to immediately leave again.

      With this release, the responsibility for clearing all transient node attributes has been consolidated into a single component.

      As a result, these timing issues are no longer possible, and nodes can rejoin the cluster without being immediately removed due to stale `shutdown` attributes.
      Show
      .Nodes no longer unexpectedly leave the cluster after rejoining Before this update, when a node left a cluster, the cleanup of its transient attributes was handled by two separate components. As a consequence, a node's shutdown attribute might not have been cleared before the node attempted to rejoin the cluster, causing the node to immediately leave again. With this release, the responsibility for clearing all transient node attributes has been consolidated into a single component. As a result, these timing issues are no longer possible, and nodes can rejoin the cluster without being immediately removed due to stale `shutdown` attributes.
    • Proposed
    • Done
    • Done
    • Not Required
    • All
    • All
    • 2.1.7
    • None

      What were you trying to do that didn't work?

      I tried to remove the stonith devices and stop the cluster, so I could setup sbd.

      Please provide the package NVR for which bug is seen:

      since pacemaker-2.1.6-7.el9.x86_64

      How reproducible:

      Sometimes, 50% chance

      Steps to reproduce

      1.  setup two node cluster
      2.  check out which node is a DC
      3.  on a DC node: remove the stonith devices and stop the cluster (
        pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all

        )

      Expected results

      Stonith devices are deleted, cluster stops.

      Actual results

      Cluster is stuck while stopping:

      [root@virt-253 ~]# pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all
      Attempting to stop: fence-virt-252... Stopped
      Attempting to stop: fence-virt-253... Stopped
      virt-252: Stopping Cluster (pacemaker)...
      
      [root@virt-253 ~]# pcs status --full
      Cluster name: STSRHTS14392
      
      WARNINGS:
      No stonith devices and stonith-enabled is not false
      
      Cluster Summary:
        * Stack: corosync (Pacemaker daemons are shutting down)
        * Current DC: virt-253 (2) (version 2.1.6-9.el9-6fdc9deea29) - MIXED-VERSION partition with quorum
        * Last updated: Fri Oct 13 13:16:22 2023 on virt-253
        * Last change:  Fri Oct 13 13:15:18 2023 by root via cibadmin on virt-252
        * 2 nodes configured
        * 0 resource instances configured
      
      Node List:
        * Node virt-252 (1): pending, feature set <3.15.1
        * Node virt-253 (2): online, feature set 3.17.4
      
      Full List of Resources:
        * No resources
      
      Migration Summary:
      
      Tickets:
      
      PCSD Status:
        virt-252: Online
        virt-253: Online
      
      Daemon Status:
        corosync: active/enabled
        pacemaker: inactive/enabled
        pcsd: active/enabled
      
      

      After 15 minutes when cluster is stuck (`cluster-recheck-interval` I assume) cluster finally stops.

      I created a crm_report from the incident and attached it. The cluster got stuck on the stop action around Oct 13 13:15

      [^cluster-froze-when-stop.tar.bz2]

              rhn-support-clumens Christopher Lumens
              rhn-support-msmazova Marketa Smazova
              Christopher Lumens Christopher Lumens
              Jana Rehova Jana Rehova
              Michal Stubna Michal Stubna
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: