Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: rhel-10.0
Component/s: pacemaker
Labels:
None

Regression:
None
Severity:
Moderate

AssignedTeam:
rhel-ha

Story Points:
8
ACKs Check:

Dev ack
Target Version:

rhel-9.4
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None
Release Blocker:
Approved Blocker

Preliminary Testing:
None
Test Coverage:
None

Release Note Type:
Bug Fix
Release Note Text:

Hide
.Nodes no longer unexpectedly leave the cluster after rejoining

Before this update, when a node left a cluster, the cleanup of its transient attributes was handled by two separate components. As a consequence, a node's shutdown attribute might not have been cleared before the node attempted to rejoin the cluster, causing the node to immediately leave again.

With this release, the responsibility for clearing all transient node attributes has been consolidated into a single component.

As a result, these timing issues are no longer possible, and nodes can rejoin the cluster without being immediately removed due to stale `shutdown` attributes.

Show
.Nodes no longer unexpectedly leave the cluster after rejoining Before this update, when a node left a cluster, the cleanup of its transient attributes was handled by two separate components. As a consequence, a node's shutdown attribute might not have been cleared before the node attempted to rejoin the cluster, causing the node to immediately leave again. With this release, the responsibility for clearing all transient node attributes has been consolidated into a single component. As a result, these timing issues are no longer possible, and nodes can rejoin the cluster without being immediately removed due to stale `shutdown` attributes.
Release Note Status:
Proposed
ProdDocsReview-CCS:
Done
ProdDocsReview-Dev:
Required

Experience:
Architecture:

All
OS:
All
Target Upstream Version:
2.1.7

PX Impact Score:
PX Impact Range:
PX Priority Data:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

I tried to remove the stonith devices and stop the cluster, so I could setup sbd.

Please provide the package NVR for which bug is seen:

since pacemaker-2.1.6-7.el9.x86_64

How reproducible:

Sometimes, 50% chance

Steps to reproduce

setup two node cluster
check out which node is a DC

on a DC node: remove the stonith devices and stop the cluster (

pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all

)

Expected results

Stonith devices are deleted, cluster stops.

Actual results

Cluster is stuck while stopping:

[root@virt-253 ~]# pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all
Attempting to stop: fence-virt-252... Stopped
Attempting to stop: fence-virt-253... Stopped
virt-252: Stopping Cluster (pacemaker)...

[root@virt-253 ~]# pcs status --full
Cluster name: STSRHTS14392

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker daemons are shutting down)
  * Current DC: virt-253 (2) (version 2.1.6-9.el9-6fdc9deea29) - MIXED-VERSION partition with quorum
  * Last updated: Fri Oct 13 13:16:22 2023 on virt-253
  * Last change:  Fri Oct 13 13:15:18 2023 by root via cibadmin on virt-252
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Node virt-252 (1): pending, feature set <3.15.1
  * Node virt-253 (2): online, feature set 3.17.4

Full List of Resources:
  * No resources

Migration Summary:

Tickets:

PCSD Status:
  virt-252: Online
  virt-253: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: inactive/enabled
  pcsd: active/enabled

After 15 minutes when cluster is stuck (`cluster-recheck-interval` I assume) cluster finally stops.

I created a crm_report from the incident and attached it. The cluster got stuck on the stop action around Oct 13 13:15

[^cluster-froze-when-stop.tar.bz2]

account is impacted by

RHEL-83443 Unexpected node shutdown after rebooting of another host

Closed

clones

RHEL-13216 Revert broken attempt to fix "cluster got stuck while stopping" [rhel-9]

Closed

is cloned by

RHEL-86147 Avoid "shutdown" node attribute persisting after shutdown [rhel-9]

In Progress

links to

ClusterLabs T137

ClusterLabs T138

ClusterLabs T139

(1 links to)

Assignee:: Christopher Lumens

Reporter:: Marketa Smazova

Developer:: Christopher Lumens

QA Contact:: Marketa Smazova

Doc Contact:: Michal Stubna

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/01/29 9:23 PM

Updated:: 2025/10/06 1:22 PM

Stale Date:: 2026/09/03

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Steps to reproduce

Expected results

Actual results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide