-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
1
-
rhel-ha
-
HA-PCMK Sprint #5
-
5
-
False
-
-
None
What were you trying to do that didn't work?
On Azure, we are validating RHEL 10 OS for SAP workload. We have setup a 2-node cluster with SBD as stonith mechansim. The systemd service of SBD is throwing below error message (highlighted in red) on both the nodes.
We don't see any issue in SBD behavior during our testing. But we want to understand what this fatal internal error is about.
root@rh0dhdb00l025:~# systemctl status sbd
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/sbd.service.d
└─sbd_delay_start.conf
Active: active (running) since Wed 2025-11-12 23:38:15 UTC; 21h ago
Invocation: 0bea9a513c454f56ab7309e3f64f6f5f
Docs: man:sbd(8)
Main PID: 7242 (sbd)
Tasks: 6 (limit: 1025784)
Memory: 19.6M (peak: 20.6M)
CPU: 1min 30.612s
CGroup: /system.slice/sbd.service
├─7242 "sbd: inquisitor"
├─7243 "sbd: watcher: /dev/disk/by-id/scsi-3600140568f22b8820e6462d8ed2d256e - slot: 1 - uuid: 3fc1f9d7-3af2-4592-8e45-c98897e67d51"
├─7244 "sbd: watcher: /dev/disk/by-id/scsi-36001405aed93b0201c940629159f2230 - slot: 1 - uuid: 606c9cf0-900b-4eb8-95d9-9a2e933a7250"
├─7245 "sbd: watcher: /dev/disk/by-id/scsi-3600140544c9ccfd0f134917b0d547ed6 - slot: 1 - uuid: 6003d801-8249-4361-acd1-ccd04cd51624"
├─7246 "sbd: watcher: Pacemaker"
└─7247 "sbd: watcher: Cluster"
Nov 12 23:38:14 rh0dhdb00l025 sbd[7244]: /dev/disk/by-id/scsi-36001405aed93b0201c940629159f2230: notice: servant_md: Monitoring slot 1 on disk /dev/disk/by-id/scsi-36001405aed93b0201c940629159f2230
Nov 12 23:38:14 rh0dhdb00l025 sbd[7245]: /dev/disk/by-id/scsi-3600140544c9ccfd0f134917b0d547ed6: notice: servant_md: Monitoring slot 1 on disk /dev/disk/by-id/scsi-3600140544c9ccfd0f134917b0d547ed6
Nov 12 23:38:14 rh0dhdb00l025 sbd[7247]: cluster: notice: servant_cluster: Monitoring corosync cluster health
Nov 12 23:38:14 rh0dhdb00l025 sbd[7247]: cluster: notice: verify_against_cmap_config: Corosync is in 2Node-mode
Nov 12 23:38:14 rh0dhdb00l025 sbd[7247]: cluster: error: log_assertion_as: pcmk_server_message_type: Triggered fatal assertion at servers.c:164 : (server > 0) && (server < PCMK_NELEM(server_info))
Nov 12 23:38:14 rh0dhdb00l025 sbd[7247]: cluster: notice: update_peer_state_iter: Node rh0dhdb00l025 state is now member | nodeid=1 previous=unknown source=crm_update_peer_proc
Nov 12 23:38:14 rh0dhdb00l025 sbd[7242]: notice: inquisitor_child: Servant cluster is healthy (age: 0)
Nov 12 23:38:15 rh0dhdb00l025 sbd[7242]: notice: watchdog_init: Using watchdog device '/dev/watchdog'
Nov 12 23:38:15 rh0dhdb00l025 systemd[1]: Started sbd.service - Shared-storage based fencing daemon.
Nov 12 23:38:19 rh0dhdb00l025 sbd[7242]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)
What is the impact of this issue to you?
Currently don't see any impact but don't know if there could be any issue due to this fatal error message in any edge cases.
Please provide the package NVR for which the bug is seen:
root@rh0dhdb00l025:~# rpm -qa | grep -Ei "pacemaker|corosync|sbd|fence-agents-sbd"
corosynclib-3.1.9-1.el10_0.1.x86_64
pacemaker-schemas-3.0.0-5.1.el10_0.noarch
pacemaker-libs-3.0.0-5.1.el10_0.x86_64
pacemaker-cluster-libs-3.0.0-5.1.el10_0.x86_64
corosync-3.1.9-1.el10_0.1.x86_64
pacemaker-3.0.0-5.1.el10_0.x86_64
pacemaker-cli-3.0.0-5.1.el10_0.x86_64
sbd-1.5.2-1.el10.5.x86_64
fence-agents-sbd-4.16.0-5.el10_0.6.noarch
root@rh0dhdb00l025:~# more /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="10.0 (Coughlan)"
ID="rhel"
ID_LIKE="centos fedora"
VERSION_ID="10.0"
PLATFORM_ID="platform:el10"
PRETTY_NAME="Red Hat Enterprise Linux 10.0 (Coughlan)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:10.0"
HOME_URL="https://www.redhat.com/"
VENDOR_NAME="Red Hat"
VENDOR_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/10"
BUG_REPORT_URL="https://issues.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 10"
REDHAT_BUGZILLA_PRODUCT_VERSION=10.0
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="10.0"
How reproducible is this bug?:
Everytime
Steps to reproduce
- Attached a shared LUN(s) in two node cluster
- Configure the SBD
root@rh0dhdb00l025:~# more /etc/sysconfig/sbd | grep -v '#'
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=186
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_SYNC_RESOURCE_STARTUP=yes
SBD_OPTS=
SBD_DEVICE="/dev/disk/by-id/scsi-3600140568f22b8820e6462d8ed2d256e;/dev/disk/by-id/scsi-36001405aed93b0201c940629159f2230;/dev/disk/by-id/scsi-3600140544c9ccfd0f134917b0d547ed6"
- Setup the cluster
- Enable the SBD service "systemctl enable sbd"
- Start the cluster. This will start the SBD service
- Check SBD service: "systemctl status sbd". The fatal error message would pop up.
Expected results
Actual results
- split to
-
RHEL-128442 Fatal error message in SBD service
-
- Integration
-