Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-8.10.z, rhel-9.2.0.z
Component/s: pacemaker
Labels:
None

Regression:
No
Severity:
Low

AssignedTeam:
rhel-ha

Story Points:
2
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

Customer is operating their Openstack "scale-out" HA Instance as normal, but with cron jobs that regularly check "pcs status" output.

What is the impact of this issue to you?

When the issue becomes present, this blocks pacemaker_controld's "liveneness" check, and can result in a restart of pacmeker-controld:

Dec  2 08:16:40 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 1 tries
Dec  2 08:16:47 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 2 tries
Dec  2 08:16:54 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 3 tries
Dec  2 08:17:01 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 4 tries
Dec  2 08:17:08 controller-0 pacemakerd[698941]: error: pacemaker-controld[896317] is unresponsive to ipc after 5 tries but we found the pid so have it killed that we can restart
-----------------------------------------8<----------------------------------------- 
Dec  2 08:17:08 controller-0 pacemakerd[698941]: warning: pacemaker-controld[896317] terminated with signal 9 (Killed)
Dec  2 08:17:08 controller-0 pacemakerd[698941]: notice: Respawning pacemaker-controld subdaemon after unexpected exit

The above "liveness" failures appear to be blocked behind "stonith history" checks ran in a synchronous fashion:

Dec 02 08:16:00.477 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.477 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90875 from client crm_mon.210990
Dec 02 08:16:00.480 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.480 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90883 from client crm_mon.210990
Dec 02 08:16:00.483 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.483 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90874 from client crm_mon.210990
Dec 02 08:16:00.486 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.486 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90876 from client crm_mon.210990
Dec 02 08:16:00.489 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.489 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90882 from client crm_mon.210990
Dec 02 08:16:00.492 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.492 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90878 from client crm_mon.210990
Dec 02 08:16:00.495 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.495 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90879 from client crm_mon.210990
Dec 02 08:16:00.498 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.498 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90880 from client crm_mon.210990
Dec 02 08:16:00.501 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.501 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90881 from client crm_mon.210990
Dec 02 08:16:00.503 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.504 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90877 from client crm_mon.210990
Dec 02 08:16:00.506 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
Dec 02 08:16:00.506 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90884 from client crm_mon.210990
Dec 02 08:16:00.509 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete

The restart of pacemaker-controld following the above issue, can additional trigger recovery actions against any active resources on the node. These recovery actions can result in extended downtime for the resources if not successful.

Please provide the package NVR for which the bug is seen:

$ grep -h -e kernel-[0-9] -e pacemaker-[0-9] */installed-rpms
kernel-5.14.0-284.40.1.el9_2.x86_64                         Fri Dec 15 00:14:28 2023
pacemaker-2.1.5-9.el9_2.3.x86_64                            Fri Dec 15 02:16:53 2023

How reproducible is this bug?:

A form of the bug is always reproducible. It's simple enough to create the IPC liveness check failures. Although in my reproducer I am not able to create the `pacemaker_controld` failure the customer is consistently seeing, I do get another component failure and pacemaker killed ( pacemaker_attrd mainly affected on my reproducer ):

# pacemaker.log snippets:
Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)     notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries
Dec 04 08:47:25 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)     notice: pacemaker-attrd[8317] is unresponsive to ipc after 1 tries
-----------------------------------------8<----------------------------------------- 
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk_child_exit@pcmkd_subdaemons.c:282)         error: pacemaker-attrd[2574] exited with status 102 (Not connected)
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to attrd IPC: Connection refused
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk_process_exit@pcmkd_subdaemons.c:321)       notice: Not respawning pacemaker-attrd subdaemon until cluster returns
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2

Steps to reproduce

On a standard cluster run the below loop to generate many simultaneous "pcs stonith history" reports ( simple 2-node cluster is sufficient ):

# Generate 10 "pcs stonith history" checks every 3s for 30 seconds:
$ for i in $( seq 1 10 ) ; do for j in $( seq 1 40 ) ; do pcs stonith history & done ; sleep 3; done

Expected results

We get "pcs stonith history" output, without any issues to the cluster.

Actual results

The pacemaker service kills itself and recovers, due to failed "liveness" checks against subcomponents:

Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2
Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90)       info: Cluster reconnect failed (connection will be reattempted once per second)
Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to attrd IPC: Connection refused
Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140)    trace: Checked pacemaker-attrd[2574]: IPC server process is active but not accepting connections (-1020)
Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)    notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries
-----------------------------------------8<----------------------------------------- 
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90)       info: Cluster reconnect failed (connection will be reattempted once per second)
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to crmd IPC: Connection refused
Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140)    trace: Checked pacemaker-controld[0]: IPC server is unresponsive (-1019)

Additional Notes:

Something I notice is that all of these stonith history checks are being ran as "synchronous" calls:

Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327947
Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327947: complete
Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327948
Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327948: complete
Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327967
Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327967: complete
Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327951
Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327951: complete

Talking with narwahl++ I understand there were some efforts in the past to make calls like this asynchronously in the past to avoid impact. Just wondering here if the sync call maybe making the issue worse, and if this can be ran as an "async" call instead.

links to

crmd times out when communicating with lrmd causing the crmd process to respawn

Assignee:: Christopher Lumens

Reporter:: Joshua Baker

Developer:: Kenneth Gaillot (Inactive)

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/12/04 4:09 PM

Updated:: 2025/11/03 7:40 PM

Stale Date:: 2026/11/02

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Additional Notes:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates