-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhel-8.10.z, rhel-9.2.0.z
-
None
-
No
-
Low
-
rhel-ha
-
2
-
False
-
False
-
-
None
-
None
-
None
-
None
-
None
What were you trying to do that didn't work?
Customer is operating their Openstack "scale-out" HA Instance as normal, but with cron jobs that regularly check "pcs status" output.
What is the impact of this issue to you?
When the issue becomes present, this blocks pacemaker_controld's "liveneness" check, and can result in a restart of pacmeker-controld:
Dec 2 08:16:40 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 1 tries Dec 2 08:16:47 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 2 tries Dec 2 08:16:54 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 3 tries Dec 2 08:17:01 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 4 tries Dec 2 08:17:08 controller-0 pacemakerd[698941]: error: pacemaker-controld[896317] is unresponsive to ipc after 5 tries but we found the pid so have it killed that we can restart -----------------------------------------8<----------------------------------------- Dec 2 08:17:08 controller-0 pacemakerd[698941]: warning: pacemaker-controld[896317] terminated with signal 9 (Killed) Dec 2 08:17:08 controller-0 pacemakerd[698941]: notice: Respawning pacemaker-controld subdaemon after unexpected exit
The above "liveness" failures appear to be blocked behind "stonith history" checks ran in a synchronous fashion:
Dec 02 08:16:00.477 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.477 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90875 from client crm_mon.210990 Dec 02 08:16:00.480 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.480 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90883 from client crm_mon.210990 Dec 02 08:16:00.483 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.483 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90874 from client crm_mon.210990 Dec 02 08:16:00.486 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.486 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90876 from client crm_mon.210990 Dec 02 08:16:00.489 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.489 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90882 from client crm_mon.210990 Dec 02 08:16:00.492 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.492 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90878 from client crm_mon.210990 Dec 02 08:16:00.495 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.495 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90879 from client crm_mon.210990 Dec 02 08:16:00.498 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.498 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90880 from client crm_mon.210990 Dec 02 08:16:00.501 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.501 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90881 from client crm_mon.210990 Dec 02 08:16:00.503 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.504 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90877 from client crm_mon.210990 Dec 02 08:16:00.506 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete Dec 02 08:16:00.506 controller-1 pacemaker-fenced [498872] (stonith_command@fenced_commands.c:3640) debug: Processing synchronous st_fence_history request 90884 from client crm_mon.210990 Dec 02 08:16:00.509 controller-1 pacemaker-fenced [498872] (handle_request@fenced_commands.c:3584) debug: Processed st_fence_history request from client crm_mon.210990: complete
The restart of pacemaker-controld following the above issue, can additional trigger recovery actions against any active resources on the node. These recovery actions can result in extended downtime for the resources if not successful.
Please provide the package NVR for which the bug is seen:
$ grep -h -e kernel-[0-9] -e pacemaker-[0-9] */installed-rpms kernel-5.14.0-284.40.1.el9_2.x86_64 Fri Dec 15 00:14:28 2023 pacemaker-2.1.5-9.el9_2.3.x86_64 Fri Dec 15 02:16:53 2023
How reproducible is this bug?:
A form of the bug is always reproducible. It's simple enough to create the IPC liveness check failures. Although in my reproducer I am not able to create the `pacemaker_controld` failure the customer is consistently seeing, I do get another component failure and pacemaker killed ( pacemaker_attrd mainly affected on my reproducer ):
# pacemaker.log snippets: Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168) notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries Dec 04 08:47:25 rhel8-node1 pacemakerd [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168) notice: pacemaker-attrd[8317] is unresponsive to ipc after 1 tries -----------------------------------------8<----------------------------------------- Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (pcmk_child_exit@pcmkd_subdaemons.c:282) error: pacemaker-attrd[2574] exited with status 102 (Not connected) Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617) info: Could not connect to attrd IPC: Connection refused Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (pcmk_process_exit@pcmkd_subdaemons.c:321) notice: Not respawning pacemaker-attrd subdaemon until cluster returns Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (cluster_connect_cfg@pcmkd_corosync.c:154) crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2
Steps to reproduce
On a standard cluster run the below loop to generate many simultaneous "pcs stonith history" reports ( simple 2-node cluster is sufficient ):
# Generate 10 "pcs stonith history" checks every 3s for 30 seconds: $ for i in $( seq 1 10 ) ; do for j in $( seq 1 40 ) ; do pcs stonith history & done ; sleep 3; done
Expected results
We get "pcs stonith history" output, without any issues to the cluster.
Actual results
The pacemaker service kills itself and recovers, due to failed "liveness" checks against subcomponents:
Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (cluster_connect_cfg@pcmkd_corosync.c:154) crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2 Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90) info: Cluster reconnect failed (connection will be reattempted once per second) Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617) info: Could not connect to attrd IPC: Connection refused Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140) trace: Checked pacemaker-attrd[2574]: IPC server process is active but not accepting connections (-1020) Dec 04 08:44:29 rhel8-node1 pacemakerd [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168) notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries -----------------------------------------8<----------------------------------------- Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (cluster_connect_cfg@pcmkd_corosync.c:154) crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2 Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90) info: Cluster reconnect failed (connection will be reattempted once per second) Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617) info: Could not connect to crmd IPC: Connection refused Dec 04 08:44:31 rhel8-node1 pacemakerd [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140) trace: Checked pacemaker-controld[0]: IPC server is unresponsive (-1019)
Additional Notes:
Something I notice is that all of these stonith history checks are being ran as "synchronous" calls:
Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced [21516] (stonith_command) debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327947 Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced [21516] (handle_request) debug: Processed st_fence_history request from client stonith_admin.327947: complete Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced [21516] (stonith_command) debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327948 Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced [21516] (handle_request) debug: Processed st_fence_history request from client stonith_admin.327948: complete Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced [21516] (stonith_command) debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327967 Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced [21516] (handle_request) debug: Processed st_fence_history request from client stonith_admin.327967: complete Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced [21516] (stonith_command) debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327951 Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced [21516] (handle_request) debug: Processed st_fence_history request from client stonith_admin.327951: complete
Talking with narwahl++ I understand there were some efforts in the past to make calls like this asynchronously in the past to avoid impact. Just wondering here if the sync call maybe making the issue worse, and if this can be ran as an "async" call instead.