Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-69994

Synchronous Stonith History calls cause IPC communication timeouts for "pacemaker-controld"

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-8.10.z, rhel-9.2.0.z
    • pacemaker
    • None
    • No
    • Low
    • rhel-ha
    • 2
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      What were you trying to do that didn't work?

      Customer is operating their Openstack "scale-out" HA Instance as normal, but with cron jobs that regularly check "pcs status" output.

      What is the impact of this issue to you?

      When the issue becomes present, this blocks pacemaker_controld's "liveneness" check, and can result in a restart of pacmeker-controld:

      Dec  2 08:16:40 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 1 tries
      Dec  2 08:16:47 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 2 tries
      Dec  2 08:16:54 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 3 tries
      Dec  2 08:17:01 controller-0 pacemakerd[698941]: notice: pacemaker-controld[896317] is unresponsive to ipc after 4 tries
      Dec  2 08:17:08 controller-0 pacemakerd[698941]: error: pacemaker-controld[896317] is unresponsive to ipc after 5 tries but we found the pid so have it killed that we can restart
      -----------------------------------------8<----------------------------------------- 
      Dec  2 08:17:08 controller-0 pacemakerd[698941]: warning: pacemaker-controld[896317] terminated with signal 9 (Killed)
      Dec  2 08:17:08 controller-0 pacemakerd[698941]: notice: Respawning pacemaker-controld subdaemon after unexpected exit

      The above "liveness" failures appear to be blocked behind "stonith history" checks ran in a synchronous fashion:

       

      Dec 02 08:16:00.477 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.477 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90875 from client crm_mon.210990
      Dec 02 08:16:00.480 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.480 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90883 from client crm_mon.210990
      Dec 02 08:16:00.483 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.483 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90874 from client crm_mon.210990
      Dec 02 08:16:00.486 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.486 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90876 from client crm_mon.210990
      Dec 02 08:16:00.489 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.489 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90882 from client crm_mon.210990
      Dec 02 08:16:00.492 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.492 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90878 from client crm_mon.210990
      Dec 02 08:16:00.495 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.495 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90879 from client crm_mon.210990
      Dec 02 08:16:00.498 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.498 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90880 from client crm_mon.210990
      Dec 02 08:16:00.501 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.501 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90881 from client crm_mon.210990
      Dec 02 08:16:00.503 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.504 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90877 from client crm_mon.210990
      Dec 02 08:16:00.506 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      Dec 02 08:16:00.506 controller-1 pacemaker-fenced    [498872] (stonith_command@fenced_commands.c:3640)  debug: Processing synchronous st_fence_history request 90884 from client crm_mon.210990
      Dec 02 08:16:00.509 controller-1 pacemaker-fenced    [498872] (handle_request@fenced_commands.c:3584)   debug: Processed st_fence_history request from client crm_mon.210990: complete
      

       

       

      The restart of pacemaker-controld following the above issue, can additional trigger recovery actions against any active resources on the node. These recovery actions can result in extended downtime for the resources if not successful.

       

      Please provide the package NVR for which the bug is seen:

       

      $ grep -h -e kernel-[0-9] -e pacemaker-[0-9] */installed-rpms
      kernel-5.14.0-284.40.1.el9_2.x86_64                         Fri Dec 15 00:14:28 2023
      pacemaker-2.1.5-9.el9_2.3.x86_64                            Fri Dec 15 02:16:53 2023
      

      How reproducible is this bug?:

      A form of the bug is always reproducible. It's simple enough to create the IPC liveness check failures. Although in my reproducer I am not able to create the `pacemaker_controld` failure the customer is consistently seeing, I do get another component failure and pacemaker killed  ( pacemaker_attrd mainly affected on my reproducer ): 

      # pacemaker.log snippets:
      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)     notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries
      Dec 04 08:47:25 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)     notice: pacemaker-attrd[8317] is unresponsive to ipc after 1 tries
      -----------------------------------------8<----------------------------------------- 
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk_child_exit@pcmkd_subdaemons.c:282)         error: pacemaker-attrd[2574] exited with status 102 (Not connected)
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to attrd IPC: Connection refused
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk_process_exit@pcmkd_subdaemons.c:321)       notice: Not respawning pacemaker-attrd subdaemon until cluster returns
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2

      Steps to reproduce

      On a standard cluster run the below loop to generate many simultaneous "pcs stonith history" reports ( simple 2-node cluster is sufficient ):

      # Generate 10 "pcs stonith history" checks every 3s for 30 seconds:
      $ for i in $( seq 1 10 ) ; do for j in $( seq 1 40 ) ; do pcs stonith history & done ; sleep 3; done 

      Expected results

      We get "pcs stonith history" output, without any issues to the cluster.

      Actual results

      The pacemaker service kills itself and recovers, due to failed "liveness" checks against subcomponents:

      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2
      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90)       info: Cluster reconnect failed (connection will be reattempted once per second)
      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to attrd IPC: Connection refused
      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140)    trace: Checked pacemaker-attrd[2574]: IPC server process is active but not accepting connections (-1020)
      Dec 04 08:44:29 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:168)    notice: pacemaker-attrd[2574] is unresponsive to ipc after 1 tries
      -----------------------------------------8<----------------------------------------- 
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_connect_cfg@pcmkd_corosync.c:154)       crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY | rc=2
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (cluster_reconnect_cb@pcmkd_corosync.c:90)       info: Cluster reconnect failed (connection will be reattempted once per second)
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (pcmk__ipc_is_authentic_process_active@ipc_client.c:1617)        info: Could not connect to crmd IPC: Connection refused
      Dec 04 08:44:31 rhel8-node1 pacemakerd          [2570] (check_next_subdaemon@pcmkd_subdaemons.c:140)    trace: Checked pacemaker-controld[0]: IPC server is unresponsive (-1019)

      Additional Notes:

      Something I notice is that all of these stonith history checks are being ran as "synchronous" calls:

       

      Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327947
      Dec 04 09:04:39.145 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327947: complete
      Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327948
      Dec 04 09:04:39.398 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327948: complete
      Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327967
      Dec 04 09:04:39.945 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327967: complete
      Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced    [21516] (stonith_command)     debug: Processing synchronous st_fence_history request 2 from client stonith_admin.327951
      Dec 04 09:04:41.072 clustera-rhel9 pacemaker-fenced    [21516] (handle_request)     debug: Processed st_fence_history request from client stonith_admin.327951: complete

      Talking with narwahl++ I understand there were some efforts in the past to make calls like this asynchronously in the past to avoid impact. Just wondering here if the sync call maybe making the issue worse, and if this can be ran as an "async" call instead. 

              rhn-support-clumens Christopher Lumens
              rhn-support-jobaker Joshua Baker
              Kenneth Gaillot Kenneth Gaillot (Inactive)
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: