Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-50910

pacemaker-controld is unresponsive to ipc

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • rhel-9.2.0
    • pacemaker
    • None
    • No
    • None
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • ppc64le
    • None

      What were you trying to do that didn't work?

      In a cluster consists of 2 cluster nodes and 2 Pacemaker remote nodes, taking down the public network on one of the Pacemaker remote node caused the Pacemaker control daemon on the DC to hang and subsequently killed.  This is not expected.  The expected behaviour is the DC would detect the remote node as OFFLINE and perform fencing operation.

      Please provide the package NVR for which bug is seen: 

      Pacemaker 2.1.6-4.db2pcmk.el9

      How reproducible:

      Hit the issue in the second attempt.

      Steps to reproduce

      1. Create a cluster that consists of 2 cluster nodes and 2 Pacemaker remote nodes
      2. Run ifconfig <interface> down on the public interface on one of the 2 Pacemaker remote node
      3. Shortly after the interface is down, the DC node is OFFLINE and Pacemaker restarted on the DC host.

      Expected results:  The node would be detected as OFFLINE and the DC perform node recovery.

      Actual results:  The Pacemaker control daemon on the DC timed out with these errors, then terminated.  The DC role then restarted on a different host:

      Jul 17 15:59:49.331 p10rhel094 pacemakerd          [2271] (pcmk__ipc_is_authentic_process_active)       info: Could not connect to crmd IPC: timeout
      Jul 17 15:59:49.331 p10rhel094 pacemakerd          [2271] (check_next_subdaemon)        notice: pacemaker-controld[2504] is unresponsive to ipc after 1 tries

      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (pcmk__ipc_is_authentic_process_active)       info: Could not connect to crmd IPC: timeout
      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (check_next_subdaemon)        error: pacemaker-controld[2504] is unresponsive to ipc after 5 tries but we found the pid so have it killed that we can restart

      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (pcmk_child_exit)     warning: pacemaker-controld[2504] terminated with signal 9 (Killed)
      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (pcmk__ipc_is_authentic_process_active)       info: Could not connect to crmd IPC: Connection refused
      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (pcmk_process_exit)   notice: Respawning pacemaker-controld subdaemon after unexpected exit
      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (start_child)         info: Using uid=189 and group=189 for process pacemaker-controld
      Jul 17 16:00:17.331 p10rhel094 pacemakerd          [2271] (start_child)         info: Forked child 3602000 for process pacemaker-controld

            kgaillot@redhat.com Kenneth Gaillot
            lpham@ca.ibm.com Lan Pham
            IBM Confidential Group
            Kenneth Gaillot Kenneth Gaillot
            Cluster QE Cluster QE
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: