Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-34276

Pacemaker remote resource migration failed

    • pacemaker-2.1.9-1.el9
    • None
    • Important
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 13
    • 5
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • Yes
    • None
    • Bug Fix
    • Hide
      Cause (the user action or circumstances that trigger the bug): If network communication is interrupted between a Pacemaker Remote node and the cluster node hosting its connection while it the Pacemaker Remote daemon is reading certain data such as the TLS handshake portion of the initial connection, the Pacemaker Remote daemon could block for a long time waiting for more data.

      Consequence (what the user experience is when the bug occurs): The Pacemaker Remote connection is unable to be recovered on another cluster node (possibly logging a "key was rejected" error message on that node).

      Fix (what has changed to fix the bug; do not include overly technical details):
      The TLS handshake is now asynchronous.

      Result (what happens now that the patch is applied): The Pacemaker Remote connection is successfully recovered elsewhere.
      Show
      Cause (the user action or circumstances that trigger the bug): If network communication is interrupted between a Pacemaker Remote node and the cluster node hosting its connection while it the Pacemaker Remote daemon is reading certain data such as the TLS handshake portion of the initial connection, the Pacemaker Remote daemon could block for a long time waiting for more data. Consequence (what the user experience is when the bug occurs): The Pacemaker Remote connection is unable to be recovered on another cluster node (possibly logging a "key was rejected" error message on that node). Fix (what has changed to fix the bug; do not include overly technical details): The TLS handshake is now asynchronous. Result (what happens now that the patch is applied): The Pacemaker Remote connection is successfully recovered elsewhere.
    • Proposed
    • x86_64
    • None

      What were you trying to do that didn't work?  The cluster consists of 2 quorum nodes and 2 Pacemaker remote nodes.  When a quorum node was fenced (this is part of the testing), the Pacemaker remote resource that was running on that node migrated to run on the other quorum node.  However, it failed to restart with this error as seen the pacemaker.log file:

      Apr 25 11:40:18.247 ps-4 pacemaker-schedulerd[3072816] (unpack_rsc_op) error: Preventing ps-1 from restarting on ps-4 because of hard failure (invalid parameter: Key was rejected by service) | ps-1_last_failure_0

      Please provide the package NVR for which bug is seen: 

      Pacemaker 2.1.6-4.db2pcmk.el9

      How reproducible: Frequent

      Steps to reproduce

      1. Set up a cluster with 2 quorum nodes and 2 Pacemaker remote node, with fencing agent define to perform the fencing.
      2. Run ifconfig <eth> down to take down the main interface on one quorum node
      3. The Pacemaker remote resource that was running on the fenced node, migrated to run on the other quorum node but failed to start.

      Expected results: Pacemaker remote resource migrated successfully

      Actual results: Pacemaker remote resource failed to start and stayed in stopped state.

      pcmk-Thu-25-Apr-2024.tar.bz2

        1. pacemaker-remote-start-failure.tar.bz2
          1.70 MB
          Lan Pham
        2. pcmk-Mon-29-Apr-2024.tar.bz2
          397 kB
          Lan Pham
        3. pcmk-Thu-25-Apr-2024.tar.bz2
          1.23 MB
          Lan Pham

            rhn-support-clumens Christopher Lumens
            lpham@ca.ibm.com Lan Pham
            IBM Confidential Group
            Kenneth Gaillot Kenneth Gaillot
            Jana Rehova Jana Rehova
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: