• pacemaker-2.1.9-1.el9
    • None
    • Important
    • rhel-sst-high-availability
    • 13
    • 5
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • Yes
    • None
    • Bug Fix
    • Hide
      .Successful recovery of an interrupted Pacemaker remote connection

      Before this update, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection during the TLS handshake portion of the initial connection, the connection in some cases blocked and could not be recovered on another cluster node. With this update, the TLS handshake is asynchronous and a remote connection is successfully recovered elsewhere.
      Show
      .Successful recovery of an interrupted Pacemaker remote connection Before this update, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection during the TLS handshake portion of the initial connection, the connection in some cases blocked and could not be recovered on another cluster node. With this update, the TLS handshake is asynchronous and a remote connection is successfully recovered elsewhere.
    • Done
    • x86_64
    • None

      What were you trying to do that didn't work?  

      The cluster consists of 2 quorum nodes and 2 Pacemaker remote nodes.  When a quorum node was fenced (this is part of the testing), the Pacemaker remote resource that was running on that node migrated to run on the other quorum node.  However, it failed to restart with this error as seen the pacemaker.log file:
      Apr 25 11:40:18.247 ps-4 pacemaker-schedulerd[3072816] (unpack_rsc_op) error: Preventing ps-1 from restarting on ps-4 because of hard failure (invalid parameter: Key was rejected by service) | ps-1_last_failure_0

      Please provide the package NVR for which bug is seen: 

      Pacemaker 2.1.6-4.db2pcmk.el9

      How reproducible: Frequent

      Steps to reproduce

      1. Set up a cluster with 2 quorum nodes and 2 Pacemaker remote node, with fencing agent define to perform the fencing.
      2. Run ifconfig <eth> down to take down the main interface on one quorum node
      3. The Pacemaker remote resource that was running on the fenced node, migrated to run on the other quorum node but failed to start.

      Expected results: Pacemaker remote resource migrated successfully

      Actual results: Pacemaker remote resource failed to start and stayed in stopped state.

      pcmk-Thu-25-Apr-2024.tar.bz2

            [RHEL-34276] Pacemaker remote resource migration failed

            Pinned comments

            Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.)

            The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving.

            I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable).

            Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios.

            As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better.

            As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

            Kenneth Gaillot (Inactive) added a comment - Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.) The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving. I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable). Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios. As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better. As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

            All comments

            slevine@redhat.com ,

            > When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere

            drop "and blocked" (it was blocked previously, and now is not)

            > Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

            You could replace "while reading data such as" with "during"

            Kenneth Gaillot (Inactive) added a comment - slevine@redhat.com , > When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere drop "and blocked" (it was blocked previously, and now is not) > Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere. You could replace "while reading data such as" with "during"

            kgaillot@redhat.com: I'm trying to come up with a release note descriptions that's just a summary overview โ€“ a user who wants to know the details can go to the bug ticket.

            As a release note, does this make sense technically, and does it need more detail? Also,tThat first sentence is very long but I'm not sure how to split it. Could I eliminate the phrase "data such as", since the fix here seems to involve only the TLS handhsake.

            .When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere

            Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

            Steven Levine added a comment - kgaillot@redhat.com : I'm trying to come up with a release note descriptions that's just a summary overview โ€“ a user who wants to know the details can go to the bug ticket. As a release note, does this make sense technically, and does it need more detail? Also,tThat first sentence is very long but I'm not sure how to split it. Could I eliminate the phrase "data such as", since the fix here seems to involve only the TLS handhsake. .When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

            This project got more complicated than I anticipated.  I'm going to use this issue to track just the portion that we were able to complete for RHEL-9.6 (handshaking, remote CIB ops, and some other minor stuff).  The cloned issue RHEL-65544 I'll use for tracking the rest of it.  This will also be covered in upstream issue T901.

            Christopher Lumens added a comment - This project got more complicated than I anticipated.  I'm going to use this issue to track just the portion that we were able to complete for RHEL-9.6 (handshaking, remote CIB ops, and some other minor stuff).  The cloned issue RHEL-65544 I'll use for tracking the rest of it.  This will also be covered in upstream issue T901.

            I think it'll be better just to work on the async communication. Getting info from the CIB to remote nodes is a bit of a pain (especially keeping it in sync if it changes), and lowering the timeout wouldn't completely eliminate the problem, just reduce the chance of it.

            The 9.5 cycle is wrapping up right now, but hopefully we can get it for 9.6. It should be feasible to backport it too, since it will likely be purely on the daemon side.

            Kenneth Gaillot (Inactive) added a comment - I think it'll be better just to work on the async communication. Getting info from the CIB to remote nodes is a bit of a pain (especially keeping it in sync if it changes), and lowering the timeout wouldn't completely eliminate the problem, just reduce the chance of it. The 9.5 cycle is wrapping up right now, but hopefully we can get it for 9.6. It should be feasible to backport it too, since it will likely be purely on the daemon side.

            Lan Pham added a comment - - edited

            Would you consider making the timeout value from the Pacemaker remote daemon configurable ?

            • Adding support for a new Pacemaker remote resource instance attribute - daemon_check_timeout.
            • Once the remote node successfully connects to the cluster, then it (the daemon) will read this value and configure itself accordingly.  If the parameter is not set, then use the default value.
            • Alternatively, read the monitor timeout value from the Pacemaker remote resource and use the same value for the daemon read timeout.

            Lan Pham added a comment - - edited Would you consider making the timeout value from the Pacemaker remote daemon configurable ? Adding support for a new Pacemaker remote resource instance attribute - daemon_check_timeout. Once the remote node successfully connects to the cluster, then it (the daemon) will read this value and configure itself accordingly.  If the parameter is not set, then use the default value. Alternatively, read the monitor timeout value from the Pacemaker remote resource and use the same value for the daemon read timeout.

            Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.)

            The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving.

            I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable).

            Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios.

            As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better.

            As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

            Kenneth Gaillot (Inactive) added a comment - Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.) The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving. I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable). Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios. As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better. As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

            Lan Pham added a comment -

            Please post any update to the case.  We are blocked by this issue.

            Lan Pham added a comment - Please post any update to the case.  We are blocked by this issue.

            Lan Pham added a comment - - edited

            I was able to reproduce the issue using RHEL 9.2 standard Pacemaker packages:

            Version: Pacemaker 2.1.5-9.el9_2.4
            Time of failure: Jul 16 15:51:

            I uploaded the latest crm_report collection file, named: pacemaker-remote-start-failure.tar.bz2 

            Cluster configuration:

            • RHEL 9.2
            • 2 full cluster nodes: lphamps-srv-1, lphamps-srv-2
            • 2 Pacemaker remote nodes: lphamps-srv-3, lphamps-srv-4

            Sequence of events:

            • At 15:51:51, ethernet interface was down on host lphamps-srv-2 (via ifconfig eth1 down)
            • At 15:51:54, from the DC host lphamps-srv-1, the other cluster node, lphamps-srv-2 was detected as left the cluster
            • At 15:52:02, the Pacemaker remote resource lphamps-srv-3 failed to start on cluster node lphamps-srv-1

            Jul 16 15:52:02.578 lphamps-srv-1 pacemaker-controld  [7137] (log_executor_event)       error: Result of start operation for lphamps-srv-3 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 70, graph action confirmed; call=3 key=lphamps-srv-3_start_0

            • At 15:02:07, the Pacemaker remote resource lphamps-srv-4 failed to start on cluster node lphamps-srv-1

            Jul 16 15:52:07.592 lphamps-srv-1 pacemaker-controld  [7137] (log_executor_event)       error: Result of start operation for lphamps-srv-4 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 76, graph action confirmed; call=4 key=lphamps-srv-4_start_0

            • Then both lphamps-srv-3 and lphamps-srv-4 hosts were fenced:

            Jul 16 15:52:07.628 lphamps-srv-1 pacemaker-schedulerd[7136] (pe_fence_node)    warning: Remote node lphamps-srv-4 will be fenced: db2_member_regress1_1 is thought to be active there

            Jul 16 15:52:07.631 lphamps-srv-1 pacemaker-schedulerd[7136] (pe_fence_node)    warning: Remote node lphamps-srv-3 will be fenced: db2_idle_regress1_999_lphamps-srv-3 is thought to be active there

            Lan Pham added a comment - - edited I was able to reproduce the issue using RHEL 9.2 standard Pacemaker packages: Version: Pacemaker 2.1.5-9.el9_2.4 Time of failure: Jul 16 15:51: I uploaded the latest crm_report collection file, named: pacemaker-remote-start-failure.tar.bz2   Cluster configuration: RHEL 9.2 2 full cluster nodes: lphamps-srv-1, lphamps-srv-2 2 Pacemaker remote nodes: lphamps-srv-3, lphamps-srv-4 Sequence of events: At 15:51:51, ethernet interface was down on host lphamps-srv-2 (via ifconfig eth1 down) At 15:51:54, from the DC host lphamps-srv-1, the other cluster node, lphamps-srv-2 was detected as left the cluster At 15:52:02, the Pacemaker remote resource lphamps-srv-3 failed to start on cluster node lphamps-srv-1 Jul 16 15:52:02.578 lphamps-srv-1 pacemaker-controld  [7137] (log_executor_event)       error: Result of start operation for lphamps-srv-3 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 70, graph action confirmed; call=3 key=lphamps-srv-3_start_0 At 15:02:07, the Pacemaker remote resource lphamps-srv-4 failed to start on cluster node lphamps-srv-1 Jul 16 15:52:07.592 lphamps-srv-1 pacemaker-controld  [7137] (log_executor_event)       error: Result of start operation for lphamps-srv-4 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 76, graph action confirmed; call=4 key=lphamps-srv-4_start_0 Then both lphamps-srv-3 and lphamps-srv-4 hosts were fenced: Jul 16 15:52:07.628 lphamps-srv-1 pacemaker-schedulerd [7136] (pe_fence_node)    warning: Remote node lphamps-srv-4 will be fenced: db2_member_regress1_1 is thought to be active there Jul 16 15:52:07.631 lphamps-srv-1 pacemaker-schedulerd [7136] (pe_fence_node)    warning: Remote node lphamps-srv-3 will be fenced: db2_idle_regress1_999_lphamps-srv-3 is thought to be active there

            Lan Pham added a comment -

            Ken, any further update to the case ?  We are blocked by this issue.

            Lan Pham added a comment - Ken, any further update to the case ?  We are blocked by this issue.

            Lan Pham added a comment -

            So we tried both workarounds:

            • The first workaround was to configure the start-failure-is-fatal=false to let the Pacemaker retry the start failure on the same host.  With this setting, the failed start was retried and was successful on the retry, BUT the cluster manager still triggered fencing on the remote node on the first failure.  This is not what we want as it would stop all resources on that host.
            • The second work around was to configure an order constraint so that the remote resources would be started sequentially.  After running for a few iterations, we still hit the original error "error: Result of start operation for ps-1 on ps-3: Error (Key was rejected by service) | graph action confirmed; call=3 key=ps-1_start_0".  From the log, I saw clearly that ps-1 was started before ps-2 but start of ps-1 still failed (and that triggered node fencing).  ps-2 was started later and it was successful.  So the error wasn't with starting multiple remote resources at the same time.

            We need a solution that would not cause the remote node to be fenced.

            Is there another workaround, we need to wait for the fix ?

            Lan Pham added a comment - So we tried both workarounds: The first workaround was to configure the start-failure-is-fatal=false to let the Pacemaker retry the start failure on the same host.  With this setting, the failed start was retried and was successful on the retry, BUT the cluster manager still triggered fencing on the remote node on the first failure.  This is not what we want as it would stop all resources on that host. The second work around was to configure an order constraint so that the remote resources would be started sequentially.  After running for a few iterations, we still hit the original error "error: Result of start operation for ps-1 on ps-3: Error (Key was rejected by service) | graph action confirmed; call=3 key=ps-1_start_0".  From the log, I saw clearly that ps-1 was started before ps-2 but start of ps-1 still failed (and that triggered node fencing).  ps-2 was started later and it was successful.  So the error wasn't with starting multiple remote resources at the same time. We need a solution that would not cause the remote node to be fenced. Is there another workaround, we need to wait for the fix ?

              rhn-support-clumens Christopher Lumens
              lpham@ca.ibm.com Lan Pham
              IBM Confidential Group
              Kenneth Gaillot Kenneth Gaillot (Inactive)
              Jana Rehova Jana Rehova
              Steven Levine Steven Levine
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: