[RHEL-34276] Pacemaker remote resource migration failed

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: rhel-9.6
Affects Version/s: rhel-9.2.0
Component/s: pacemaker
Labels:
- fixed_upstream

Fixed in Build:
pacemaker-2.1.9-1.el9
Regression:
None
Severity:
Important

Pool Team:

rhel-sst-high-availability

Dev Target Milestone:
13
Story Points:
5
ACKs Check:

Dev ack
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
Yes
Sprint:
None

Preliminary Testing:
Pass
Testable Builds:

Hide
https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=4795767

Show
https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=4795767
Errata Link:
https://errata.engineering.redhat.com/advisory/142935
Test Coverage:
None

Release Note Type:
Bug Fix
Release Note Text:

Hide
.Successful recovery of an interrupted Pacemaker remote connection

Before this update, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection during the TLS handshake portion of the initial connection, the connection in some cases blocked and could not be recovered on another cluster node. With this update, the TLS handshake is asynchronous and a remote connection is successfully recovered elsewhere.

Show
.Successful recovery of an interrupted Pacemaker remote connection Before this update, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection during the TLS handshake portion of the initial connection, the connection in some cases blocked and could not be recovered on another cluster node. With this update, the TLS handshake is asynchronous and a remote connection is successfully recovered elsewhere.
Release Note Status:
Done

Architecture:

x86_64

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

The cluster consists of 2 quorum nodes and 2 Pacemaker remote nodes. When a quorum node was fenced (this is part of the testing), the Pacemaker remote resource that was running on that node migrated to run on the other quorum node. However, it failed to restart with this error as seen the pacemaker.log file:
Apr 25 11:40:18.247 ps-4 pacemaker-schedulerd[3072816] (unpack_rsc_op) error: Preventing ps-1 from restarting on ps-4 because of hard failure (invalid parameter: Key was rejected by service) | ps-1_last_failure_0

Please provide the package NVR for which bug is seen:

Pacemaker 2.1.6-4.db2pcmk.el9

How reproducible: Frequent

Steps to reproduce

Set up a cluster with 2 quorum nodes and 2 Pacemaker remote node, with fencing agent define to perform the fencing.
Run ifconfig <eth> down to take down the main interface on one quorum node
The Pacemaker remote resource that was running on the fenced node, migrated to run on the other quorum node but failed to start.

Expected results: Pacemaker remote resource migrated successfully

Actual results: Pacemaker remote resource failed to start and stayed in stopped state.

pcmk-Thu-25-Apr-2024.tar.bz2

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

pacemaker-remote-start-failure.tar.bz2
1.70 MB
2024/07/16 8:12 PM
pcmk-Mon-29-Apr-2024.tar.bz2
397 kB
2024/04/29 4:02 PM
pcmk-Thu-25-Apr-2024.tar.bz2
1.23 MB
2024/04/25 9:46 PM

is cloned by

RHEL-65544 Pacemaker remote resource migration failed

Planning

is duplicated by

RHEL-50910 pacemaker-controld is unresponsive to ipc

Closed

relates to

RHEL-65042 Need to update our limits on the # of remote nodes supported/recommended

Planning

links to

ClusterLabs T824

ClusterLabs T855

RHBA-2024:142935 pacemaker update

(1 links to)

Pinned comments

Pinned by Lan Pham

Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.)

The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving.

I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable).

Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios.

As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better.

As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM Bringing the target node's network interface down does not drop active TCP connections. From the remote node's point of view, the connection remains intact, but no data is arriving. (TCP timeouts will eventually kick in but are irrelevant for the time frame we're interested in, and even lowering the timeouts wouldn't likely help due to the next factor.) The remote daemon happens to be reading a message from the connection when the interface goes down. Currently, the remote daemon reads one message from the connection synchronously using a 60-second timeout, meaning that the daemon process can block for up to 60 seconds if no data is arriving. I believe the problem is that the remote can't accept the new connection while it's blocked on reading from the old connection. The solution would need to be in code, to make the read asynchronous (likely difficult but doable). Unfortunately there's no good workaround. The 60-second read timeout is hardcoded in Pacemaker. A hacky approach would be to group each remote connection with an ocf:heartbeat:Delay dummy resource with a 60-second start delay, but of course that would slow down recovery in more typical failure scenarios. As an aside, bringing a network interface down does not simulate real-world networking problems well. Blocking inbound and outbound packets on the interface at the firewall level would be a better simulation of a cable pull scenario, though Pacemaker's behavior might not be any better. As another aside, it should not be necessary to change cluster-recheck-interval. Since Pacemaker 2.0.3 (RHEL 8.2), the recheck interval is dynamically calculated for everything besides rules with date_spec elements.

All comments

Kenneth Gaillot (Inactive) added a comment - 2024/12/19 9:56 PM

slevine@redhat.com ,

> When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere

drop "and blocked" (it was blocked previously, and now is not)

> Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

You could replace "while reading data such as" with "during"

Kenneth Gaillot (Inactive) added a comment - 2024/12/19 9:56 PM slevine@redhat.com , > When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere drop "and blocked" (it was blocked previously, and now is not) > Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere. You could replace "while reading data such as" with "during"

Steven Levine added a comment - 2024/12/19 7:58 PM

kgaillot@redhat.com: I'm trying to come up with a release note descriptions that's just a summary overview – a user who wants to know the details can go to the bug ticket.

As a release note, does this make sense technically, and does it need more detail? Also,tThat first sentence is very long but I'm not sure how to split it. Could I eliminate the phrase "data such as", since the fix here seems to involve only the TLS handhsake.

.When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere

Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

Steven Levine added a comment - 2024/12/19 7:58 PM kgaillot@redhat.com : I'm trying to come up with a release note descriptions that's just a summary overview – a user who wants to know the details can go to the bug ticket. As a release note, does this make sense technically, and does it need more detail? Also,tThat first sentence is very long but I'm not sure how to split it. Could I eliminate the phrase "data such as", since the fix here seems to involve only the TLS handhsake. .When a Pacemaker remote connection is interrupted and blocked during a TLS handshake, the remote connection is now successfully recovered elsewhere Previously, when network communication was interrupted between a Pacemaker remote node and the cluster node hosting its connection while reading [data such as] the TLS handshake portion of the initial connection, the connection could block and could not be recovered on another cluster node. With this fix, the TLS handshake is now asynchronous and a remote connection is successfully recovered elsewhere.

Christopher Lumens added a comment - 2024/11/01 4:51 PM

This project got more complicated than I anticipated. I'm going to use this issue to track just the portion that we were able to complete for RHEL-9.6 (handshaking, remote CIB ops, and some other minor stuff). The cloned issue RHEL-65544 I'll use for tracking the rest of it. This will also be covered in upstream issue T901.

Christopher Lumens added a comment - 2024/11/01 4:51 PM This project got more complicated than I anticipated. I'm going to use this issue to track just the portion that we were able to complete for RHEL-9.6 (handshaking, remote CIB ops, and some other minor stuff). The cloned issue RHEL-65544 I'll use for tracking the rest of it. This will also be covered in upstream issue T901.

Kenneth Gaillot (Inactive) added a comment - 2024/07/30 9:46 PM

I think it'll be better just to work on the async communication. Getting info from the CIB to remote nodes is a bit of a pain (especially keeping it in sync if it changes), and lowering the timeout wouldn't completely eliminate the problem, just reduce the chance of it.

The 9.5 cycle is wrapping up right now, but hopefully we can get it for 9.6. It should be feasible to backport it too, since it will likely be purely on the daemon side.

Kenneth Gaillot (Inactive) added a comment - 2024/07/30 9:46 PM I think it'll be better just to work on the async communication. Getting info from the CIB to remote nodes is a bit of a pain (especially keeping it in sync if it changes), and lowering the timeout wouldn't completely eliminate the problem, just reduce the chance of it. The 9.5 cycle is wrapping up right now, but hopefully we can get it for 9.6. It should be feasible to backport it too, since it will likely be purely on the daemon side.

Lan Pham added a comment - 2024/07/30 6:44 PM - edited

Would you consider making the timeout value from the Pacemaker remote daemon configurable ?

Adding support for a new Pacemaker remote resource instance attribute - daemon_check_timeout.
Once the remote node successfully connects to the cluster, then it (the daemon) will read this value and configure itself accordingly. If the parameter is not set, then use the default value.
Alternatively, read the monitor timeout value from the Pacemaker remote resource and use the same value for the daemon read timeout.

Lan Pham added a comment - 2024/07/30 6:44 PM - edited Would you consider making the timeout value from the Pacemaker remote daemon configurable ? Adding support for a new Pacemaker remote resource instance attribute - daemon_check_timeout. Once the remote node successfully connects to the cluster, then it (the daemon) will read this value and configure itself accordingly. If the parameter is not set, then use the default value. Alternatively, read the monitor timeout value from the Pacemaker remote resource and use the same value for the daemon read timeout.

Pinned by Lan Pham

Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Lan Pham added a comment - 2024/07/29 7:21 PM

Please post any update to the case. We are blocked by this issue.

Lan Pham added a comment - 2024/07/29 7:21 PM Please post any update to the case. We are blocked by this issue.

Lan Pham added a comment - 2024/07/16 8:27 PM - edited

I was able to reproduce the issue using RHEL 9.2 standard Pacemaker packages:

Version: Pacemaker 2.1.5-9.el9_2.4
Time of failure: Jul 16 15:51:

I uploaded the latest crm_report collection file, named: pacemaker-remote-start-failure.tar.bz2

Cluster configuration:

RHEL 9.2
2 full cluster nodes: lphamps-srv-1, lphamps-srv-2
2 Pacemaker remote nodes: lphamps-srv-3, lphamps-srv-4

Sequence of events:

At 15:51:51, ethernet interface was down on host lphamps-srv-2 (via ifconfig eth1 down)
At 15:51:54, from the DC host lphamps-srv-1, the other cluster node, lphamps-srv-2 was detected as left the cluster
At 15:52:02, the Pacemaker remote resource lphamps-srv-3 failed to start on cluster node lphamps-srv-1

Jul 16 15:52:02.578 lphamps-srv-1 pacemaker-controld [7137] (log_executor_event) error: Result of start operation for lphamps-srv-3 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 70, graph action confirmed; call=3 key=lphamps-srv-3_start_0

At 15:02:07, the Pacemaker remote resource lphamps-srv-4 failed to start on cluster node lphamps-srv-1

Jul 16 15:52:07.592 lphamps-srv-1 pacemaker-controld [7137] (log_executor_event) error: Result of start operation for lphamps-srv-4 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 76, graph action confirmed; call=4 key=lphamps-srv-4_start_0

Then both lphamps-srv-3 and lphamps-srv-4 hosts were fenced:

Jul 16 15:52:07.628 lphamps-srv-1 pacemaker-schedulerd[7136] (pe_fence_node) warning: Remote node lphamps-srv-4 will be fenced: db2_member_regress1_1 is thought to be active there

Jul 16 15:52:07.631 lphamps-srv-1 pacemaker-schedulerd[7136] (pe_fence_node) warning: Remote node lphamps-srv-3 will be fenced: db2_idle_regress1_999_lphamps-srv-3 is thought to be active there

Lan Pham added a comment - 2024/07/16 8:27 PM - edited I was able to reproduce the issue using RHEL 9.2 standard Pacemaker packages: Version: Pacemaker 2.1.5-9.el9_2.4 Time of failure: Jul 16 15:51: I uploaded the latest crm_report collection file, named: pacemaker-remote-start-failure.tar.bz2 Cluster configuration: RHEL 9.2 2 full cluster nodes: lphamps-srv-1, lphamps-srv-2 2 Pacemaker remote nodes: lphamps-srv-3, lphamps-srv-4 Sequence of events: At 15:51:51, ethernet interface was down on host lphamps-srv-2 (via ifconfig eth1 down) At 15:51:54, from the DC host lphamps-srv-1, the other cluster node, lphamps-srv-2 was detected as left the cluster At 15:52:02, the Pacemaker remote resource lphamps-srv-3 failed to start on cluster node lphamps-srv-1 Jul 16 15:52:02.578 lphamps-srv-1 pacemaker-controld [7137] (log_executor_event) error: Result of start operation for lphamps-srv-3 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 70, graph action confirmed; call=3 key=lphamps-srv-3_start_0 At 15:02:07, the Pacemaker remote resource lphamps-srv-4 failed to start on cluster node lphamps-srv-1 Jul 16 15:52:07.592 lphamps-srv-1 pacemaker-controld [7137] (log_executor_event) error: Result of start operation for lphamps-srv-4 on lphamps-srv-1: Error (Key was rejected by service) | CIB update 76, graph action confirmed; call=4 key=lphamps-srv-4_start_0 Then both lphamps-srv-3 and lphamps-srv-4 hosts were fenced: Jul 16 15:52:07.628 lphamps-srv-1 pacemaker-schedulerd [7136] (pe_fence_node) warning: Remote node lphamps-srv-4 will be fenced: db2_member_regress1_1 is thought to be active there Jul 16 15:52:07.631 lphamps-srv-1 pacemaker-schedulerd [7136] (pe_fence_node) warning: Remote node lphamps-srv-3 will be fenced: db2_idle_regress1_999_lphamps-srv-3 is thought to be active there

Lan Pham added a comment - 2024/05/28 3:46 PM

Ken, any further update to the case ? We are blocked by this issue.

Lan Pham added a comment - 2024/05/28 3:46 PM Ken, any further update to the case ? We are blocked by this issue.

Lan Pham added a comment - 2024/05/24 5:23 PM

So we tried both workarounds:

The first workaround was to configure the start-failure-is-fatal=false to let the Pacemaker retry the start failure on the same host. With this setting, the failed start was retried and was successful on the retry, BUT the cluster manager still triggered fencing on the remote node on the first failure. This is not what we want as it would stop all resources on that host.
The second work around was to configure an order constraint so that the remote resources would be started sequentially. After running for a few iterations, we still hit the original error "error: Result of start operation for ps-1 on ps-3: Error (Key was rejected by service) | graph action confirmed; call=3 key=ps-1_start_0". From the log, I saw clearly that ps-1 was started before ps-2 but start of ps-1 still failed (and that triggered node fencing). ps-2 was started later and it was successful. So the error wasn't with starting multiple remote resources at the same time.

We need a solution that would not cause the remote node to be fenced.

Is there another workaround, we need to wait for the fix ?

Lan Pham added a comment - 2024/05/24 5:23 PM So we tried both workarounds: The first workaround was to configure the start-failure-is-fatal=false to let the Pacemaker retry the start failure on the same host. With this setting, the failed start was retried and was successful on the retry, BUT the cluster manager still triggered fencing on the remote node on the first failure. This is not what we want as it would stop all resources on that host. The second work around was to configure an order constraint so that the remote resources would be started sequentially. After running for a few iterations, we still hit the original error "error: Result of start operation for ps-1 on ps-3: Error (Key was rejected by service) | graph action confirmed; call=3 key=ps-1_start_0". From the log, I saw clearly that ps-1 was started before ps-2 but start of ps-1 still failed (and that triggered node fencing). ps-2 was started later and it was successful. So the error wasn't with starting multiple remote resources at the same time. We need a solution that would not cause the remote node to be fenced. Is there another workaround, we need to wait for the fix ?

Assignee:: Christopher Lumens

Reporter:: Lan Pham

Contributing Groups:: IBM Confidential Group

Developer:: Kenneth Gaillot (Inactive)

QA Contact:: Jana Rehova

Doc Contact:: Steven Levine

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2024/04/25 9:46 PM

Updated:: 2025/02/18 1:53 PM

Dev Target end:: 2024/11/25

Next Planned Release Date:: 2025/05/13

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible: Frequent

Steps to reproduce

Expected results: Pacemaker remote resource migrated successfully

Actual results: Pacemaker remote resource failed to start and stayed in stopped state.

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Pinned by Lan Pham Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Expand comment: Pinned by Lan Pham Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Collapse comment: Kenneth Gaillot (Inactive) added a comment - 2024/12/19 9:56 PM

Expand comment: Kenneth Gaillot (Inactive) added a comment - 2024/12/19 9:56 PM

Collapse comment: Steven Levine added a comment - 2024/12/19 7:58 PM

Expand comment: Steven Levine added a comment - 2024/12/19 7:58 PM

Collapse comment: Christopher Lumens added a comment - 2024/11/01 4:51 PM

Expand comment: Christopher Lumens added a comment - 2024/11/01 4:51 PM

Collapse comment: Kenneth Gaillot (Inactive) added a comment - 2024/07/30 9:46 PM

Expand comment: Kenneth Gaillot (Inactive) added a comment - 2024/07/30 9:46 PM

Collapse comment: Lan Pham added a comment - 2024/07/30 6:44 PM, Edited by Lan Pham - 2024/07/30 6:45 PM

Expand comment: Lan Pham added a comment - 2024/07/30 6:44 PM, Edited by Lan Pham - 2024/07/30 6:45 PM

Collapse comment: Pinned by Lan Pham Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Expand comment: Pinned by Lan Pham Kenneth Gaillot (Inactive) added a comment - 2024/07/30 6:05 PM

Collapse comment: Lan Pham added a comment - 2024/07/29 7:21 PM

Expand comment: Lan Pham added a comment - 2024/07/29 7:21 PM

Collapse comment: Lan Pham added a comment - 2024/07/16 8:27 PM, Edited by Lan Pham - 2024/07/16 8:57 PM

Expand comment: Lan Pham added a comment - 2024/07/16 8:27 PM, Edited by Lan Pham - 2024/07/16 8:57 PM

Collapse comment: Lan Pham added a comment - 2024/05/28 3:46 PM

Expand comment: Lan Pham added a comment - 2024/05/28 3:46 PM

Collapse comment: Lan Pham added a comment - 2024/05/24 5:23 PM

Expand comment: Lan Pham added a comment - 2024/05/24 5:23 PM

People

Dates