Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: rhel-8.1.0
Component/s: fence-agents
Labels:

Regression:
None
Severity:
Important

Pool Team:

rhel-sst-high-availability
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
8
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Release Note Type:
If docs needed, set a value

Experience:
Architecture:

All
Bugzilla Bug:
RHBZ: 1780515

PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Priority Data:
PX Review Complete:
PX Scheduling Request:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

Description of problem:

Neither a delay attribute nor a pcmk_delay_max attribute properly prevents a fence race in an AWS Pacemaker cluster.

This may be because the EC2 StopInstances endpoint initiates a graceful shutdown rather than a hard reboot. I have yet to find a way to perform a hard power-off of an EC2 instance.

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_StopInstances.html

Version-Release number of selected component (if applicable):

fence-agents-aws-4.2.1-30.el8_1.1.noarch
pacemaker-2.0.2-3.el8.x86_64

How reproducible:

Always so far. The issue may be avoidable with a long enough delay, but that's not practical in production.

Steps to Reproduce:
1. In a two-node cluster, configure one fence_aws stonith device for each node. This issue can be reproduced with a shared stonith device and pcmk_delay_max, but a static delay can be configured for a single node this way for consistency.

2. Set delay=60 on one of the stonith devices.

3. Simulate a heartbeat network failure between nodes.

Actual results:

Only the node without the delay gets fenced.

Expected results:

Both nodes get fenced. The node with the delay does not get rebooted until after its delay expires.

Additional info:

This issue was reported by a user working for AWS. They've stated, "I'm more than happy to work side by side with Red Hat development to test and help to provide code for an improved agent version."

external trackers

PnT-DevOps Jira RHELPLAN-34360

Red Hat Customer Portal 02535181

Red Hat Customer Portal 02535182

Red Hat Customer Portal 02842249

Red Hat Customer Portal 03122081

Red Hat Issue Tracker RHELPLAN-34360

Red Hat Knowledge Base (Solution) 4642491

Red Hat Knowledge Base (Solution) 4644971

links to

Support Policies for RHEL High Availability Clusters - sbd and fence_sbd

(3 external trackers, 1 links to)