-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
rhel-8.1.0
-
None
-
Important
-
rhel-sst-high-availability
-
ssg_filesystems_storage_and_HA
-
8
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
All
-
None
Description of problem:
Neither a delay attribute nor a pcmk_delay_max attribute properly prevents a fence race in an AWS Pacemaker cluster.
This may be because the EC2 StopInstances endpoint initiates a graceful shutdown rather than a hard reboot. I have yet to find a way to perform a hard power-off of an EC2 instance.
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_StopInstances.html
Version-Release number of selected component (if applicable):
fence-agents-aws-4.2.1-30.el8_1.1.noarch
pacemaker-2.0.2-3.el8.x86_64
How reproducible:
Always so far. The issue may be avoidable with a long enough delay, but that's not practical in production.
Steps to Reproduce:
1. In a two-node cluster, configure one fence_aws stonith device for each node. This issue can be reproduced with a shared stonith device and pcmk_delay_max, but a static delay can be configured for a single node this way for consistency.
2. Set delay=60 on one of the stonith devices.
3. Simulate a heartbeat network failure between nodes.
Actual results:
Only the node without the delay gets fenced.
Expected results:
Both nodes get fenced. The node with the delay does not get rebooted until after its delay expires.
Additional info:
This issue was reported by a user working for AWS. They've stated, "I'm more than happy to work side by side with Red Hat development to test and help to provide code for an improved agent version."
- external trackers
- links to