-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-8.4.0
-
sst_high_availability
-
ssg_filesystems_storage_and_HA
-
5
-
False
-
-
If docs needed, set a value
-
-
Unspecified
Description of problem:
During a hardware failure in a PCS cluster, fencing via IPMI failed since the node was physically broken. As a result, all resources on that node were UNCLEAN. However, the BMC for the node was still reporting correctly that the node was OFF. Therefore, the cluster resources should have been restarted on a surviving node instead of becoming UNCLEAN.
Version-Release number of selected component (if applicable):
fence-agents-ipmilan-4.2.1-89.el8
How reproducible:
I have reproduced the customer issue in my lab using KVM and virtualBMC. However, the customer environment was using physical nodes in an OpenStack 16.1 environment.
Steps to Reproduce:
1. Create a cluster that uses ipmilan for fencing
2. Ensure some resources are running on the node that will "fail"
3. Cause issue in the node that will prevent it from booting (in my lab, I simply moved the qcow2 used for booting to another name, but in the CU environment, the mainboard failed).
4. Trigger fencing on the node
5. After some time, the node and resources will become UNCLEAN
6. 'fence_ipmilan ... -o status' will show that the node is "OFF"
Actual results:
Even though ipmilan is able to query the status of the broken node, and the node is OFF, the resources still become UNCLEAN and do not fail over until the user manually confirms the fencing action.
Expected results:
So long as the BMC/ILO is able to report the status of the node as being "OFF", the cluster resources should be restarted on surviving nodes without manual intervention.
Additional info:
- external trackers
- links to