Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-8.4.0
Component/s: fence-agents
Labels:
- MigratedToJIRA

Pool Team:

sst_high_availability
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None

Release Note Type:
If docs needed, set a value

Experience:
Architecture:

Unspecified
Bugzilla Bug:
RHBZ: 2084626

PX Impact Score:
PX Priority Data:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
During a hardware failure in a PCS cluster, fencing via IPMI failed since the node was physically broken. As a result, all resources on that node were UNCLEAN. However, the BMC for the node was still reporting correctly that the node was OFF. Therefore, the cluster resources should have been restarted on a surviving node instead of becoming UNCLEAN.

Version-Release number of selected component (if applicable):
fence-agents-ipmilan-4.2.1-89.el8

How reproducible:
I have reproduced the customer issue in my lab using KVM and virtualBMC. However, the customer environment was using physical nodes in an OpenStack 16.1 environment.

Steps to Reproduce:
1. Create a cluster that uses ipmilan for fencing
2. Ensure some resources are running on the node that will "fail"
3. Cause issue in the node that will prevent it from booting (in my lab, I simply moved the qcow2 used for booting to another name, but in the CU environment, the mainboard failed).
4. Trigger fencing on the node
5. After some time, the node and resources will become UNCLEAN
6. 'fence_ipmilan ... -o status' will show that the node is "OFF"

Actual results:
Even though ipmilan is able to query the status of the broken node, and the node is OFF, the resources still become UNCLEAN and do not fail over until the user manually confirms the fencing action.

Expected results:
So long as the BMC/ILO is able to report the status of the node as being "OFF", the cluster resources should be restarted on surviving nodes without manual intervention.

Additional info:

external trackers

Red Hat Customer Portal 03214981

Red Hat Issue Tracker RHELPLAN-121887

links to

Red Hat KCS solution 7038014

Assignee:: Oyvind Albrigtsen

Reporter:: Matthew Secaur (Inactive)

Developer:: Oyvind Albrigtsen

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/22 7:36 PM

Updated:: 2024/01/31 1:29 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates

Hide