Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7662

fence_ipmilan should not leave resources in UNCLEAN state when status of failed node is known

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-8.4.0
    • fence-agents
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 5
    • False
    • Hide

      None

      Show
      None
    • If docs needed, set a value

      Description of problem:
      During a hardware failure in a PCS cluster, fencing via IPMI failed since the node was physically broken. As a result, all resources on that node were UNCLEAN. However, the BMC for the node was still reporting correctly that the node was OFF. Therefore, the cluster resources should have been restarted on a surviving node instead of becoming UNCLEAN.

      Version-Release number of selected component (if applicable):
      fence-agents-ipmilan-4.2.1-89.el8

      How reproducible:
      I have reproduced the customer issue in my lab using KVM and virtualBMC. However, the customer environment was using physical nodes in an OpenStack 16.1 environment.

      Steps to Reproduce:
      1. Create a cluster that uses ipmilan for fencing
      2. Ensure some resources are running on the node that will "fail"
      3. Cause issue in the node that will prevent it from booting (in my lab, I simply moved the qcow2 used for booting to another name, but in the CU environment, the mainboard failed).
      4. Trigger fencing on the node
      5. After some time, the node and resources will become UNCLEAN
      6. 'fence_ipmilan ... -o status' will show that the node is "OFF"

      Actual results:
      Even though ipmilan is able to query the status of the broken node, and the node is OFF, the resources still become UNCLEAN and do not fail over until the user manually confirms the fencing action.

      Expected results:
      So long as the BMC/ILO is able to report the status of the node as being "OFF", the cluster resources should be restarted on surviving nodes without manual intervention.

      Additional info:

            rhn-engineering-oalbrigt Oyvind Albrigtsen
            msecaur@redhat.com Matthew Secaur (Inactive)
            Oyvind Albrigtsen Oyvind Albrigtsen
            Cluster QE Cluster QE
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: