Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-6315

Support to recover from HW/BIOS failures on baremetal MNO cluster

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Normal Normal
    • openshift-4.18
    • openshift-4.18
    • None
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Support to recover from HW/BIOS failures on baremetal MNO cluster.

      2. What is the nature and description of the request?

      The context of this RFE are baremetal clusters with a relatively large number of nodes (can be 100+). These cluster are composed by multiple worker nodes in the field hosting telco workloads consuming additional secondary HW NICs, requiring specific HW such as GPUs. 

      An example is the case of secondary HW NICs failures that can be caused due to firmware error, driver error, NIC port down due to ToR switch misconfigured etc Another specific case BIOS failure could be that of having a node not enabled with SecureBoot in the telco field.

      These are cases where the node is healthy from an OCP perspective but not from customer perspective as workloads hosted in the node.    

      The request is to support:

      • Detecting HW/BIOS failures (i.e., detect a worker node needs to go into maintenance mode due to HW/BIOS failure)
      • Set/Unset worker node into maintenance mode (Start cordoning and drain the node after node is marked as "unhealthy")

      3. Why does the customer need this? (List the business requirements here)

      This request will eliminate requirement on the customer side to automate the management of baremetal state resources or manually checking the BIOS of the nodes to define what nodes are healthy to host their applications. 

      4. List any affected packages or components.

      OpenShift, ACM

            fbaudin@redhat.com Franck Baudin
            jnunez@redhat.com Jose Nuñez
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: