Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-3923

Make bare-metal nodes fencing process faster

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • API
    • None
    • False
    • None
    • False
    • Not Selected
    • 0
    • 0% 0%

      1. Proposed title of this feature request

      Fast baremetal node fencing.

      2. What is the nature and description of the request?

      Currently metal3 components running in openshift-machine-api are configured to run as replica 1 with toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable set to 120 seconds.

       

      This makes node fencing unnecessary long in case the outage affect the node where these resources were running - they have to be recreated on the remaining nodes before the affected node will be fenced with the extra 120 seconds of delay.

       

      3. Why does the customer need this? (List the business requirements here)

      During the PoC we were running baremetal 3 master nodes compact cluster. It was installed with Agent Based Installer. The customer has defined one of the use-cases to ensure OCP-V based VMs are being rescheduled fast in case of single master node failure. To fence the node and allow VM restart on the remaining nodes I'm using MachineHealthCheck as described here:

      The problem I observe is with time required to fence the node if the node is hosting crucial to the fencing process pods, for an instance:

      $ oc get pods -n openshift-machine-api -o wide
      NAME                                                  READY   STATUS    RESTARTS         AGE     IP             NODE    NOMINATED NODE   READINESS GATES
      cluster-autoscaler-operator-74f9b6b57c-f9r7t          2/2     Running   10 (2d19h ago)   3d      10.132.0.89    node2   <none>           <none>
      cluster-baremetal-operator-79c464cc4d-v7hdl           2/2     Running   4                3d      10.132.0.96    node2   <none>           <none>
      control-plane-machine-set-operator-5f94b56df6-lhlhx   1/1     Running   8 (2d19h ago)    3d      10.132.0.125   node2   <none>           <none>
      ironic-proxy-b6qbh                                    1/1     Running   0                2d21h   10.90.26.21    node1   <none>           <none>
      ironic-proxy-lfmq8                                    1/1     Running   0                2d19h   10.90.26.23    node3   <none>           <none>
      ironic-proxy-sdtfc                                    1/1     Running   2                3d1h    10.90.26.22    node2   <none>           <none>
      machine-api-controllers-7f65fc86-npn98                7/7     Running   43 (2d19h ago)   3d1h    10.132.0.8     node2   <none>           <none>
      machine-api-operator-768598bf68-gqs7d                 2/2     Running   6 (2d19h ago)    3d1h    10.132.0.49    node2   <none>           <none>
      metal3-55fc56d99b-tcbk6                               5/5     Running   10               3d1h    10.90.26.22    node2   <none>           <none>
      metal3-image-customization-dc549f75b-wwb9n            1/1     Running   2                3d1h    10.132.0.45    node2   <none>           <none>

      If crash affects node2, these pods have to be restarted first on the other nodes before they can fence node2.

      Couldn't we run all necessary to the fencing process resources on more nodes simultaneously? This would speedup fencing process a lot. If that's difficult to achieve due to the initial design would it be possible to minimise delay for metal3 pods to reschedule after node hosting them becomes not-ready or unreachable?

       

      4. List any affected packages or components.

      openshift-machine-api 

            wcabanba@redhat.com William Caban
            rszmigie Rafal Szmigiel
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: