Uploaded image for project: 'OpenShift Cloud'
  1. OpenShift Cloud
  2. OCPCLOUD-1539

Create separate Node Drain Controller for Machine API Operator

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • 5
    • False
    • None
    • False
    • CLOUD Sprint 219, CLOUD Sprint 220, CLOUD Sprint 221

      User Story

      As a user of OpenShift I want Machines being drained with long running workloads not to block the general actions of the Machine controller so that when I delete a large number of Machines, the Machine controller can quickly create new Machines without waiting for the other Nodes to drain.

      Background

      This came from Slack: https://coreos.slack.com/archives/CFUGK0K9R/p1651807500538259

       

      A MachineSet on the build cluster was accidentally dropped from 55 replicas to 0. Each of these machines had long running test jobs on with indefinitely blocking PDBs.

      As the Machine controller waits 20s on each reconcile before cancelling the drain, it took a really long time to process the requests for creating new Machines.

      Each reconcile of the 55 machines took 20s (that's 18m20s), then each new Machine takes at least 3 reconciles before the Machine is actually created (one to add the finalizer, one to set the phase to provisioning, one to call Create on the provider), so this means a Machine in this scenario will take around an hour to actually be created.

      To improve this, we should make the drain operation asynchronous from the other Machine operations.

      Steps

      We should move the drain logic into a separate controller that:

      • Checks if the Machine is to be deleted
      • Checks for the exclude node draining annotation, skips if set
      • Handles PreDrainHooks
      • Drains the Machine
      • Sets the Drained condition to true when it has finished

      Once this is done:

      • Modify the machine controller to requeue the Machine when the drained condition is False/missing
      • Remove logic around draining from the core of the Machine controller

      Stakeholders

      • Cluster Infra
      • DPTP

      Definition of Done

      • Machine draining is async of other Machine operations
      • Docs
      • Likely need to update Machine docs to explain this change, release note required
      • Testing
      • Delete lots of machines with PDBs blocking their removal and check that new Machines come up

              dmoiseev Denis Moiseev (Inactive)
              joelspeed Joel Speed
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: