Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- groomed

Story Points:
5
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
OCPCLOUD-737

Sprint:
CLOUD Sprint 219, CLOUD Sprint 220, CLOUD Sprint 221

Target Version:

openshift-4.12

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

User Story

As a user of OpenShift I want Machines being drained with long running workloads not to block the general actions of the Machine controller so that when I delete a large number of Machines, the Machine controller can quickly create new Machines without waiting for the other Nodes to drain.

Background

This came from Slack: https://coreos.slack.com/archives/CFUGK0K9R/p1651807500538259

A MachineSet on the build cluster was accidentally dropped from 55 replicas to 0. Each of these machines had long running test jobs on with indefinitely blocking PDBs.

As the Machine controller waits 20s on each reconcile before cancelling the drain, it took a really long time to process the requests for creating new Machines.

Each reconcile of the 55 machines took 20s (that's 18m20s), then each new Machine takes at least 3 reconciles before the Machine is actually created (one to add the finalizer, one to set the phase to provisioning, one to call Create on the provider), so this means a Machine in this scenario will take around an hour to actually be created.

To improve this, we should make the drain operation asynchronous from the other Machine operations.

Steps

We should move the drain logic into a separate controller that:

Checks if the Machine is to be deleted
Checks for the exclude node draining annotation, skips if set
Handles PreDrainHooks
Drains the Machine
Sets the Drained condition to true when it has finished

Once this is done:

Modify the machine controller to requeue the Machine when the drained condition is False/missing
Remove logic around draining from the core of the Machine controller

Stakeholders

Cluster Infra
DPTP

Definition of Done

Machine draining is async of other Machine operations

Docs

Likely need to update Machine docs to explain this change, release note required

Testing

Delete lots of machines with PDBs blocking their removal and check that new Machines come up

Assignee:: Denis Moiseev (Inactive)

Reporter:: Joel Speed

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/05/06 2:36 PM

Updated:: 2022/08/26 2:05 PM

Resolved:: 2022/06/28 4:18 PM

Details

Description

User Story

Background

Steps

Stakeholders

Definition of Done

Attachments

Easy Agile Planning Poker

Activity

People

Dates