Epic Goal:

To prevent disruptive operations (such as node updates, fencing, and maintenance windows) from happening at the same time
Use k8s node leases to provide an integration point that ensures disruptive operations do not conflict with each other

Why is this important?

Fencing while a node is partially updated could corrupt the node
Software and configuration updates during maintenance activities may complicate triage

Problem:

There exists a number of node actions that could produce unexpected or suboptimal results if performed in parallel:

Rebooting a node
Draining or Cordoning a node
Updating a node's configuration
Powering off a node for maintenance
Deleting/deprovisioning a node, possibly due to failed health checks

Why is this important:

Being able to ensure exclusive access to a node allows these operations to be more robust and provides a better user experience.

During these maintenance windows, the machine should be left alone.

If the power were to be restored while and admin is physically handling a machine it create a health and safety issue. Additionally, triage activities become significantly more challenging if the machine's software or configuration is modified, or the machine is rebooted during the maintenance window.

User stories

As an OpenShift admin, I want exclusive access to a machine under maintenance, so that poorly timed automated updates and/or reboots do not create additional support challenges.
As an OpenShift admin, I want machines to stay off when requested, so that they do not come to life during hardware changes.
As an OpenShift admin, I do not want the system to update a node that I am trying to delete.
As an OpenShift admin, I do not want the system to power off a node that I am trying to drain.

As an OpenShift admin, I want to prevent the system from upgrading machines flagged for maintenance, so that the system is not changing during triage activities

As an OpenShift admin, I want to be warned that machines are flagged for maintenance prior to initiating an upgrade, so that I can make an informed decision.

As an OpenShift admin, I want to be warned that an upgrade is in progress prior to initiating host maintenance, so that I can make an informed decision.

Dependencies (internal and external):

Upstream NodeLease enhancement

Prioritized epics + deliverables (in scope / not in scope):

Proposal needs to be accepted upstream
Proposal can then be brought downstream
Identify sources of disruptive actions and update components to require the lease first

Estimate (XS, S, M, L, XL, XXL): L

Previous Work:

A Node Maintenance Operator (NMO) was developed and shipped as part of the CNV project in order to fast-track the ability to drain machines from the UI. The implementation utilized a CRD for tracking who was requesting the drain and why, as well as which workloads were yet to be moved away (so that the UI could indicate the remaining work to the admin).

Customers:

Verizon

Acceptance Criteria

Operator obtains a node lock/lease before entering maintenance mode
Operator releases the node lock/lease after exiting maintenance mode
Machine healthcheck (fencing) requests the same lock/lease prior to remediating a node
MachineConfigDaemon requests the same lock/lease prior to updating the node's configuration
?CVO? requests the same lock/lease prior to Upgrade awaiting approval the node's software
Any other automated paths that request a node power off or reboot first requests the same lock/lease
A node in maintenance mode can be shutdown or rebooted by an admin
Documentation
Testing

Assignee:: Unassigned

Reporter:: Andrew Beekhof

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2020/12/17 11:26 AM

Updated:: 2023/03/07 7:44 PM

Resolved:: 2023/03/07 7:44 PM

Details

Description

Epic Goal:

Why is this important?

Acceptance Criteria

Attachments

Easy Agile Planning Poker

Activity

People

Dates