-
Epic
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
disruptive-node-leases
-
False
-
False
-
To Do
-
OCPPLAN-2827 - OCP clusters life-cycle infrastructure
-
Undefined
-
0
-
0
Epic Goal:
- To prevent disruptive operations (such as node updates, fencing, and maintenance windows) from happening at the same time
- Use k8s node leases to provide an integration point that ensures disruptive operations do not conflict with each other
Why is this important?
- Fencing while a node is partially updated could corrupt the node
- Software and configuration updates during maintenance activities may complicate triage
Problem:
There exists a number of node actions that could produce unexpected or suboptimal results if performed in parallel:
- Rebooting a node
- Draining or Cordoning a node
- Updating a node's configuration
- Powering off a node for maintenance
- Deleting/deprovisioning a node, possibly due to failed health checks
Why is this important:
Being able to ensure exclusive access to a node allows these operations to be more robust and provides a better user experience.
During these maintenance windows, the machine should be left alone.
If the power were to be restored while and admin is physically handling a machine it create a health and safety issue. Additionally, triage activities become significantly more challenging if the machine's software or configuration is modified, or the machine is rebooted during the maintenance window.
User stories
- As an OpenShift admin, I want exclusive access to a machine under maintenance, so that poorly timed automated updates and/or reboots do not create additional support challenges.
- As an OpenShift admin, I want machines to stay off when requested, so that they do not come to life during hardware changes.
- As an OpenShift admin, I do not want the system to update a node that I am trying to delete.
- As an OpenShift admin, I do not want the system to power off a node that I am trying to drain.
- As an OpenShift admin, I want to prevent the system from upgrading machines flagged for maintenance, so that the system is not changing during triage activities
- As an OpenShift admin, I want to be warned that machines are flagged for maintenance prior to initiating an upgrade, so that I can make an informed decision.
- As an OpenShift admin, I want to be warned that an upgrade is in progress prior to initiating host maintenance, so that I can make an informed decision.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
- Proposal needs to be accepted upstream
- Proposal can then be brought downstream
- Identify sources of disruptive actions and update components to require the lease first
Estimate (XS, S, M, L, XL, XXL): L
Previous Work:
- https://jira.coreos.com/browse/MGMT-298
- https://jira.coreos.com/browse/KNIP-865
- https://docs.google.com/document/d/1aoL8vKTdoL8t9Ynr3HsfE8QzixFD2G99L4fc82cs0Dc/edit?ts=5dd3b252#heading=h.jfc876tne6m9
A Node Maintenance Operator (NMO) was developed and shipped as part of the CNV project in order to fast-track the ability to drain machines from the UI. The implementation utilized a CRD for tracking who was requesting the drain and why, as well as which workloads were yet to be moved away (so that the UI could indicate the remaining work to the admin).
Customers:
- Verizon
Acceptance Criteria
|