-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
5
-
False
-
None
-
False
-
OCPSTRAT-1823 - [TP] 'oc adm upgrade status' command and status API
-
-
-
OTA 265, OTA 266, OTA 267
Implement a new Informer controller in the Update Status Controller to watch Node resources in the cluster and maintain an update status insight for each. The informer will need to interact with additional resources such as MachineConfigPools and MachineConfigs, e.g. to discover the OCP version tied to config that is being reconciled on the Node, but should not attempt to maintain the MachineConfigPool status insights. Generally the node status insight should carry enough data for any client to be able to render a line that the oc adm upgrade status currently shows:
NAME ASSESSMENT PHASE VERSION EST MESSAGE
build0-gstfj-ci-prowjobs-worker-b-9lztv Degraded Draining 4.16.0-ec.2 ? failed to drain node: <node> after 1 hour. Please see machine-config-controller logs for more informatio
build0-gstfj-ci-prowjobs-worker-d-ddnxd Unavailable Pending ? ? Machine Config Daemon is processing the node
build0-gstfj-ci-tests-worker-b-d9vz2 Unavailable Pending ? ? Not ready
build0-gstfj-ci-tests-worker-c-jq5rk Unavailable Updated 4.16.0-ec.3 - Node is marked unschedulable
The basic expectations for Node status insights are described in the design docs but the current source of truth for the data structure is the NodeStatusInsight structure from https://github.com/openshift/api/pull/2012 .
Definition of Done
- During the upgrade, the status api contains a Node status insight for each Node in the cluster
- Do not bother with the status insight lifecycle (when a Node is removed from the cluster, the status insight should technically disappear, but do not address that in this card, suitable lifecycle mechanism for this does not exist yet and OTA-1418 will address it)
- Overall the functionality should match what oc adm upgrade status client-based checks
- The NodeStatusInsight should have correctly populated: name, resource, poolResource, scopeType, version, estToComplete and message fields, following the existing logic from oc adm upgrade status
- Health insights are out of scope
- Status insights for MCPs are out of scope
- The Updating condition has a similar meaning and interpretation like in the other insights.
- When its status is False, it will contain a reason which needs to be interpreted. Three known reasons are Pending, Updated and Paused:
- Pending: Node will eventually be updated but has not started yet
- Updated: Node already underwent the update.
- Paused: Node is running an outdated version but something is pausing the process (like parent MCP .spec.paused field)
- When Updating=True, there are also three known reasons: Draining, Updating and Rebooting.
- Draining: MCO drains the node so it can be updated and rebooted
- Updating: MCO applies the new config and prepares the node to be rebooted into the new OS version
- Rebooting: MCO is rebooting the node, after which it (hopefully) becomes ready again
- When its status is False, it will contain a reason which needs to be interpreted. Three known reasons are Pending, Updated and Paused:
- The Degraded and Unavailable condition logic should match the existing assessment logic from oc adm upgrade status
- links to