[OTA-1427] USC: Maintain status insights for Nodes - Red Hat Issue Tracker

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- groomed

Story Points:
5
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
Update Status API
Feature Link:
OCPSTRAT-1823 - [TP] 'oc adm upgrade status' command and status API
Intelligence Requested:
Market:

Sprint:
OTA 265, OTA 266, OTA 267

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Implement a new Informer controller in the Update Status Controller to watch Node resources in the cluster and maintain an update status insight for each. The informer will need to interact with additional resources such as MachineConfigPools and MachineConfigs, e.g. to discover the OCP version tied to config that is being reconciled on the Node, but should not attempt to maintain the MachineConfigPool status insights. Generally the node status insight should carry enough data for any client to be able to render a line that the oc adm upgrade status currently shows:

NAME                                      ASSESSMENT    PHASE      VERSION       EST   MESSAGE
build0-gstfj-ci-prowjobs-worker-b-9lztv   Degraded      Draining   4.16.0-ec.2   ?     failed to drain node: <node> after 1 hour. Please see machine-config-controller logs for more informatio
build0-gstfj-ci-prowjobs-worker-d-ddnxd   Unavailable   Pending    ?             ?     Machine Config Daemon is processing the node
build0-gstfj-ci-tests-worker-b-d9vz2      Unavailable   Pending    ?             ?     Not ready
build0-gstfj-ci-tests-worker-c-jq5rk      Unavailable   Updated    4.16.0-ec.3   -     Node is marked unschedulable

The basic expectations for Node status insights are described in the design docs but the current source of truth for the data structure is the NodeStatusInsight structure from https://github.com/openshift/api/pull/2012 .

Definition of Done

During the upgrade, the status api contains a Node status insight for each Node in the cluster
Do not bother with the status insight lifecycle (when a Node is removed from the cluster, the status insight should technically disappear, but do not address that in this card, suitable lifecycle mechanism for this does not exist yet and OTA-1418 will address it)
Overall the functionality should match what oc adm upgrade status client-based checks
The NodeStatusInsight should have correctly populated: name, resource, poolResource, scopeType, version, estToComplete and message fields, following the existing logic from oc adm upgrade status
Health insights are out of scope
Status insights for MCPs are out of scope

The Updating condition has a similar meaning and interpretation like in the other insights.
- When its status is False, it will contain a reason which needs to be interpreted. Three known reasons are Pending, Updated and Paused:
  - Pending: Node will eventually be updated but has not started yet
  - Updated: Node already underwent the update.
  - Paused: Node is running an outdated version but something is pausing the process (like parent MCP .spec.paused field)
- When Updating=True, there are also three known reasons: Draining, Updating and Rebooting.
  - Draining: MCO drains the node so it can be updated and rebooted
  - Updating: MCO applies the new config and prepares the node to be rebooted into the new OS version
  - Rebooting: MCO is rebooting the node, after which it (hopefully) becomes ready again