• Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • 5
    • False
    • None
    • False
    • OCPSTRAT-1823 - [TP] 'oc adm upgrade status' command and status API
    • OTA 265, OTA 266, OTA 267

      Implement a new Informer controller in the Update Status Controller to watch Node resources in the cluster and maintain an update status insight for each. The informer will need to interact with additional resources such as MachineConfigPools and MachineConfigs, e.g. to discover the OCP version tied to config that is being reconciled on the Node, but should not attempt to maintain the MachineConfigPool status insights. Generally the node status insight should carry enough data for any client to be able to render a line that the oc adm upgrade status currently shows:

      NAME                                      ASSESSMENT    PHASE      VERSION       EST   MESSAGE
      build0-gstfj-ci-prowjobs-worker-b-9lztv   Degraded      Draining   4.16.0-ec.2   ?     failed to drain node: <node> after 1 hour. Please see machine-config-controller logs for more informatio
      build0-gstfj-ci-prowjobs-worker-d-ddnxd   Unavailable   Pending    ?             ?     Machine Config Daemon is processing the node
      build0-gstfj-ci-tests-worker-b-d9vz2      Unavailable   Pending    ?             ?     Not ready
      build0-gstfj-ci-tests-worker-c-jq5rk      Unavailable   Updated    4.16.0-ec.3   -     Node is marked unschedulable
      

      The basic expectations for Node status insights are described in the design docs but the current source of truth for the data structure is the NodeStatusInsight structure from https://github.com/openshift/api/pull/2012 .

      Definition of Done

      • During the upgrade, the status api contains a Node status insight for each Node in the cluster
      • Do not bother with the status insight lifecycle (when a Node is removed from the cluster, the status insight should technically disappear, but do not address that in this card, suitable lifecycle mechanism for this does not exist yet and OTA-1418 will address it)
      • Overall the functionality should match what oc adm upgrade status client-based checks
      • The NodeStatusInsight should have correctly populated: name, resource, poolResource, scopeType, version, estToComplete and message fields, following the existing logic from oc adm upgrade status
      • Health insights are out of scope
      • Status insights for MCPs are out of scope
      • The Updating condition has a similar meaning and interpretation like in the other insights.
        • When its status is False, it will contain a reason which needs to be interpreted. Three known reasons are Pending, Updated and Paused:
          • Pending: Node will eventually be updated but has not started yet
          • Updated: Node already underwent the update.
          • Paused: Node is running an outdated version but something is pausing the process (like parent MCP .spec.paused field)
        • When Updating=True, there are also three known reasons: Draining, Updating and Rebooting.
          • Draining: MCO drains the node so it can be updated and rebooted
          • Updating: MCO applies the new config and prepares the node to be rebooted into the new OS version
          • Rebooting: MCO is rebooting the node, after which it (hopefully) becomes ready again
      • The Degraded and Unavailable condition logic should match the existing assessment logic from oc adm upgrade status

              hongkliu Hongkai Liu
              afri@afri.cz Petr Muller
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: