Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-1393

status: recognize the process of migration to multi-arch

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • 5
    • False
    • None
    • False
    • OCPSTRAT-1823 - [GA] 'oc adm upgrade status' command and status API
    • OTA 263

      After OTA-960 is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.

      $ oc adm upgrade
      info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config
      
      Upgradeable=False
      
        Reason: PoolUpdating
        Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
      
      Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
      Channel: candidate-4.18 (available channels: candidate-4.18)
      No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
      

      But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.

      $ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status
      Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
      = Control Plane =
      Assessment:      Completed
      Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
      Completion:      100% (33 operators updated, 0 updating, 0 waiting)
      Duration:        15m
      Operator Status: 33 Healthy
      
      Control Plane Nodes
      NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
      ip-10-0-95-224.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
      ip-10-0-33-81.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -
      ip-10-0-45-170.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
      
      = Worker Upgrade =
      
      WORKER POOL   ASSESSMENT   COMPLETION   STATUS
      worker        Completed    100%         3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded
      
      Worker Pool Nodes: worker
      NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
      ip-10-0-72-40.us-east-2.compute.internal    Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
      ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
      ip-10-0-22-179.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
      
      = Update Health =
      SINCE   LEVEL     IMPACT         MESSAGE
      -       Warning   Update Speed   Node ip-10-0-95-224.us-east-2.compute.internal is unavailable
      -       Warning   Update Speed   Node ip-10-0-72-40.us-east-2.compute.internal is unavailable
      
      Run with --details=health for additional description and links to related online documentation
      
      $ oc get clusterversion version
      NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.0-ec.3   True        True          14m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config
      
      $ oc get co machine-config
      NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      machine-config   4.18.0-ec.3   True        True          False      63m     Working towards 4.18.0-ec.3
      

      The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.

       

      For grooming:

      It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.

      One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:

      oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image'
      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5 

      Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.

      We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.

       

      manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.

       

      oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):

      • "Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
        • We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint.  MULTIARCH-4559 is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code.  Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.
      • "Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
      • Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.
      • "targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.

              hongkliu Hongkai Liu
              hongkliu Hongkai Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: