Loading...

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Activity Type:
Product / Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Update Status API
Story Points:
5

Target Version:
None
Release Blocker:
None
Sprint:
OTA 263, OTA 264, OTA 265, OTA 266, OTA 267

After ~~OTA-960~~ is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.

$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Completed
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      100% (33 operators updated, 0 updating, 0 waiting)
Duration:        15m
Operator Status: 33 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-95-224.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-33-81.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -
ip-10-0-45-170.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100%         3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Nodes: worker
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-72-40.us-east-2.compute.internal    Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
ip-10-0-22-179.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Update Health =
SINCE   LEVEL     IMPACT         MESSAGE
-       Warning   Update Speed   Node ip-10-0-95-224.us-east-2.compute.internal is unavailable
-       Warning   Update Speed   Node ip-10-0-72-40.us-east-2.compute.internal is unavailable

Run with --details=health for additional description and links to related online documentation

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        True          14m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.0-ec.3   True        True          False      63m     Working towards 4.18.0-ec.3

The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.

For grooming:

It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.

One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:

oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5

Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.

We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.

manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.

oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):

"Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
- We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint. ~~MULTIARCH-4559~~ is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code. Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.

"Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
- I think I get "the 97% completion will be reached so fast." now as only MCO has the operator-image pull spec. Other COs claim the completeness immaturely. With that said, "N/A" sounds like the most possible way for now.

Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.

"targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.