-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
None
After OTA-960 is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.18 (available channels: candidate-4.18) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Completed Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3) Completion: 100% (33 operators updated, 0 updating, 0 waiting) Duration: 15m Operator Status: 33 Healthy Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-95-224.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-33-81.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-45-170.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Completed 100% 3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-72-40.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-17-117.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-22-179.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Update Health = SINCE LEVEL IMPACT MESSAGE - Warning Update Speed Node ip-10-0-95-224.us-east-2.compute.internal is unavailable - Warning Update Speed Node ip-10-0-72-40.us-east-2.compute.internal is unavailable Run with --details=health for additional description and links to related online documentation $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-ec.3 True True 14m Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.18.0-ec.3 True True False 63m Working towards 4.18.0-ec.3
The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.
For grooming:
It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.
One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:
oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5
Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.
We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.
manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.
oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):
- "Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
- We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint.
MULTIARCH-4559is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code. Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.
- We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint.
- "Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
- Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.
- "targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.