-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem
The cluster-version operator can be slow to update its RetrievedUpdates conditions. For example, this tech-preview CI run failed on:
: [Serial][sig-cli] oc adm upgrade recommend When the update service has no recommendations runs successfully [Suite:openshift/conformance/serial] 19s { fail [github.com/openshift/origin/test/extended/cli/adm_upgrade/recommend.go:107]: Unexpected error: <*errors.errorString | 0xc007fc9920>: expected: warning: Cannot refresh available updates: Reason: NoChannel Message: The update channel has not been configured. Upstream update service: http://172.30.47.137:8000/graph Channel: test-channel No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. to match regular expression: ...
But simultaneously claiming Channel: test-channel and The update channel has not been configured doesn't make sense.
Version-Release number of selected component
Seen in 4.20 CI, but the test-case that's turning it up didn't exist in 4.19, so the behavior could be older.
How reproducible
Sippy shows that test-case succeeding over 99% of the time, so whatever is going on seems rare.
Steps to Reproduce
- Set up a custom update service (
OTA-520), but don't point ClusterVersion upstream at it yet. - Clear the cluster's channel with oc adm upgrade channel
- Get an appropriate NoChannel reason in ClusterVersion's RetrievedUpdates conditions
- Set the cluster's channel again with oc adm upgrade channel $ACTUAL_CHANNEL
- Patch upstream to point at the custom update service from (1). This is likely the racy bit, and you'll probably need to land this patch within milliseconds of the channel bump in order to trigger this issue.
- Give the cluster at least 16s to form opinions about the new channel
- Check ClusterVersion's RetrievedUpdates condition again
For (2) and (6), you can use:
$ oc get -o jsonpath='{.status.conditions[?(.type=="RetrievedUpdates")]}{"\n"}' clusterversion version
- Actual results
{"lastTransitionTime":"...","message":"The update channel has not been configured","reason":"NoChannel","status":"False","type":"RetrievedUpdates"}
- Expected results
{"lastTransitionTime":"...","status":"True","type":"RetrievedUpdates"}
- Additional info
From test-case stdout in the job I opened with:
I0818 01:20:42.406557 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade channel test-channel' warning: No channels known to be compatible with the current version "4.20.0-0.nightly-2025-08-17-232035"; unable to validate "test-channel". Setting the update channel to "test-channel" anyway. I0818 01:20:42.536347 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 patch clusterversions.config.openshift.io version --type json -p [{"op": "add", "path": "/spec/upstream", "value": "http://172.30.47.137:8000/graph"}]' clusterversion.config.openshift.io/version patched I0818 01:20:58.722682 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade recommend' [FAILED] in [It] - github.com/openshift/origin/test/extended/cli/adm_upgrade/recommend.go:107 @ 08/18/25 01:20:58.857 I0818 01:20:58.858116 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade channel ' warning: Clearing channel "test-channel"; cluster will no longer request available update recommendations.
So on the test-suite side, the timeline is:
- 1:20:42.406, set channel to test-channel.
- 1:20:42.536, set upstream to point to a local Pod serving a dummy update service.
- Waited 16s for the CVO to process those changes.
- 1:20:58.722, ran recommend and saw ClusterVersion still complaining about NoChannel.
During that time, [CVO logshttps://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/] have:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-86b5f6885b-mzm6l_cluster-version-operator.log | grep '0818 01:2[01]:.*\(cincinnati\|availableupdates\)' I0818 01:20:18.894857 1 availableupdates.go:98] Available updates were recently retrieved, with less than 3m42.992944812s elapsed since 2025-08-18T01:16:36Z, will try later. I0818 01:20:42.526149 1 availableupdates.go:77] Retrieving available updates again, because the channel has changed from "" to "test-channel" I0818 01:20:42.529936 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=test-channel&id=1b8e4fd0-ab6d-4e19-8393-ec99ea639b0e&version=4.20.0-0.nightly-2025-08-17-232035 I0818 01:21:12.805171 1 availableupdates.go:398] Update service https://api.openshift.com/api/upgrades_info/v1/graph could not return available updates: VersionNotFound: currently reconciling cluster version 4.20.0-0.nightly-2025-08-17-232035 not found in the "test-channel" channel I0818 01:21:12.805240 1 availableupdates.go:77] Retrieving available updates again, because the channel has changed from "test-channel" to "" I0818 01:21:12.819094 1 availableupdates.go:98] Available updates were recently retrieved, with less than 3m42.992944812s elapsed since 2025-08-18T01:21:12Z, will try later.
So there's a 1:20:42.529 test-channel retrieval attempt, but it's using the default api.openshift.com upstream, and not our custom local Pod. And there doesn't seem to be a retry when the local Pod's upstream is set.
Way out at 01:32, I do see the CVO triggering a new fetch on an upstream change, although it's a different IP address for a different test-case:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-86b5f6885b-mzm6l_cluster-version-operator.log | grep upstream I0818 01:32:03.778657 1 availableupdates.go:103] Retrieving available updates again, because the update service has changed from "" to "http://172.30.151.226:8000/graph" from ClusterVersion spec.upstream
The bug here is why this test-case run failed to trigger a retrieval after the upstream bump; likely some kind of race between upstream-change-detection and they channel-bump-induced retrieval attempt.