Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61289

Cluster-version operator should always attempt retrieval soon after an 'upstream' config change

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      The cluster-version operator can be slow to update its RetrievedUpdates conditions. For example, this tech-preview CI run failed on:

      : [Serial][sig-cli] oc adm upgrade recommend When the update service has no recommendations runs successfully [Suite:openshift/conformance/serial]	19s
      {  fail [github.com/openshift/origin/test/extended/cli/adm_upgrade/recommend.go:107]: Unexpected error:
          <*errors.errorString | 0xc007fc9920>: 
          expected:
            warning: Cannot refresh available updates:
              Reason: NoChannel
              Message: The update channel has not been configured.
            
            Upstream update service: http://172.30.47.137:8000/graph
            Channel: test-channel
            No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
          to match regular expression:
      ...
      

      But simultaneously claiming Channel: test-channel and The update channel has not been configured doesn't make sense.

      Version-Release number of selected component

      Seen in 4.20 CI, but the test-case that's turning it up didn't exist in 4.19, so the behavior could be older.

      How reproducible

      Sippy shows that test-case succeeding over 99% of the time, so whatever is going on seems rare.

      Steps to Reproduce

      1. Set up a custom update service (OTA-520), but don't point ClusterVersion upstream at it yet.
      2. Clear the cluster's channel with oc adm upgrade channel
      3. Get an appropriate NoChannel reason in ClusterVersion's RetrievedUpdates conditions
      4. Set the cluster's channel again with oc adm upgrade channel $ACTUAL_CHANNEL
      5. Patch upstream to point at the custom update service from (1). This is likely the racy bit, and you'll probably need to land this patch within milliseconds of the channel bump in order to trigger this issue.
      6. Give the cluster at least 16s to form opinions about the new channel
      7. Check ClusterVersion's RetrievedUpdates condition again

      For (2) and (6), you can use:

      $ oc get -o jsonpath='{.status.conditions[?(.type=="RetrievedUpdates")]}{"\n"}' clusterversion version
      
      1. Actual results
      {"lastTransitionTime":"...","message":"The update channel has not been configured","reason":"NoChannel","status":"False","type":"RetrievedUpdates"}
      
      1. Expected results
      {"lastTransitionTime":"...","status":"True","type":"RetrievedUpdates"}
      
      1. Additional info

      From test-case stdout in the job I opened with:

      I0818 01:20:42.406557 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade channel test-channel'
      warning: No channels known to be compatible with the current version "4.20.0-0.nightly-2025-08-17-232035"; unable to validate "test-channel". Setting the update channel to "test-channel" anyway.
      I0818 01:20:42.536347 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 patch clusterversions.config.openshift.io version --type json -p [{"op": "add", "path": "/spec/upstream", "value": "http://172.30.47.137:8000/graph"}]'
      clusterversion.config.openshift.io/version patched
      I0818 01:20:58.722682 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade recommend'
        [FAILED] in [It] - github.com/openshift/origin/test/extended/cli/adm_upgrade/recommend.go:107 @ 08/18/25 01:20:58.857
      I0818 01:20:58.858116 52322 client.go:1022] Running 'oc --namespace=e2e-oc-adm-upgrade-recommend-2867 --kubeconfig=/tmp/kubeconfig-2727234894 adm upgrade channel '
      warning: Clearing channel "test-channel"; cluster will no longer request available update recommendations.
      

      So on the test-suite side, the timeline is:

      • 1:20:42.406, set channel to test-channel.
      • 1:20:42.536, set upstream to point to a local Pod serving a dummy update service.
      • Waited 16s for the CVO to process those changes.
      • 1:20:58.722, ran recommend and saw ClusterVersion still complaining about NoChannel.

      During that time, [CVO logshttps://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/] have:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-86b5f6885b-mzm6l_cluster-version-operator.log | grep '0818 01:2[01]:.*\(cincinnati\|availableupdates\)'
      I0818 01:20:18.894857       1 availableupdates.go:98] Available updates were recently retrieved, with less than 3m42.992944812s elapsed since 2025-08-18T01:16:36Z, will try later.
      I0818 01:20:42.526149       1 availableupdates.go:77] Retrieving available updates again, because the channel has changed from "" to "test-channel"
      I0818 01:20:42.529936       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=test-channel&id=1b8e4fd0-ab6d-4e19-8393-ec99ea639b0e&version=4.20.0-0.nightly-2025-08-17-232035
      I0818 01:21:12.805171       1 availableupdates.go:398] Update service https://api.openshift.com/api/upgrades_info/v1/graph could not return available updates: VersionNotFound: currently reconciling cluster version 4.20.0-0.nightly-2025-08-17-232035 not found in the "test-channel" channel
      I0818 01:21:12.805240       1 availableupdates.go:77] Retrieving available updates again, because the channel has changed from "test-channel" to ""
      I0818 01:21:12.819094       1 availableupdates.go:98] Available updates were recently retrieved, with less than 3m42.992944812s elapsed since 2025-08-18T01:21:12Z, will try later.
      

      So there's a 1:20:42.529 test-channel retrieval attempt, but it's using the default api.openshift.com upstream, and not our custom local Pod. And there doesn't seem to be a retry when the local Pod's upstream is set.

      Way out at 01:32, I do see the CVO triggering a new fetch on an upstream change, although it's a different IP address for a different test-case:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-vsphere-ovn-techpreview-serial/1957221994252472320/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-86b5f6885b-mzm6l_cluster-version-operator.log | grep upstream
      I0818 01:32:03.778657       1 availableupdates.go:103] Retrieving available updates again, because the update service has changed from "" to "http://172.30.151.226:8000/graph" from ClusterVersion spec.upstream
      

      The bug here is why this test-case run failed to trigger a retrieval after the upstream bump; likely some kind of race between upstream-change-detection and they channel-bump-induced retrieval attempt.

              trking W. Trevor King
              trking W. Trevor King
              None
              None
              Jia Liu Jia Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: