Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49321

Cluster API ClusterOperator should manage status.versions on unsupported platforms

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem

      On unsupported platforms, the Cluster API operator currently sets its ClusterOperator Available=True with a Cluster API is not yet implemented on this platform message. That's nice, but Available is only one of several conditions that the cluster-version operator considers when deciding whether a ClusterOperator is sufficiently happy to proceed to later manifests in the ordered manifest-graph update flow. So TechPreviewNoUpgrade clusters (where the cluster-api ClusterOperator resource exists) on platforms that Cluster API does not yet support, all updates, including patch updates, currently stick with the cluster-version operator waiting forever for the cluster-api ClusterOperator to bump its versions[name=operator].

      Version-Release number of selected component

      Seen in an Azure cluster updating from 4.15.12 to 4.15.42, but the relevant operator code is still current.

      How reproducible

      I haven't tried, but I'd assume 100%.

      Steps to Reproduce

      1. Install a TechPreviewNoUpgrade cluster on an infrastructure platform Cluster API doesn't yet support.
      2. Launch a patch update from 4.y.z to 4.y.z'. TechPreviewNoUpgrade makes the cluster Upgradeable=False, which will cause the CVO to reject attempts to update between minor versions (4.y to 4.(y+1)).
      3. Wait an hour or two for the update to complete.

      Actual results

      The update hangs, with oc adm upgrade's Progressing condition saying ClusterOperatorUpdating: Working towards ..., waiting on cluster-api.

      Expected results

      Completed update.

      Additional context

      In addition to managing Available here, the controller should also ensure Degraded=False (because with no Cluster API operands on that platform, there's nothing that could be degraded) and versions[name=operator] (because with no Cluster API operands on that platform, there's nothing that could be depending on component version getting bumped). Hmm, or maybe I was misunderstanding the scope of ClusterOperatorStatusClient.SetStatusAvailable, because it seems to cover Available, Degraded, versions, and more. And it always sets the release version. So I'm not clear on where in the chain things were getting lost, but the release version comes in via the RELEASE_VERSION environment variable, and should percolate through from there, even on unsupported platforms.

      It also looks like CAPI grew Azure coverage in 4.18 via OCPCLOUD-1577 and capi-o#115, so if you want to reproduce in 4.18 or later, you'll need to try a provider that still lacks CAPI support. AzureStackCloud and Nutanix seem like they're still not supported.

      Workaround for anyone bitten by this would be to have the cluster admin patch the status subresource for the cluster-api ClusterOperator to bump status.versions themselves, filling in for the action that the CAPI operator will automate once this bug is fixed.

              rh-ee-nbrubake Nolan Brubaker
              trking W. Trevor King
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: