Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-46379

HyperShift should preserve older versions when appending to FeatureGate status

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14
    • HyperShift

      Description of problem

      HyperShift currently seems to only maintain one version at a time in status on a FeatureGate resource. For example, in a HostedControlPlane that had been installed a while back, and recently done 4.14.37 > 4.14.38 > 4.14.39, the only version in FeatureGate was 4.14.39:

      $ jq -r '.status.featureGates[].version' featuregates.yaml
      4.14.39
      

      Compare that with standalone clusters, where FeatureGates status is appended with each release. For example, in this 4.18.0-rc.0 to 4.18.0-rc.1 CI run:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1865110488958898176/artifacts/e2e-aws-upgrade/must-gather.tar | tar -xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7fd0a8ff4df55c00e9e4e676d8c06fad2222fe83282fbbea3dad3ff9aca1ebb/cluster-scoped-resources/config.openshift.io/featuregates/cluster.yaml | yaml2json | jq -r '.status.featureGates[].version'
      4.18.0-rc.1
      4.18.0-rc.0
      

      The append approach allows consumers to gracefully transition over time, as they each update from the outgoing version to the incoming version. With the current HyperShift logic, there's a race between the FeatureGate status bump and the consuming component bumps:

      1. HCP running vA
      2. HostedControlPlane spec bumped to request vB, and vB control-plane operator launched.
      3. CPO (or some other HyperShift component?) pushes vB status to FeatureGate.
      4. All the vA components looking for vA in FeatureGate status break.
      5. Dangerous race period, hopefully the CPO doesn't get stuck here.
      6. CPO bumps the other components to vB.
      7. All the vB components looking for vB in FeatureGate status are happy.

      In this bug, I'm asking for HyperShift to adopt the standalone approach of appending to FeatureGate status instead of dropping the outgoing version, to avoid that kind of race window. At least until there's some assurance that the update to the incoming version has completely rolled out. Standalone pruning removes versions that no longer exist in ClusterVersion history. Checking a long-lived standalone cluster I have access to, I see:

      $ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version'
      4.18.0-ec.4
      4.18.0-ec.3
      ...
      4.14.0-ec.1
      4.14.0-ec.0
      $ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version' | wc -l
      27
      

      so it seems like pruning is currently either non-existent, or pretty relaxed.

      Version-Release number of selected component

      Seen in a 4.14.38 to 4.14.39 HostedCluster update. May or may not apply to more recent 4.y.

      How reproducible

      Unclear

      Steps to Reproduce

      1. Install vA HostedCluster.
      2. Watch the cluster FeatureGate's status.
      3. Update to vB.
      4. Wait for the update to complete.

      Actual results

      When vB is added to FeatureGate status, vA is dropped.

      If the CPO gets stuck during the transition, some management-cluster-side pods (cloud-network-config-controller, cluster-network-operator, ingress-operator, cluster-storage-operator, etc.) crash loop with logs like:

      E1211 15:43:58.314619       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster
      E1211 15:43:58.635080       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster
      

      Expected results

      vB is added to FeatureGate status early in the update, and vA is preserved through much of the update, and only removed when it seems like there might not be any more consumers (when a version is dropped from ClusterVersion history, if you want to match the current standalone handling on this).

      Additional info

      None yet.

              Unassigned Unassigned
              trking W. Trevor King
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: