Description of problem
HyperShift currently seems to only maintain one version at a time in status on a FeatureGate resource. For example, in a HostedControlPlane that had been installed a while back, and recently done 4.14.37 > 4.14.38 > 4.14.39, the only version in FeatureGate was 4.14.39:
$ jq -r '.status.featureGates[].version' featuregates.yaml 4.14.39
Compare that with standalone clusters, where FeatureGates status is appended with each release. For example, in this 4.18.0-rc.0 to 4.18.0-rc.1 CI run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1865110488958898176/artifacts/e2e-aws-upgrade/must-gather.tar | tar -xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7fd0a8ff4df55c00e9e4e676d8c06fad2222fe83282fbbea3dad3ff9aca1ebb/cluster-scoped-resources/config.openshift.io/featuregates/cluster.yaml | yaml2json | jq -r '.status.featureGates[].version' 4.18.0-rc.1 4.18.0-rc.0
The append approach allows consumers to gracefully transition over time, as they each update from the outgoing version to the incoming version. With the current HyperShift logic, there's a race between the FeatureGate status bump and the consuming component bumps:
- HCP running vA
- HostedControlPlane spec bumped to request vB, and vB control-plane operator launched.
- CPO (or some other HyperShift component?) pushes vB status to FeatureGate.
- All the vA components looking for vA in FeatureGate status break.
- Dangerous race period, hopefully the CPO doesn't get stuck here.
- CPO bumps the other components to vB.
- All the vB components looking for vB in FeatureGate status are happy.
In this bug, I'm asking for HyperShift to adopt the standalone approach of appending to FeatureGate status instead of dropping the outgoing version, to avoid that kind of race window. At least until there's some assurance that the update to the incoming version has completely rolled out. Standalone pruning removes versions that no longer exist in ClusterVersion history. Checking a long-lived standalone cluster I have access to, I see:
$ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version' 4.18.0-ec.4 4.18.0-ec.3 ... 4.14.0-ec.1 4.14.0-ec.0 $ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version' | wc -l 27
so it seems like pruning is currently either non-existent, or pretty relaxed.
Version-Release number of selected component
Seen in a 4.14.38 to 4.14.39 HostedCluster update. May or may not apply to more recent 4.y.
How reproducible
Unclear
Steps to Reproduce
- Install vA HostedCluster.
- Watch the cluster FeatureGate's status.
- Update to vB.
- Wait for the update to complete.
Actual results
When vB is added to FeatureGate status, vA is dropped.
If the CPO gets stuck during the transition, some management-cluster-side pods (cloud-network-config-controller, cluster-network-operator, ingress-operator, cluster-storage-operator, etc.) crash loop with logs like:
E1211 15:43:58.314619 1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster E1211 15:43:58.635080 1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster
Expected results
vB is added to FeatureGate status early in the update, and vA is preserved through much of the update, and only removed when it seems like there might not be any more consumers (when a version is dropped from ClusterVersion history, if you want to match the current standalone handling on this).
Additional info
None yet.
- relates to
-
RFE-6872 Surface control-plane operator errors in HostedControlPlane conditions
- Backlog