Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.19.0
Affects Version/s: 4.14
Component/s: HyperShift
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:

4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z
Target Version:

4.19.0
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

HyperShift currently seems to only maintain one version at a time in status on a FeatureGate resource. For example, in a HostedControlPlane that had been installed a while back, and recently done 4.14.37 > 4.14.38 > 4.14.39, the only version in FeatureGate was 4.14.39:

$ jq -r '.status.featureGates[].version' featuregates.yaml
4.14.39

Compare that with standalone clusters, where FeatureGates status is appended with each release. For example, in this 4.18.0-rc.0 to 4.18.0-rc.1 CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1865110488958898176/artifacts/e2e-aws-upgrade/must-gather.tar | tar -xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7fd0a8ff4df55c00e9e4e676d8c06fad2222fe83282fbbea3dad3ff9aca1ebb/cluster-scoped-resources/config.openshift.io/featuregates/cluster.yaml | yaml2json | jq -r '.status.featureGates[].version'
4.18.0-rc.1
4.18.0-rc.0

The append approach allows consumers to gracefully transition over time, as they each update from the outgoing version to the incoming version. With the current HyperShift logic, there's a race between the FeatureGate status bump and the consuming component bumps:

HCP running vA
HostedControlPlane spec bumped to request vB, and vB control-plane operator launched.
CPO (or some other HyperShift component?) pushes vB status to FeatureGate.
All the vA components looking for vA in FeatureGate status break.
Dangerous race period, hopefully the CPO doesn't get stuck here.
CPO bumps the other components to vB.
All the vB components looking for vB in FeatureGate status are happy.

In this bug, I'm asking for HyperShift to adopt the standalone approach of appending to FeatureGate status instead of dropping the outgoing version, to avoid that kind of race window. At least until there's some assurance that the update to the incoming version has completely rolled out. Standalone pruning removes versions that no longer exist in ClusterVersion history. Checking a long-lived standalone cluster I have access to, I see:

$ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version'
4.18.0-ec.4
4.18.0-ec.3
...
4.14.0-ec.1
4.14.0-ec.0
$ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version' | wc -l
27

so it seems like pruning is currently either non-existent, or pretty relaxed.

Version-Release number of selected component

Seen in a 4.14.38 to 4.14.39 HostedCluster update. May or may not apply to more recent 4.y.

How reproducible

Unclear

Steps to Reproduce

Install vA HostedCluster.
Watch the cluster FeatureGate's status.
Update to vB.
Wait for the update to complete.

Actual results

When vB is added to FeatureGate status, vA is dropped.

If the CPO gets stuck during the transition, some management-cluster-side pods (cloud-network-config-controller, cluster-network-operator, ingress-operator, cluster-storage-operator, etc.) crash loop with logs like:

E1211 15:43:58.314619       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster
E1211 15:43:58.635080       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster

Expected results

vB is added to FeatureGate status early in the update, and vA is preserved through much of the update, and only removed when it seems like there might not be any more consumers (when a version is dropped from ClusterVersion history, if you want to match the current standalone handling on this).

Additional info

None yet.

account is impacted by

OCPBUGS-49907 OpenShift cluster feature gate only contains latest version which impacts cluster operator availability

OCPBUGS-33480 OpenShift cluster feature gate only contains latest version which impacts cluster operator availability

Closed

duplicates

OCPBUGS-33480 OpenShift cluster feature gate only contains latest version which impacts cluster operator availability

Closed

relates to

RFE-6872 Surface control-plane operator errors in HostedControlPlane conditions

Backlog

links to

openshift/hypershift#5871: OCPBUGS-46379: Kas bootstrap bin

RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update

(1 links to)

Assignee:: Alberto Garcia Lamela

Reporter:: W. Trevor King

Need Info From:: None

Contributors:: None

QA Contact:: XiuJuan Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2024/12/12 9:17 PM

Updated:: 2025/10/08 12:51 PM

Resolved:: 2025/06/17 4:55 PM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates