-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
4.15, 4.16
-
Moderate
-
Yes
-
CFE Sprint 252
-
1
-
Rejected
-
False
-
-
N/A
-
Release Note Not Required
-
In Progress
Description of problem
sjenning noticed that HyperShift HostedCluster update CI is struggling with new DNS operators vs. old ClusterRoles since CFE-852 landed both configInformers.Config().V1().FeatureGates() calls to the 0000_70_dns-operator_02-deployment.yaml operator and FeatureGate RBAC to the 0000_70_dns-operator_00-cluster-role.yaml ClusterRole. In standalone clusters, the cluster-version operator ensures that ClusterRole is reconciled before bumping the operator Deployment, so all goes smoothly. In HyperShift, the HostedControlPlane controller is rolling out the new Deployment in parallel with the cluster-version operator rolling out the new ClusterRole, and when the Deployment wins that race, there can be a few rounds of crash-looping like:
: TestUpgradeControlPlane/Main/EnsureNoCrashingPods 0s {Failed === RUN TestUpgradeControlPlane/Main/EnsureNoCrashingPods util.go:488: Container dns-operator in pod dns-operator-687bd5d756-c48qm has a restartCount > 0 (3) --- FAIL: TestUpgradeControlPlane/Main/EnsureNoCrashingPods (0.02s) }
with pod logs like:
... W0410 22:58:02.495248 1 reflector.go:535] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope E0410 22:58:02.495277 1 reflector.go:147] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: Failed to watch *v1.FeatureGate: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope time="2024-04-10T22:58:22Z" level=error msg="<nil>timed out waiting for FeatureGate detection" time="2024-04-10T22:58:22Z" level=fatal msg="failed to create operator: timed out waiting for FeatureGate detection"
Eventually RBAC will catch up, and the cluster will heal. But the crash-looping fails the CI test-case, which is expecting a more elegant transition.
Version-Release number of selected component
Updates that cross CFE-852, e.g. 4.15 to new 4.16 nightlies.
How reproducible
Racy. Sometimes the CVO gets the ClusterRole bumped quickly enough for the Deployment bump to happen smoothly. I'm unclear on odds for the race.
Steps to Reproduce
Run a bunch of HyperShift e2e.
Actual results
Racy failures for the TestUpgradeControlPlane/Main/EnsureNoCrashingPods test case.
Expected results
Reliable success for this test case, with smooth updates.
Additional info
There are a number of possible approaches to make HyperShift e2e more happy about these updates. Personally I think we want something like OTA-951 long-term, so HyperShift would have the same "ClusterRole will be bumped first" handling that standalone is getting today. But that's a bigger architectural lift. One simpler pivot to cover the current HyperShift approach would be to backport the RBAC additions to the 4.15.z ClusterRole and raise minor_min to push clusters through that newer 4.15.z or later, where they'd pick up the new RBAC, before they were recommended to head off to 4.16 releases that would require the new RBAC to be in place. That 4.15.z 0000_70_dns-operator_00-cluster-role.yaml RBAC backport is what this ticket is asking for. Although if folks have even less invasive ideas for denoising these HyperShift updates, that would be great .
- clones
-
OCPBUGS-32093 Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC
- Closed
- is depended on by
-
OCPBUGS-32093 Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update