Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15, 4.16
Component/s: Networking / DNS
Labels:
- ne-triaged

Severity:
Moderate
Regression:
Yes
Sprint:
CFE Sprint 252
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

sjenning noticed that HyperShift HostedCluster update CI is struggling with new DNS operators vs. old ClusterRoles since ~~CFE-852~~ landed both configInformers.Config().V1().FeatureGates() calls to the 0000_70_dns-operator_02-deployment.yaml operator and FeatureGate RBAC to the 0000_70_dns-operator_00-cluster-role.yaml ClusterRole. In standalone clusters, the cluster-version operator ensures that ClusterRole is reconciled before bumping the operator Deployment, so all goes smoothly. In HyperShift, the HostedControlPlane controller is rolling out the new Deployment in parallel with the cluster-version operator rolling out the new ClusterRole, and when the Deployment wins that race, there can be a few rounds of crash-looping like:

: TestUpgradeControlPlane/Main/EnsureNoCrashingPods	0s
{Failed  === RUN   TestUpgradeControlPlane/Main/EnsureNoCrashingPods
    util.go:488: Container dns-operator in pod dns-operator-687bd5d756-c48qm has a restartCount > 0 (3)
        --- FAIL: TestUpgradeControlPlane/Main/EnsureNoCrashingPods (0.02s)
}

with pod logs like:

...
W0410 22:58:02.495248       1 reflector.go:535] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
E0410 22:58:02.495277       1 reflector.go:147] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: Failed to watch *v1.FeatureGate: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
time="2024-04-10T22:58:22Z" level=error msg="<nil>timed out waiting for FeatureGate detection"
time="2024-04-10T22:58:22Z" level=fatal msg="failed to create operator: timed out waiting for FeatureGate detection"

Eventually RBAC will catch up, and the cluster will heal. But the crash-looping fails the CI test-case, which is expecting a more elegant transition.

Version-Release number of selected component

Updates that cross ~~CFE-852~~, e.g. 4.15 to new 4.16 nightlies.

How reproducible

Racy. Sometimes the CVO gets the ClusterRole bumped quickly enough for the Deployment bump to happen smoothly. I'm unclear on odds for the race.

Steps to Reproduce

Run a bunch of HyperShift e2e.

Actual results

Racy failures for the TestUpgradeControlPlane/Main/EnsureNoCrashingPods test case.

Expected results

Reliable success for this test case, with smooth updates.

Additional info

There are a number of possible approaches to make HyperShift e2e more happy about these updates. Personally I think we want something like OTA-951 long-term, so HyperShift would have the same "ClusterRole will be bumped first" handling that standalone is getting today. But that's a bigger architectural lift. One simpler pivot to cover the current HyperShift approach would be to backport the RBAC additions to the 4.15.z ClusterRole and raise minor_min to push clusters through that newer 4.15.z or later, where they'd pick up the new RBAC, before they were recommended to head off to 4.16 releases that would require the new RBAC to be in place. That 4.15.z 0000_70_dns-operator_00-cluster-role.yaml RBAC backport is what this ticket is asking for. Although if folks have even less invasive ideas for denoising these HyperShift updates, that would be great .

clones

OCPBUGS-32093 Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC

Closed

is depended on by

OCPBUGS-32093 Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC

Closed

links to

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Miciah Masters

Reporter:: W. Trevor King

QA Contact:: Melvin Joseph

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/04/17 4:51 PM

Updated:: 2024/06/27 11:43 AM

Resolved:: 2024/06/27 11:43 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates