Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32093

Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC

XMLWordPrintable

    • Moderate
    • Yes
    • CFE Sprint 252
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * {product-title} {product-version} adds a permission to the role-based access control (RBAC) rule so that the DNS Operator can read the `featureGates` resource. Without this permission, an upgrade operation to a later version of {product-title} could fail. (link:https://issues.redhat.com/browse/OCPBUGS-32093[*OCPBUGS-32093*])
      Show
      * {product-title} {product-version} adds a permission to the role-based access control (RBAC) rule so that the DNS Operator can read the `featureGates` resource. Without this permission, an upgrade operation to a later version of {product-title} could fail. (link: https://issues.redhat.com/browse/OCPBUGS-32093 [* OCPBUGS-32093 *])
    • Enhancement
    • Done

      Description of problem

      sjenning noticed that HyperShift HostedCluster update CI is struggling with new DNS operators vs. old ClusterRoles since CFE-852 landed both configInformers.Config().V1().FeatureGates() calls to the 0000_70_dns-operator_02-deployment.yaml operator and FeatureGate RBAC to the 0000_70_dns-operator_00-cluster-role.yaml ClusterRole. In standalone clusters, the cluster-version operator ensures that ClusterRole is reconciled before bumping the operator Deployment, so all goes smoothly. In HyperShift, the HostedControlPlane controller is rolling out the new Deployment in parallel with the cluster-version operator rolling out the new ClusterRole, and when the Deployment wins that race, there can be a few rounds of crash-looping like:

      : TestUpgradeControlPlane/Main/EnsureNoCrashingPods	0s
      {Failed  === RUN   TestUpgradeControlPlane/Main/EnsureNoCrashingPods
          util.go:488: Container dns-operator in pod dns-operator-687bd5d756-c48qm has a restartCount > 0 (3)
              --- FAIL: TestUpgradeControlPlane/Main/EnsureNoCrashingPods (0.02s)
      }
      

      with pod logs like:

      ...
      W0410 22:58:02.495248       1 reflector.go:535] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
      E0410 22:58:02.495277       1 reflector.go:147] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: Failed to watch *v1.FeatureGate: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
      time="2024-04-10T22:58:22Z" level=error msg="<nil>timed out waiting for FeatureGate detection"
      time="2024-04-10T22:58:22Z" level=fatal msg="failed to create operator: timed out waiting for FeatureGate detection"
      

      Eventually RBAC will catch up, and the cluster will heal. But the crash-looping fails the CI test-case, which is expecting a more elegant transition.

      Version-Release number of selected component

      Updates that cross CFE-852, e.g. 4.15 to new 4.16 nightlies.

      How reproducible

      Racy. Sometimes the CVO gets the ClusterRole bumped quickly enough for the Deployment bump to happen smoothly. I'm unclear on odds for the race.

      Steps to Reproduce

      Run a bunch of HyperShift e2e.

      Actual results

      Racy failures for the TestUpgradeControlPlane/Main/EnsureNoCrashingPods test case.

      Expected results

      Reliable success for this test case, with smooth updates.

      Additional info

      There are a number of possible approaches to make HyperShift e2e more happy about these updates. Personally I think we want something like OTA-951 long-term, so HyperShift would have the same "ClusterRole will be bumped first" handling that standalone is getting today. But that's a bigger architectural lift. One simpler pivot to cover the current HyperShift approach would be to backport the RBAC additions to the 4.15.z ClusterRole and raise minor_min to push clusters through that newer 4.15.z or later, where they'd pick up the new RBAC, before they were recommended to head off to 4.16 releases that would require the new RBAC to be in place. That 4.15.z 0000_70_dns-operator_00-cluster-role.yaml RBAC backport is what this ticket is asking for. Although if folks have even less invasive ideas for denoising these HyperShift updates, that would be great .

            rh-ee-arsen Arkadeep Sen
            trking W. Trevor King
            Melvin Joseph Melvin Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: