Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32373

Accessing FeatureGates in 4.15-to-4.16 updates with 4.15 RBAC

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Normal
    • None
    • 4.15, 4.16
    • Networking / DNS
    • Moderate
    • Yes
    • CFE Sprint 252
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: In OpenShift 4.16, cluster-dns-operator requires permission to read the featuregates resource. For OpenShift clusters, cluster-version-operator upgrades the RBAC manifests prior to upgrading the operator itself, so the required permission is in place by the time the operator needs it. On HyperShift, sometimes the operator can be upgraded before the RBAC manifests are upgraded.

      Consequence: If the RBAC manifests are not updated to the 4.16 level before the operator is upgraded to the 4.16 level, then by design, the operator crashloops until the RBAC manifests are upgraded.

      Fix: The permission to read the featuregates resource is added to the OpenShift 4.15 RBAC manifests for cluster-dns-operator.

      Result: The operator no longer crashloops during the upgrade of HyperShift clusters from 4.15 to 4.16.
      Show
      Cause: In OpenShift 4.16, cluster-dns-operator requires permission to read the featuregates resource. For OpenShift clusters, cluster-version-operator upgrades the RBAC manifests prior to upgrading the operator itself, so the required permission is in place by the time the operator needs it. On HyperShift, sometimes the operator can be upgraded before the RBAC manifests are upgraded. Consequence: If the RBAC manifests are not updated to the 4.16 level before the operator is upgraded to the 4.16 level, then by design, the operator crashloops until the RBAC manifests are upgraded. Fix: The permission to read the featuregates resource is added to the OpenShift 4.15 RBAC manifests for cluster-dns-operator. Result: The operator no longer crashloops during the upgrade of HyperShift clusters from 4.15 to 4.16.

    Description

      Description of problem

      sjenning noticed that HyperShift HostedCluster update CI is struggling with new DNS operators vs. old ClusterRoles since CFE-852 landed both configInformers.Config().V1().FeatureGates() calls to the 0000_70_dns-operator_02-deployment.yaml operator and FeatureGate RBAC to the 0000_70_dns-operator_00-cluster-role.yaml ClusterRole. In standalone clusters, the cluster-version operator ensures that ClusterRole is reconciled before bumping the operator Deployment, so all goes smoothly. In HyperShift, the HostedControlPlane controller is rolling out the new Deployment in parallel with the cluster-version operator rolling out the new ClusterRole, and when the Deployment wins that race, there can be a few rounds of crash-looping like:

      : TestUpgradeControlPlane/Main/EnsureNoCrashingPods	0s
      {Failed  === RUN   TestUpgradeControlPlane/Main/EnsureNoCrashingPods
          util.go:488: Container dns-operator in pod dns-operator-687bd5d756-c48qm has a restartCount > 0 (3)
              --- FAIL: TestUpgradeControlPlane/Main/EnsureNoCrashingPods (0.02s)
      }
      

      with pod logs like:

      ...
      W0410 22:58:02.495248       1 reflector.go:535] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
      E0410 22:58:02.495277       1 reflector.go:147] github.com/openshift/client-go/config/informers/externalversions/factory.go:116: Failed to watch *v1.FeatureGate: failed to list *v1.FeatureGate: featuregates.config.openshift.io is forbidden: User "system:serviceaccount:openshift-dns-operator:dns-operator" cannot list resource "featuregates" in API group "config.openshift.io" at the cluster scope
      time="2024-04-10T22:58:22Z" level=error msg="<nil>timed out waiting for FeatureGate detection"
      time="2024-04-10T22:58:22Z" level=fatal msg="failed to create operator: timed out waiting for FeatureGate detection"
      

      Eventually RBAC will catch up, and the cluster will heal. But the crash-looping fails the CI test-case, which is expecting a more elegant transition.

      Version-Release number of selected component

      Updates that cross CFE-852, e.g. 4.15 to new 4.16 nightlies.

      How reproducible

      Racy. Sometimes the CVO gets the ClusterRole bumped quickly enough for the Deployment bump to happen smoothly. I'm unclear on odds for the race.

      Steps to Reproduce

      Run a bunch of HyperShift e2e.

      Actual results

      Racy failures for the TestUpgradeControlPlane/Main/EnsureNoCrashingPods test case.

      Expected results

      Reliable success for this test case, with smooth updates.

      Additional info

      There are a number of possible approaches to make HyperShift e2e more happy about these updates. Personally I think we want something like OTA-951 long-term, so HyperShift would have the same "ClusterRole will be bumped first" handling that standalone is getting today. But that's a bigger architectural lift. One simpler pivot to cover the current HyperShift approach would be to backport the RBAC additions to the 4.15.z ClusterRole and raise minor_min to push clusters through that newer 4.15.z or later, where they'd pick up the new RBAC, before they were recommended to head off to 4.16 releases that would require the new RBAC to be in place. That 4.15.z 0000_70_dns-operator_00-cluster-role.yaml RBAC backport is what this ticket is asking for. Although if folks have even less invasive ideas for denoising these HyperShift updates, that would be great .

      Attachments

        Issue Links

          Activity

            People

              mmasters1@redhat.com Miciah Masters
              trking W. Trevor King
              Melvin Joseph Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: