Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-70339

Storage operator should configure kubernetes-k8s Role(Binding) for the openshift-cluster-csi-drivers namespace

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • Approved
    • None
    • In Progress
    • Bug Fix
    • This issue prevented metrics from being collected from the openshift-cluster-csi-drivers namespace on hosted clusters on hypershift. The fix grants permission to prometheus to collect metrics from this namespace.
    • None
    • None
    • None
    • None

      Description of problem

      The storage operator creates the openshift-cluster-csi-drivers Namespace with the openshift.io/cluster-monitoring label. But it diverges from the monitoring configuration docs in not creating a Role(Binding) to allow system:serviceaccount:openshift-monitoring:prometheus-k8s to list/watch Pods, Endpoints, or Services in that Namespace. This creates wasteful API traffic and breaks the ability to scrape metrics from that namespace.

      Version-Release number of selected component

      Seen in 4.20 through 4.22 HyperShift AKS runs. Unclear how broadly it spreads beyond that, if at all.

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&context=0&search=alert+PrometheusKubernetesListWatchFailures+fired+for.*with+labels' | grep '^periodic.*failures match'
      periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
      periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance (all) - 18 runs, 72% failed, 46% of failures match = 33% impact
      periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-azure-aks-ovn-conformance (all) - 16 runs, 25% failed, 375% of failures match = 94% impact
      

      How reproducible

      Every time.

      Steps to Reproduce

      1. Launch a HyperShift AKS cluster.
      2. Check Kube API server audit logs for kubernetes-k8s requests involving the openshift-cluster-csi-drivers namespace.

      Actual results

      In a HyperShift AKS run, gathered Kube API server audit logs show:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance/2007965597874262016/artifacts/e2e-azure-aks-ovn-conformance/dump/artifacts/namespaces/clusters-32729036fdd5a24a5029/core/pods/logs/kube-apiserver-654cf8b4f4-j4sw9-audit-logs.log | grep '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
            2 openshift-cloud-credential-operator services watch 200
            6 openshift-cluster-csi-drivers endpoints list 403
            6 openshift-cluster-csi-drivers pods list 403
            7 openshift-cluster-csi-drivers services list 403
            2 openshift-cluster-machine-approver endpoints watch 200
      

      Expected results

      Successful prometheus-k8s access to the openshift-cluster-csi-drivers Namespace, without Kube API server audit logs reporting 403s for that username.

      Additional information

      More traditional cluster structures do have the RoleBinding, and work well. For example, looking at https://amd64.ocp.releases.ci.openshift.org/ 4.21.0-ec.3 > aws-ovn-serial-1of2 Artifacts > ... -> gather-extra artifacts shows the Namespace setting the label:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata | select(.name == "openshift-cluster-csi-drivers").labels["openshift.io/cluster-monitoring"]'
      true
      

      And there is a RoleBinding about the prometheus-k8s ServiceAccount:

       curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/rolebindings.json | jq '.items[] | select(.metadata.namespace == "openshift-cluster-csi-drivers" and (.subjects | tostring | contains("prometheus-k8s")))'
      {
        "apiVersion": "rbac.authorization.k8s.io/v1",
        "kind": "RoleBinding",
        "metadata": {
          "creationTimestamp": "2025-11-19T03:06:43Z",
          "name": "aws-ebs-csi-driver-prometheus",
          "namespace": "openshift-cluster-csi-drivers",
          "resourceVersion": "10259",
          "uid": "a563a0f8-9ab5-455f-9f35-b9f9b9fef0e1"
        },
        "roleRef": {
          "apiGroup": "rbac.authorization.k8s.io",
          "kind": "Role",
          "name": "aws-ebs-csi-driver-prometheus"
        },
        "subjects": [
          {
            "kind": "ServiceAccount",
            "name": "prometheus-k8s",
            "namespace": "openshift-monitoring"
          }
        ]
      }
      

      And gathered audit logs shows happy access:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
      $ zgrep -h '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' kube-apiserver/*audit*.log.gz | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
           74 openshift-cloud-credential-operator services watch 200
            4 openshift-cluster-csi-drivers endpoints list 200
           78 openshift-cluster-csi-drivers endpoints watch 200
            4 openshift-cluster-csi-drivers pods list 200
           74 openshift-cluster-csi-drivers pods watch 200
            4 openshift-cluster-csi-drivers services list 200
           76 openshift-cluster-csi-drivers services watch 200
            4 openshift-cluster-machine-approver endpoints list 200
      

      Checking audit logs to see where that aws-ebs-csi-driver-prometheus RoleBinding is coming from, it seems to predate the captured logs:

      $ zgrep -h '"verb":"create".*"resource":"rolebindings","namespace":"openshift-cluster-csi-drivers"' kube-apiserver/*audit*.log.gz | jq -r '.objectRef.name + " " + .user.username'
      system:image-builders system:serviceaccount:openshift-infra:default-rolebindings-controller
      system:image-pullers system:serviceaccount:openshift-infra:default-rolebindings-controller
      system:deployers system:serviceaccount:openshift-infra:default-rolebindings-controller
      

      Possibly it's coming from the AWS EBS CSI driver.

      Not clear to me why nothing is setting up that RoleBinding in the HyperShift AKS run. Maybe there isn't a relevant CSI driver installed, and that exposes a previously-reliable assumption that there would be some monitoring operator around to create the Role(Binding)? And the fix would be shifting the Role(Binding) up from specific CSI drivers over to the generic storage operator? Or something...

      It's hard to tell much about what's going on in the namespace on the hosted HyperShift cluster without a must-gather step in the hypershift-azure-aks-conformance workflow that that job uses.

              jdobson@redhat.com Jonathan Dobson
              trking W. Trevor King
              None
              None
              Wei Duan Wei Duan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: