-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.20, 4.21, 4.22
-
None
-
False
-
-
None
-
Moderate
-
None
-
In Progress
-
Bug Fix
-
This issue prevented metrics from being collected from the openshift-cluster-csi-drivers namespace on hosted clusters on hypershift. The fix grants permission to prometheus to collect metrics from this namespace.
-
None
-
None
-
None
-
None
Description of problem
The storage operator creates the openshift-cluster-csi-drivers Namespace with the openshift.io/cluster-monitoring label. But it diverges from the monitoring configuration docs in not creating a Role(Binding) to allow system:serviceaccount:openshift-monitoring:prometheus-k8s to list/watch Pods, Endpoints, or Services in that Namespace. This creates wasteful API traffic and breaks the ability to scrape metrics from that namespace.
Version-Release number of selected component
Seen in 4.20 through 4.22 HyperShift AKS runs. Unclear how broadly it spreads beyond that, if at all.
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&context=0&search=alert+PrometheusKubernetesListWatchFailures+fired+for.*with+labels' | grep '^periodic.*failures match' periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance (all) - 7 runs, 43% failed, 233% of failures match = 100% impact periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance (all) - 18 runs, 72% failed, 46% of failures match = 33% impact periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-azure-aks-ovn-conformance (all) - 16 runs, 25% failed, 375% of failures match = 94% impact
How reproducible
Every time.
Steps to Reproduce
1. Launch a HyperShift AKS cluster.
2. Check Kube API server audit logs for kubernetes-k8s requests involving the openshift-cluster-csi-drivers namespace.
Actual results
In a HyperShift AKS run, gathered Kube API server audit logs show:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance/2007965597874262016/artifacts/e2e-azure-aks-ovn-conformance/dump/artifacts/namespaces/clusters-32729036fdd5a24a5029/core/pods/logs/kube-apiserver-654cf8b4f4-j4sw9-audit-logs.log | grep '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
2 openshift-cloud-credential-operator services watch 200
6 openshift-cluster-csi-drivers endpoints list 403
6 openshift-cluster-csi-drivers pods list 403
7 openshift-cluster-csi-drivers services list 403
2 openshift-cluster-machine-approver endpoints watch 200
Expected results
Successful prometheus-k8s access to the openshift-cluster-csi-drivers Namespace, without Kube API server audit logs reporting 403s for that username.
Additional information
More traditional cluster structures do have the RoleBinding, and work well. For example, looking at https://amd64.ocp.releases.ci.openshift.org/ > 4.21.0-ec.3 > aws-ovn-serial-1of2 > Artifacts > ... -> gather-extra artifacts shows the Namespace setting the label:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata | select(.name == "openshift-cluster-csi-drivers").labels["openshift.io/cluster-monitoring"]' true
And there is a RoleBinding about the prometheus-k8s ServiceAccount:
curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/rolebindings.json | jq '.items[] | select(.metadata.namespace == "openshift-cluster-csi-drivers" and (.subjects | tostring | contains("prometheus-k8s")))'
{
"apiVersion": "rbac.authorization.k8s.io/v1",
"kind": "RoleBinding",
"metadata": {
"creationTimestamp": "2025-11-19T03:06:43Z",
"name": "aws-ebs-csi-driver-prometheus",
"namespace": "openshift-cluster-csi-drivers",
"resourceVersion": "10259",
"uid": "a563a0f8-9ab5-455f-9f35-b9f9b9fef0e1"
},
"roleRef": {
"apiGroup": "rbac.authorization.k8s.io",
"kind": "Role",
"name": "aws-ebs-csi-driver-prometheus"
},
"subjects": [
{
"kind": "ServiceAccount",
"name": "prometheus-k8s",
"namespace": "openshift-monitoring"
}
]
}
And gathered audit logs shows happy access:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' kube-apiserver/*audit*.log.gz | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
74 openshift-cloud-credential-operator services watch 200
4 openshift-cluster-csi-drivers endpoints list 200
78 openshift-cluster-csi-drivers endpoints watch 200
4 openshift-cluster-csi-drivers pods list 200
74 openshift-cluster-csi-drivers pods watch 200
4 openshift-cluster-csi-drivers services list 200
76 openshift-cluster-csi-drivers services watch 200
4 openshift-cluster-machine-approver endpoints list 200
Checking audit logs to see where that aws-ebs-csi-driver-prometheus RoleBinding is coming from, it seems to predate the captured logs:
$ zgrep -h '"verb":"create".*"resource":"rolebindings","namespace":"openshift-cluster-csi-drivers"' kube-apiserver/*audit*.log.gz | jq -r '.objectRef.name + " " + .user.username' system:image-builders system:serviceaccount:openshift-infra:default-rolebindings-controller system:image-pullers system:serviceaccount:openshift-infra:default-rolebindings-controller system:deployers system:serviceaccount:openshift-infra:default-rolebindings-controller
Possibly it's coming from the AWS EBS CSI driver.
Not clear to me why nothing is setting up that RoleBinding in the HyperShift AKS run. Maybe there isn't a relevant CSI driver installed, and that exposes a previously-reliable assumption that there would be some monitoring operator around to create the Role(Binding)? And the fix would be shifting the Role(Binding) up from specific CSI drivers over to the generic storage operator? Or something...
It's hard to tell much about what's going on in the namespace on the hosted HyperShift cluster without a must-gather step in the hypershift-azure-aks-conformance workflow that that job uses.
- blocks
-
OCPBUGS-72509 Storage operator should configure kubernetes-k8s Role(Binding) for the openshift-cluster-csi-drivers namespace
-
- MODIFIED
-
- is cloned by
-
OCPBUGS-72509 Storage operator should configure kubernetes-k8s Role(Binding) for the openshift-cluster-csi-drivers namespace
-
- MODIFIED
-
- links to