Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: 4.22.0
Affects Version/s: 4.20, 4.21, 4.22
Component/s: Storage / Operators
Labels:
- component-regression

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:

4.20.z, 4.21.z
Target Version:

4.22.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:
This issue prevented metrics from being collected from the openshift-cluster-csi-drivers namespace on hosted clusters on hypershift. The fix grants permission to prometheus to collect metrics from this namespace.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

The storage operator creates the openshift-cluster-csi-drivers Namespace with the openshift.io/cluster-monitoring label. But it diverges from the monitoring configuration docs in not creating a Role(Binding) to allow system:serviceaccount:openshift-monitoring:prometheus-k8s to list/watch Pods, Endpoints, or Services in that Namespace. This creates wasteful API traffic and breaks the ability to scrape metrics from that namespace.

Version-Release number of selected component

Seen in 4.20 through 4.22 HyperShift AKS runs. Unclear how broadly it spreads beyond that, if at all.

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&context=0&search=alert+PrometheusKubernetesListWatchFailures+fired+for.*with+labels' | grep '^periodic.*failures match'
periodic-ci-openshift-hypershift-release-4.20-periodics-e2e-azure-aks-ovn-conformance (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance (all) - 18 runs, 72% failed, 46% of failures match = 33% impact
periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-azure-aks-ovn-conformance (all) - 16 runs, 25% failed, 375% of failures match = 94% impact

How reproducible

Every time.

Steps to Reproduce

1. Launch a HyperShift AKS cluster.
2. Check Kube API server audit logs for kubernetes-k8s requests involving the openshift-cluster-csi-drivers namespace.

Actual results

In a HyperShift AKS run, gathered Kube API server audit logs show:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-azure-aks-ovn-conformance/2007965597874262016/artifacts/e2e-azure-aks-ovn-conformance/dump/artifacts/namespaces/clusters-32729036fdd5a24a5029/core/pods/logs/kube-apiserver-654cf8b4f4-j4sw9-audit-logs.log | grep '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
      2 openshift-cloud-credential-operator services watch 200
      6 openshift-cluster-csi-drivers endpoints list 403
      6 openshift-cluster-csi-drivers pods list 403
      7 openshift-cluster-csi-drivers services list 403
      2 openshift-cluster-machine-approver endpoints watch 200

Expected results

Successful prometheus-k8s access to the openshift-cluster-csi-drivers Namespace, without Kube API server audit logs reporting 403s for that username.

Additional information

More traditional cluster structures do have the RoleBinding, and work well. For example, looking at https://amd64.ocp.releases.ci.openshift.org/ ~~> 4.21.0-ec.3~~ > aws-ovn-serial-1of2 ~~> Artifacts~~ > ... -> gather-extra artifacts shows the Namespace setting the label:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata | select(.name == "openshift-cluster-csi-drivers").labels["openshift.io/cluster-monitoring"]'
true

And there is a RoleBinding about the prometheus-k8s ServiceAccount:

 curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/rolebindings.json | jq '.items[] | select(.metadata.namespace == "openshift-cluster-csi-drivers" and (.subjects | tostring | contains("prometheus-k8s")))'
{
  "apiVersion": "rbac.authorization.k8s.io/v1",
  "kind": "RoleBinding",
  "metadata": {
    "creationTimestamp": "2025-11-19T03:06:43Z",
    "name": "aws-ebs-csi-driver-prometheus",
    "namespace": "openshift-cluster-csi-drivers",
    "resourceVersion": "10259",
    "uid": "a563a0f8-9ab5-455f-9f35-b9f9b9fef0e1"
  },
  "roleRef": {
    "apiGroup": "rbac.authorization.k8s.io",
    "kind": "Role",
    "name": "aws-ebs-csi-driver-prometheus"
  },
  "subjects": [
    {
      "kind": "ServiceAccount",
      "name": "prometheus-k8s",
      "namespace": "openshift-monitoring"
    }
  ]
}

And gathered audit logs shows happy access:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2/1990973748538249216/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' kube-apiserver/*audit*.log.gz | jq -r '(.objectRef | .namespace + " " + .resource) + " " + .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c | grep -1 openshift-cluster-csi-drivers
     74 openshift-cloud-credential-operator services watch 200
      4 openshift-cluster-csi-drivers endpoints list 200
     78 openshift-cluster-csi-drivers endpoints watch 200
      4 openshift-cluster-csi-drivers pods list 200
     74 openshift-cluster-csi-drivers pods watch 200
      4 openshift-cluster-csi-drivers services list 200
     76 openshift-cluster-csi-drivers services watch 200
      4 openshift-cluster-machine-approver endpoints list 200

Checking audit logs to see where that aws-ebs-csi-driver-prometheus RoleBinding is coming from, it seems to predate the captured logs:

$ zgrep -h '"verb":"create".*"resource":"rolebindings","namespace":"openshift-cluster-csi-drivers"' kube-apiserver/*audit*.log.gz | jq -r '.objectRef.name + " " + .user.username'
system:image-builders system:serviceaccount:openshift-infra:default-rolebindings-controller
system:image-pullers system:serviceaccount:openshift-infra:default-rolebindings-controller
system:deployers system:serviceaccount:openshift-infra:default-rolebindings-controller

Possibly it's coming from the AWS EBS CSI driver.

Not clear to me why nothing is setting up that RoleBinding in the HyperShift AKS run. Maybe there isn't a relevant CSI driver installed, and that exposes a previously-reliable assumption that there would be some monitoring operator around to create the Role(Binding)? And the fix would be shifting the Role(Binding) up from specific CSI drivers over to the generic storage operator? Or something...

It's hard to tell much about what's going on in the namespace on the hosted HyperShift cluster without a must-gather step in the hypershift-azure-aks-conformance workflow that that job uses.

blocks

OCPBUGS-72509 Storage operator should configure kubernetes-k8s Role(Binding) for the openshift-cluster-csi-drivers namespace

Closed

is cloned by

OCPBUGS-72509 Storage operator should configure kubernetes-k8s Role(Binding) for the openshift-cluster-csi-drivers namespace

Closed

links to

openshift/csi-operator#488: OCPBUGS-70339: deploy prometheus role and binding on hypershift guest

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional information

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates