Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: 4.17.z
Affects Version/s: 4.15
Component/s: Storage / Operators
Labels:
- qe-premerge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.17.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, the `AzureDiskCSIDriverOperator` entered a degraded state after its pod experienced a panic, specifically an "assignment to entry in nil map" error and an Remote Procedure Call (RPC) keepalive ping timeout. This failure prevented the Operator from reconciling its static resources, creating a significant risk of failures during future cluster upgrades. With this release, To resolve the issue, the `clustercsidriver` custom resource is deleted, forcing the Operator to recreate and reconcile the object, esolving the panics and ensuring the cluster's stability. (link:https://issues.redhat.com/browse/OCPBUGS-60597[~~OCPBUGS-60597~~])

Show
Before this update, the `AzureDiskCSIDriverOperator` entered a degraded state after its pod experienced a panic, specifically an "assignment to entry in nil map" error and an Remote Procedure Call (RPC) keepalive ping timeout. This failure prevented the Operator from reconciling its static resources, creating a significant risk of failures during future cluster upgrades. With this release, To resolve the issue, the `clustercsidriver` custom resource is deleted, forcing the Operator to recreate and reconcile the object, esolving the panics and ensuring the cluster's stability. (link: https://issues.redhat.com/browse/OCPBUGS-60597 [ OCPBUGS-60597 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue OCPBUGS-60464. The following is the description of the original issue:
—
Description of problem:

The AzureDiskCSIDriverOperator is in a degraded state. The logs for the operator pod show a panic with the message "assignment to entry in nil map" and an error message of "AzureDiskDriverStaticResourcesControllerDegraded: "csidriver.yaml" (string): rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout". This degraded state indicates the operator is failing to reconcile its static resources, specifically the `CSIDriver` for `disk.csi.azure.com`, which could lead to issues during future cluster upgrades.

Version-Release number of selected component (if applicable):

ARO v4.15.49

How reproducible:

Not sure, because the root cause of the issue is not clear.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

The operator entered a degraded state on June 1, 2025, and restarting the CSI operator pods did not resolve the panics or the degraded status. Deleting the `clustercsidrivers` did work.

Expected results:

The `AzureDiskCSIDriverOperator` should be in a healthy, non-degraded state, and there should be no panics in the operator pod logs. It should be able to successfully reconcile all its static resources, including the `csidriver.yaml` manifest.

Additional info:

- The degradation has been observed since June 1, 2025.
- The panics were observed around 09:49 UTC on the same day.
- There are no relevant entries in the audit logs immediately preceding the degradation.
- No apparent performance issues were noted.
- The issue persists regardless of the master node the operator pod is running on.
- A workaround attempt of restarting the CSI operator pods was unsuccessful as panics were still observed afterwards.
- The two symptoms (panic and degradation) may or may not be directly linked.
- This issue is similar to OCPBUGS-57395 but we are not sure the root cause is the same. The workaround to delete the `clustercsidrivers` object to let the operator reconcile it works: `$ oc delete clustercsidriver disk.csi.azure.com`.

links to

openshift/csi-operator#422: [4.17] OCPBUGS-60597: Bump library-go to fix assignment to nil map issue

openshift/library-go#1999: OCPBUGS-60597: Fix panic in required labels for csidriver object

Assignee:: Hemant Kumar

Reporter:: Natalia Garea Garcia

Need Info From:: None

Contributors:: None

QA Contact:: Penghao Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/08/18 2:52 PM

Updated:: 2025/10/10 8:07 AM

Resolved:: 2025/09/10 11:07 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates