Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54382

Azure stack: storage azure disk csi driver node pods CrashLoopBackOff

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • Yes
    • None
    • Approved
    • None
    • In Progress
    • Release Note Not Required
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Storage operator is degraded on 4.19 installs to azure stack:

      storage                                    4.19.0-0.ci.test-2025-03-26-183004-ci-ln-q7lhwk2-latest   False       True          False      76s     AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
      
      

      Pod logs show a permission error

      [root@fedora auth]# oc logs azure-disk-csi-driver-node-6c44h -n openshift-cluster-csi-drivers csi-driver | tail -n 1
      E0331 00:57:40.711688       1 utils.go:110] GRPC error: rpc error: code = Internal desc = getNodeInfoFromLabels on node(padillon03271650-4x84l-worker-mtcazs-r6584) failed with get node(padillon03271650-4x84l-worker-mtcazs-r6584) failed with nodes "padillon03271650-4x84l-worker-mtcazs-r6584" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-disk-csi-driver-node-sa" cannot get resource "nodes" in API group "" at the cluster scope 

      I did not look into the source of this call to get nodes. Checking the rbac:

      # oc describe clusterrole azure-disk-privileged-role
      Name:         azure-disk-privileged-role
      Labels:       <none>
      Annotations:  <none>
      PolicyRule:
        Resources                                         Non-Resource URLs  Resource Names  Verbs
        ---------                                         -----------------  --------------  -----
        securitycontextconstraints.security.openshift.io  []                 [privileged]    [use] 

      Not sure what to make of this. Perhaps an upstream change? It could always be azure stack weirdness, more context below.

       

      Version-Release number of selected component (if applicable):

          4.19ec3

      How reproducible:

          Always

      Steps to Reproduce:

      All Azure Stack installs.
          

      Actual results:

          Degraded operator

      Expected results:

          Available

      Additional info:

      1. https://issues.redhat.com/browse/OCPBUGS-51090 tracks an upstream bug in cloud-provider-azure which I have an upstream wip fix for here: https://github.com/kubernetes-sigs/cloud-provider-azure/pull/8755. Upstream changed the sdk in the cloud provider to the v2 implementation which still has spotty at best support for azure stack. 
      
      If the storage operator depends on node labels, this cloud provider bug could be the cause. 

       

      2.CI is down because new security measures were put in place for our environment. Manual token validation is now required. They are meeting on monday about enabling accessed from the fixed ip address we have given them.

       

      Must gather attached

              rhn-support-pewang Penghao Wang
              padillon Patrick Dillon
              None
              None
              Penghao Wang Penghao Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: