Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18581

[Azure-File-CSI-Driver] storageaccount created by Driver sometimes only allow worker-subnets which lead to mount denied from master node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Undefined
    • None
    • 4.14
    • Storage / Operators
    • None
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Creating pods with Azure File NFS volumes that are scheduled to the control plane node causes the mount to be denied. (link:https://issues.redhat.com/browse/OCPBUGS-18581[*OCPBUGS-18581*])
      +
      To work around this issue: If your control plane nodes are schedulable, and the pods can run on worker nodes, use `nodeSelector` or Affinity to schedule the pod in worker nodes.
      Show
      * Creating pods with Azure File NFS volumes that are scheduled to the control plane node causes the mount to be denied. (link: https://issues.redhat.com/browse/OCPBUGS-18581 [* OCPBUGS-18581 *]) + To work around this issue: If your control plane nodes are schedulable, and the pods can run on worker nodes, use `nodeSelector` or Affinity to schedule the pod in worker nodes.
    • Known Issue
    • Done

    Description

      Description of problem:

      In an Azure compact cluster(only 3 master nodes but all have the worker role), I created sc with skuname: Premium_LRS (I found this is easier to reproduce than other type) and pvc/pod, the CSI Driver helps create a storagceaccount when provisoning the volume, sometimes the storagceaccount allows "all Public network" access as below:

            "networkAcls": {
                  "bypass": "AzureServices",
                  "virtualNetworkRules": [],
                  "ipRules": [],
                  "defaultAction": "Allow"
              },
      

       

      But in some cases, it only allows "selected virtual networks and IP addresses" and "*.worker-subnet" is the only allowed subnet as below:

       

             "networkAcls": {
                  "bypass": "AzureServices",
                  "virtualNetworkRules": [
                      {
                          "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0906a-az-p95c4-rg/providers/Microsoft.Network/virtualNetworks/wduan-0906a-az-p95c4-vnet/subnets/wduan-0906a-az-p95c4-worker-subnet",
                          "action": "Allow",
                          "state": "Succeeded"
                      }
                  ],
                  "ipRules": [],
                  "defaultAction": "Deny"
              },
      

      But actually the scheduled node is master node and only has the "*.master-subnet", so azure-file failed mount due to access denied from master as below:

       

      Mounting arguments: -t nfs -o vers=4,minorversion=1,sec=sys f79137987692a4afea86fb6.file.core.windows.net:/f79137987692a4afea86fb6/pvcn-5dcfcd81-4b29-4876-b2eb-1a778657a35c /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/091066f6c53b5709246f64097bd117917b9daedba792ff9a507b72e6f2cbb4b9/globalmount
        Output: mount.nfs: access denied by server while mounting f79137987692a4afea86fb6.file.core.windows.net:/f79137987692a4afea86fb6/pvcn-5dcfcd81-4b29-4876-b2eb-1a778657a35c
      

      Checked with installer team, it makes sense to have "*.worker-subnet" even there is no worker node yet, it might be used to computer provisioning as day-2 action, also it might impact several scenarios:

      1. compact/SNO cluster as mentioned above
      2. regular cluster when try to schedule pod on master node with Azure-file pvc

       

      So I think we need to check how Azure-File CSI Driver generate networl access rule when creating storageaccount, I think "allow all" might be better or at least all ".master-subnet"/".worker-subnet" subnet should be allowed.

      I'm not sure if this is the right code: https://github.com/openshift/azure-file-csi-driver/blob/master/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_storageaccount.go#L314

       

      Again, it doesn't happen always, so if in regular cluster, I think we might try with:

      1. create pvc (with sc skuname: Premium_LRS) and pod (make it scheduled to master only)

      2. check if pod is running and check storageaccount used in the portal

      3. remove the storageaccount and try again if not reproduce

      See  https://issues.redhat.com/browse/OCPBUGS-18581?focusedId=22953323&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-22953323

       

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-arm64-2023-09-05-140644 (I found/checked in an arm64 build, but I guess it is the same as x86 platform)
      
      And reproduced in 4.14.0-0.nightly-2023-09-02-132842 as well.

       

      How reproducible:

      Sometimes

       

      Steps to Reproduce:

      See Description

       

      Actual results:

      Mount failed and pod is not running

       

      Expected results:

      Mount succeed and pod is running

      Attachments

        Activity

          People

            fbertina@redhat.com Fabio Bertinatto
            wduan@redhat.com Wei Duan
            Wei Duan Wei Duan
            Lisa Pettyjohn Lisa Pettyjohn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: