Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10762

Machine should be Failed if Machine has a Failed state on Azure

XMLWordPrintable

    • Low
    • No
    • CLOUD Sprint 234, CLOUD Sprint 235, CLOUD Sprint 236, CLOUD Sprint 237
    • 4
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When creating machine and attaching Azure Ultra Disks as Data Disks in Arm cluster, machine is Provisioned, but checked in azure web console, instance is failed with error ZonalAllocationFailed.

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-arm64-2023-03-22-204044

      How reproducible:

      Always

      Steps to Reproduce:

      
      /// Not Needed up to point 6 ////
      
      1. Make sure storagecluster is already present
      kind: StorageClass
      apiVersion: storage.k8s.io/v1
      metadata:
        name: ultra-disk-sc
      provisioner: disk.csi.azure.com # replace with "kubernetes.io/azure-disk" if aks version is less than 1.21
      volumeBindingMode: WaitForFirstConsumer # optional, but recommended if you want to wait until the pod that will use this disk is created 
      parameters:
        skuname: UltraSSD_LRS
        kind: managed
        cachingMode: None
        diskIopsReadWrite: "2000"  # minimum value: 2 IOPS/GiB 
        diskMbpsReadWrite: "320"   # minimum value: 0.032/GiB
      2. Create a new custom secret using the worker-data-secret  
      $ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.userData | base64decode}}' | jq > userData.txt
      3. Edit the userData.txt by adding below part just before the ending '}' and add a comma 
      "storage": {
        "disks": [
          {
            "device": "/dev/disk/azure/scsi1/lun0",
            "partitions": [
              {
                "label": "lun0p1",
                "sizeMiB": 1024,
                "startMiB": 0
              }
            ]
          }
        ],
        "filesystems": [
          {
            "device": "/dev/disk/by-partlabel/lun0p1",
            "format": "xfs",
            "path": "/var/lib/lun0p1"
          }
        ]
      },
      "systemd": {
        "units": [
          {
            "contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n",
            "enabled": true,
            "name": "var-lib-lun0p1.mount"
          }
        ]
      }
      4. Extract the disabling template value using below
      $ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt
      5. Merge the two files to create a datasecret file to be used 
      $ oc -n openshift-machine-api create secret generic worker-user-data-x5 --from-file=userData=userData.txt --from-file=disableTemplating=disableTemplating.txt 
      
      
      /// Not needed up to here ///
      
      6.modify the new machineset yaml with below datadisk being seperate field as the osDisks 
                dataDisks:
                - nameSuffix: ultrassd
                  lun: 0
                  diskSizeGB: 4 # The same issue on the machine status fields is reproducible on x86_64 by setting 65535 to overcome the maximum limits of the Azure accounts we use.
                  cachingType: None
                  deletionPolicy: Delete
                  managedDisk:
                    storageAccountType: UltraSSD_LRS
      7. scale up machineset or delete an existing machine to force the reprovisioning.

      Actual results:

      Machine stuck in Provisoned phase, but check from azure, it failed
      $ oc get machine -o wide                
      NAME                                        PHASE         TYPE               REGION      ZONE   AGE     NODE                                        PROVIDERID                                                                                                                                                                              STATE
      zhsunaz3231-lds8h-master-0                  Running       Standard_D8ps_v5   centralus   1      4h15m   zhsunaz3231-lds8h-master-0                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-0                  Running
      zhsunaz3231-lds8h-master-1                  Running       Standard_D8ps_v5   centralus   2      4h15m   zhsunaz3231-lds8h-master-1                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-1                  Running
      zhsunaz3231-lds8h-master-2                  Running       Standard_D8ps_v5   centralus   3      4h15m   zhsunaz3231-lds8h-master-2                  azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-2                  Running
      zhsunaz3231-lds8h-worker-centralus1-sfhs7   Provisioned   Standard_D4ps_v5   centralus   1      3m23s                                               azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-worker-centralus1-sfhs7   Creating
      
      $ oc get machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 -o yaml
        - lastTransitionTime: "2023-03-23T06:07:32Z"
          message: 'Failed to check if machine exists: vm for machine zhsunaz3231-lds8h-worker-centralus1-sfhs7
            exists, but has unexpected ''Failed'' provisioning state'
          reason: ErrorCheckingProvider
          status: Unknown
          type: InstanceExists
        - lastTransitionTime: "2023-03-23T06:07:05Z"
          status: "True"
          type: Terminable
        lastUpdated: "2023-03-23T06:07:32Z"
        phase: Provisioned

      Expected results:

      Machine should be failed if failed in azure

      Additional info:

      must-gather: https://drive.google.com/file/d/1z1gyJg4NBT8JK2-aGvQCruJidDHs0DV6/view?usp=sharing

            dodvarka@redhat.com Daniel Odvarka (Inactive)
            rhn-support-zhsun Zhaohua Sun
            Alessandro Di Stefano Alessandro Di Stefano
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: