-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.14
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
No
-
None
-
None
-
CLOUD Sprint 234, CLOUD Sprint 235, CLOUD Sprint 236, CLOUD Sprint 237
-
4
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When creating machine and attaching Azure Ultra Disks as Data Disks in Arm cluster, machine is Provisioned, but checked in azure web console, instance is failed with error ZonalAllocationFailed.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-arm64-2023-03-22-204044
How reproducible:
Always
Steps to Reproduce:
/// Not Needed up to point 6 ////
1. Make sure storagecluster is already present
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: ultra-disk-sc
provisioner: disk.csi.azure.com # replace with "kubernetes.io/azure-disk" if aks version is less than 1.21
volumeBindingMode: WaitForFirstConsumer # optional, but recommended if you want to wait until the pod that will use this disk is created
parameters:
skuname: UltraSSD_LRS
kind: managed
cachingMode: None
diskIopsReadWrite: "2000" # minimum value: 2 IOPS/GiB
diskMbpsReadWrite: "320" # minimum value: 0.032/GiB
2. Create a new custom secret using the worker-data-secret
$ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.userData | base64decode}}' | jq > userData.txt
3. Edit the userData.txt by adding below part just before the ending '}' and add a comma
"storage": {
"disks": [
{
"device": "/dev/disk/azure/scsi1/lun0",
"partitions": [
{
"label": "lun0p1",
"sizeMiB": 1024,
"startMiB": 0
}
]
}
],
"filesystems": [
{
"device": "/dev/disk/by-partlabel/lun0p1",
"format": "xfs",
"path": "/var/lib/lun0p1"
}
]
},
"systemd": {
"units": [
{
"contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhere=/var/lib/lun0p1\nWhat=/dev/disk/by-partlabel/lun0p1\nOptions=defaults,pquota\n[Install]\nWantedBy=local-fs.target\n",
"enabled": true,
"name": "var-lib-lun0p1.mount"
}
]
}
4. Extract the disabling template value using below
$ oc -n openshift-machine-api get secret worker-user-data --template='{{index .data.disableTemplating | base64decode}}' | jq > disableTemplating.txt
5. Merge the two files to create a datasecret file to be used
$ oc -n openshift-machine-api create secret generic worker-user-data-x5 --from-file=userData=userData.txt --from-file=disableTemplating=disableTemplating.txt
/// Not needed up to here ///
6.modify the new machineset yaml with below datadisk being seperate field as the osDisks
dataDisks:
- nameSuffix: ultrassd
lun: 0
diskSizeGB: 4 # The same issue on the machine status fields is reproducible on x86_64 by setting 65535 to overcome the maximum limits of the Azure accounts we use.
cachingType: None
deletionPolicy: Delete
managedDisk:
storageAccountType: UltraSSD_LRS
7. scale up machineset or delete an existing machine to force the reprovisioning.
Actual results:
Machine stuck in Provisoned phase, but check from azure, it failed $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsunaz3231-lds8h-master-0 Running Standard_D8ps_v5 centralus 1 4h15m zhsunaz3231-lds8h-master-0 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-0 Running zhsunaz3231-lds8h-master-1 Running Standard_D8ps_v5 centralus 2 4h15m zhsunaz3231-lds8h-master-1 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-1 Running zhsunaz3231-lds8h-master-2 Running Standard_D8ps_v5 centralus 3 4h15m zhsunaz3231-lds8h-master-2 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-master-2 Running zhsunaz3231-lds8h-worker-centralus1-sfhs7 Provisioned Standard_D4ps_v5 centralus 1 3m23s azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsunaz3231-lds8h-rg/providers/Microsoft.Compute/virtualMachines/zhsunaz3231-lds8h-worker-centralus1-sfhs7 Creating $ oc get machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 -o yaml - lastTransitionTime: "2023-03-23T06:07:32Z" message: 'Failed to check if machine exists: vm for machine zhsunaz3231-lds8h-worker-centralus1-sfhs7 exists, but has unexpected ''Failed'' provisioning state' reason: ErrorCheckingProvider status: Unknown type: InstanceExists - lastTransitionTime: "2023-03-23T06:07:05Z" status: "True" type: Terminable lastUpdated: "2023-03-23T06:07:32Z" phase: Provisioned
Expected results:
Machine should be failed if failed in azure
Additional info:
must-gather: https://drive.google.com/file/d/1z1gyJg4NBT8JK2-aGvQCruJidDHs0DV6/view?usp=sharing