-
Bug
-
Resolution: Cannot Reproduce
-
Undefined
-
None
-
4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
The local storage operator fails to make a Persistent Volume (PV) available after usage, if it is of type IBM ECKD DASD & used in Block mode.
The PV remains in the following state:
# oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE [...] local-pv-78f4a9a 70Gi RWO Delete Released default/vm-pvc-fio local-dasd <unset> 11m # oc describe pv local-pv-78f4a9a Name: local-pv-78f4a9a [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal VolumeDelete 2m47s localvolume-symlink-controller Starting cleanup of Block PV "local-pv-78f4a9a", this may take a while
The operator log reveals the underlying issue - mkfs.xfs throws an segmentation fault:
# oc logs diskmaker-manager-prwxv -n openshift-local-storage I0321 14:56:28.260731 311971 deleter.go:324] Cleanup pv "local-pv-78f4a9a": StderrBuf - "Metadata CRC error detected at 0x2aa35d15890, xfs_agf block 0x8/0x1000" I0321 14:56:28.262276 311971 deleter.go:324] Cleanup pv "local-pv-78f4a9a": StderrBuf - "Metadata CRC error detected at 0x2aa35d32800, xfs_agi block 0x10/0x1000" I0321 14:56:28.262861 311971 deleter.go:324] Cleanup pv "local-pv-78f4a9a": StderrBuf - "Metadata CRC error detected at 0x2aa35d4059a, xfs_sb block 0x0/0x1000" I0321 14:56:28.798599 311971 deleter.go:324] Cleanup pv "local-pv-78f4a9a": StderrBuf - "/scripts/quick_reset.sh: line 36: 317053 Segmentation fault (core dumped) ionice -c 3 mkfs.xfs -f $LOCAL_PV_BLKDEVICE"
The reason for this behavior are the specifics of IBM ECKD DASD devices:
# https://www.ibm.com/docs/en/linux-on-systems?topic=wd-preparing-eckd-1 Before you can use an ECKD type DASD as a disk for Linux® on IBM® Z, you must format it with a suitable disk layout.
It is mandatory, to use a suitable disk layout before using the DASD. The `quick_reset.sh` script does not check, if the disk is of type DASD and therefore tries to mkfs.xfs on the DASD without partition.
This behavior makes it impossible to use IBM ECKD DASD with Local-Storage-Operator in Block mode without manual cleanup.
Furthermore, the `quick_reset.sh` script can not be controlled by setting the `forceWipeDevicesAndDestroyAllData: false`, for some reason. Is using a PV in block mode expected to be a destructive action?
Therefore, the `quick_reset.sh` script must not execute mkfs.xfs on a ECKD DASD device with no partition. The device driver on OS-level reveals the type and may be used to implement a check, e. g. `/sys/class/block/dasdb/device/subsystem/drivers/dasd-eckd`
I have attached all necessary logs on my Red Hat Google Drive, including must-gather and sosreport of the worker with the IBM ECKD DASD: https://drive.google.com/drive/folders/1rfmvebgrLjTzU6oKLI3M2ctpF3kUWZ2w?usp=drive_link
Thank you very much for looking into this issue & please let me know, if you need further data or input.
Version-Release number of selected component (if applicable):
4.18.4
How reproducible:
Always
Steps to Reproduce:
- Attach ECKD DASD on Worker - Make it available via LocalVolume - Use it in Block mode with a PVC - Delete PVC
Actual results:
- Stuck in Released state forever (due to Seg. Fault of mkfs.xfs in quick_reset.sh hack script)
Expected results:
- In available state
Additional info:
Setting forceWipeDevicesAndDestroyAllData: false does not help