Description of problem:
This is a similar issue reported in https://issues.redhat.com/browse/OCPBUGS-29474 where the basic problem is of disk name swapping which causes issues across reboots. Tradditionally, Linux Kernel utilized in RHEL as well as all other distros don't guarantee the disk names to be persistent across reboots.
At Red Hat IBM COC, One of our other client is facing similar issue (not exactly same scenario of dedicated storage for etcd but separate disk for Ceph Storage)
- Baremetal Installation on vSphere ESXi 7.0
- Three masters VMs
- Three workers VMs with two HDDs each
The Primary disk of 192GB is OS and other disk of 1TB is for ODF Storage.
The installation went through properly detecting the disks for OS and for ODF (1TB) as expected, but reboots flipped the names. When this happens, CoreOS itself can boot without problem but rook-ceph-osd-x-xxxx pod scheduled on such node fails.
# ssh core@worker0.xxxx lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS loop0 7:0 0 1T 0 loop sda 8:0 0 192G 0 disk ├─sda1 8:1 0 1M 0 part ├─sda2 8:2 0 127M 0 part ├─sda3 8:3 0 384M 0 part /boot .... .... sdb 8:16 0 1T 0 disk sr0 11:0 1 1024M 0 rom
The rook-ceph-osd-x-xxxx pod is working fine on the node which has correct device paths like above.
But when /dev/sda and /dev/sdb is changed, that pod is not working.
And this problem happens randomly whenever we reboot each node.
$ cat proc/partitions major minor #blocks name 8 0 201326592 sda 8 1 1024 sda1 8 2 130048 sda2 8 3 393216 sda3 8 4 200801263 sda4 8 16 1073741824 sdb $ cat sos_commands/block/blkid_-c_.dev.null | grep sd /dev/sdb: TYPE="ceph_bluestore" /dev/sda4: LABEL="root" UUID="d4e71bf1-e1b1-4061-8820-94d415bac740" TYPE="xfs" PARTLABEL="root" PARTUUID="7f8ab0e2-e8a2-4678-a692-8420fa0da9ee" /dev/sda2: SEC_TYPE="msdos" LABEL_FATBOOT="EFI-SYSTEM" LABEL="EFI-SYSTEM" UUID="E3C4-265C" TYPE="vfat" PARTLABEL="EFI-SYSTEM" PARTUUID="162c85ea-992b-440a-b0bd-a817347eaecc" /dev/sda3: LABEL="boot" UUID="9fb2c61c-5556-46aa-9836-69af853caa4b" TYPE="ext4" PARTLABEL="boot" PARTUUID="60e64345-8fcd-454a-82c3-e20b79aaa68a" /dev/sda1: PARTLABEL="BIOS-BOOT" PARTUUID="637709e2-87f5-4e3d-ac30-1d6996e85259"
We need a solution/fix which would guarantee that the secondary disk used for ODF would always be detected as sdb in this specific case.
Version-Release number of selected component (if applicable):
4.x releases
How reproducible:
Random but 50% chances across all reboots of the nodes with multiple disks dedicated for such purposes.
Steps to Reproduce:
1. Install OCP 2. Install ODF and dedicate a separate disk sdb for ceph storage 3. Reboot the nodes to observe the disk names change across reboots randomly.
Actual results:
The disk names aren't persistent.
Expected results:
The disk names should be persistent.
Additional info:
The disk naming isn't guaranteed as per design, hence we might need some custom tailored solution for this scenario.
- impacts account
-
OCPBUGS-29474 Persistent disk naming issues persist across reboots in CoreOS, challenging conventional fixes, impacting various environments and requiring robust solutions.
- Closed