-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.16.z, 4.18.z, 4.20.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
contract-priority
-
None
-
None
-
None
-
None
-
None
-
None
-
None
This is a follow-up from the original OCPBUGS-56857
(original bug) Description of problem:
Deploying 2 BM 4.16.24 clusters using ZTP (advanced-cluster-management.v2.11.3, openshift-gitops-operator.v1.14.1). Successfully deployed a lab (19 nodes) but doing the same on 2 other sites (20 nodes each) with the exact same HW this fails with some random nodes failing to deploy with error: "ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/xxx.../0': No such file or directory" It doesn't matter the role of this node failing, sometimes is a master, making the whole deployment halt, or could be storage, gateway or worker. They have been able to deploy individually by roles, all 3 masters, then 2 gateways, after 4 storage nodes, and when trying to deploy 11 workers one of these failed. These nodes go into emergency mode. We there checked /sysroot and it was empty. Our suspicions where something in their HW settings changing the order of disks. Because if they reboot the node, this is unable to find the boot disk unless they reboot several times. Then ignition starts again. But last logs provided show that everything is working as expected. We have in the related case must-gathers, sosreports, site-config files, deployment logs and the log from RHCOS deployment failing.
Version-Release number of selected component (if applicable):
customer reports that they facing the issue either in 4.16, 4.18 and also testing 4.20RC.
How reproducible:
not consistently, the issue seems to be very random
Additional info:
it looks like customer made some progress in troubleshooting the issue, here we're adding some context.
They reported that creating the configmap that defines the partitioning schema without defining the root device inside it, seems to make the deployment to works without any ostree issue 100% of times.
So this is the original ignition override, which leads randomly to ostree/boot issues:
{ "ignition": { "version": "3.2.0" }, "storage": { "disks": [ { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:0:0", "wipeTable": true, "partitions": [ { "number": 1, "sizeMiB": 1, "wipePartitionEntry": true }, { "number": 2, "sizeMiB": 127, "wipePartitionEntry": true }, { "number": 3, "sizeMiB": 384, "wipePartitionEntry": true }, { "number": 4, "sizeMiB": 0, "wipePartitionEntry": true } ] }, { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:1:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-containers", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] }, { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:2:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-etcd", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] }, { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:3:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-prometheus-data", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] } ], "filesystems": [ { "device": "/dev/disk/by-partlabel/var-lib-containers", "wipeFilesystem": true, "format": "xfs", "mountOptions": [ "defaults", "prjquota" ], "path": "/var/lib/var-lib-containers" }, { "device": "/dev/disk/by-partlabel/var-lib-etcd", "wipeFilesystem": true, "format": "xfs", "mountOptions": [ "defaults", "prjquota" ], "path": "/var/lib/var-lib-etcd" }, { "device": "/dev/disk/by-partlabel/var-lib-prometheus-data", "wipeFilesystem": true, "format": "xfs" } ] },
and this is the one that lacks completely the rootdevice, but customer is reporting as working 100% of times (4.16, 4.18, 4.20RC):
{ "ignition": { "version": "3.2.0" }, "storage": { "disks": [ { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:1:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-containers", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] }, { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:2:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-etcd", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] }, { "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:3:0", "wipeTable": true, "partitions": [ { "number": 1, "label": "var-lib-prometheus-data", "startMiB": 0, "sizeMiB": 0, "wipePartitionEntry": true } ] } ], "filesystems": [ { "device": "/dev/disk/by-partlabel/var-lib-containers", "wipeFilesystem": true, "format": "xfs", "mountOptions": [ "defaults", "prjquota" ], "path": "/var/lib/var-lib-containers" }, { "device": "/dev/disk/by-partlabel/var-lib-etcd", "wipeFilesystem": true, "format": "xfs", "mountOptions": [ "defaults", "prjquota" ], "path": "/var/lib/var-lib-etcd" }, { "device": "/dev/disk/by-partlabel/var-lib-prometheus-data", "wipeFilesystem": true, "format": "xfs" } ] },
Also, skoksal@redhat.com got a similar feedback from jhernand-rh on another slack thread [1] that seems to point to the same direction:
I believe it isn't necessary to define again the partitions for the first disk. The CoreOS installer (not exactly assisted installer) will automatically create that same partition scheme by default. It would be needed if you wanted to change the size of some partition, or create additional partitions in the same disk. But looks like in your scenario you aren't changing anything in that disk, only in other disks.
But again, the random ostree issue might still need to be addressed in case the customer will need to customize i.e. the partitions.
That said even if the customer is able to workaround the original blocker as described, we still have some grey areas and they're looking for some guidance in order to properly configure their templates/nodes:
if something is wrong in their ignition config override. maybe some labeling was introduced for these partitions what they did not know, and or we shall skip it completely to let the installer do its job. Because in 4.18 w/ dell hw they faced sometimes the same issue. This is what is not clear now, how shall they define the igntion config override for agent based installation and also for acm based installation.
Thank you