Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z, 4.18.z, 4.20.z
Component/s: RHCOS
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Special Handling:

contract-priority

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a follow-up from the original OCPBUGS-56857

(original bug) Description of problem:

Deploying 2 BM 4.16.24 clusters using ZTP (advanced-cluster-management.v2.11.3, openshift-gitops-operator.v1.14.1).
Successfully deployed a lab (19 nodes) but doing the same on 2 other sites (20 nodes each) with the exact same HW this fails with some random nodes failing to deploy with error: "ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/xxx.../0': No such file or directory"
It doesn't matter the role of this node failing, sometimes is a master, making the whole deployment halt, or could be storage, gateway or worker.
They have been able to deploy individually by roles, all 3 masters, then 2 gateways, after 4 storage nodes, and when trying to deploy 11 workers one of these failed. 
These nodes go into emergency mode. We there checked /sysroot and it was empty.
Our suspicions where something in their HW settings changing the order of disks. Because if they reboot the node, this is unable to find the boot disk unless they reboot several times. Then ignition starts again.
But last logs provided show that everything is working as expected.
We have in the related case must-gathers, sosreports, site-config files, deployment logs and the log from RHCOS deployment failing.

Version-Release number of selected component (if applicable):

customer reports that they facing the issue either in 4.16, 4.18 and also testing 4.20RC.

How reproducible:

not consistently, the issue seems to be very random

Additional info:
it looks like customer made some progress in troubleshooting the issue, here we're adding some context.

They reported that creating the configmap that defines the partitioning schema without defining the root device inside it, seems to make the deployment to works without any ostree issue 100% of times.

So this is the original ignition override, which leads randomly to ostree/boot issues:

{
    "ignition": {
        "version": "3.2.0"
    },
    "storage": {
        "disks": [
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:0:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "sizeMiB": 1,
                        "wipePartitionEntry": true
                    },
                    {
                        "number": 2,
                        "sizeMiB": 127,
                        "wipePartitionEntry": true
                    },
                    {
                        "number": 3,
                        "sizeMiB": 384,
                        "wipePartitionEntry": true
                    },
                    {
                        "number": 4,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            },
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:1:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-containers",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            },
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:2:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-etcd",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            },
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:3:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-prometheus-data",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            }
        ],
        "filesystems": [
            {
                "device": "/dev/disk/by-partlabel/var-lib-containers",
                "wipeFilesystem": true,
                "format": "xfs",
                "mountOptions": [
                    "defaults",
                    "prjquota"
                ],
                "path": "/var/lib/var-lib-containers"
            },
            {
                "device": "/dev/disk/by-partlabel/var-lib-etcd",
                "wipeFilesystem": true,
                "format": "xfs",
                "mountOptions": [
                    "defaults",
                    "prjquota"
                ],
                "path": "/var/lib/var-lib-etcd"
            },
            {
                "device": "/dev/disk/by-partlabel/var-lib-prometheus-data",
                "wipeFilesystem": true,
                "format": "xfs"
            }
        ]
    },

and this is the one that lacks completely the rootdevice, but customer is reporting as working 100% of times (4.16, 4.18, 4.20RC):

{
    "ignition": {
        "version": "3.2.0"
    },
    "storage": {
        "disks": [
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:1:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-containers",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            },
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:2:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-etcd",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            },
            {
                "device": "/dev/disk/by-path/pci-0000:4a:00.0-scsi-0:2:3:0",
                "wipeTable": true,
                "partitions": [
                    {
                        "number": 1,
                        "label": "var-lib-prometheus-data",
                        "startMiB": 0,
                        "sizeMiB": 0,
                        "wipePartitionEntry": true
                    }
                ]
            }
        ],
        "filesystems": [
            {
                "device": "/dev/disk/by-partlabel/var-lib-containers",
                "wipeFilesystem": true,
                "format": "xfs",
                "mountOptions": [
                    "defaults",
                    "prjquota"
                ],
                "path": "/var/lib/var-lib-containers"
            },
            {
                "device": "/dev/disk/by-partlabel/var-lib-etcd",
                "wipeFilesystem": true,
                "format": "xfs",
                "mountOptions": [
                    "defaults",
                    "prjquota"
                ],
                "path": "/var/lib/var-lib-etcd"
            },
            {
                "device": "/dev/disk/by-partlabel/var-lib-prometheus-data",
                "wipeFilesystem": true,
                "format": "xfs"
            }
        ]
    },

Also, skoksal@redhat.com got a similar feedback from jhernand-rh on another slack thread [1] that seems to point to the same direction:

I believe it isn't necessary to define again the partitions for the first disk. The CoreOS installer (not exactly assisted installer) will automatically create that same partition scheme by default. It would be needed if you wanted to change the size of some partition, or create additional partitions in the same disk. But looks like in your scenario you aren't changing anything in that disk, only in other disks.

But again, the random ostree issue might still need to be addressed in case the customer will need to customize i.e. the partitions.
That said even if the customer is able to workaround the original blocker as described, we still have some grey areas and they're looking for some guidance in order to properly configure their templates/nodes:

if something is wrong in their ignition config override. maybe some labeling was introduced for these partitions what they did not know, and or we shall skip it completely to let the installer do its job. Because in 4.18 w/ dell hw they faced sometimes the same issue. This is what is not clear now, how shall they define the igntion config override for agent based installation and also for acm based installation.

Thank you

[0] https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1759224250950329?thread_ts=1758883417.190999&cid=CUPJTHQ5P

Assignee:: Unassigned

Reporter:: Flavio Piccioni

Need Info From:: None

Contributors:: None

QA Contact:: Jad Haj Yahya

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Due:: 2025/12/29

Created:: 2025/09/30 1:37 PM

Updated:: 2025/10/13 9:59 AM

Resolved:: 2025/10/13 9:57 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide