-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Multiple issues with the agent based installer with the root device hints when trying to do host config on the rendezvous node. On top of that the errors posted by the services don't make any sense and aren't helpful to understand what is the issue.
For example deploying UPI bare metal/agnostic with 5 VMs on KVM using 60GB for disks and have the following configuration:
apiVersion: v1beta1
kind: AgentConfig
metadata:
name: agentbased-cluster
rendezvousIP: 172.23.191.161
hosts:
- role: master
interfaces:
- name: enp1s0
macAddress: 52:54:00:1f:a6:4e
rootDeviceHints:
deviceName: /dev/vda
minSizeGigabytes: 40
- role: master
interfaces:
- name: enp1s0
macAddress: 52:54:00:bd:ce:a7
rootDeviceHints:
deviceName: /dev/vda
minSizeGigabytes: 40
- role: master
interfaces:
- name: enp1s0
macAddress: 52:54:00:4c:fd:35
rootDeviceHints:
deviceName: /dev/vda
minSizeGigabytes: 40
- role: worker
interfaces:
- name: enp1s0
macAddress: 52:54:00:3f:16:e3
rootDeviceHints:
deviceName: /dev/vda
minSizeGigabytes: 40
- role: worker
interfaces:
- name: enp1s0
macAddress: 52:54:00:04:ee:db
rootDeviceHints:
deviceName: /dev/vda
minSizeGigabytes: 40
Fails every time with the below error:
- From the openshift-install:
WARNING Host master0.agentbased-cluster validation: No eligible disks were found, please check specific disks to see why they are not eligible.
- on the rendezvous node we see on the console an error that to me makes no sense and it is confusing:
6088" go-id=241 host_id=c5b961bb-0266-466d-b740-5babbe40507e infra_env_id=a9061380-e1da-45a5-b0f8-eb150860b0a7 pkg=Inventory request_id=41051eea-f1d8-4d5a-8a21-9f243a758be5
Apr 11 09:58:03 master0.agentbased-cluster.redhatrules.local service[2775]: time="2025-04-11T09:58:03Z" level=error msg="failed to set installation disk path </dev/not-found-by-hints> host <c5b961bb-0266-466d-b740-5babbe40507e> infra env <a9061380-e1da-45a5-b0f8-eb150860b0a7>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).updateHostDisksSelectionConfig" file="/src/internal/bminventory/inventory.go:6091" error="Requested installation disk is not part of the host's valid disks" go-id=241 host_id=c5b961bb-0266-466d-b740-5babbe40507e infra_env_id=a9061380-e1da-45a5-b0f8-eb150860b0a7 pkg=Inventory request_id=41051eea-f1d8-4d5a-8a21-9f243a758be5
Nothing on this error is helpful. Looking at the service logs also doesn't help:
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Checking configuration for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Searching for config for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Found host config in /etc/assisted/hostconfig/host-1"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Searching for config for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Read root device hints file"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="No disk found matching root device hints"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Found role master"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Host role master already configured"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Updating host"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Found host config in /etc/assisted/hostconfig/host-1"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Read root device hints file"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="No disk found matching root device hints"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Found role master"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Host role master already configured"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Updating host"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=error msg="Host master1.agentbased-cluster.redhatrules.local update refused: AssistedServiceError Code: 409 Href: ID: 409 Kind: Error Reason: Requested installation disk is not part of the host's valid disks"
Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="All expected hosts found"
Looking at the files, all seems to match what was configured on the agent-config:
[root@master0 ~]# cat /etc/assisted/hostconfig/host-1/root-device-hints.yaml
deviceName: /dev/vda
minSizeGigabytes: 40
Which matches the VMs disks:
[root@master0 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 9.5G 0 loop /var/lib/containers/storage/overlay
/var
/etc
/run/ephemeral
loop1 7:1 0 1G 0 loop /usr
/boot
/
/sysroot
vda 252:0 0 60G 0 disk
[root@master0 ~]# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx. 1 root root 9 Apr 11 09:46 pci-0000:04:00.0 -> ../../vda
lrwxrwxrwx. 1 root root 9 Apr 11 09:46 virtio-pci-0000:04:00.0 -> ../../vda
Setting the /dev/disk/by-path has the exact same issues, so the problem doesn't seem to be using the device name.
Looking at the code, the key minSizeGigabytes seems to override the default and allow discover disks that have equal or higher size, which on this config will be 40GB. Therefore this being a 60GB should be fine.
More confusing in these errors is that when we check the validations the disk is there and marked as "eligible":true:
Apr 11 10:03:57 master0.agentbased-cluster.redhatrules.local objective_shtern[10255]: {"bmc_address":"0.0.0.0","bmc_v6address":"::/0","boot":{"command_line":"coreos.live.rootfs_url=http://192.168.13.184:9480/data/agent.x86_64-rootfs.img rw ignition.firstboot ignition.platform.id=metal\n","current_boot_mode":"bios"},"cpu":{"architecture":"x86_64","count":8,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","pdpe1gb","rdtscp","lm","constant_tsc","rep_good","nopl","xtopology","cpuid","tsc_known_freq","pni","vmx","ssse3","cx16","pcid","sse4_1","sse4_2","x2apic","popcnt","tsc_deadline_timer","hypervisor","lahf_lm","cpuid_fault","pti","ssbd","ibrs","ibpb","stibp","tpr_shadow","flexpriority","ept","vpid","tsc_adjust","arat","vnmi","umip","flush_l1d","arch_capabilities"],"model_name":"Westmere E56xx/L56xx/X56xx (Nehalem-C)"},"disks":[{"by_path":"/dev/disk/by-path/pci-0000:04:00.0","drive_type":"HDD","id":"/dev/disk/by-path/pci-0000:04:00.0","installation_eligibility":{"eligible":true,"not_eligible_reasons":null},"name":"vda","path":"/dev/vda","size_bytes":64424509440,"vendor":"0x1af4"}],"gpus":[{"address":"0000:00:01.0"}],"hostname":"master0.agentbased-cluster.redhatrules.local","interfaces":[{"flags":["up","loopback","running"],"has_carrier":true,"ipv4_addresses":["127.0.0.1/8"],"ipv6_addresses":["::1/128"],"mtu":65536,"name":"lo","type":"device"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":["172.23.191.161/24"],"ipv6_addresses":["2001:db8:ca2:2:1::ab/64"],"mac_address":"52:54:00:1f:a6:4e","mtu":9000,"name":"enp1s0","product":"0x0001","speed_mbps":-1,"type":"physical","vendor":"0x1af4"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":["10.88.0.1/16"],"ipv6_addresses":[],"mac_address":"62:0c:99:4f:6c:51","mtu":1500,"name":"cni-podman0","speed_mbps":10000,"type":"bridge"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":[],"ipv6_addresses":[],"mac_address":"92:a5:58:54:71:3e","mtu":1500,"name":"vethc6e91971","speed_mbps":10000,"type":"veth"}],"memory":{"physical_bytes":20971520000,"physical_bytes_method":"dmidecode","usable_bytes":20474601472},"routes":[{"destination":"0.0.0.0","family":2,"gateway":"172.23.191.1","interface":"enp1s0","metric":100},{"destination":"10.88.0.0","family":2,"interface":"cni-podman0"},{"destination":"172.23.191.0","family":2,"interface":"enp1s0","metric":100},{"destination":"::1","family":10,"interface":"lo","metric":256},{"destination":"2001:db8:ca2:2:1::ab","family":10,"interface":"enp1s0","metric":100},{"destination":"2001:db8:ca2:2::","family":10,"interface":"enp1s0","metric":100},{"destination":"fe80::","family":10,"interface":"cni-podman0","metric":256},{"destination":"fe80::","family":10,"interface":"vethc6e91971","metric":256},{"destination":"fe80::","family":10,"interface":"enp1s0","metric":1024},{"destination":"::","family":10,"gateway":"fe80::5054:ff:fe93:6562","interface":"enp1s0","metric":100}],"system_vendor":{"manufacturer":"QEMU","product_name":"Standard PC (Q35 + ICH9, 2009)","virtual":true},"tpm_version":"none"}
After many tries I figure out that the issue here was with the size and when I create VMs with 120GB no issue happens and installation of the cluster proceeds.
So either minSizeGigabytes is for something else and my understanding of the code is incorrect or is not being checked. If having 120GB is mandatory then I have to open a bug for DOCs to change the wording, since we only say it is recommended, meaning that if I want to have a small test cluster I can't override this.
Version-Release number of selected component (if applicable):
OCP 4.16 (may affect other versions)
How reproducible:
Every time
Steps to Reproduce:
1. Configure similar agent-config and hosts
2. Start deployment whether using iso or pxe-files
Actual results:
Expected results:
Additional info: