Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54898

[AGENT INSTALLER] Issues with root device hints and bad error reporting

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Multiple issues with the agent based installer with the root device hints when trying to do host config on the rendezvous node. On top of that the errors posted by the services don't make any sense and aren't helpful to understand what is the issue.
      
      For example deploying UPI bare metal/agnostic with 5 VMs on KVM using 60GB for disks and have the following configuration:
      
      apiVersion: v1beta1
      kind: AgentConfig
      metadata:
        name: agentbased-cluster
      rendezvousIP: 172.23.191.161
      hosts:
        - role: master
          interfaces:
            - name: enp1s0
              macAddress: 52:54:00:1f:a6:4e
          rootDeviceHints:
            deviceName: /dev/vda
            minSizeGigabytes: 40
        - role: master
          interfaces:
            - name: enp1s0
              macAddress: 52:54:00:bd:ce:a7
          rootDeviceHints:
            deviceName: /dev/vda
            minSizeGigabytes: 40
        - role: master
          interfaces:
            - name: enp1s0
              macAddress: 52:54:00:4c:fd:35
          rootDeviceHints:
            deviceName: /dev/vda
            minSizeGigabytes: 40
        - role: worker
          interfaces:
            - name: enp1s0
              macAddress: 52:54:00:3f:16:e3
          rootDeviceHints:
            deviceName: /dev/vda
            minSizeGigabytes: 40
        - role: worker
          interfaces:
            - name: enp1s0
              macAddress: 52:54:00:04:ee:db
          rootDeviceHints:
            deviceName: /dev/vda
            minSizeGigabytes: 40
      
      Fails every time with the below error:
      
       - From the openshift-install:
      
      WARNING Host master0.agentbased-cluster validation: No eligible disks were found, please check specific disks to see why they are not eligible.
      
       - on the rendezvous node we see on the console an error that to me makes no sense and it is confusing:
      
      6088" go-id=241 host_id=c5b961bb-0266-466d-b740-5babbe40507e infra_env_id=a9061380-e1da-45a5-b0f8-eb150860b0a7 pkg=Inventory request_id=41051eea-f1d8-4d5a-8a21-9f243a758be5
      Apr 11 09:58:03 master0.agentbased-cluster.redhatrules.local service[2775]: time="2025-04-11T09:58:03Z" level=error msg="failed to set installation disk path </dev/not-found-by-hints> host <c5b961bb-0266-466d-b740-5babbe40507e> infra env <a9061380-e1da-45a5-b0f8-eb150860b0a7>" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).updateHostDisksSelectionConfig" file="/src/internal/bminventory/inventory.go:6091" error="Requested installation disk is not part of the host's valid disks" go-id=241 host_id=c5b961bb-0266-466d-b740-5babbe40507e infra_env_id=a9061380-e1da-45a5-b0f8-eb150860b0a7 pkg=Inventory request_id=41051eea-f1d8-4d5a-8a21-9f243a758be5
      
      Nothing on this error is helpful. Looking at the service logs also doesn't help:
      
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Checking configuration for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Searching for config for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Found host config in /etc/assisted/hostconfig/host-1"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Searching for config for host d30e714e-4dfb-4149-a8ff-7a24193e42aa"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Read root device hints file"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="No disk found matching root device hints"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Found role master"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Host role master already configured"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local podman[3211]: time="2025-04-11T09:59:39Z" level=info msg="Updating host"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Found host config in /etc/assisted/hostconfig/host-1"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Read root device hints file"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="No disk found matching root device hints"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Found role master"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Host role master already configured"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="Updating host"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=error msg="Host master1.agentbased-cluster.redhatrules.local update refused: AssistedServiceError Code: 409 Href:  ID: 409 Kind: Error Reason: Requested installation disk is not part of the host's valid disks"
      Apr 11 09:59:39 master0.agentbased-cluster.redhatrules.local apply-host-config[3252]: time="2025-04-11T09:59:39Z" level=info msg="All expected hosts found"
      
      Looking at the files, all seems to match what was configured on the agent-config:
      [root@master0 ~]# cat /etc/assisted/hostconfig/host-1/root-device-hints.yaml 
      deviceName: /dev/vda
      minSizeGigabytes: 40
      
      Which matches the VMs disks:
      
      [root@master0 ~]# lsblk
      NAME  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      loop0   7:0    0  9.5G  0 loop /var/lib/containers/storage/overlay
                                     /var
                                     /etc
                                     /run/ephemeral
      loop1   7:1    0    1G  0 loop /usr
                                     /boot
                                     /
                                     /sysroot
      vda   252:0    0   60G  0 disk 
      
      [root@master0 ~]# ls -l /dev/disk/by-path/
      total 0
      lrwxrwxrwx. 1 root root 9 Apr 11 09:46 pci-0000:04:00.0 -> ../../vda
      lrwxrwxrwx. 1 root root 9 Apr 11 09:46 virtio-pci-0000:04:00.0 -> ../../vda
      
      
      Setting the /dev/disk/by-path has the exact same issues, so the problem doesn't seem to be using the device name.
      
      Looking at the code, the key minSizeGigabytes seems to override the default and allow discover disks that have equal or higher size, which on this config will be 40GB. Therefore this being a 60GB should be fine.
      
      More confusing in these errors is that when we check the validations the disk is there and marked as "eligible":true:
      
      Apr 11 10:03:57 master0.agentbased-cluster.redhatrules.local objective_shtern[10255]: {"bmc_address":"0.0.0.0","bmc_v6address":"::/0","boot":{"command_line":"coreos.live.rootfs_url=http://192.168.13.184:9480/data/agent.x86_64-rootfs.img rw ignition.firstboot ignition.platform.id=metal\n","current_boot_mode":"bios"},"cpu":{"architecture":"x86_64","count":8,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","pdpe1gb","rdtscp","lm","constant_tsc","rep_good","nopl","xtopology","cpuid","tsc_known_freq","pni","vmx","ssse3","cx16","pcid","sse4_1","sse4_2","x2apic","popcnt","tsc_deadline_timer","hypervisor","lahf_lm","cpuid_fault","pti","ssbd","ibrs","ibpb","stibp","tpr_shadow","flexpriority","ept","vpid","tsc_adjust","arat","vnmi","umip","flush_l1d","arch_capabilities"],"model_name":"Westmere E56xx/L56xx/X56xx (Nehalem-C)"},"disks":[{"by_path":"/dev/disk/by-path/pci-0000:04:00.0","drive_type":"HDD","id":"/dev/disk/by-path/pci-0000:04:00.0","installation_eligibility":{"eligible":true,"not_eligible_reasons":null},"name":"vda","path":"/dev/vda","size_bytes":64424509440,"vendor":"0x1af4"}],"gpus":[{"address":"0000:00:01.0"}],"hostname":"master0.agentbased-cluster.redhatrules.local","interfaces":[{"flags":["up","loopback","running"],"has_carrier":true,"ipv4_addresses":["127.0.0.1/8"],"ipv6_addresses":["::1/128"],"mtu":65536,"name":"lo","type":"device"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":["172.23.191.161/24"],"ipv6_addresses":["2001:db8:ca2:2:1::ab/64"],"mac_address":"52:54:00:1f:a6:4e","mtu":9000,"name":"enp1s0","product":"0x0001","speed_mbps":-1,"type":"physical","vendor":"0x1af4"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":["10.88.0.1/16"],"ipv6_addresses":[],"mac_address":"62:0c:99:4f:6c:51","mtu":1500,"name":"cni-podman0","speed_mbps":10000,"type":"bridge"},{"flags":["up","broadcast","multicast","running"],"has_carrier":true,"ipv4_addresses":[],"ipv6_addresses":[],"mac_address":"92:a5:58:54:71:3e","mtu":1500,"name":"vethc6e91971","speed_mbps":10000,"type":"veth"}],"memory":{"physical_bytes":20971520000,"physical_bytes_method":"dmidecode","usable_bytes":20474601472},"routes":[{"destination":"0.0.0.0","family":2,"gateway":"172.23.191.1","interface":"enp1s0","metric":100},{"destination":"10.88.0.0","family":2,"interface":"cni-podman0"},{"destination":"172.23.191.0","family":2,"interface":"enp1s0","metric":100},{"destination":"::1","family":10,"interface":"lo","metric":256},{"destination":"2001:db8:ca2:2:1::ab","family":10,"interface":"enp1s0","metric":100},{"destination":"2001:db8:ca2:2::","family":10,"interface":"enp1s0","metric":100},{"destination":"fe80::","family":10,"interface":"cni-podman0","metric":256},{"destination":"fe80::","family":10,"interface":"vethc6e91971","metric":256},{"destination":"fe80::","family":10,"interface":"enp1s0","metric":1024},{"destination":"::","family":10,"gateway":"fe80::5054:ff:fe93:6562","interface":"enp1s0","metric":100}],"system_vendor":{"manufacturer":"QEMU","product_name":"Standard PC (Q35 + ICH9, 2009)","virtual":true},"tpm_version":"none"}
      
      After many tries I figure out that the issue here was with the size and when I create VMs with 120GB no issue happens and installation of the cluster proceeds.
      
      So either minSizeGigabytes is for something else and my understanding of the code is incorrect or is not being checked. If having 120GB is mandatory then I have to open a bug for DOCs to change the wording, since we only say it is recommended, meaning that if I want to have a small test cluster I can't override this.

      Version-Release number of selected component (if applicable):

          OCP 4.16 (may affect other versions)

      How reproducible:

          Every time

      Steps to Reproduce:

          1. Configure similar agent-config and hosts
          2. Start deployment whether using iso or pxe-files

      Actual results:

          

      Expected results:

          

      Additional info:

          

              bfournie@redhat.com Robert Fournier
              rhn-support-andcosta Andre Costa
              None
              None
              Manoj Hans Manoj Hans
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: