Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63200

OCP install using ZTP with Gitops fail - Worker Nodes BMH Inspection Timeout Failures

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • x86_64
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      BareMetalHost inspection consistently fails with timeout errors on Dell PowerEdge R750 worker nodes during OpenShift cluster deployment using ClusterInstance/ZTP. All three worker nodes fail hardware inspection after successfully booting from virtual media, while master nodes with identical hardware complete inspection successfully. The nodes transition from "inspect wait" to "inspect failed" state after exceeding the default inspection timeout period.

      Version-Release number of selected component (if applicable):

      OpenShift Version: 4.16.0-0.nightly-2025-10-13-172705
      Deployment Method: Zero Touch Provisioning (ZTP) with GitOps

      How reproducible:

      Always (100% reproducible)
      - Occurs on every deployment attempt
      - Affects all 3 worker nodes consistently  
      - Master nodes never affected (0% failure rate)
      - Reproducible across multiple deployment attempts

      Steps to Reproduce:

      #### Environment Setup
      Hardware: Dell PowerEdge R750 servers (6 nodes total: 3 masters + 3 workers)
      Firmware: iDRAC 7.20.60.50, BIOS 1.18.1
      
      1. Create ClusterInstance resource
      2. Apply ClusterInstance to trigger ZTP deployment
      3. Monitor BareMetalHost resources in the cluster namespace
      4. Observe inspection process through Metal3 logs:
            

      Actual results:

      #### Reproduction Timeline 
      1. T+0min: ClusterInstance applied, BareMetalHost resources created 
      2. T+5min: Virtual media boot successful, nodes power on 
      3. T+10min: Inspection agent starts, hardware enumeration begins 
      4. T+30min: Master nodes complete inspection successfully (state: "manageable") 
      5. T+30-45min: Worker nodes still in "inspect wait" state 
      6. T+45-60min: Worker nodes timeout and transition to "inspect failed" 

      Expected results:

          Successful OCP install and Allow cluster deployment to proceed to BMH provisioning

      Additional info:

      ### Additional Info
      ### Hardware Specifications
      Server Model: Dell PowerEdge R750
      CPU: Intel Xeon Gold 6330N @ 2.20GHz (112 cores, 2 sockets)
      Memory: 512GB DDR4
      Storage: Dell PERC H745 with NVMe drives
      Network: Multiple 25GbE interfaces
      BMC: iDRAC 7.20.60.50
      BIOS: 1.18.1

              eterrell@redhat.com Eduardo Otubo
              rh-ee-rchiluve Rajesh Chiluveru
              Rajesh Chiluveru
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: