Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-25778

SiteConfig Operator can overwrite BMH.spec.online field to false resulting in failed deployment of SNO

XMLWordPrintable

    • Important
    • None

      Description of problem:

      While building a proof of concept test for ACM deploying many SNOs via IBI with AAP and EDA to fire off a "day2" playbook, I observed roughly 5 out of 324 SNOs fail to deploy where I found the SNO's machine powered down. Upon investigation of log files I found that it appears the SiteConfig Operator applied the BMH manifest twice and its default template sets spec.online to false. In the logs we see SiteConfig Operator apply BMH, IBIO then sets it to power on, and eventually SiteConfig Operator applies the BMH a 2nd time which prevents the SNO hardware from powering on fully and thus installing.

      In the attached logs vm00058 was the clusterinstance that was observed with this behavior:

      1st Dry-run
      2025-10-31T03:56:44.296Z	INFO	ClusterInstanceController.validateRenderedManifests	controller/clusterinstance_controller.go:646	Executing a dry-run validation on the rendered manifests	{"name": "vm00058", "namespace": "vm00058", "version": "172880"}
      2025-10-31T03:56:44.296Z	DEBUG	ClusterInstanceController.validateRenderedManifests.applyObject	controller/clusterinstance_controller.go:579	Applying object using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      2025-10-31T03:56:44.301Z	DEBUG	ClusterInstanceController.validateRenderedManifests.applyObject	controller/clusterinstance_controller.go:587	Object applied using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      
      SiteConfig Operator applies BMH 1st time
      2025-10-31T03:56:44.398Z	INFO	ClusterInstanceController.applyRenderedManifests	controller/clusterinstance_controller.go:692	Applying the rendered manifests	{"name": "vm00058", "namespace": "vm00058", "version": "172880"}
      2025-10-31T03:56:44.398Z	DEBUG	ClusterInstanceController.applyRenderedManifests.applyObject	controller/clusterinstance_controller.go:579	Applying object using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      2025-10-31T03:56:44.403Z	DEBUG	ClusterInstanceController.applyRenderedManifests.applyObject	controller/clusterinstance_controller.go:587	Object applied using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      
      IBIO sets spec.online (Only time)
      time="2025-10-31T03:57:02Z" level=info msg="Setting BareMetalHost (vm00058/vm00058) spec.Online to true" func="github.com/openshift/image-based-install-operator/controllers.(*ImageClusterInstallReconciler).updateBMHProvisioningState" file="/opt/app-root/src/controllers/imageclusterinstall_controller.go:715" name=vm00058 namespace=vm00058
      
      2nd Dry run
      2025-10-31T03:58:33.720Z	INFO	ClusterInstanceController.validateRenderedManifests	controller/clusterinstance_controller.go:646	Executing a dry-run validation on the rendered manifests	{"name": "vm00058", "namespace": "vm00058", "version": "182349"}
      2025-10-31T03:58:33.720Z	DEBUG	ClusterInstanceController.validateRenderedManifests.applyObject	controller/clusterinstance_controller.go:579	Applying object using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      2025-10-31T03:58:33.728Z	DEBUG	ClusterInstanceController.validateRenderedManifests.applyObject	controller/clusterinstance_controller.go:587	Object applied using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      
      SiteConfig Operator applies BMH 2nd time
      2025-10-31T03:58:33.869Z	INFO	ClusterInstanceController.applyRenderedManifests	controller/clusterinstance_controller.go:692	Applying the rendered manifests	{"name": "vm00058", "namespace": "vm00058", "version": "182349"}
      2025-10-31T03:58:33.869Z	DEBUG	ClusterInstanceController.applyRenderedManifests.applyObject	controller/clusterinstance_controller.go:579	Applying object using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      2025-10-31T03:58:33.876Z	DEBUG	ClusterInstanceController.applyRenderedManifests.applyObject	controller/clusterinstance_controller.go:587	Object applied using Server-Side Apply	{"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
      

       

      Version-Release number of selected component (if applicable):

      Hub OCP 4.20.1

      Deployed OCP 4.19.14

      ACM - 2.15.0-DOWNSTREAM-2025-10-11-01-13-51

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

      As a manual workaround, if you catch the clusterinstance not progressing and it is before the timeout, you can manually edit the bmh to power on and the cluster will succeed install.

              sakhoury@redhat.com Sharat Akhoury
              akrzos@redhat.com Alex Krzos
              Ting Xue Ting Xue
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: