-
Bug
-
Resolution: Done
-
Undefined
-
None
Description of problem:
While building a proof of concept test for ACM deploying many SNOs via IBI with AAP and EDA to fire off a "day2" playbook, I observed roughly 5 out of 324 SNOs fail to deploy where I found the SNO's machine powered down. Upon investigation of log files I found that it appears the SiteConfig Operator applied the BMH manifest twice and its default template sets spec.online to false. In the logs we see SiteConfig Operator apply BMH, IBIO then sets it to power on, and eventually SiteConfig Operator applies the BMH a 2nd time which prevents the SNO hardware from powering on fully and thus installing.
In the attached logs vm00058 was the clusterinstance that was observed with this behavior:
1st Dry-run
2025-10-31T03:56:44.296Z INFO ClusterInstanceController.validateRenderedManifests controller/clusterinstance_controller.go:646 Executing a dry-run validation on the rendered manifests {"name": "vm00058", "namespace": "vm00058", "version": "172880"}
2025-10-31T03:56:44.296Z DEBUG ClusterInstanceController.validateRenderedManifests.applyObject controller/clusterinstance_controller.go:579 Applying object using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
2025-10-31T03:56:44.301Z DEBUG ClusterInstanceController.validateRenderedManifests.applyObject controller/clusterinstance_controller.go:587 Object applied using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
SiteConfig Operator applies BMH 1st time
2025-10-31T03:56:44.398Z INFO ClusterInstanceController.applyRenderedManifests controller/clusterinstance_controller.go:692 Applying the rendered manifests {"name": "vm00058", "namespace": "vm00058", "version": "172880"}
2025-10-31T03:56:44.398Z DEBUG ClusterInstanceController.applyRenderedManifests.applyObject controller/clusterinstance_controller.go:579 Applying object using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
2025-10-31T03:56:44.403Z DEBUG ClusterInstanceController.applyRenderedManifests.applyObject controller/clusterinstance_controller.go:587 Object applied using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "172880", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
IBIO sets spec.online (Only time)
time="2025-10-31T03:57:02Z" level=info msg="Setting BareMetalHost (vm00058/vm00058) spec.Online to true" func="github.com/openshift/image-based-install-operator/controllers.(*ImageClusterInstallReconciler).updateBMHProvisioningState" file="/opt/app-root/src/controllers/imageclusterinstall_controller.go:715" name=vm00058 namespace=vm00058
2nd Dry run
2025-10-31T03:58:33.720Z INFO ClusterInstanceController.validateRenderedManifests controller/clusterinstance_controller.go:646 Executing a dry-run validation on the rendered manifests {"name": "vm00058", "namespace": "vm00058", "version": "182349"}
2025-10-31T03:58:33.720Z DEBUG ClusterInstanceController.validateRenderedManifests.applyObject controller/clusterinstance_controller.go:579 Applying object using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
2025-10-31T03:58:33.728Z DEBUG ClusterInstanceController.validateRenderedManifests.applyObject controller/clusterinstance_controller.go:587 Object applied using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
SiteConfig Operator applies BMH 2nd time
2025-10-31T03:58:33.869Z INFO ClusterInstanceController.applyRenderedManifests controller/clusterinstance_controller.go:692 Applying the rendered manifests {"name": "vm00058", "namespace": "vm00058", "version": "182349"}
2025-10-31T03:58:33.869Z DEBUG ClusterInstanceController.applyRenderedManifests.applyObject controller/clusterinstance_controller.go:579 Applying object using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
2025-10-31T03:58:33.876Z DEBUG ClusterInstanceController.applyRenderedManifests.applyObject controller/clusterinstance_controller.go:587 Object applied using Server-Side Apply {"name": "vm00058", "namespace": "vm00058", "version": "182349", "name": "vm00058", "namespace": "vm00058", "kind": "BareMetalHost"}
Version-Release number of selected component (if applicable):
Hub OCP 4.20.1
Deployed OCP 4.19.14
ACM - 2.15.0-DOWNSTREAM-2025-10-11-01-13-51
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
As a manual workaround, if you catch the clusterinstance not progressing and it is before the timeout, you can manually edit the bmh to power on and the cluster will succeed install.