-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
ACM 2.15.0
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
-
-
-
Important
-
Yes
Description of problem:
During a 3-node cluster deployment via ClusterInstance with hub-side templating, the bootstrap node (helix28) became stuck in the "Waiting for bootkube" stage.
The Cluster Version Operator (CVO) remained in Pending state indefinitely, causing bootkube.service to repeatedly fail and restart.
CVO stuck in Pending, bootkube repeatedly fails:
Pod Status:openshift-cluster-version/cluster-version-operator Pending Error: error while checking pod status: timed out waiting for the condition bootkube.service: Failed with result 'exit-code'
Agent validation shows:
Id: api-domain-name-resolved-correctly Message: Domain name resolution for the api.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required Status: success Id: api-int-domain-name-resolved-correctly Message: Domain name resolution for the api-int.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required Status: success
Installation proceeds and bootstrap node logs show:
Jan 05 01:10:03 bootkube.sh[6646]: Unable to resolve API_INT_URL api-int.kni-qe-51.lab.eng.tlv2.redhat.com
Version-Release number of selected component (if applicable):
HUB OCP VERSION 4.21.0-ec.3
SPOKE OCP VERSION 4.21.0-rc.0
ACM 2.15.0
MCE 2.10.0
How reproducible:
often
Steps to Reproduce:
- Configure a 3-node cluster deployment via ClusterInstance
- Deploy the cluster
- Observe agents pass validation and installation begins
- Bootstrap node gets stuck in "Waiting for bootkube" stage
Actual results:
Agent remains stuck until timeout, leading the installation to fail :
State: error
State Info: Host failed to install because its installation stage Waiting for bootkube took longer than expected 1h0m0s
Expected results:
bootkube get completed succesfully.
Additional info:
Bootstrap node journal showing repeated failures:
Jan 05 01:30:21 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[10132]: Error: error while checking pod status: timed out waiting for the condition Jan 05 01:30:22 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[16374]: Error: error while checking pod status: timed out waiting for the condition Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Jan 05 02:10:56 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[22133]: Error: error while checking pod status: timed out waiting for the condition Jan 05 02:10:57 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'. Jan 05 02:31:13 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[27849]: Error: error while checking pod status: timed out waiting for the condition Jan 05 02:31:14 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
The bootstrap node's hostname kept flipping between the correct hostname and the API VIP hostname :
Jan 05 00:57:26 helix28.lab.eng.tlv2.redhat.com NetworkManager[1784]: <info> [1767574646.7445] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup) Jan 05 01:10:36 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info> [1767575436.1749] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup) Jan 05 01:10:36 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd-hostnamed[10626]: Hostname set to <api.kni-qe-51.lab.eng.tlv2.redhat.com> (transient) Jan 05 01:17:41 api.kni-qe-51.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info> [1767575861.7222] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup) Jan 05 01:17:45 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info> [1767575865.4495] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup)
- Failure can happen on any random server in a 3-node cluster.
- Failure is intermittent - it happened with spoke 4.18 and 4.21, but not 4.20 with the same ACM/MCE version in the last CI iteration.