Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: ACM 2.15.0
Component/s: Infrastructure Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:
RH Private Keywords:

Severity:
Important

Regression:
Yes

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

During a 3-node cluster deployment via ClusterInstance with hub-side templating, the bootstrap node (helix28) became stuck in the "Waiting for bootkube" stage.
The Cluster Version Operator (CVO) remained in Pending state indefinitely, causing bootkube.service to repeatedly fail and restart.

CVO stuck in Pending, bootkube repeatedly fails:

Pod Status:openshift-cluster-version/cluster-version-operator  Pending
Error: error while checking pod status: timed out waiting for the condition
bootkube.service: Failed with result 'exit-code'

Agent validation shows:

Id: api-domain-name-resolved-correctly
Message: Domain name resolution for the api.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required
Status: success  Id: api-int-domain-name-resolved-correctly  
Message: Domain name resolution for the api-int.kni-qe-51.lab.eng.tlv2.redhat.com domain was successful or not required
  Status: success

Installation proceeds and bootstrap node logs show:

Jan 05 01:10:03 bootkube.sh[6646]: Unable to resolve API_INT_URL api-int.kni-qe-51.lab.eng.tlv2.redhat.com

Version-Release number of selected component (if applicable):

HUB OCP VERSION 4.21.0-ec.3
SPOKE OCP VERSION 4.21.0-rc.0
ACM 2.15.0
MCE 2.10.0

How reproducible:

often

Steps to Reproduce:

Configure a 3-node cluster deployment via ClusterInstance
Deploy the cluster
Observe agents pass validation and installation begins
Bootstrap node gets stuck in "Waiting for bootkube" stage

Actual results:

Agent remains stuck until timeout, leading the installation to fail :

  State: error
  State Info: Host failed to install because its installation stage Waiting for bootkube took longer than expected 1h0m0s

Expected results:

bootkube get completed succesfully.

Additional info:

Bootstrap node journal showing repeated failures:

Jan 05 01:30:21 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[10132]: Error: error while checking pod status: timed out waiting for the condition
Jan 05 01:30:22 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[16374]: Error: error while checking pod status: timed out waiting for the condition
Jan 05 01:50:39 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Jan 05 02:10:56 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[22133]: Error: error while checking pod status: timed out waiting for the condition
Jan 05 02:10:57 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
Jan 05 02:31:13 api.kni-qe-51.lab.eng.tlv2.redhat.com bootkube.sh[27849]: Error: error while checking pod status: timed out waiting for the condition
Jan 05 02:31:14 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Failed with result 'exit-code'.

The bootstrap node's hostname kept flipping between the correct hostname and the API VIP hostname :

Jan 05 00:57:26 helix28.lab.eng.tlv2.redhat.com NetworkManager[1784]: <info>  [1767574646.7445] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup)
Jan 05 01:10:36 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575436.1749] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup)
Jan 05 01:10:36 api.kni-qe-51.lab.eng.tlv2.redhat.com systemd-hostnamed[10626]: Hostname set to <api.kni-qe-51.lab.eng.tlv2.redhat.com> (transient)
Jan 05 01:17:41 api.kni-qe-51.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575861.7222] policy: set-hostname: set hostname to 'helix28.lab.eng.tlv2.redhat.com' (from address lookup)
Jan 05 01:17:45 helix28.lab.eng.tlv2.redhat.com NetworkManager[5981]: <info>  [1767575865.4495] policy: set-hostname: set hostname to 'api.kni-qe-51.lab.eng.tlv2.redhat.com' (from address lookup)

Failure can happen on any random server in a 3-node cluster.
Failure is intermittent - it happened with spoke 4.18 and 4.21, but not 4.20 with the same ACM/MCE version in the last CI iteration.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

helix28-journal.log
2026/01/05 4:16 PM
12.58 MB
Bahaa Azem
helix27-journal.log
2026/01/05 4:16 PM
19.42 MB
Bahaa Azem
helix26-journal.log
2026/01/05 4:16 PM
31.06 MB
Bahaa Azem

Details

Description

Description of problem:

Agent validation shows:

Installation proceeds and bootstrap node logs show:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Bootstrap node journal showing repeated failures:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates