Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Installer / Assisted installer
Labels:
- Triaged
- cnf-vran:ztp

Severity:
Moderate
Regression:
Yes
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Environment:Hub Cluster: OCP 4.16.6, ACM 2.11.2, MCE 2.6.2

Attempting to install latest 4.17 green nightly build 4.17.0-0.nightly-2024-08-01-213905 on a 5-node baremetal env via ZTP/GitOps using assisted installer. The installation consistently fails with one node (not always the same node) stuck in Provisioning state. Errors from the node are as follows:

$ journalctl -b -f -u release-image.service -u bootkube.service

Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api.kni-qe-26.lab.eng.tlv2.redhat.com of type API_URL reachable
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_URL's https endpoint at https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Check if API-Int URL is reachable during bootstrap
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api-int.kni-qe-26.lab.eng.tlv2.redhat.com of type API_INT_URL reachable
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_INT_URL's https endpoint at https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: bootkube.service complete
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Deactivated successfully.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Consumed 34.010s CPU time.}}

Aug 07 19:53:24 helix26.lab.eng.tlv2.redhat.com kubelet.sh[6957]: E0807 19:53:24.209981 6957 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com_openshift-etcd(0a1902434a202329b9007efe5b197061)\"" pod="openshift-etcd/etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com" podUID="0a1902434a202329b9007efe5b197061"

I have not seen this error when installing a compact 3-node cluster using the same hub cluster and three baremetal servers.

Version-Release number of selected component (if applicable):

     ACM 2.11.2, MCE 2.6.2

How reproducible:

    Always

Steps to Reproduce:

    1.Install hub cluster with OCP 4.16.6, ACM 2.11.2, MCE 2.6.2
    2.Start spoke cluster installation using ZTP/Gitops Workflow
    3.Observe that "oc get bmh" shows 4/5 BMH as "Provisioned" with one stuck "Provisioning

Actual results:

    One node (not always same node) stuck provisioning with errors shown above.

Expected results:

    All BMHs provisioned.

Additional info:

    See links in comments for must-gathers.

Assignee:: liat gamliel

Reporter:: Joshua Clark

QA Contact:: Michael Burman

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/08/08 12:20 AM

Updated:: 2024/08/12 8:44 PM

Resolved:: 2024/08/12 8:44 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates