-
Bug
-
Resolution: Cannot Reproduce
-
Undefined
-
None
-
4.17
-
Moderate
-
Yes
-
False
-
-
Description of problem:
Environment:Hub Cluster: OCP 4.16.6, ACM 2.11.2, MCE 2.6.2
Attempting to install latest 4.17 green nightly build 4.17.0-0.nightly-2024-08-01-213905 on a 5-node baremetal env via ZTP/GitOps using assisted installer. The installation consistently fails with one node (not always the same node) stuck in Provisioning state. Errors from the node are as follows:
$ journalctl -b -f -u release-image.service -u bootkube.service
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api.kni-qe-26.lab.eng.tlv2.redhat.com of type API_URL reachable
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_URL's https endpoint at https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Check if API-Int URL is reachable during bootstrap
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Checking if api-int.kni-qe-26.lab.eng.tlv2.redhat.com of type API_INT_URL reachable
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to reach API_INT_URL's https endpoint at https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: Unable to validate. https://api-int.kni-qe-26.lab.eng.tlv2.redhat.com:6443/readyz is currently unreachable.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com bootkube.sh[6984]: bootkube.service complete
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Deactivated successfully.
Aug 07 18:57:56 api.kni-qe-26.lab.eng.tlv2.redhat.com systemd[1]: bootkube.service: Consumed 34.010s CPU time.}}
Aug 07 19:53:24 helix26.lab.eng.tlv2.redhat.com kubelet.sh[6957]: E0807 19:53:24.209981 6957 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com_openshift-etcd(0a1902434a202329b9007efe5b197061)\"" pod="openshift-etcd/etcd-bootstrap-member-api-int.helix26.lab.eng.tlv2.redhat.com" podUID="0a1902434a202329b9007efe5b197061"
I have not seen this error when installing a compact 3-node cluster using the same hub cluster and three baremetal servers.
Version-Release number of selected component (if applicable):
ACM 2.11.2, MCE 2.6.2
How reproducible:
Always
Steps to Reproduce:
1.Install hub cluster with OCP 4.16.6, ACM 2.11.2, MCE 2.6.2 2.Start spoke cluster installation using ZTP/Gitops Workflow 3.Observe that "oc get bmh" shows 4/5 BMH as "Provisioned" with one stuck "Provisioning
Actual results:
One node (not always same node) stuck provisioning with errors shown above.
Expected results:
All BMHs provisioned.
Additional info:
See links in comments for must-gathers.