-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
This is happening "sometimes", in the baremetal CI. When the cluster is being installed and all the two masters and three workers become ready, we expect the rendezVous host to eventually reboot, install and join the cluster as the third master to conclude the installation. However, it seems that one of the two masters is not detected as ready, even if `oc get nodes` would show two masters as joined and ready. This lead the rendezVous host to keep stuck and never reboot to finish the installation by joining the cluster.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Sometimes. Most of the times it works fine, sometimes it fails, always the same way.
Steps to Reproduce:
1. Install an agent-based cluster. The configuration i've faced this with is - dhcp - ipv4 only - BMO enabled (platform: baremetal) - 3 masters + 3 workers
Actual results:
The installation fails
Expected results:
The rendezvous host reboots, join the cluster and the installation succeeds.
Additional info:
By using `oc get nodes` and `oc get co`, everything looked good, except the degraded operators due to the missing third master. example job (from a PR, but no changes are made to the automation for the agent installation) https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/54483/rehearse-54483-periodic-ci-openshift-hypershift-release-4.16-periodics-mce-e2e-agent-connected-ovn-ipv4-metal3-conformance/1817908001202245632 Also, in the journal, right after the timeout of the cluster installation: "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-00_openshift-etcd(77a8963015ba959007010dc81931348d)\"" pod="openshift-etcd/etcd-bootstrap-member-master-00" podUID="77a8963015ba959007010dc81931348d"Jul 29 14:48:21 master-00 start-cluster-installation.sh[7044]: Cluster status: errorJul 29 14:48:26 master-00 start-cluster-installation.sh[7044]: Cluster status: error Also see attachments. The etcd-bootstrap-member-master-00 is in crashloopbackoff. Log: aft.go:77","msg":"8f4fcab0df4f7c44 switched to configuration voters=(7372168020071371606 10326695331593419844 18105834420489811888)"} {"level":"info","ts":"2024-07-29T14:49:41.373886Z","caller":"membership/cluster.go:537","msg":"promote member","cluster-id":"cf7ed821fb17c7fa","local-member-id":"8f4fcab0df4f7c44"} {"level":"warn","ts":"2024-07-29T14:49:41.374503Z","caller":"etcdserver/server.go:1149","msg":"server error","error":"the member has been permanently removed from the cluster"} {"level":"warn","ts":"2024-07-29T14:49:41.374538Z","caller":"etcdserver/server.go:1150","msg":"data-dir used by this member must be removed"}