-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
While installing various cluster sizes at scale with ACM/ZTP via the assisted-service, several clusters failed to complete install seemingly all on the additional nodes failing to pull there ignition. 4 clusters out of 34 were in this similar state: compact-00031 compact-00034 compact-00098 compact-00340 At the conclusion of the test these clusters were seemingly still in InstallationInProgress as the installs went beyond 12 hours and later went to error state. Example cluster # oc get aci -n compact-00034 NAME CLUSTER STATE compact-00034 compact-00034 error # oc get bmh -n compact-00034 NAME STATE CONSUMER ONLINE ERROR AGE vm01307 provisioned true 39h vm01308 provisioned true 39h vm01309 provisioning true 39h # ssh core@vm01307 ssh: connect to host vm01307 port 22: Connection refused # ssh core@vm01308 ssh: connect to host vm01308 port 22: Connection refused # ssh core@vm01309 ... This is the bootstrap node; it will be destroyed when the master is fully up.The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g. journalctl -b -f -u release-image.service -u bootkube.service Last login: Wed Oct 4 13:40:40 2023 from fc00:1004::1 [core@vm01309 ~]$ Most notably to find a cluster in this condition, we can see 2 of the 3 control-plane bmh's are in provisioned state with the last one in provisioning state. The two provisioned are unable to be ssh-ed to because ignition has not been applied the ssh key for access. Journal logs and agentclusterinstall logs have been collected for the available machines to assist in the debug of this issue
Version-Release number of selected component (if applicable):
Hub OCP - 4.14.0-rc.2 Deployed clusters 4.14.0-rc.2 ACM - 2.9.0-DOWNSTREAM-2023-09-27-22-12-46
How reproducible:
4 out of 34 failures appear to be this so far
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
VM to cluster bmhs: # oc get bmh -n compact-00031 NAME STATE CONSUMER ONLINE ERROR AGE vm01298 provisioned true 40h vm01299 provisioned true 40h vm01300 provisioning true 40h # oc get bmh -n compact-00034 NAME STATE CONSUMER ONLINE ERROR AGE vm01307 provisioned true 40h vm01308 provisioned true 40h vm01309 provisioning true 40h # oc get bmh -n compact-00098 NAME STATE CONSUMER ONLINE ERROR AGE vm01499 provisioned true 40h vm01500 provisioning true 40h vm01501 provisioned true 40h # oc get bmh -n compact-00340 NAME STATE CONSUMER ONLINE ERROR AGE vm02225 provisioning true 40h vm02226 provisioned true 40h vm02227 provisioned true 40h