-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.14.0
-
None
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
When installing using the openshift-install-4.14.0-rc.5 installer over vsphere/ipi, two out of 4 times it failed because one of the nodes was not powered on. The node is created correctly from a clone, but it never gets powered on. From machine-controller logs, I see the node was cloned and it exists but not powered on: ~~~ 2023-10-16T22:54:15.340197257Z I1016 22:54:15.340172 1 controller.go:156] ocp4-vmware-nwvhs-worker-0-hx657: reconciling Machine 2023-10-16T22:54:15.340243798Z I1016 22:54:15.340228 1 actuator.go:113] ocp4-vmware-nwvhs-worker-0-hx657: actuator checking if machine exists 2023-10-16T22:54:15.351400450Z I1016 22:54:15.351376 1 reconciler.go:304] ocp4-vmware-nwvhs-worker-0-hx657: already exists, but was not powered on after clone, requeue ~~~ After that, it seems the controller tries to create again the same node, but it fails because the name already exists: ~~~ 2023-10-16T22:54:16.881836503Z E1016 22:54:16.881831 1 actuator.go:60] ocp4-vmware-nwvhs-worker-0-hx657 error: ocp4-vmware-nwvhs-worker-0-hx657: reconciler failed to Create machine: The name 'ocp4-vmware-nwvhs-worker-0-hx657' alre ady exists. 2023-10-16T22:54:16.881870489Z I1016 22:54:16.881846 1 machine_scope.go:104] ocp4-vmware-nwvhs-worker-0-hx657: patching machine 2023-10-16T22:54:16.881980574Z I1016 22:54:16.881953 1 recorder.go:104] events "msg"="ocp4-vmware-nwvhs-worker-0-hx657: reconciler failed to Create machine: The name 'ocp4-vmware-nwvhs-worker-0-hx657' already exists." "object"={"k ind":"Machine","namespace":"openshift-machine-api","name":"ocp4-vmware-nwvhs-worker-0-hx657","uid":"ec9fbc15-b85e-45cd-938a-bf668a83e058","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"22809"} "reason"="FailedCreate" "ty pe"="Warning" ~~~ Then machine-controller loops again and again trough the node trying to reconcile it, until the install process times out and fail. The final state of the cluster is, control plane correctly provisioned, bootstrap node removed, one worker node available, the second worker is stuck in Provisioning: NAME STATUS ROLES AGE VERSION ocp4-vmware-nwvhs-master-0 Ready control-plane,master 131m v1.27.6+98158f9 ocp4-vmware-nwvhs-master-1 Ready control-plane,master 131m v1.27.6+98158f9 ocp4-vmware-nwvhs-master-2 Ready control-plane,master 131m v1.27.6+98158f9 ocp4-vmware-nwvhs-worker-0-2w5b7 Ready worker 96m v1.27.6+98158f9 NAME PHASE TYPE REGION ZONE AGE ocp4-vmware-nwvhs-master-0 Running 137m ocp4-vmware-nwvhs-master-1 Running 137m ocp4-vmware-nwvhs-master-2 Running 137m ocp4-vmware-nwvhs-worker-0-2w5b7 Running 119m ocp4-vmware-nwvhs-worker-0-hx657 Provisioning 119m
Version-Release number of selected component (if applicable):
openshift-installer: openshift-install-4.14.0-rc.5 4.14.0-rc.5 built from commit e170cbcd2461b3d72a1ea177dc5cbb08d8063559 release image quay.io/openshift-release-dev/ocp-release@sha256:042899f17f33259ed9f2cfc179930af283733455720f72ea3483fd1905f9b301 release architecture amd64 Vcenter: 7.0.3 build: 18778458
How reproducible:
Steps to Reproduce:
1. Create an install-config.yaml file for a ipi/vsphere cluster. 2. openshift-install create cluster 3. Wait until it finishes.
Actual results:
Installation times out with an error, control plane is ok and one of the worker nodes is correct, the second worker is powered off.
Expected results:
Installation should succeed.
Additional info:
- duplicates
-
OCPBUGS-1735 [vsphere] Machine stuck in Provisioning status when machine is power off
-
- Closed
-