Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21770

Install process fails. Machine controller fails with the error "reconciler failed to Create machine: The name 'xxxxxxx' already exists

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.14.0
    • None
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When installing using the openshift-install-4.14.0-rc.5 installer over vsphere/ipi, two out of 4 times it failed because one of the nodes was not powered on. The node is created correctly from a clone, but it never gets powered on. 
      
      From machine-controller logs, I see the node was cloned and it exists but not powered on:
      
      ~~~
      2023-10-16T22:54:15.340197257Z I1016 22:54:15.340172       1 controller.go:156] ocp4-vmware-nwvhs-worker-0-hx657: reconciling Machine
      2023-10-16T22:54:15.340243798Z I1016 22:54:15.340228       1 actuator.go:113] ocp4-vmware-nwvhs-worker-0-hx657: actuator checking if machine exists
      2023-10-16T22:54:15.351400450Z I1016 22:54:15.351376       1 reconciler.go:304] ocp4-vmware-nwvhs-worker-0-hx657: already exists, but was not powered on after clone, requeue
      ~~~
      
      After that, it seems the controller tries to create again the same node, but it fails because the name already exists:
      
      ~~~
      2023-10-16T22:54:16.881836503Z E1016 22:54:16.881831       1 actuator.go:60] ocp4-vmware-nwvhs-worker-0-hx657 error: ocp4-vmware-nwvhs-worker-0-hx657: reconciler failed to Create machine: The name 'ocp4-vmware-nwvhs-worker-0-hx657' alre
      ady exists.
      2023-10-16T22:54:16.881870489Z I1016 22:54:16.881846       1 machine_scope.go:104] ocp4-vmware-nwvhs-worker-0-hx657: patching machine
      2023-10-16T22:54:16.881980574Z I1016 22:54:16.881953       1 recorder.go:104] events "msg"="ocp4-vmware-nwvhs-worker-0-hx657: reconciler failed to Create machine: The name 'ocp4-vmware-nwvhs-worker-0-hx657' already exists." "object"={"k
      ind":"Machine","namespace":"openshift-machine-api","name":"ocp4-vmware-nwvhs-worker-0-hx657","uid":"ec9fbc15-b85e-45cd-938a-bf668a83e058","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"22809"} "reason"="FailedCreate" "ty
      pe"="Warning"
      ~~~
      
      Then machine-controller loops again and again trough the node trying to reconcile it, until the install process times out and fail. 
      
      
      The final state of the cluster is, control plane correctly provisioned, bootstrap node removed, one worker node available, the second worker is stuck in Provisioning:
      
      NAME                               STATUS   ROLES                  AGE    VERSION
      ocp4-vmware-nwvhs-master-0         Ready    control-plane,master   131m   v1.27.6+98158f9
      ocp4-vmware-nwvhs-master-1         Ready    control-plane,master   131m   v1.27.6+98158f9
      ocp4-vmware-nwvhs-master-2         Ready    control-plane,master   131m   v1.27.6+98158f9
      ocp4-vmware-nwvhs-worker-0-2w5b7   Ready    worker                 96m    v1.27.6+98158f9
      
      
      NAME                               PHASE          TYPE   REGION   ZONE   AGE
      ocp4-vmware-nwvhs-master-0         Running                               137m
      ocp4-vmware-nwvhs-master-1         Running                               137m
      ocp4-vmware-nwvhs-master-2         Running                               137m
      ocp4-vmware-nwvhs-worker-0-2w5b7   Running                               119m
      ocp4-vmware-nwvhs-worker-0-hx657   Provisioning                          119m  

      Version-Release number of selected component (if applicable):

      openshift-installer:
      openshift-install-4.14.0-rc.5 4.14.0-rc.5
      built from commit e170cbcd2461b3d72a1ea177dc5cbb08d8063559
      release image quay.io/openshift-release-dev/ocp-release@sha256:042899f17f33259ed9f2cfc179930af283733455720f72ea3483fd1905f9b301
      release architecture amd64
      
      Vcenter: 7.0.3
      build: 18778458
      
      
      

      How reproducible:

       

      Steps to Reproduce:

      1. Create an install-config.yaml file for a ipi/vsphere cluster.
      2. openshift-install create cluster 
      3. Wait until it finishes.
      

      Actual results:

      Installation times out with an error, control plane is ok and one of the worker nodes is correct, the second worker is powered off.

      Expected results:

      Installation should succeed. 

      Additional info:

       

              rmanak@redhat.com Radek Manak
              rhn-gps-alfredo Alfredo Pizarro
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: