Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5018

Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

    XMLWordPrintable

Details

    • CLOUD Sprint 230
    • 1
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:
      
      $ oc get clusterversion                                                                              
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS                                      
      version   4.11.0-0.nightly-2022-12-16-190443   True        True          117m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
      
      $ oc get co                                                                                                                                                                                                                              
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                                                   
      authentication                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h47m   
      baremetal                                  4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      cloud-controller-manager                   4.12.0-rc.5                          True        False         False      5h3m    
      cloud-credential                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h4m                                                                                                                                              
      cluster-autoscaler                         4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      config-operator                            4.12.0-rc.5                          True        False         False      5h1m    
      console                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h43m   
      csi-snapshot-controller                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
      dns                                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      etcd                                       4.12.0-rc.5                          True        False         False      4h58m         
      image-registry                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h54m         
      ingress                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m   
      insights                                   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
      kube-apiserver                             4.12.0-rc.5                          True        False         False      4h50m         
      kube-controller-manager                    4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             
      kube-scheduler                             4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             kube-storage-version-migrator              4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-api                                4.11.0-0.nightly-2022-12-16-190443   True        True          False      4h56m   Progressing towards operator: 4.12.0-rc.5                                                                                                 
      machine-approver                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-config                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             marketplace                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      monitoring                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m                                                                                                                                             
      network                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h3m          
      node-tuning                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             
      openshift-apiserver                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
      openshift-controller-manager               4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h56m                                                                                                                                             
      openshift-samples                          4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
      operator-lifecycle-manager                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
      service-ca                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      storage                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
      
      When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:
      
      $ oc get pods -n openshift-machine-api                                                                                                                                                                                                   
      NAME                                           READY   STATUS             RESTARTS   AGE                                                                                                                                                                               
      cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h5m                                                                                                                                                                              
      cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h5m                                          
      machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          94m                                                                                                                                                                               
      machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          97m                                           
      machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
      machine-api-termination-handler-gj4pf          1/1     Running            0          4h57m                                                                                                                                                                             
      machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
      machine-api-termination-handler-l95x2          1/1     Running            0          4h54m                                                                                                                                                                             
      machine-api-termination-handler-p6sw6          1/1     Running            0          4h57m   
      
      $ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api                                                                                                                                                        
      Name:                 machine-api-termination-handler-fcfq2
      Namespace:            openshift-machine-api
      Priority:             2000001000
      Priority Class Name:  system-node-critical
      .....................................................................
      Events:
        Type     Reason                  Age                    From               Message
        ----     ------                  ----                   ----               -------
        Normal   Scheduled               94m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
        Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
      erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
       exists. ","ErrorCode":2147947410}
        Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
      erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
       exists. ","ErrorCode":2147947410}
        Normal   Pulling                 93m (x4 over 94m)      kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
        Warning  Failed                  93m (x4 over 94m)      kubelet            Error: ErrImagePull
        Normal   BackOff                 4m39s (x393 over 94m)  kubelet            Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
      
      
      $ oc get pods -n openshift-machine-api -o wide
      NAME                                           READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
      cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h8m    10.130.0.10    ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
      cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h8m    10.130.0.8     ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
      machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          97m     10.128.0.144   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
      machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          100m    10.128.0.143   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          97m     10.129.0.7     ip-10-0-145-114.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-gj4pf          1/1     Running            0          5h      10.0.223.37    ip-10-0-223-37.us-east-2.compute.internal    <none>           <none>
      machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          97m     10.128.0.4     ip-10-0-143-111.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-l95x2          1/1     Running            0          4h57m   10.0.172.211   ip-10-0-172-211.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-p6sw6          1/1     Running            0          5h      10.0.146.227   ip-10-0-146-227.us-east-2.compute.internal   <none>           <none>
      [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
      ip-10-0-143-111.us-east-2.compute.internal   Ready    worker   4h24m   v1.24.0-2566+5157800f2a3bc3   10.0.143.111   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
      [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
      ip-10-0-145-114.us-east-2.compute.internal   Ready    worker   4h18m   v1.24.0-2566+5157800f2a3bc3   10.0.145.114   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
      [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
      jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-145-114.us-east-2.compute.internal   aws:///us-east-2a/i-0b69d52c625c46a6a   running
      [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
      jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-143-111.us-east-2.compute.internal   aws:///us-east-2a/i-05e422c0051707d16   running
      
      This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.
      
      

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2022-12-16-190443   True        True          141m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
      $ oc version
      Client Version: 4.11.0-0.ci-2022-06-09-065118
      Kustomize Version: v4.5.4
      Server Version: 4.11.0-0.nightly-2022-12-16-190443
      Kubernetes Version: v1.25.4+77bec7a
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet
      2. Perform the upgrade to 4.12
      3. Wait for the upgrade to hang on the machine-api component
      

      Actual results:

      The upgrade hangs when upgrading the machine-api component.
      

      Expected results:

      The upgrade suceeds
      

      Additional info:

      
      

      Attachments

        Issue Links

          Activity

            People

              ddonati@redhat.com Damiano Donati
              rhn-engineering-jfrancoa Jose Luis Franco Arza (Inactive)
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: