Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5417

Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

    • None
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-5018. The following is the description of the original issue:
      โ€”
      Description of problem:

      When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:
      
      $ oc get clusterversion                                                                              
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS                                      
      version   4.11.0-0.nightly-2022-12-16-190443   True        True          117m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
      
      $ oc get co                                                                                                                                                                                                                              
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                                                   
      authentication                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h47m   
      baremetal                                  4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      cloud-controller-manager                   4.12.0-rc.5                          True        False         False      5h3m    
      cloud-credential                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h4m                                                                                                                                              
      cluster-autoscaler                         4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      config-operator                            4.12.0-rc.5                          True        False         False      5h1m    
      console                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h43m   
      csi-snapshot-controller                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
      dns                                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      etcd                                       4.12.0-rc.5                          True        False         False      4h58m         
      image-registry                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h54m         
      ingress                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m   
      insights                                   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
      kube-apiserver                             4.12.0-rc.5                          True        False         False      4h50m         
      kube-controller-manager                    4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             
      kube-scheduler                             4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             kube-storage-version-migrator              4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-api                                4.11.0-0.nightly-2022-12-16-190443   True        True          False      4h56m   Progressing towards operator: 4.12.0-rc.5                                                                                                 
      machine-approver                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-config                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             marketplace                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
      monitoring                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m                                                                                                                                             
      network                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h3m          
      node-tuning                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             
      openshift-apiserver                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
      openshift-controller-manager               4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h56m                                                                                                                                             
      openshift-samples                          4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
      operator-lifecycle-manager                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
      service-ca                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
      storage                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
      
      When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:
      
      $ oc get pods -n openshift-machine-api                                                                                                                                                                                                   
      NAME                                           READY   STATUS             RESTARTS   AGE                                                                                                                                                                               
      cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h5m                                                                                                                                                                              
      cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h5m                                          
      machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          94m                                                                                                                                                                               
      machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          97m                                           
      machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
      machine-api-termination-handler-gj4pf          1/1     Running            0          4h57m                                                                                                                                                                             
      machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
      machine-api-termination-handler-l95x2          1/1     Running            0          4h54m                                                                                                                                                                             
      machine-api-termination-handler-p6sw6          1/1     Running            0          4h57m   
      
      $ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api                                                                                                                                                        
      Name:                 machine-api-termination-handler-fcfq2
      Namespace:            openshift-machine-api
      Priority:             2000001000
      Priority Class Name:  system-node-critical
      .....................................................................
      Events:
        Type     Reason                  Age                    From               Message
        ----     ------                  ----                   ----               -------
        Normal   Scheduled               94m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
        Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
      erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
       exists. ","ErrorCode":2147947410}
        Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
      erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
       exists. ","ErrorCode":2147947410}
        Normal   Pulling                 93m (x4 over 94m)      kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
        Warning  Failed                  93m (x4 over 94m)      kubelet            Error: ErrImagePull
        Normal   BackOff                 4m39s (x393 over 94m)  kubelet            Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
      
      
      $ oc get pods -n openshift-machine-api -o wide
      NAME                                           READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
      cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h8m    10.130.0.10    ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
      cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h8m    10.130.0.8     ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
      machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          97m     10.128.0.144   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
      machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          100m    10.128.0.143   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          97m     10.129.0.7     ip-10-0-145-114.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-gj4pf          1/1     Running            0          5h      10.0.223.37    ip-10-0-223-37.us-east-2.compute.internal    <none>           <none>
      machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          97m     10.128.0.4     ip-10-0-143-111.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-l95x2          1/1     Running            0          4h57m   10.0.172.211   ip-10-0-172-211.us-east-2.compute.internal   <none>           <none>
      machine-api-termination-handler-p6sw6          1/1     Running            0          5h      10.0.146.227   ip-10-0-146-227.us-east-2.compute.internal   <none>           <none>
      [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
      ip-10-0-143-111.us-east-2.compute.internal   Ready    worker   4h24m   v1.24.0-2566+5157800f2a3bc3   10.0.143.111   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
      [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
      ip-10-0-145-114.us-east-2.compute.internal   Ready    worker   4h18m   v1.24.0-2566+5157800f2a3bc3   10.0.145.114   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
      [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
      jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-145-114.us-east-2.compute.internal   aws:///us-east-2a/i-0b69d52c625c46a6a   running
      [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
      jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-143-111.us-east-2.compute.internal   aws:///us-east-2a/i-05e422c0051707d16   running
      
      This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.
      
      

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2022-12-16-190443   True        True          141m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
      $ oc version
      Client Version: 4.11.0-0.ci-2022-06-09-065118
      Kustomize Version: v4.5.4
      Server Version: 4.11.0-0.nightly-2022-12-16-190443
      Kubernetes Version: v1.25.4+77bec7a
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet
      2. Perform the upgrade to 4.12
      3. Wait for the upgrade to hang on the machine-api component
      

      Actual results:

      The upgrade hangs when upgrading the machine-api component.
      

      Expected results:

      The upgrade suceeds
      

      Additional info:

      
      

            [OCPBUGS-5417] Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2022:7399

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399

            Zhaohua Sun added a comment -

            This is verified before merge, move it to done.

            Zhaohua Sun added a comment - This is verified before merge, move it to done.

            Zhaohua Sun added a comment -

            Verified before pr merge, upgrade from 4.11.0-0.nightly-2023-01-04-235653 to 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest(build 4.12,openshift/machine-api-operator#1105) is successful.

            $ oc get clusterversion                                                                                                  
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.11.0-0.nightly-2023-01-04-235653   True        False         44m     Cluster version is 4.11.0-0.nightly-2023-01-04-235653
            $ oc adm upgrade --to-image='registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest' --force --allow-explicit-upgrade
            Requesting update to release image registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest
            $ oc get clusterversion      
            NAME      VERSION                                               AVAILABLE   PROGRESSING   SINCE   STATUS
            version   0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest   True        False         3m19s   Cluster version is 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest
            $ oc get po     
            NAME                                                  READY   STATUS    RESTARTS   AGE
            cluster-autoscaler-operator-74755fdcc4-dnf8r          2/2     Running   0          17m
            cluster-baremetal-operator-646574547f-s5tgd           2/2     Running   0          17m
            control-plane-machine-set-operator-7b9998968c-rhgks   1/1     Running   0          17m
            machine-api-controllers-6b9985b47-chndj               7/7     Running   0          10m
            machine-api-operator-755668d696-79xht                 2/2     Running   0          10m
            machine-api-termination-handler-drmqc                 1/1     Running   1          52m
            machine-api-termination-handler-prbdt                 1/1     Running   1          52m
            machine-api-termination-handler-qwn6j                 1/1     Running   1          52m 

            Zhaohua Sun added a comment - Verified before pr merge, upgrade from 4.11.0-0.nightly-2023-01-04-235653 to 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest(build 4.12,openshift/machine-api-operator#1105) is successful. $ oc get clusterversion                                                                                                   NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.11.0-0.nightly-2023-01-04-235653   True        False         44m     Cluster version is 4.11.0-0.nightly-2023-01-04-235653 $ oc adm upgrade --to-image= 'registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest' --force --allow-explicit-upgrade Requesting update to release image registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest $ oc get clusterversion       NAME      VERSION                                               AVAILABLE   PROGRESSING   SINCE   STATUS version   0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest   True        False         3m19s   Cluster version is 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest $ oc get po     NAME                                                  READY   STATUS    RESTARTS   AGE cluster-autoscaler- operator -74755fdcc4-dnf8r          2/2     Running   0          17m cluster-baremetal- operator -646574547f-s5tgd           2/2     Running   0          17m control-plane-machine-set- operator -7b9998968c-rhgks   1/1     Running   0          17m machine-api-controllers-6b9985b47-chndj               7/7     Running   0          10m machine-api- operator -755668d696-79xht                 2/2     Running   0          10m machine-api-termination-handler-drmqc                 1/1     Running   1          52m machine-api-termination-handler-prbdt                 1/1     Running   1          52m machine-api-termination-handler-qwn6j                 1/1     Running   1          52m

              joelspeed Joel Speed
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: