[OCPBUGS-5417] Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.12.z
Affects Version/s: 4.12
Component/s: Cloud Compute / Unknown
Labels:
- pre-merge-tested

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-5018~~. The following is the description of the original issue:
—
Description of problem:

When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:

$ oc get clusterversion                                                                              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS                                      
version   4.11.0-0.nightly-2022-12-16-190443   True        True          117m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api

$ oc get co                                                                                                                                                                                                                              
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                                                   
authentication                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h47m   
baremetal                                  4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
cloud-controller-manager                   4.12.0-rc.5                          True        False         False      5h3m    
cloud-credential                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h4m                                                                                                                                              
cluster-autoscaler                         4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
config-operator                            4.12.0-rc.5                          True        False         False      5h1m    
console                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h43m   
csi-snapshot-controller                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
dns                                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
etcd                                       4.12.0-rc.5                          True        False         False      4h58m         
image-registry                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h54m         
ingress                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m   
insights                                   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
kube-apiserver                             4.12.0-rc.5                          True        False         False      4h50m         
kube-controller-manager                    4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             
kube-scheduler                             4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             kube-storage-version-migrator              4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-api                                4.11.0-0.nightly-2022-12-16-190443   True        True          False      4h56m   Progressing towards operator: 4.12.0-rc.5                                                                                                 
machine-approver                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-config                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             marketplace                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
monitoring                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m                                                                                                                                             
network                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h3m          
node-tuning                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             
openshift-apiserver                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
openshift-controller-manager               4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h56m                                                                                                                                             
openshift-samples                          4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
operator-lifecycle-manager                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
service-ca                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
storage                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      

When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:

$ oc get pods -n openshift-machine-api                                                                                                                                                                                                   
NAME                                           READY   STATUS             RESTARTS   AGE                                                                                                                                                                               
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h5m                                                                                                                                                                              
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h5m                                          
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          94m                                                                                                                                                                               
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          97m                                           
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-gj4pf          1/1     Running            0          4h57m                                                                                                                                                                             
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-l95x2          1/1     Running            0          4h54m                                                                                                                                                                             
machine-api-termination-handler-p6sw6          1/1     Running            0          4h57m   

$ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api                                                                                                                                                        
Name:                 machine-api-termination-handler-fcfq2
Namespace:            openshift-machine-api
Priority:             2000001000
Priority Class Name:  system-node-critical
.....................................................................
Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               94m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Normal   Pulling                 93m (x4 over 94m)      kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
  Warning  Failed                  93m (x4 over 94m)      kubelet            Error: ErrImagePull
  Normal   BackOff                 4m39s (x393 over 94m)  kubelet            Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"


$ oc get pods -n openshift-machine-api -o wide
NAME                                           READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h8m    10.130.0.10    ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h8m    10.130.0.8     ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          97m     10.128.0.144   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          100m    10.128.0.143   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          97m     10.129.0.7     ip-10-0-145-114.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-gj4pf          1/1     Running            0          5h      10.0.223.37    ip-10-0-223-37.us-east-2.compute.internal    <none>           <none>
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          97m     10.128.0.4     ip-10-0-143-111.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-l95x2          1/1     Running            0          4h57m   10.0.172.211   ip-10-0-172-211.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-p6sw6          1/1     Running            0          5h      10.0.146.227   ip-10-0-146-227.us-east-2.compute.internal   <none>           <none>
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
ip-10-0-143-111.us-east-2.compute.internal   Ready    worker   4h24m   v1.24.0-2566+5157800f2a3bc3   10.0.143.111   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
ip-10-0-145-114.us-east-2.compute.internal   Ready    worker   4h18m   v1.24.0-2566+5157800f2a3bc3   10.0.145.114   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-145-114.us-east-2.compute.internal   aws:///us-east-2a/i-0b69d52c625c46a6a   running
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-143-111.us-east-2.compute.internal   aws:///us-east-2a/i-05e422c0051707d16   running

This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-12-16-190443   True        True          141m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
$ oc version
Client Version: 4.11.0-0.ci-2022-06-09-065118
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-2022-12-16-190443
Kubernetes Version: v1.25.4+77bec7a

How reproducible:

Always

Steps to Reproduce:

1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet
2. Perform the upgrade to 4.12
3. Wait for the upgrade to hang on the machine-api component

Actual results:

The upgrade hangs when upgrading the machine-api component.

Expected results:

The upgrade suceeds

Additional info:

clones

OCPBUGS-5018 Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

Closed

is blocked by

OCPBUGS-5018 Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.

Closed

links to

openshift/machine-api-operator#1105: [release-4.12] OCPBUGS-5417: machine-api-termination-handler: run DaemonSet only on Linux

Errata Tool added a comment - 2023/01/17 7:45 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:7399

Errata Tool added a comment - 2023/01/17 7:45 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399

Zhaohua Sun added a comment - 2023/01/09 2:31 AM

This is verified before merge, move it to done.

Zhaohua Sun added a comment - 2023/01/09 2:31 AM This is verified before merge, move it to done.

Zhaohua Sun added a comment - 2023/01/06 2:23 AM

Verified before pr merge, upgrade from 4.11.0-0.nightly-2023-01-04-235653 to 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest(build 4.12,openshift/machine-api-operator#1105) is successful.

$ oc get clusterversion                                                                                                  
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2023-01-04-235653   True        False         44m     Cluster version is 4.11.0-0.nightly-2023-01-04-235653
$ oc adm upgrade --to-image='registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest' --force --allow-explicit-upgrade
Requesting update to release image registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest
$ oc get clusterversion      
NAME      VERSION                                               AVAILABLE   PROGRESSING   SINCE   STATUS
version   0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest   True        False         3m19s   Cluster version is 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest
$ oc get po     
NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-74755fdcc4-dnf8r          2/2     Running   0          17m
cluster-baremetal-operator-646574547f-s5tgd           2/2     Running   0          17m
control-plane-machine-set-operator-7b9998968c-rhgks   1/1     Running   0          17m
machine-api-controllers-6b9985b47-chndj               7/7     Running   0          10m
machine-api-operator-755668d696-79xht                 2/2     Running   0          10m
machine-api-termination-handler-drmqc                 1/1     Running   1          52m
machine-api-termination-handler-prbdt                 1/1     Running   1          52m
machine-api-termination-handler-qwn6j                 1/1     Running   1          52m

Zhaohua Sun added a comment - 2023/01/06 2:23 AM Verified before pr merge, upgrade from 4.11.0-0.nightly-2023-01-04-235653 to 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest(build 4.12,openshift/machine-api-operator#1105) is successful. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2023-01-04-235653 True False 44m Cluster version is 4.11.0-0.nightly-2023-01-04-235653 $ oc adm upgrade --to-image= 'registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest' --force --allow-explicit-upgrade Requesting update to release image registry.build05.ci.openshift.org/ci-ln-ki46j5k/release:latest $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest True False 3m19s Cluster version is 0.0.1-0.test-2023-01-05-135126-ci-ln-ki46j5k-latest $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler- operator -74755fdcc4-dnf8r 2/2 Running 0 17m cluster-baremetal- operator -646574547f-s5tgd 2/2 Running 0 17m control-plane-machine-set- operator -7b9998968c-rhgks 1/1 Running 0 17m machine-api-controllers-6b9985b47-chndj 7/7 Running 0 10m machine-api- operator -755668d696-79xht 2/2 Running 0 10m machine-api-termination-handler-drmqc 1/1 Running 1 52m machine-api-termination-handler-prbdt 1/1 Running 1 52m machine-api-termination-handler-qwn6j 1/1 Running 1 52m

Assignee:: Joel Speed

Reporter:: OpenShift Prow Bot

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/01/05 1:24 PM

Updated:: 2024/02/15 3:28 PM

Resolved:: 2023/01/17 7:45 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2023/01/17 7:45 PM

Expand comment: Errata Tool added a comment - 2023/01/17 7:45 PM

Collapse comment: Zhaohua Sun added a comment - 2023/01/09 2:31 AM

Expand comment: Zhaohua Sun added a comment - 2023/01/09 2:31 AM

Collapse comment: Zhaohua Sun added a comment - 2023/01/06 2:23 AM

Expand comment: Zhaohua Sun added a comment - 2023/01/06 2:23 AM

People

Dates