-
Bug
-
Resolution: Done
-
Normal
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
This is a clone of issue OCPBUGS-5018. The following is the description of the original issue:
—
Description of problem:
When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-12-16-190443 True True 117m Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-0.nightly-2022-12-16-190443 True False False 4h47m
baremetal 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m
cloud-controller-manager 4.12.0-rc.5 True False False 5h3m
cloud-credential 4.11.0-0.nightly-2022-12-16-190443 True False False 5h4m
cluster-autoscaler 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m
config-operator 4.12.0-rc.5 True False False 5h1m
console 4.11.0-0.nightly-2022-12-16-190443 True False False 4h43m
csi-snapshot-controller 4.11.0-0.nightly-2022-12-16-190443 True False False 5h
dns 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m
etcd 4.12.0-rc.5 True False False 4h58m
image-registry 4.11.0-0.nightly-2022-12-16-190443 True False False 4h54m
ingress 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m
insights 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m
kube-apiserver 4.12.0-rc.5 True False False 4h50m
kube-controller-manager 4.12.0-rc.5 True False False 4h57m
kube-scheduler 4.12.0-rc.5 True False False 4h57m kube-storage-version-migrator 4.11.0-0.nightly-2022-12-16-190443 True False False 5h machine-api 4.11.0-0.nightly-2022-12-16-190443 True True False 4h56m Progressing towards operator: 4.12.0-rc.5
machine-approver 4.11.0-0.nightly-2022-12-16-190443 True False False 5h machine-config 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m marketplace 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m
monitoring 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m
network 4.11.0-0.nightly-2022-12-16-190443 True False False 5h3m
node-tuning 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m
openshift-apiserver 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m
openshift-controller-manager 4.11.0-0.nightly-2022-12-16-190443 True False False 4h56m
openshift-samples 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m
operator-lifecycle-manager 4.11.0-0.nightly-2022-12-16-190443 True False False 5h
operator-lifecycle-manager-catalog 4.11.0-0.nightly-2022-12-16-190443 True False False 5h
operator-lifecycle-manager-packageserver 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m
service-ca 4.11.0-0.nightly-2022-12-16-190443 True False False 5h
storage 4.11.0-0.nightly-2022-12-16-190443 True False False 5h
When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:
$ oc get pods -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-6ff66b6655-kpgp9 2/2 Running 0 5h5m
cluster-baremetal-operator-6dbcd6f76b-d9dwd 2/2 Running 0 5h5m
machine-api-controllers-cdb8d979b-79xlh 7/7 Running 0 94m
machine-api-operator-86bf4f6d79-g2vwm 2/2 Running 0 97m
machine-api-termination-handler-fcfq2 0/1 ImagePullBackOff 0 94m
machine-api-termination-handler-gj4pf 1/1 Running 0 4h57m
machine-api-termination-handler-krwdg 0/1 ImagePullBackOff 0 94m
machine-api-termination-handler-l95x2 1/1 Running 0 4h54m
machine-api-termination-handler-p6sw6 1/1 Running 0 4h57m
$ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api
Name: machine-api-termination-handler-fcfq2
Namespace: openshift-machine-api
Priority: 2000001000
Priority Class Name: system-node-critical
.....................................................................
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 94m default-scheduler Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
Warning FailedCreatePodSandBox 94m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
exists. ","ErrorCode":2147947410}
Warning FailedCreatePodSandBox 94m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
exists. ","ErrorCode":2147947410}
Normal Pulling 93m (x4 over 94m) kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
Warning Failed 93m (x4 over 94m) kubelet Error: ErrImagePull
Normal BackOff 4m39s (x393 over 94m) kubelet Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
$ oc get pods -n openshift-machine-api -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cluster-autoscaler-operator-6ff66b6655-kpgp9 2/2 Running 0 5h8m 10.130.0.10 ip-10-0-180-35.us-east-2.compute.internal <none> <none>
cluster-baremetal-operator-6dbcd6f76b-d9dwd 2/2 Running 0 5h8m 10.130.0.8 ip-10-0-180-35.us-east-2.compute.internal <none> <none>
machine-api-controllers-cdb8d979b-79xlh 7/7 Running 0 97m 10.128.0.144 ip-10-0-138-246.us-east-2.compute.internal <none> <none>
machine-api-operator-86bf4f6d79-g2vwm 2/2 Running 0 100m 10.128.0.143 ip-10-0-138-246.us-east-2.compute.internal <none> <none>
machine-api-termination-handler-fcfq2 0/1 ImagePullBackOff 0 97m 10.129.0.7 ip-10-0-145-114.us-east-2.compute.internal <none> <none>
machine-api-termination-handler-gj4pf 1/1 Running 0 5h 10.0.223.37 ip-10-0-223-37.us-east-2.compute.internal <none> <none>
machine-api-termination-handler-krwdg 0/1 ImagePullBackOff 0 97m 10.128.0.4 ip-10-0-143-111.us-east-2.compute.internal <none> <none>
machine-api-termination-handler-l95x2 1/1 Running 0 4h57m 10.0.172.211 ip-10-0-172-211.us-east-2.compute.internal <none> <none>
machine-api-termination-handler-p6sw6 1/1 Running 0 5h 10.0.146.227 ip-10-0-146-227.us-east-2.compute.internal <none> <none>
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
ip-10-0-143-111.us-east-2.compute.internal Ready worker 4h24m v1.24.0-2566+5157800f2a3bc3 10.0.143.111 <none> Windows Server 2019 Datacenter 10.0.17763.3770 containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
ip-10-0-145-114.us-east-2.compute.internal Ready worker 4h18m v1.24.0-2566+5157800f2a3bc3 10.0.145.114 <none> Windows Server 2019 Datacenter 10.0.17763.3770 containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh Running m5a.large us-east-2 us-east-2a 4h37m ip-10-0-145-114.us-east-2.compute.internal aws:///us-east-2a/i-0b69d52c625c46a6a running
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc Running m5a.large us-east-2 us-east-2a 4h37m ip-10-0-143-111.us-east-2.compute.internal aws:///us-east-2a/i-05e422c0051707d16 running
This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-12-16-190443 True True 141m Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api $ oc version Client Version: 4.11.0-0.ci-2022-06-09-065118 Kustomize Version: v4.5.4 Server Version: 4.11.0-0.nightly-2022-12-16-190443 Kubernetes Version: v1.25.4+77bec7a
How reproducible:
Always
Steps to Reproduce:
1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet 2. Perform the upgrade to 4.12 3. Wait for the upgrade to hang on the machine-api component
Actual results:
The upgrade hangs when upgrading the machine-api component.
Expected results:
The upgrade suceeds
Additional info:
- clones
-
OCPBUGS-5018 Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.
-
- Closed
-
- is blocked by
-
OCPBUGS-5018 Upgrade from 4.11 to 4.12 with Windows machine workers (Spot Instances) failing due to: hcnCreateEndpoint failed in Win32: The object already exists.
-
- Closed
-
- links to