-
Bug
-
Resolution: Done-Errata
-
Major
-
4.12.z
-
None
-
No
-
SDN Sprint 242, SDN Sprint 243
-
2
-
Rejected
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
During OCP 4.12 to 4.13 upgrades, some pods are not able to reach default kubernetes service 172.30.0.1 and hang forever until pods are manually restarted. Mainly dns-default-* pods, but sometimes also dns-operator-* pods.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Very often, 8 out of 10 upgrades.
Steps to Reproduce:
1. Deploy OCP 4.12 with latest GA on a baremetal cluster with IPI and OVN-K 2. Upgrade to latest 4.13 GA 3. Check cluster version status during the upgrade, at some point upgrade hangs for long time, usually with message "Working towards 4.13.X: 694 of 842 done (82% complete), waiting on dns" 4. Check for non-running pods and you might see pods in Crashing status 5. Check pod logs, it will show "https://172.30.0.1:443/api?timeout=32: dial tcp 172.30.0.1:443: i/o timeout"
Actual results:
Upgrade gets stuck or requires manual intervention to continue when pods remain in Crashing status.
Expected results:
Upgrade should be completed without issues, and pods should not remain stuck in Crashing status.
Additional info:
- We have tested this with latest GA versions today: 4.12.31 to 4.13.10, but we have been observing this since 4.12.28
- Our deployments have dualstack, but even with single stack IPv4 we have observed the issue.
- The work-around has been to identify the pods in crashing status, restart them and the upgrade continues, we haven't found additional errors in the journal logs of the nodes with the pods crashing or other pods misbehaving
This is an example of the latest run, upgrading 4.12.31 to 4.13.10, and after some minutes the upgrade gets stuck because dns operator was Degraded, and when checking the pods of openshift-dns namespace the pod dns-default-k8hfl was Running but only with one container:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.12.31 True True 102m Working towards 4.13.10: 694 of 842 done (82% complete), waiting on dns $ oc get co | grep 4.12.31 dns 4.12.31 True True False 3h45m DNS "default" reports Progressing=True: "Have 6 available DNS pods, want 7.\nHave 6 up-to-date DNS pods, want 7."... machine-config 4.12.31 True False False 161m $ oc -n openshift-dns get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-7pghb 2/2 Running 0 14m 10.128.0.5 master-1 <none> <none> dns-default-b25vj 2/2 Running 0 15m 10.129.0.4 master-0 <none> <none> dns-default-k8hfl 1/2 Running 5 (119s ago) 12m 10.130.0.3 master-2 <none> <none> dns-default-mrvh9 2/2 Running 0 13m 10.128.2.5 worker-3 <none> <none> dns-default-pnf8w 2/2 Running 0 15m 10.130.2.4 worker-1 <none> <none> dns-default-px4cn 2/2 Running 4 3h16m 10.129.2.6 worker-2 <none> <none> dns-default-rvj6k 2/2 Running 0 14m 10.131.0.4 worker-0 <none> <none> node-resolver-p6465 1/1 Running 0 16m 192.168.22.24 worker-0 <none> <none> node-resolver-q8t6l 1/1 Running 0 16m 192.168.22.23 master-2 <none> <none> node-resolver-qb8sm 1/1 Running 0 16m 192.168.22.21 master-0 <none> <none> node-resolver-rklnq 1/1 Running 0 16m 192.168.22.25 worker-1 <none> <none> node-resolver-rlbxc 1/1 Running 0 16m 192.168.22.22 master-1 <none> <none> node-resolver-w7x4b 1/1 Running 0 16m 192.168.22.27 worker-3 <none> <none> node-resolver-wb8tt 1/1 Running 0 16m 192.168.22.26 worker-2 <none> <none>
When checking the pod logs we see the dns container complains about not able to reach endpoint https://172.30.0.1:443/version, and when testing from the other container we confirmed we can't get to that URL. But when we test directly from the node running that pod we reach the endpoint and also other pods running in the node do not have issues.
$ oc -n openshift-dns logs dns-default-k8hfl Defaulted container "dns" out of: dns, kube-rbac-proxy [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API .:5353 hostname.bind.:5353 [INFO] plugin/reload: Running configuration SHA512 = e100c1081a47648310f72de96fbdbe31f928f02784eda1155c53be749ad04c434e50da55f960a800606274fb080d8a1f79df7effa47afa9a02bddd9f96192e18 CoreDNS-1.10.1 linux/amd64, go1.19.10 X:strictfipsruntime, [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout $ oc -n openshift-dns exec -ti dns-default-k8hfl -c kube-rbac-proxy -- /bin/bash bash-4.4$ curl https://172.30.0.1:443/readyz curl: (7) Failed to connect to 172.30.0.1 port 443: Connection timed out [core@master-2 ~]$ curl -k https://172.30.0.1:443/readyz ok
If we delete that pod, it gets recreated and this time both containers in the pod are running, which unblocks the upgrade and continues with the next cluster operator, until it finishes.
$ oc -n openshift-dns delete pod dns-default-k8hfl pod "dns-default-k8hfl" deleted [kni@provisioner.cluster2.dfwt5g.lab ~]$ oc -n openshift-dns get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-7pghb 2/2 Running 0 21m 10.128.0.5 master-1 <none> <none> dns-default-b25vj 2/2 Running 0 22m 10.129.0.4 master-0 <none> <none> dns-default-l7v9r 2/2 Running 0 51s 10.130.0.5 master-2 <none> <none> dns-default-mrvh9 2/2 Running 0 20m 10.128.2.5 worker-3 <none> <none> dns-default-pnf8w 2/2 Running 0 23m 10.130.2.4 worker-1 <none> <none> dns-default-vlgtb 2/2 Running 0 14s 10.129.2.6 worker-2 <none> <none> dns-default-rvj6k 2/2 Running 0 22m 10.131.0.4 worker-0 <none> <none> node-resolver-p6465 1/1 Running 0 23m 192.168.22.24 worker-0 <none> <none> node-resolver-q8t6l 1/1 Running 0 23m 192.168.22.23 master-2 <none> <none> node-resolver-qb8sm 1/1 Running 0 23m 192.168.22.21 master-0 <none> <none> node-resolver-rklnq 1/1 Running 0 23m 192.168.22.25 worker-1 <none> <none> node-resolver-rlbxc 1/1 Running 0 23m 192.168.22.22 master-1 <none> <none> node-resolver-w7x4b 1/1 Running 0 23m 192.168.22.27 worker-3 <none> <none> node-resolver-wb8tt 1/1 Running 0 23m 192.168.22.26 worker-2 <none> <none> $ oc get co | grep 4.12.31 Cluster Operators machine-config 4.12.31 True False False 169m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.31 True True 112m Working towards 4.13.10: 714 of 842 done (84% complete), waiting on machine-config
- blocks
-
OCPBUGS-20241 OCP upgrade 4.12 to 4.13 fails because some pods can't connect to k8s default svc 172.30.0.1
- Closed
- is cloned by
-
OCPBUGS-20241 OCP upgrade 4.12 to 4.13 fails because some pods can't connect to k8s default svc 172.30.0.1
- Closed
- links to
-
RHBA-2023:5672 OpenShift Container Platform 4.13.z bug fix update