Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.13.0
Affects Version/s: 4.12.z
Component/s: Networking / ovn-kubernetes
Labels:
None

Regression:
No
Sprint:
SDN Sprint 242, SDN Sprint 243
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

During OCP 4.12 to 4.13 upgrades, some pods are not able to reach default kubernetes service 172.30.0.1 and hang forever until pods are manually restarted. Mainly dns-default-* pods, but sometimes also dns-operator-* pods.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Very often, 8 out of 10 upgrades.

Steps to Reproduce:

1. Deploy OCP 4.12 with latest GA on a baremetal cluster with IPI and OVN-K
2. Upgrade to latest 4.13 GA
3. Check cluster version status during the upgrade, at some point upgrade hangs for long time, usually with message "Working towards 4.13.X: 694 of 842 done (82% complete), waiting on dns"
4. Check for non-running pods and you might see pods in Crashing status
5. Check pod logs, it will show "https://172.30.0.1:443/api?timeout=32: dial tcp 172.30.0.1:443: i/o timeout"

Actual results:

Upgrade gets stuck or requires manual intervention to continue when pods remain in Crashing status.

Expected results:

Upgrade should be completed without issues, and pods should not remain stuck in Crashing status.

Additional info:

We have tested this with latest GA versions today: 4.12.31 to 4.13.10, but we have been observing this since 4.12.28
Our deployments have dualstack, but even with single stack IPv4 we have observed the issue.
The work-around has been to identify the pods in crashing status, restart them and the upgrade continues, we haven't found additional errors in the journal logs of the nodes with the pods crashing or other pods misbehaving

This is an example of the latest run, upgrading 4.12.31 to 4.13.10, and after some minutes the upgrade gets stuck because dns operator was Degraded, and when checking the pods of openshift-dns namespace the pod dns-default-k8hfl was Running but only with one container:

$ oc get clusterversion
NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS                                                              
clusterversion.config.openshift.io/version   4.12.31   True        True          102m    Working towards 4.13.10: 694 of 842 done (82% complete), waiting on dns

$  oc get co | grep 4.12.31
dns                                        4.12.31   True        True          False      3h45m   DNS "default" reports Progressing=True: "Have 6 available DNS pods, want 7.\nHave 6 up-to-date DNS pods, want 7."...
machine-config                             4.12.31   True        False         False      161m

$ oc -n openshift-dns get pods -o wide
NAME                  READY   STATUS    RESTARTS       AGE     IP              NODE       NOMINATED NODE   READINESS GATES
dns-default-7pghb     2/2     Running   0              14m     10.128.0.5      master-1   <none>           <none>
dns-default-b25vj     2/2     Running   0              15m     10.129.0.4      master-0   <none>           <none>
dns-default-k8hfl     1/2     Running   5 (119s ago)   12m     10.130.0.3      master-2   <none>           <none>
dns-default-mrvh9     2/2     Running   0              13m     10.128.2.5      worker-3   <none>           <none>
dns-default-pnf8w     2/2     Running   0              15m     10.130.2.4      worker-1   <none>           <none>
dns-default-px4cn     2/2     Running   4              3h16m   10.129.2.6      worker-2   <none>           <none>
dns-default-rvj6k     2/2     Running   0              14m     10.131.0.4      worker-0   <none>           <none>
node-resolver-p6465   1/1     Running   0              16m     192.168.22.24   worker-0   <none>           <none>
node-resolver-q8t6l   1/1     Running   0              16m     192.168.22.23   master-2   <none>           <none>
node-resolver-qb8sm   1/1     Running   0              16m     192.168.22.21   master-0   <none>           <none>
node-resolver-rklnq   1/1     Running   0              16m     192.168.22.25   worker-1   <none>           <none>
node-resolver-rlbxc   1/1     Running   0              16m     192.168.22.22   master-1   <none>           <none>
node-resolver-w7x4b   1/1     Running   0              16m     192.168.22.27   worker-3   <none>           <none>
node-resolver-wb8tt   1/1     Running   0              16m     192.168.22.26   worker-2   <none>           <none>

When checking the pod logs we see the dns container complains about not able to reach endpoint https://172.30.0.1:443/version, and when testing from the other container we confirmed we can't get to that URL. But when we test directly from the node running that pod we reach the endpoint and also other pods running in the node do not have issues.

$ oc -n openshift-dns logs dns-default-k8hfl
Defaulted container "dns" out of: dns, kube-rbac-proxy
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:5353
hostname.bind.:5353
[INFO] plugin/reload: Running configuration SHA512 = e100c1081a47648310f72de96fbdbe31f928f02784eda1155c53be749ad04c434e50da55f960a800606274fb080d8a1f79df7effa47afa9a02bddd9f96192e18
CoreDNS-1.10.1
linux/amd64, go1.19.10 X:strictfipsruntime, 
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

$ oc -n openshift-dns exec -ti dns-default-k8hfl -c kube-rbac-proxy -- /bin/bash                                      
bash-4.4$ curl https://172.30.0.1:443/readyz
curl: (7) Failed to connect to 172.30.0.1 port 443: Connection timed out

[core@master-2 ~]$ curl -k https://172.30.0.1:443/readyz
ok

If we delete that pod, it gets recreated and this time both containers in the pod are running, which unblocks the upgrade and continues with the next cluster operator, until it finishes.

$ oc -n openshift-dns delete pod dns-default-k8hfl
pod "dns-default-k8hfl" deleted 

[kni@provisioner.cluster2.dfwt5g.lab ~]$ oc -n openshift-dns get pods -o wide                                                                                                                                                                
NAME                  READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES                                                                                                                       
dns-default-7pghb     2/2     Running   0          21m     10.128.0.5      master-1   <none>           <none>                                                                                                                                
dns-default-b25vj     2/2     Running   0          22m     10.129.0.4      master-0   <none>           <none>                                                                                                                                
dns-default-l7v9r     2/2     Running   0          51s     10.130.0.5      master-2   <none>           <none>                                                                                                                                
dns-default-mrvh9     2/2     Running   0          20m     10.128.2.5      worker-3   <none>           <none>                                                                                                                                
dns-default-pnf8w     2/2     Running   0          23m     10.130.2.4      worker-1   <none>           <none>                                           dns-default-vlgtb     2/2     Running   0          14s     10.129.2.6      worker-2   <none>           <none>                                                                                                                                
dns-default-rvj6k     2/2     Running   0          22m     10.131.0.4      worker-0   <none>           <none>                                                                                                                                
node-resolver-p6465   1/1     Running   0          23m     192.168.22.24   worker-0   <none>           <none>                                                                                                                                
node-resolver-q8t6l   1/1     Running   0          23m     192.168.22.23   master-2   <none>           <none>                                                                                                                                
node-resolver-qb8sm   1/1     Running   0          23m     192.168.22.21   master-0   <none>           <none>                                                                                                                                
node-resolver-rklnq   1/1     Running   0          23m     192.168.22.25   worker-1   <none>           <none>                                                                                                                                
node-resolver-rlbxc   1/1     Running   0          23m     192.168.22.22   master-1   <none>           <none>                                                                                                                                
node-resolver-w7x4b   1/1     Running   0          23m     192.168.22.27   worker-3   <none>           <none>                                                                                                                                
node-resolver-wb8tt   1/1     Running   0          23m     192.168.22.26   worker-2   <none>           <none> 

$ oc get co | grep 4.12.31
Cluster Operators
machine-config                             4.12.31   True        False         False      169m

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.31   True        True          112m    Working towards 4.13.10: 714 of 842 done (84% complete), waiting on machine-config

blocks

OCPBUGS-20241 OCP upgrade 4.12 to 4.13 fails because some pods can't connect to k8s default svc 172.30.0.1

Closed

is cloned by

OCPBUGS-20241 OCP upgrade 4.12 to 4.13 fails because some pods can't connect to k8s default svc 172.30.0.1

Closed

links to

RHBA-2023:5672 OpenShift Container Platform 4.13.z bug fix update

Assignee:: Jaime Caamaño Ruiz

Reporter:: Manuel Rodriguez

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/09/01 9:12 PM

Updated:: 2023/10/26 12:34 AM

Resolved:: 2023/10/17 6:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates