Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20241

OCP upgrade 4.12 to 4.13 fails because some pods can't connect to k8s default svc 172.30.0.1

XMLWordPrintable

    • No
    • SDN Sprint 243, SDN Sprint 244
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-18472. The following is the description of the original issue:

      Description of problem:

      During OCP 4.12 to 4.13 upgrades, some pods are not able to reach default kubernetes service 172.30.0.1 and hang forever until pods are manually restarted. Mainly dns-default-* pods, but sometimes also dns-operator-* pods.
      

      Version-Release number of selected component (if applicable):

      4.12
      

      How reproducible:

      Very often, 8 out of 10 upgrades.
      

      Steps to Reproduce:

      1. Deploy OCP 4.12 with latest GA on a baremetal cluster with IPI and OVN-K
      2. Upgrade to latest 4.13 GA
      3. Check cluster version status during the upgrade, at some point upgrade hangs for long time, usually with message "Working towards 4.13.X: 694 of 842 done (82% complete), waiting on dns"
      4. Check for non-running pods and you might see pods in Crashing status
      5. Check pod logs, it will show "https://172.30.0.1:443/api?timeout=32: dial tcp 172.30.0.1:443: i/o timeout"
      
      

      Actual results:

      Upgrade gets stuck or requires manual intervention to continue when pods remain in Crashing status.
      

      Expected results:

      Upgrade should be completed without issues, and pods should not remain stuck in Crashing status.
      

      Additional info:

      • We have tested this with latest GA versions today: 4.12.31 to 4.13.10, but we have been observing this since 4.12.28
      • Our deployments have dualstack, but even with single stack IPv4 we have observed the issue.
      • The work-around has been to identify the pods in crashing status, restart them and the upgrade continues, we haven't found additional errors in the journal logs of the nodes with the pods crashing or other pods misbehaving

      This is an example of the latest run, upgrading 4.12.31 to 4.13.10, and after some minutes the upgrade gets stuck because dns operator was Degraded, and when checking the pods of openshift-dns namespace the pod dns-default-k8hfl was Running but only with one container:

      $ oc get clusterversion
      NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS                                                              
      clusterversion.config.openshift.io/version   4.12.31   True        True          102m    Working towards 4.13.10: 694 of 842 done (82% complete), waiting on dns
      
      $  oc get co | grep 4.12.31
      dns                                        4.12.31   True        True          False      3h45m   DNS "default" reports Progressing=True: "Have 6 available DNS pods, want 7.\nHave 6 up-to-date DNS pods, want 7."...
      machine-config                             4.12.31   True        False         False      161m
      
      $ oc -n openshift-dns get pods -o wide
      NAME                  READY   STATUS    RESTARTS       AGE     IP              NODE       NOMINATED NODE   READINESS GATES
      dns-default-7pghb     2/2     Running   0              14m     10.128.0.5      master-1   <none>           <none>
      dns-default-b25vj     2/2     Running   0              15m     10.129.0.4      master-0   <none>           <none>
      dns-default-k8hfl     1/2     Running   5 (119s ago)   12m     10.130.0.3      master-2   <none>           <none>
      dns-default-mrvh9     2/2     Running   0              13m     10.128.2.5      worker-3   <none>           <none>
      dns-default-pnf8w     2/2     Running   0              15m     10.130.2.4      worker-1   <none>           <none>
      dns-default-px4cn     2/2     Running   4              3h16m   10.129.2.6      worker-2   <none>           <none>
      dns-default-rvj6k     2/2     Running   0              14m     10.131.0.4      worker-0   <none>           <none>
      node-resolver-p6465   1/1     Running   0              16m     192.168.22.24   worker-0   <none>           <none>
      node-resolver-q8t6l   1/1     Running   0              16m     192.168.22.23   master-2   <none>           <none>
      node-resolver-qb8sm   1/1     Running   0              16m     192.168.22.21   master-0   <none>           <none>
      node-resolver-rklnq   1/1     Running   0              16m     192.168.22.25   worker-1   <none>           <none>
      node-resolver-rlbxc   1/1     Running   0              16m     192.168.22.22   master-1   <none>           <none>
      node-resolver-w7x4b   1/1     Running   0              16m     192.168.22.27   worker-3   <none>           <none>
      node-resolver-wb8tt   1/1     Running   0              16m     192.168.22.26   worker-2   <none>           <none>
      

      When checking the pod logs we see the dns container complains about not able to reach endpoint https://172.30.0.1:443/version, and when testing from the other container we confirmed we can't get to that URL. But when we test directly from the node running that pod we reach the endpoint and also other pods running in the node do not have issues.

      $ oc -n openshift-dns logs dns-default-k8hfl
      Defaulted container "dns" out of: dns, kube-rbac-proxy
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
      [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
      .:5353
      hostname.bind.:5353
      [INFO] plugin/reload: Running configuration SHA512 = e100c1081a47648310f72de96fbdbe31f928f02784eda1155c53be749ad04c434e50da55f960a800606274fb080d8a1f79df7effa47afa9a02bddd9f96192e18
      CoreDNS-1.10.1
      linux/amd64, go1.19.10 X:strictfipsruntime, 
      [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout
      
      $ oc -n openshift-dns exec -ti dns-default-k8hfl -c kube-rbac-proxy -- /bin/bash                                      
      bash-4.4$ curl https://172.30.0.1:443/readyz
      curl: (7) Failed to connect to 172.30.0.1 port 443: Connection timed out
      
      [core@master-2 ~]$ curl -k https://172.30.0.1:443/readyz
      ok
      

      If we delete that pod, it gets recreated and this time both containers in the pod are running, which unblocks the upgrade and continues with the next cluster operator, until it finishes.

      $ oc -n openshift-dns delete pod dns-default-k8hfl
      pod "dns-default-k8hfl" deleted 
      
      [kni@provisioner.cluster2.dfwt5g.lab ~]$ oc -n openshift-dns get pods -o wide                                                                                                                                                                
      NAME                  READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES                                                                                                                       
      dns-default-7pghb     2/2     Running   0          21m     10.128.0.5      master-1   <none>           <none>                                                                                                                                
      dns-default-b25vj     2/2     Running   0          22m     10.129.0.4      master-0   <none>           <none>                                                                                                                                
      dns-default-l7v9r     2/2     Running   0          51s     10.130.0.5      master-2   <none>           <none>                                                                                                                                
      dns-default-mrvh9     2/2     Running   0          20m     10.128.2.5      worker-3   <none>           <none>                                                                                                                                
      dns-default-pnf8w     2/2     Running   0          23m     10.130.2.4      worker-1   <none>           <none>                                           dns-default-vlgtb     2/2     Running   0          14s     10.129.2.6      worker-2   <none>           <none>                                                                                                                                
      dns-default-rvj6k     2/2     Running   0          22m     10.131.0.4      worker-0   <none>           <none>                                                                                                                                
      node-resolver-p6465   1/1     Running   0          23m     192.168.22.24   worker-0   <none>           <none>                                                                                                                                
      node-resolver-q8t6l   1/1     Running   0          23m     192.168.22.23   master-2   <none>           <none>                                                                                                                                
      node-resolver-qb8sm   1/1     Running   0          23m     192.168.22.21   master-0   <none>           <none>                                                                                                                                
      node-resolver-rklnq   1/1     Running   0          23m     192.168.22.25   worker-1   <none>           <none>                                                                                                                                
      node-resolver-rlbxc   1/1     Running   0          23m     192.168.22.22   master-1   <none>           <none>                                                                                                                                
      node-resolver-w7x4b   1/1     Running   0          23m     192.168.22.27   worker-3   <none>           <none>                                                                                                                                
      node-resolver-wb8tt   1/1     Running   0          23m     192.168.22.26   worker-2   <none>           <none> 
      
      $ oc get co | grep 4.12.31
      Cluster Operators
      machine-config                             4.12.31   True        False         False      169m
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.31   True        True          112m    Working towards 4.13.10: 714 of 842 done (84% complete), waiting on machine-config
      

              jcaamano@redhat.com Jaime Caamaño Ruiz
              openshift-crt-jira-prow OpenShift Prow Bot
              Jean Chen Jean Chen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: