Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-12968

Upgrades are failing at scale - possible network/dns lookup issues

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During regular release-to-release performance and scale testing, trying to upgrade a 249 node loaded classic ROSA cluster from 4.13.0-rc.4 to 4.13.0-rc.6 fails. Initially, all the control plane operators (including network which takes a long time because of the sheer size of the cluster and SBDB which is around 225M) upgrade and then the cluster moves into the stage of upgrading the machineconfigpools for the control plane and workers. All the machinepools are eventually upgraded but the upgrade is stuck progressing for several hours because of a few cluster operators which become degraded at some point during the MCP upgrades.
      
      Operators such as auth,console, insights are showing i/o timeouts as they are not able to reach certain URLS (likely resolution issues). The resolution issues seem to happen on certain control plane/worker nodes without clear patterns.
      
       The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout
      For purposes of capturing information -  it took a bout 3hr 10 mins for the control plane to initially upgrade but cluster is eventually stuck upgrading due to resolution/ i/o timeouts that happen at some point when the control plane MCP/worker MCP upgrades. 

       

      Version-Release number of selected component (if applicable):

      4.13.0-rc.4 -> 4.13.0-rc.6

      How reproducible:

      100%

      Steps to Reproduce:

      1. Create a 249 node classic ROSA cluster
      2. Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster
      

      Actual results:

      Upgrade is stuck progressing even after 24 hours

      Expected results:

      Upgrade completes

      Additional info:

      So picking one or two operators that are degraded due to not being able to reach certain URLS, we can see in the case of insights oeprator: I0430 03:08:13.334836       1 controller.go:220] Number of last upload failures 59 exceeded the threshold 5. Marking as degraded.
      I0430 03:08:13.334889       1 controller.go:428] The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout
      I0430 03:08:13.334942       1 controller.go:325] No status update necessary, objects are identical

      Looking at the insights operator pod by rshing into it

      NAME                                 READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
      insights-operator-64fcd46588-6ssk6   1/1     Running   0          25h   10.129.0.35   ip-10-0-171-52.us-west-2.compute.internal   <none>           <none>
      sh-4.4$ nslookup kubernetes
      ;; connection timed out; no servers could be reached
      
      sh-4.4$ nslookup console.stage.redhat.com
      ;; connection timed out; no servers could be reached 

      Next looking at console

      NAME                        READY   STATUS             RESTARTS       AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
      console-7b9596cf96-4b789    0/1     CrashLoopBackOff   51 (68s ago)   4h46m   10.130.1.131   ip-10-0-214-54.us-west-2.compute.internal    <none>           <none>
      console-7b9596cf96-q4ll2    1/1     Running            0              25h     10.128.0.22    ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
      downloads-8b57f44bb-6jvxw   1/1     Running            0              24h     10.128.0.52    ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
      downloads-8b57f44bb-7nfnb   1/1     Running            0              25h     10.129.0.29    ip-10-0-171-52.us-west-2.compute.internal    <none>           <none> 

      Only one of the console pods is crashlooping with the other one on a different control plane node being fine. Deleting the pod leads to it being scheduled on a different control plane node which still has resolution issues

      smalleni-mac:playground smalleni$ oc delete pod/console-7b9596cf96-4b789
      smalleni-mac:playground smalleni$ oc get pods -o wide
      NAME                        READY   STATUS    RESTARTS        AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
      console-7b9596cf96-fwvbc    0/1     Running   2 (3m10s ago)   9m55s   10.129.0.64   ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
      console-7b9596cf96-q4ll2    1/1     Running   0               26h     10.128.0.22   ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
      downloads-8b57f44bb-6jvxw   1/1     Running   0               25h     10.128.0.52   ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
      downloads-8b57f44bb-7nfnb   1/1     Running   0               25h     10.129.0.29   ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
      smalleni-mac:playground smalleni$ oc rsh downloads-8b57f44bb-7nfnb
      sh-4.4$ nslookup kubernetes
      ;; connection timed out; no servers could be reached 

      In the above output, you can see that the console-7b9596cf96-fwvbc and downloads-8b57f44bb-7nfnb pods are running on the same node and downloads-8b57f44bb-7nfnb also cannot lookup the kuberentes service but it is running fine as it does not need to do any lookups.

       

      The control plane node  ip-10-0-171-52.us-west-2.compute.internal  on which pods are having dns lookup issues looks to have a dns pod running there and appropriate loadbalancers created in the nbdb.

      bash-3.2$ oc get pods -o wide -n openshift-dns  | grep ip-10-0-171-52.us-west-2.compute.internal
      dns-default-lcnxr     2/2     Running   2             27h   10.129.0.4     ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
      node-resolver-zm4fb   1/1     Running   1             27h   10.0.171.52    ip-10-0-171-52.us-west-2.compute.internal    <none>           <none> 

      From a debug pod (not using host network - just a sample pod I created using pod network) on a worker node I tried to resolve the kubernetes service using the dns pod running on the  ip-10-0-171-52.us-west-2.compute.internal  control plane node which works.

       

      ; <<>> DiG 9.16.32-RH <<>> @10.129.0.4 kubernetes.default.svc.cluster.local -p 5353
      ; (1 server found)
      ;; global options: +cmd
      ;; Got answer:
      ;; WARNING: .local is reserved for Multicast DNS
      ;; You are currently testing what happens when an mDNS query is leaked to DNS
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6007
      ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
      ;; WARNING: recursion requested but not available
      
      ;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 512
      ; COOKIE: f4817742eeb85e07 (echoed)
      ;; QUESTION SECTION:
      ;kubernetes.default.svc.cluster.local. IN A
      
      ;; ANSWER SECTION:
      kubernetes.default.svc.cluster.local. 5	IN A	172.30.0.1
      
      ;; Query time: 2 msec
      ;; SERVER: 10.129.0.4#5353(10.129.0.4)
      ;; WHEN: Sun Apr 30 04:05:49 UTC 2023
      ;; MSG SIZE  rcvd: 129 
      sh-5.1# ovn-nbctl --no-leader-only ls-lb-list ip-10-0-171-52.us-west-2.compute.internal | grep 172.30.0.10
      3b408453-6b0d-457f-b325-67c7a087901f    Service_openshif    tcp        172.30.0.10:53          10.129.0.4:5353
                                                                  tcp        172.30.0.10:9154        10.129.0.4:9154
      215df548-5765-43f6-9c1e-36994f11a57d    Service_openshif    udp        172.30.0.10:53          10.129.0.4:5353 

      DNS lookups on some pods I created for debugging (not using host network) also fail/pass depending on the worker node there are in.  The etc/resolv.conf on those pods is

      sh-4.4$ cat /etc/resolv.conf
      search openshift-insights.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
      nameserver 172.30.0.10
      options ndots:5 

      One additional detail worth mentioning is that since this is a ROSA cluster, a new a machineset per AZ is created during upgrade.

       

      mac:playground smalleni$ oc get machinesets
      NAME                                       DESIRED   CURRENT   READY   AVAILABLE   AGE
      sai-test-hd8wq-infra-us-west-2a            1         1         1       1           37h
      sai-test-hd8wq-infra-us-west-2b            1         1         1       1           37h
      sai-test-hd8wq-infra-us-west-2c            1         1         1       1           37h
      sai-test-hd8wq-worker-us-west-2a           83        83        83      83          37h
      sai-test-hd8wq-worker-us-west-2a-upgrade   1         1         1       1           29h
      sai-test-hd8wq-worker-us-west-2b           83        83        82      82          37h
      sai-test-hd8wq-worker-us-west-2b-upgrade   1         1         1       1           29h
      sai-test-hd8wq-worker-us-west-2c           83        83        83      83          37h
      sai-test-hd8wq-worker-us-west-2c-upgrade   1         1         1       1           29h 

      I understand the issue could be somewhere between core networking/dns but would like to open this up against OVNKubernetes (perhaps missing flows ?) initially and see where we go from ther..

              jcaamano@redhat.com Jaime Caamaño Ruiz
              smalleni@redhat.com Sai Sindhur Malleni (Inactive)
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: