-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.13.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During regular release-to-release performance and scale testing, trying to upgrade a 249 node loaded classic ROSA cluster from 4.13.0-rc.4 to 4.13.0-rc.6 fails. Initially, all the control plane operators (including network which takes a long time because of the sheer size of the cluster and SBDB which is around 225M) upgrade and then the cluster moves into the stage of upgrading the machineconfigpools for the control plane and workers. All the machinepools are eventually upgraded but the upgrade is stuck progressing for several hours because of a few cluster operators which become degraded at some point during the MCP upgrades. Operators such as auth,console, insights are showing i/o timeouts as they are not able to reach certain URLS (likely resolution issues). The resolution issues seem to happen on certain control plane/worker nodes without clear patterns. The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout For purposes of capturing information - it took a bout 3hr 10 mins for the control plane to initially upgrade but cluster is eventually stuck upgrading due to resolution/ i/o timeouts that happen at some point when the control plane MCP/worker MCP upgrades.
Version-Release number of selected component (if applicable):
4.13.0-rc.4 -> 4.13.0-rc.6
How reproducible:
100%
Steps to Reproduce:
1. Create a 249 node classic ROSA cluster 2. Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster
Actual results:
Upgrade is stuck progressing even after 24 hours
Expected results:
Upgrade completes
Additional info:
So picking one or two operators that are degraded due to not being able to reach certain URLS, we can see in the case of insights oeprator: I0430 03:08:13.334836 1 controller.go:220] Number of last upload failures 59 exceeded the threshold 5. Marking as degraded. I0430 03:08:13.334889 1 controller.go:428] The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout I0430 03:08:13.334942 1 controller.go:325] No status update necessary, objects are identical
Looking at the insights operator pod by rshing into it
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
insights-operator-64fcd46588-6ssk6 1/1 Running 0 25h 10.129.0.35 ip-10-0-171-52.us-west-2.compute.internal <none> <none>
sh-4.4$ nslookup kubernetes
;; connection timed out; no servers could be reached
sh-4.4$ nslookup console.stage.redhat.com
;; connection timed out; no servers could be reached
Next looking at console
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES console-7b9596cf96-4b789 0/1 CrashLoopBackOff 51 (68s ago) 4h46m 10.130.1.131 ip-10-0-214-54.us-west-2.compute.internal <none> <none> console-7b9596cf96-q4ll2 1/1 Running 0 25h 10.128.0.22 ip-10-0-156-134.us-west-2.compute.internal <none> <none> downloads-8b57f44bb-6jvxw 1/1 Running 0 24h 10.128.0.52 ip-10-0-156-134.us-west-2.compute.internal <none> <none> downloads-8b57f44bb-7nfnb 1/1 Running 0 25h 10.129.0.29 ip-10-0-171-52.us-west-2.compute.internal <none> <none>
Only one of the console pods is crashlooping with the other one on a different control plane node being fine. Deleting the pod leads to it being scheduled on a different control plane node which still has resolution issues
smalleni-mac:playground smalleni$ oc delete pod/console-7b9596cf96-4b789 smalleni-mac:playground smalleni$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES console-7b9596cf96-fwvbc 0/1 Running 2 (3m10s ago) 9m55s 10.129.0.64 ip-10-0-171-52.us-west-2.compute.internal <none> <none> console-7b9596cf96-q4ll2 1/1 Running 0 26h 10.128.0.22 ip-10-0-156-134.us-west-2.compute.internal <none> <none> downloads-8b57f44bb-6jvxw 1/1 Running 0 25h 10.128.0.52 ip-10-0-156-134.us-west-2.compute.internal <none> <none> downloads-8b57f44bb-7nfnb 1/1 Running 0 25h 10.129.0.29 ip-10-0-171-52.us-west-2.compute.internal <none> <none> smalleni-mac:playground smalleni$ oc rsh downloads-8b57f44bb-7nfnb sh-4.4$ nslookup kubernetes ;; connection timed out; no servers could be reached
In the above output, you can see that the console-7b9596cf96-fwvbc and downloads-8b57f44bb-7nfnb pods are running on the same node and downloads-8b57f44bb-7nfnb also cannot lookup the kuberentes service but it is running fine as it does not need to do any lookups.
The control plane node ip-10-0-171-52.us-west-2.compute.internal on which pods are having dns lookup issues looks to have a dns pod running there and appropriate loadbalancers created in the nbdb.
bash-3.2$ oc get pods -o wide -n openshift-dns | grep ip-10-0-171-52.us-west-2.compute.internal
dns-default-lcnxr 2/2 Running 2 27h 10.129.0.4 ip-10-0-171-52.us-west-2.compute.internal <none> <none>
node-resolver-zm4fb 1/1 Running 1 27h 10.0.171.52 ip-10-0-171-52.us-west-2.compute.internal <none> <none>
From a debug pod (not using host network - just a sample pod I created using pod network) on a worker node I tried to resolve the kubernetes service using the dns pod running on the ip-10-0-171-52.us-west-2.compute.internal control plane node which works.
; <<>> DiG 9.16.32-RH <<>> @10.129.0.4 kubernetes.default.svc.cluster.local -p 5353 ; (1 server found) ;; global options: +cmd ;; Got answer: ;; WARNING: .local is reserved for Multicast DNS ;; You are currently testing what happens when an mDNS query is leaked to DNS ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6007 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ; COOKIE: f4817742eeb85e07 (echoed) ;; QUESTION SECTION: ;kubernetes.default.svc.cluster.local. IN A ;; ANSWER SECTION: kubernetes.default.svc.cluster.local. 5 IN A 172.30.0.1 ;; Query time: 2 msec ;; SERVER: 10.129.0.4#5353(10.129.0.4) ;; WHEN: Sun Apr 30 04:05:49 UTC 2023 ;; MSG SIZE rcvd: 129
sh-5.1# ovn-nbctl --no-leader-only ls-lb-list ip-10-0-171-52.us-west-2.compute.internal | grep 172.30.0.10
3b408453-6b0d-457f-b325-67c7a087901f Service_openshif tcp 172.30.0.10:53 10.129.0.4:5353
tcp 172.30.0.10:9154 10.129.0.4:9154
215df548-5765-43f6-9c1e-36994f11a57d Service_openshif udp 172.30.0.10:53 10.129.0.4:5353
DNS lookups on some pods I created for debugging (not using host network) also fail/pass depending on the worker node there are in. The etc/resolv.conf on those pods is
sh-4.4$ cat /etc/resolv.conf search openshift-insights.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal nameserver 172.30.0.10 options ndots:5
One additional detail worth mentioning is that since this is a ROSA cluster, a new a machineset per AZ is created during upgrade.
mac:playground smalleni$ oc get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE sai-test-hd8wq-infra-us-west-2a 1 1 1 1 37h sai-test-hd8wq-infra-us-west-2b 1 1 1 1 37h sai-test-hd8wq-infra-us-west-2c 1 1 1 1 37h sai-test-hd8wq-worker-us-west-2a 83 83 83 83 37h sai-test-hd8wq-worker-us-west-2a-upgrade 1 1 1 1 29h sai-test-hd8wq-worker-us-west-2b 83 83 82 82 37h sai-test-hd8wq-worker-us-west-2b-upgrade 1 1 1 1 29h sai-test-hd8wq-worker-us-west-2c 83 83 83 83 37h sai-test-hd8wq-worker-us-west-2c-upgrade 1 1 1 1 29h
I understand the issue could be somewhere between core networking/dns but would like to open this up against OVNKubernetes (perhaps missing flows ?) initially and see where we go from ther..