Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN-Bug-Backlog-Pruning

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During regular release-to-release performance and scale testing, trying to upgrade a 249 node loaded classic ROSA cluster from 4.13.0-rc.4 to 4.13.0-rc.6 fails. Initially, all the control plane operators (including network which takes a long time because of the sheer size of the cluster and SBDB which is around 225M) upgrade and then the cluster moves into the stage of upgrading the machineconfigpools for the control plane and workers. All the machinepools are eventually upgraded but the upgrade is stuck progressing for several hours because of a few cluster operators which become degraded at some point during the MCP upgrades.

Operators such as auth,console, insights are showing i/o timeouts as they are not able to reach certain URLS (likely resolution issues). The resolution issues seem to happen on certain control plane/worker nodes without clear patterns.

 The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout
For purposes of capturing information -  it took a bout 3hr 10 mins for the control plane to initially upgrade but cluster is eventually stuck upgrading due to resolution/ i/o timeouts that happen at some point when the control plane MCP/worker MCP upgrades.

Version-Release number of selected component (if applicable):

4.13.0-rc.4 -> 4.13.0-rc.6

How reproducible:

100%

Steps to Reproduce:

1. Create a 249 node classic ROSA cluster
2. Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster

Actual results:

Upgrade is stuck progressing even after 24 hours

Expected results:

Upgrade completes

Additional info:

So picking one or two operators that are degraded due to not being able to reach certain URLS, we can see in the case of insights oeprator: I0430 03:08:13.334836       1 controller.go:220] Number of last upload failures 59 exceeded the threshold 5. Marking as degraded.
I0430 03:08:13.334889       1 controller.go:428] The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://console.stage.redhat.com/api/ingress/v1/upload": dial tcp: lookup console.stage.redhat.com: i/o timeout
I0430 03:08:13.334942       1 controller.go:325] No status update necessary, objects are identical

Looking at the insights operator pod by rshing into it

NAME                                 READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
insights-operator-64fcd46588-6ssk6   1/1     Running   0          25h   10.129.0.35   ip-10-0-171-52.us-west-2.compute.internal   <none>           <none>
sh-4.4$ nslookup kubernetes
;; connection timed out; no servers could be reached

sh-4.4$ nslookup console.stage.redhat.com
;; connection timed out; no servers could be reached

Next looking at console

NAME                        READY   STATUS             RESTARTS       AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
console-7b9596cf96-4b789    0/1     CrashLoopBackOff   51 (68s ago)   4h46m   10.130.1.131   ip-10-0-214-54.us-west-2.compute.internal    <none>           <none>
console-7b9596cf96-q4ll2    1/1     Running            0              25h     10.128.0.22    ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
downloads-8b57f44bb-6jvxw   1/1     Running            0              24h     10.128.0.52    ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
downloads-8b57f44bb-7nfnb   1/1     Running            0              25h     10.129.0.29    ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>

Only one of the console pods is crashlooping with the other one on a different control plane node being fine. Deleting the pod leads to it being scheduled on a different control plane node which still has resolution issues

smalleni-mac:playground smalleni$ oc delete pod/console-7b9596cf96-4b789
smalleni-mac:playground smalleni$ oc get pods -o wide
NAME                        READY   STATUS    RESTARTS        AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
console-7b9596cf96-fwvbc    0/1     Running   2 (3m10s ago)   9m55s   10.129.0.64   ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
console-7b9596cf96-q4ll2    1/1     Running   0               26h     10.128.0.22   ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
downloads-8b57f44bb-6jvxw   1/1     Running   0               25h     10.128.0.52   ip-10-0-156-134.us-west-2.compute.internal   <none>           <none>
downloads-8b57f44bb-7nfnb   1/1     Running   0               25h     10.129.0.29   ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
smalleni-mac:playground smalleni$ oc rsh downloads-8b57f44bb-7nfnb
sh-4.4$ nslookup kubernetes
;; connection timed out; no servers could be reached

In the above output, you can see that the console-7b9596cf96-fwvbc and downloads-8b57f44bb-7nfnb pods are running on the same node and downloads-8b57f44bb-7nfnb also cannot lookup the kuberentes service but it is running fine as it does not need to do any lookups.

The control plane node ip-10-0-171-52.us-west-2.compute.internal on which pods are having dns lookup issues looks to have a dns pod running there and appropriate loadbalancers created in the nbdb.

bash-3.2$ oc get pods -o wide -n openshift-dns  | grep ip-10-0-171-52.us-west-2.compute.internal
dns-default-lcnxr     2/2     Running   2             27h   10.129.0.4     ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
node-resolver-zm4fb   1/1     Running   1             27h   10.0.171.52    ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>

From a debug pod (not using host network - just a sample pod I created using pod network) on a worker node I tried to resolve the kubernetes service using the dns pod running on the ip-10-0-171-52.us-west-2.compute.internal control plane node which works.

; <<>> DiG 9.16.32-RH <<>> @10.129.0.4 kubernetes.default.svc.cluster.local -p 5353
; (1 server found)
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6007
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; COOKIE: f4817742eeb85e07 (echoed)
;; QUESTION SECTION:
;kubernetes.default.svc.cluster.local. IN A

;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 5	IN A	172.30.0.1

;; Query time: 2 msec
;; SERVER: 10.129.0.4#5353(10.129.0.4)
;; WHEN: Sun Apr 30 04:05:49 UTC 2023
;; MSG SIZE  rcvd: 129

sh-5.1# ovn-nbctl --no-leader-only ls-lb-list ip-10-0-171-52.us-west-2.compute.internal | grep 172.30.0.10
3b408453-6b0d-457f-b325-67c7a087901f    Service_openshif    tcp        172.30.0.10:53          10.129.0.4:5353
                                                            tcp        172.30.0.10:9154        10.129.0.4:9154
215df548-5765-43f6-9c1e-36994f11a57d    Service_openshif    udp        172.30.0.10:53          10.129.0.4:5353

DNS lookups on some pods I created for debugging (not using host network) also fail/pass depending on the worker node there are in. The etc/resolv.conf on those pods is

sh-4.4$ cat /etc/resolv.conf
search openshift-insights.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
nameserver 172.30.0.10
options ndots:5

One additional detail worth mentioning is that since this is a ROSA cluster, a new a machineset per AZ is created during upgrade.

mac:playground smalleni$ oc get machinesets
NAME                                       DESIRED   CURRENT   READY   AVAILABLE   AGE
sai-test-hd8wq-infra-us-west-2a            1         1         1       1           37h
sai-test-hd8wq-infra-us-west-2b            1         1         1       1           37h
sai-test-hd8wq-infra-us-west-2c            1         1         1       1           37h
sai-test-hd8wq-worker-us-west-2a           83        83        83      83          37h
sai-test-hd8wq-worker-us-west-2a-upgrade   1         1         1       1           29h
sai-test-hd8wq-worker-us-west-2b           83        83        82      82          37h
sai-test-hd8wq-worker-us-west-2b-upgrade   1         1         1       1           29h
sai-test-hd8wq-worker-us-west-2c           83        83        83      83          37h
sai-test-hd8wq-worker-us-west-2c-upgrade   1         1         1       1           29h

I understand the issue could be somewhere between core networking/dns but would like to open this up against OVNKubernetes (perhaps missing flows ?) initially and see where we go from ther..

Assignee:: Jaime Caamaño Ruiz

Reporter:: Sai Sindhur Malleni (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/05/01 5:17 PM

Updated:: 2025/07/26 11:45 PM

Resolved:: 2023/10/11 5:28 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates