Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Platform:OVNK

Severity:
Critical
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Customer Impact:

Customer Escalated
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Technical Impact Notes:
07/29 RHOCPRIO, px score +4000. BQI:Excellent
PX Impact Range:
PX Review Complete:
PX Technical Impact:

Description of problem:
Following a report of a partner about packet drop during the 4.13 to 4.14 upgrade phase of an EUS upgrade, I tried reproducing their issue. The partner reported 10 seconds of impact for their application, whereas I personally can reproduce a couple of dropped packets during the upgrade. So the partner's issue may be different from what I'm seeing, but it's of note that their application sees drops during the OVNK upgrade stage and that I can consistently detect packet drop during the same stage.
I can reproduce this packet drop every time I upgrade an OCP AWS cluster from 4.13.44 to 4.14.31

The worker pools are paused to avoid worker reboots, and the packet drop occurs during the network co upgrade stage, around the time when the ovnkube pods restart, before master nodes are rebooted. I'll attach output, a must-gather and sosreports as well from my test cluster.

Version-Release number of selected component (if applicable):
upgrade 4.13.44 to 4.14.31

How reproducible:

Deploy a cluster with cluster bot:

launch 4.13.44 aws

Check out repo https://github.com/andreaskaris/network-check/ , then inspect and run steps from deploy.sh (see comments below for what the script does) and then monitor progress with monitor.sh:

# deploy.sh enables debug logging, routingViaHost: true, and deploys pods on the worker
# nodes that ping each other, google and that curl their own service via an ingress
# route
# it then pauses the worker nodes (so that they will not reboot) and will kick off
# an upgrade
bash -x deploy.sh
# once the cluster starts upgrading, run:
bash monitor.sh

When the network-check pods come up, they'll initially report some packet drop because they are pinging pods that aren't up, ignore that.
However, around the time when the OVN Kubernetes pods restart, I consistently see a bit of packet drop, but not on all workers. Sometimes it's one, sometimes it's a couple that show this:

====================
Mon Jul  8 04:33:42 PM CEST 2024   # <---  14:33:42 UTC
====================
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.44   True        True          49m     Working towards 4.14.31: 700 of 860 done (81% complete), waiting on network
NAME                                         STATUS   ROLES                  AGE    VERSION
ip-10-0-131-36.us-east-2.compute.internal    Ready    worker                 108m   v1.26.15+4818370
ip-10-0-142-32.us-east-2.compute.internal    Ready    control-plane,master   128m   v1.26.15+4818370
ip-10-0-143-3.us-east-2.compute.internal     Ready    control-plane,master   128m   v1.26.15+4818370
ip-10-0-176-162.us-east-2.compute.internal   Ready    worker                 71m    v1.26.15+4818370
ip-10-0-223-228.us-east-2.compute.internal   Ready    control-plane,master   128m   v1.26.15+4818370
ip-10-0-232-133.us-east-2.compute.internal   Ready    worker                 115m   v1.26.15+4818370
NAME                  READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
network-check-66htn   1/1     Running   0          71m   10.131.0.7   ip-10-0-176-162.us-east-2.compute.internal   <none>           <none>
network-check-lrrvm   1/1     Running   0          91m   10.129.2.3   ip-10-0-232-133.us-east-2.compute.internal   <none>           <none>
network-check-mmh64   1/1     Running   0          91m   10.130.2.3   ip-10-0-131-36.us-east-2.compute.internal    <none>           <none>
NAME                   READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-5lft4   6/6     Running   0          3m43s   10.0.142.32    ip-10-0-142-32.us-east-2.compute.internal    <none>           <none>
ovnkube-master-ptwk7   6/6     Running   0          7m47s   10.0.143.3     ip-10-0-143-3.us-east-2.compute.internal     <none>           <none>
ovnkube-master-rc6xq   6/6     Running   0          5m45s   10.0.223.228   ip-10-0-223-228.us-east-2.compute.internal   <none>           <none>
ovnkube-node-67qds     5/5     Running   0          9m6s    10.0.131.36    ip-10-0-131-36.us-east-2.compute.internal    <none>           <none>
ovnkube-node-9ljzd     8/8     Running   0          40s     10.0.143.3     ip-10-0-143-3.us-east-2.compute.internal     <none>           <none>
ovnkube-node-dnmmq     5/5     Running   0          11m     10.0.223.228   ip-10-0-223-228.us-east-2.compute.internal   <none>           <none>
ovnkube-node-fjjjb     5/8     Running   0          17s     10.0.176.162   ip-10-0-176-162.us-east-2.compute.internal   <none>           <none>
ovnkube-node-rp48s     5/5     Running   0          9m38s   10.0.232.133   ip-10-0-232-133.us-east-2.compute.internal   <none>           <none>
ovnkube-node-trz8j     5/5     Running   0          10m     10.0.142.32    ip-10-0-142-32.us-east-2.compute.internal    <none>           <none>
=== pod/network-check-66htn ===
Mon Jul  8 13:22:41 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 14:33:37 UTC 2024: Ping to network-check-lrrvm (10.129.2.3) lost       # <--- this
Mon Jul  8 14:33:37 UTC 2024: Ping to network-check-mmh64 (10.130.2.3) lost   # <--- this
=== pod/network-check-lrrvm ===
Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-jlsfc (10.131.0.9) lost
Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:33 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:34 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:35 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:36 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:38 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:39 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:40 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:41 UTC 2024: Ping to network-check-66htn (<none>) lost
=== pod/network-check-mmh64 ===
Mon Jul  8 13:22:31 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-jlsfc (10.131.0.9) lost
Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:33 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:35 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:36 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:37 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:38 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:39 UTC 2024: Ping to network-check-66htn (<none>) lost
Mon Jul  8 13:22:40 UTC 2024: Ping to network-check-66htn (<none>) lost

Steps to Reproduce:

1.

2.

3.

Additional info:

Is packet drop something that we test for in our CI lanes? This is too easy to reproduce and I wonder if we purposefully tolerate the loss of a few packets?

Details

Description

Attachments

Activity

People

Dates