Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36710

Packet drop around time of OVN container restarts when upgrading routingViaHost: true from 4.13 to 4.14

XMLWordPrintable

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated
    • 07/29 RHOCPRIO, px score +4000. BQI:Excellent

      Description of problem:
      Following a report of a partner about packet drop during the 4.13 to 4.14 upgrade phase of an EUS upgrade, I tried reproducing their issue. The partner reported 10 seconds of impact for their application, whereas I personally can reproduce a couple of dropped packets during the upgrade. So the partner's issue may be different from what I'm seeing, but it's of note that their application sees drops during the OVNK upgrade stage and that I can consistently detect packet drop during the same stage.
      I can reproduce this packet drop every time I upgrade an OCP AWS cluster from 4.13.44 to 4.14.31

      The worker pools are paused to avoid worker reboots, and the packet drop occurs during the network co upgrade stage, around the time when the ovnkube pods restart, before master nodes are rebooted. I'll attach output, a must-gather and sosreports as well from my test cluster.

      Version-Release number of selected component (if applicable):
      upgrade 4.13.44 to 4.14.31

      How reproducible:

      Deploy a cluster with cluster bot:

      launch 4.13.44 aws
      

      Check out repo https://github.com/andreaskaris/network-check/ , then inspect and run steps from deploy.sh (see comments below for what the script does) and then monitor progress with monitor.sh:

      # deploy.sh enables debug logging, routingViaHost: true, and deploys pods on the worker
      # nodes that ping each other, google and that curl their own service via an ingress
      # route
      # it then pauses the worker nodes (so that they will not reboot) and will kick off
      # an upgrade
      bash -x deploy.sh
      # once the cluster starts upgrading, run:
      bash monitor.sh
      

      When the network-check pods come up, they'll initially report some packet drop because they are pinging pods that aren't up, ignore that.
      However, around the time when the OVN Kubernetes pods restart, I consistently see a bit of packet drop, but not on all workers. Sometimes it's one, sometimes it's a couple that show this:

      ====================
      Mon Jul  8 04:33:42 PM CEST 2024   # <---  14:33:42 UTC
      ====================
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.44   True        True          49m     Working towards 4.14.31: 700 of 860 done (81% complete), waiting on network
      NAME                                         STATUS   ROLES                  AGE    VERSION
      ip-10-0-131-36.us-east-2.compute.internal    Ready    worker                 108m   v1.26.15+4818370
      ip-10-0-142-32.us-east-2.compute.internal    Ready    control-plane,master   128m   v1.26.15+4818370
      ip-10-0-143-3.us-east-2.compute.internal     Ready    control-plane,master   128m   v1.26.15+4818370
      ip-10-0-176-162.us-east-2.compute.internal   Ready    worker                 71m    v1.26.15+4818370
      ip-10-0-223-228.us-east-2.compute.internal   Ready    control-plane,master   128m   v1.26.15+4818370
      ip-10-0-232-133.us-east-2.compute.internal   Ready    worker                 115m   v1.26.15+4818370
      NAME                  READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
      network-check-66htn   1/1     Running   0          71m   10.131.0.7   ip-10-0-176-162.us-east-2.compute.internal   <none>           <none>
      network-check-lrrvm   1/1     Running   0          91m   10.129.2.3   ip-10-0-232-133.us-east-2.compute.internal   <none>           <none>
      network-check-mmh64   1/1     Running   0          91m   10.130.2.3   ip-10-0-131-36.us-east-2.compute.internal    <none>           <none>
      NAME                   READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
      ovnkube-master-5lft4   6/6     Running   0          3m43s   10.0.142.32    ip-10-0-142-32.us-east-2.compute.internal    <none>           <none>
      ovnkube-master-ptwk7   6/6     Running   0          7m47s   10.0.143.3     ip-10-0-143-3.us-east-2.compute.internal     <none>           <none>
      ovnkube-master-rc6xq   6/6     Running   0          5m45s   10.0.223.228   ip-10-0-223-228.us-east-2.compute.internal   <none>           <none>
      ovnkube-node-67qds     5/5     Running   0          9m6s    10.0.131.36    ip-10-0-131-36.us-east-2.compute.internal    <none>           <none>
      ovnkube-node-9ljzd     8/8     Running   0          40s     10.0.143.3     ip-10-0-143-3.us-east-2.compute.internal     <none>           <none>
      ovnkube-node-dnmmq     5/5     Running   0          11m     10.0.223.228   ip-10-0-223-228.us-east-2.compute.internal   <none>           <none>
      ovnkube-node-fjjjb     5/8     Running   0          17s     10.0.176.162   ip-10-0-176-162.us-east-2.compute.internal   <none>           <none>
      ovnkube-node-rp48s     5/5     Running   0          9m38s   10.0.232.133   ip-10-0-232-133.us-east-2.compute.internal   <none>           <none>
      ovnkube-node-trz8j     5/5     Running   0          10m     10.0.142.32    ip-10-0-142-32.us-east-2.compute.internal    <none>           <none>
      === pod/network-check-66htn ===
      Mon Jul  8 13:22:41 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 14:33:37 UTC 2024: Ping to network-check-lrrvm (10.129.2.3) lost       # <--- this
      Mon Jul  8 14:33:37 UTC 2024: Ping to network-check-mmh64 (10.130.2.3) lost   # <--- this
      === pod/network-check-lrrvm ===
      Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-jlsfc (10.131.0.9) lost
      Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:33 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:34 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:35 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:36 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:38 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:39 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:40 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:41 UTC 2024: Ping to network-check-66htn (<none>) lost
      === pod/network-check-mmh64 ===
      Mon Jul  8 13:22:31 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-jlsfc (10.131.0.9) lost
      Mon Jul  8 13:22:32 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:33 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:35 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:36 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:37 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:38 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:39 UTC 2024: Ping to network-check-66htn (<none>) lost
      Mon Jul  8 13:22:40 UTC 2024: Ping to network-check-66htn (<none>) lost
      

      Steps to Reproduce:

      1.

      2.

      3.

      Additional info:

      Is packet drop something that we test for in our CI lanes? This is too easy to reproduce and I wonder if we purposefully tolerate the loss of a few packets?

            jtanenba@redhat.com Jacob Tanenbaum
            akaris@redhat.com Andreas Karis
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated: