Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52503

[4.16] [OVN+CNV] Cached route with incorrect MTU causes huge instability in the cluster

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      * Previously, during the restart of OVN-Kubernetes containers, routes for the internal ovn-k8-mp0 interface were removed and readded, resulting in temporary traffic outage. With this release, the traffic paths flowing across the ovn-k8s-mp0 interface are not interrupted and the routes are not removed during the ovn-kubernetes pod restart. (link:https://issues.redhat.com/browse/OCPBUGS-52503[OCPBUGS-52503])
      Show
      * Previously, during the restart of OVN-Kubernetes containers, routes for the internal ovn-k8-mp0 interface were removed and readded, resulting in temporary traffic outage. With this release, the traffic paths flowing across the ovn-k8s-mp0 interface are not interrupted and the routes are not removed during the ovn-kubernetes pod restart. (link: https://issues.redhat.com/browse/OCPBUGS-52503 [ OCPBUGS-52503 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-48777. The following is the description of the original issue:

      Description of problem:
      This issue started on one of the customers that our partner manages, where they were getting route cache entries with incorrect MTUs, very similar to the issue we worked a few months ago mentioned below. However the root cause doesn't look to be the same, since customer version has all the fixes and it seems very particular to the usage they have on the cluster. It is also very hard (to almost impossible) to reproduce, unlike it was before we introduced the fixes from the Jira below.

      So far, me and the partner, we were only able to reproduce by doing upgrades, inthis case EUS to EUS from 4.12 to 4.14. In that regard I ran several reproducers doing this upgrade to see when the issue appears.

      Another difference from the previous bug, is that flushing cache doesn't help, since immediately after being flush the route cache entries appear again. The only solution is to reboot the nodes.

      The results of this are practically the same as we have seen before, with TCP connections being severely affected causing many network connectivity failures and degradation in the network performance, along with high number of packet errors and drops in the geneve port, ICMP need defrag seen in tcpdumps, etc.

      These clusters are bundles created by our partner that consist in:

      • 3 master/node cluster with optionally 2 or 3 additional worker nodes;
      • IPI bare metal installation;
      • ODF, CNV, SRIOV, MetalLB abd NMState operators as base;
      • Additional bridges and VLANs used for SRIOV to be used on VMs;
      • Default subdomain and public API routing done via MetalLB LoadBalancer service;
      • Main network bond interface has 2 VLANs. One VLAN is for machine network for internal traffic only and a second VLAN on a different network for public access. This VLAN is where both services above are exposed. This VLAN is where the default route is, changed after installation.
      • OVN gateway router set to local mode.
      • Cluster MTU set to 9000/8900

      Test I made to try to reduce the possible components involved in causing the issue, starting by the ones where I didn't see the issues:

      • Cluster with specs mentioned above, apart from SRIOV. Instead I'm using another interface in my VMs where I create a bridge using NMState operator. This bridge is then used for the cnv-bridge net-attach-def (example similar to what we have in our OpenShift Virtualization docs). This test with VMs with some using interfaces on the cnv-bridge. ODF being used for rbd and cephfs. OVN gateway in shared mode.
      • Cluster with all the same specs but no VMs nor additional bridge configured (CNV alone being configured or not makes no difference). No issue as expected which seems to prove there is no regression from the Jira below.

      The ones I see the issue seem to have one thing in common, which is either having VMs with additional networks or additional bridges or combination of both, plus the OVN gateway in local mode:

      • Tested cluster with VMs with interfaces on cnv-bridge with and without live migration enabled.
      • Tested cluster with CNV configured including cnv-bridge ready for VMs, but no VMs created.
      • Tested cluster with CNV configured including cnv-bridge and VMs running on the cluster network and on this bridge. External VLAN was also configured with MTU 9000 the same way as the internal VLAN.
      • These result in seeing results like this (taken from several tests):

      https://privatebin.corp.redhat.com/?47cb4fde3d39062f#98gvHAC7ptsoqcLfGfj2LXSd1hg9oygenkukU6NVVBHW

      Results like this below is when I try to run `ip route flush cache`. A second or 2 below the cache is immediately recreated and network issues on the cluster don't get any better:

      [WARNING]: Platform linux on host master-node0.ocp4-aio-
      cluster.redhatrules.local is using the discovered Python interpreter at
      /usr/bin/python3.9, but future installation of another Python interpreter could
      change the meaning of that path. See https://docs.ansible.com/ansible-
      core/2.16/reference_appendices/interpreter_discovery.html for more information.
      master-node0.ocp4-aio-cluster.redhatrules.local | CHANGED | rc=0 >>
      unicast 172.23.184.101 dev br-ex
      cache users 1 age 15sec

      I have many data gathered from my clusters and also customer has several uploaded to the case attached to the Jira.
      Let me know what is the best way to share the data.

      Version-Release number of selected component (if applicable):
      So far seen in OCP 4.12, 4.13 and 4.14.

      How reproducible:
      Sometimes. So far we could only reproduce during upgrades.

      Steps to Reproduce:

      1. Several ways to reproduce. So far only with upgrades and issue starts when the CNO starts the upgrade and restarts OVN.

      Actual results:

      Expected results:

      Additional info:
      This is very similar to an old bug we had on jira - https://issues.redhat.com/browse/OCPBUGS-26700 - but nothing indicates to be a regression, since this issue is not reproducible with simple OCP even with the same vlan configurations and cluster network setup.

              trozet@redhat.com Tim Rozet
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: