Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57396

Stale routes to the join switch subnet cause intermittent drops during egress

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • None
    • None
    • CORENET Sprint 272, CORENET Sprint 273
    • 2
    • +
    • Done
    • Bug Fix
    • Hide
      * Previously, in certain situations the gateway IP address for a node changed and caused the `OVN` cluster router, which manages the static route to the cluster subnet, to add a new static route with the new gateway IP address, without deleting the original one. As a result, a stale route still pointed to the switch subnet and this caused intermittent drops during egress traffic transfer. With this release, a patch applied to the `OVN` cluster router ensures that if the gateway IP address changes, the `OVN` cluster router updates the existing static route with the new gateway IP address. A stale route no longer points to the `OVN` cluster router so that egress traffic flow does not drop. (link:https://issues.redhat.com/browse/OCPBUGS-32754[OCPBUGS-32754])
      Show
      * Previously, in certain situations the gateway IP address for a node changed and caused the `OVN` cluster router, which manages the static route to the cluster subnet, to add a new static route with the new gateway IP address, without deleting the original one. As a result, a stale route still pointed to the switch subnet and this caused intermittent drops during egress traffic transfer. With this release, a patch applied to the `OVN` cluster router ensures that if the gateway IP address changes, the `OVN` cluster router updates the existing static route with the new gateway IP address. A stale route no longer points to the `OVN` cluster router so that egress traffic flow does not drop. (link: https://issues.redhat.com/browse/OCPBUGS-32754 [ OCPBUGS-32754 ])
    • None

      This is a clone of issue OCPBUGS-56443. The following is the description of the original issue:

      Description of problem:

      Under some circumstances (not clear exactly which ones), the OVN databases of 2 nodes ended up having 2 src-ip static routes in ovn_cluster_router instead of one: one of them points to the correct IP of the rtoj-GR_${NODE_NAME} LRP and one points to a wrong IP on the join subnet (that IP is not used in any other LRP or LSP).

      Both static routes are taken into consideration while routing traffic out from the cluster, so packets that use the right route are able to egress while the packets that use the wrong route are dropped.

      Version-Release number of selected component (if applicable):

      Reproduced in 4.14.20

      How reproducible:

      At least once. Only 2 nodes of the cluster.  

      Steps to Reproduce:

      (Not sure, it was just found after investigation of strange packet drop)

      Actual results:

      Wrong static route to some non-existent IP in the join subnet. Intermittent packet drop.

      Expected results:

      No wrong static routes. No packet drop.

      Additional info:

      This can be workarounded by wiping the OVN databases of the impacted node.

              rravaiol@redhat.com Riccardo Ravaioli
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Jean Chen Jean Chen
              None
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: