Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-4442

Address MTU issues in shared gw mode


    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • OVN Kubernetes
    • None
    • False
    • None
    • False
    • ---
    • SDN Sprint 251, SDN Sprint 252, SDN Sprint 253
    • 0
    • 0
    • Rejected

      Description of problem:

      We have various MTU issues with ovnkube in shared gateway mode due to the pod networking living on an MTU boundary 100 bytes less than the physical network. For example, when an ovn networked pod contacts an external entity, and it replies with a packet larger than the pod MTU, it results in needs frag sent back to the host. Additionally, OVS does not support IP fragmentation, so even if the "don't fragment" bit is 0, OVN will always send ICMP needs frag.


      We have various workarounds in place that mitigate the issue, like putting 1400 MTU on routes in the hosts towards service network, and recommending customers use local gateway mode which level the kernel to do fragmentation. However, we need to address this in a more holistic approach.


      Our current line of thinking is that we can make the pod MTUs the same as the physical network, thus eliminating a difference in MTU boundary between pods and the physical network. This means the pod egress and egress reply traffic can operate at the higher MTU, which will also improve throughput.


      The exception includes packets routed over geneve. For this path, a pod sending a packet that is too large to another packet would result in ICMP needs frag generated by the geneve kernel module. We need support from OVN to route these back to the pod:


      As additional prevention from the pods sending too large of packets in the first place, we can set MTU routes inside each pod towards the pod subnet, as well as to the service subnet.


      After these changes, the only path that can still result in MTU lowering would be ingress traffic that hits a service and is proxied to a pod on another node (like nodeport service). In this case, the ICMP needs frag is unavoidable.


      We will need to consider how upgrade will work here. 

            rravaiol@redhat.com Riccardo Ravaioli
            trozet@redhat.com Tim Rozet
            Huiran Wang Huiran Wang
            0 Vote for this issue
            5 Start watching this issue