Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33960

High Egress IP failover latency during scale testing

XMLWordPrintable

    • No
    • SDN Sprint 253, SDN Sprint 254
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link:https://issues.redhat.com/browse/OCPBUGS-33960[*OCPBUGS-33960*]
      Show
      * Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link: https://issues.redhat.com/browse/OCPBUGS-33960 [* OCPBUGS-33960 *]
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-32161. The following is the description of the original issue:

      Description of problem:

          We have created 24000 eips for 24000 pods (where each namespace has 1 EIP and 1 pod) on a 120 node baremetal environment and failed over the node which has 200 EIPs by blocking port  9107 using iptables and observed high pod connection latencies (varying between 41 sec to 221 msec) for which EIP failed over to other nodes. 
      
      
      pod EIP Failover latency in sec
      client-1-13103-78c6585bbb-jkr8h4 41.0 sec
      client-1-2777-7d86cd47bf-djgnf 38.0 sec
      client-1-2609-79cfd5ff55-7z446 23.2 sec
      client-1-22868-7bf96cd49-fjrtj 16.0 sec
      client-1-23491-56f499cc69-w5hbr 9.01 sec
      client-1-11301-78b5bbc987-vrs8s 9.01 sec
      client-1-6098-64b7d9d4f4-b62zm 2.00 sec
      client-1-22599-5975f8bdc4-hgng2 2.00 sec
      client-1-15570-86b979d584-j7cpb  221 msec

      CPU usage and ovs flow metrics avaibale in grafana dashbaord https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenbaa/kube-burner-report-eip?orgId=1&from=1712835501022&to=1712857101023&var-Datasource=AWS+Pro+-+ripsaw-kube-burner&var-workload=egressip&var-uuid=7f8a09af-8ed6-4027-bbc7-0583aa18db10&var-master=f20-h02-000-r640.rdu2.scalelab.redhat.com&var-worker=f20-h11-000-r640.rdu2.scalelab.redhat.com&var-infra=f36-h10-000-r640.rdu2.scalelab.redhat.com&var-namespace=All&var-latencyPercentile=P99

      must-gahter  http://storage.scalelab.redhat.com/anilvenkata/eip_failover_mg/must-gather.local.2880304935723177257.tgz 

      All the resources were already created before we issued node failover. Node on which port 9107 is blcoked also hosts 200 pods. This node also has 200 EIPs. We only issued iptables command to block port 9107

      sudo iptables -A INPUT -p tcp --dport 9107 -j DROP

      and we didn't delete any conntrack entries or ovs flows etc .. for failover simulation.

              jcaamano@redhat.com Jaime Caamaño Ruiz
              openshift-crt-jira-prow OpenShift Prow Bot
              Sachin Ninganure Sachin Ninganure
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: