[OCPBUGS-34570] High Egress IP failover latency during scale testing

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.14.z
Affects Version/s: 4.16
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Backport
- SDN:OVNK:EgressIP

Regression:
No
Sprint:
SDN Sprint 254, SDN Sprint 255
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link:https://issues.redhat.com/browse/OCPBUGS-34570[*~~OCPBUGS-34570~~*])

Show
* Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link: https://issues.redhat.com/browse/OCPBUGS-34570 [* OCPBUGS-34570 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-33960~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-32161~~. The following is the description of the original issue:
—
Description of problem:

    We have created 24000 eips for 24000 pods (where each namespace has 1 EIP and 1 pod) on a 120 node baremetal environment and failed over the node which has 200 EIPs by blocking port  9107 using iptables and observed high pod connection latencies (varying between 41 sec to 221 msec) for which EIP failed over to other nodes.

pod	EIP Failover latency in sec
client-1-13103-78c6585bbb-jkr8h4	41.0 sec
client-1-2777-7d86cd47bf-djgnf	38.0 sec
client-1-2609-79cfd5ff55-7z446	23.2 sec
client-1-22868-7bf96cd49-fjrtj	16.0 sec
client-1-23491-56f499cc69-w5hbr	9.01 sec
client-1-11301-78b5bbc987-vrs8s	9.01 sec
client-1-6098-64b7d9d4f4-b62zm	2.00 sec
client-1-22599-5975f8bdc4-hgng2	2.00 sec
client-1-15570-86b979d584-j7cpb	221 msec

CPU usage and ovs flow metrics avaibale in grafana dashbaord https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenbaa/kube-burner-report-eip?orgId=1&from=1712835501022&to=1712857101023&var-Datasource=AWS+Pro+-+ripsaw-kube-burner&var-workload=egressip&var-uuid=7f8a09af-8ed6-4027-bbc7-0583aa18db10&var-master=f20-h02-000-r640.rdu2.scalelab.redhat.com&var-worker=f20-h11-000-r640.rdu2.scalelab.redhat.com&var-infra=f36-h10-000-r640.rdu2.scalelab.redhat.com&var-namespace=All&var-latencyPercentile=P99

must-gahter http://storage.scalelab.redhat.com/anilvenkata/eip_failover_mg/must-gather.local.2880304935723177257.tgz

All the resources were already created before we issued node failover. Node on which port 9107 is blcoked also hosts 200 pods. This node also has 200 EIPs. We only issued iptables command to block port 9107

sudo iptables -A INPUT -p tcp --dport 9107 -j DROP

and we didn't delete any conntrack entries or ovs flows etc .. for failover simulation.

clones

OCPBUGS-33960 High Egress IP failover latency during scale testing

Closed

is blocked by

OCPBUGS-33960 High Egress IP failover latency during scale testing

Closed

links to

openshift/ovn-kubernetes#2188: [release-4.14] OCPBUGS-34570: Egressip garp fix 4.15

RHBA-2024:3881 OpenShift Container Platform 4.14.z bug fix update

Errata Tool added a comment - 2024/06/19 2:37 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Moderate: OpenShift Container Platform 4.14.30 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:3881

Errata Tool added a comment - 2024/06/19 2:37 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.14.30 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3881

Sachin Ninganure added a comment - 2024/06/12 6:06 AM

Verified!

Sachin Ninganure added a comment - 2024/06/12 6:06 AM Verified!

Assignee:: Jaime Caamaño Ruiz

Reporter:: OpenShift Prow Bot

QA Contact:: Sachin Ninganure

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/05/29 8:09 AM

Updated:: 2025/03/19 8:41 AM

Resolved:: 2024/06/19 2:37 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/06/19 2:37 PM

Expand comment: Errata Tool added a comment - 2024/06/19 2:37 PM

Collapse comment: Sachin Ninganure added a comment - 2024/06/12 6:06 AM

Expand comment: Sachin Ninganure added a comment - 2024/06/12 6:06 AM

People

Dates