Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.15.z
Affects Version/s: 4.16
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Backport

Regression:
No
Sprint:
SDN Sprint 253, SDN Sprint 254
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link:https://issues.redhat.com/browse/OCPBUGS-33960[*~~OCPBUGS-33960~~*]

Show
* Previously, certain situations caused transfer of an Egress IP address from one node to a different node to fail, and this failure impacted the OVN-Kubernetes network. The network failed to send gratuitous Address Resolution Protocol (ARP) requests to peers to inform them of the new node’s medium access control (MAC) address. As a result, peers would temporarily send reply traffic to an old node and this traffic led to failover issues. With this release, the OVN-Kubernetes network correctly sends a gratuitous ARP to peers to inform them of the new Egress IP node MAC address, so that each peer can send reply traffic to the new node without causing failover time issues. (link: https://issues.redhat.com/browse/OCPBUGS-33960 [* OCPBUGS-33960 *]
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-32161~~. The following is the description of the original issue:
—
Description of problem:

    We have created 24000 eips for 24000 pods (where each namespace has 1 EIP and 1 pod) on a 120 node baremetal environment and failed over the node which has 200 EIPs by blocking port  9107 using iptables and observed high pod connection latencies (varying between 41 sec to 221 msec) for which EIP failed over to other nodes.

pod	EIP Failover latency in sec
client-1-13103-78c6585bbb-jkr8h4	41.0 sec
client-1-2777-7d86cd47bf-djgnf	38.0 sec
client-1-2609-79cfd5ff55-7z446	23.2 sec
client-1-22868-7bf96cd49-fjrtj	16.0 sec
client-1-23491-56f499cc69-w5hbr	9.01 sec
client-1-11301-78b5bbc987-vrs8s	9.01 sec
client-1-6098-64b7d9d4f4-b62zm	2.00 sec
client-1-22599-5975f8bdc4-hgng2	2.00 sec
client-1-15570-86b979d584-j7cpb	221 msec

CPU usage and ovs flow metrics avaibale in grafana dashbaord https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenbaa/kube-burner-report-eip?orgId=1&from=1712835501022&to=1712857101023&var-Datasource=AWS+Pro+-+ripsaw-kube-burner&var-workload=egressip&var-uuid=7f8a09af-8ed6-4027-bbc7-0583aa18db10&var-master=f20-h02-000-r640.rdu2.scalelab.redhat.com&var-worker=f20-h11-000-r640.rdu2.scalelab.redhat.com&var-infra=f36-h10-000-r640.rdu2.scalelab.redhat.com&var-namespace=All&var-latencyPercentile=P99

must-gahter http://storage.scalelab.redhat.com/anilvenkata/eip_failover_mg/must-gather.local.2880304935723177257.tgz

All the resources were already created before we issued node failover. Node on which port 9107 is blcoked also hosts 200 pods. This node also has 200 EIPs. We only issued iptables command to block port 9107

sudo iptables -A INPUT -p tcp --dport 9107 -j DROP

and we didn't delete any conntrack entries or ovs flows etc .. for failover simulation.

blocks

OCPBUGS-34570 High Egress IP failover latency during scale testing

Closed

clones

OCPBUGS-32161 High Egress IP failover latency during scale testing

Closed

is blocked by

OCPBUGS-32161 High Egress IP failover latency during scale testing

Closed

is cloned by

OCPBUGS-34570 High Egress IP failover latency during scale testing

Closed

links to

openshift/ovn-kubernetes#2175: OCPBUGS-33960: Egressip garp fix 4.15

RHSA-2024:3327 OpenShift Container Platform 4.15.z security update

(1 links to)

Assignee:: Jaime Caamaño Ruiz

Reporter:: OpenShift Prow Bot

QA Contact:: Sachin Ninganure

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/05/20 11:19 AM

Updated:: 2024/05/30 8:45 AM

Resolved:: 2024/05/29 3:43 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates