-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.16
-
No
-
SDN Sprint 253, SDN Sprint 254
-
2
-
False
-
-
-
Bug Fix
-
Done
This is a clone of issue OCPBUGS-32161. The following is the description of the original issue:
—
Description of problem:
We have created 24000 eips for 24000 pods (where each namespace has 1 EIP and 1 pod) on a 120 node baremetal environment and failed over the node which has 200 EIPs by blocking port 9107 using iptables and observed high pod connection latencies (varying between 41 sec to 221 msec) for which EIP failed over to other nodes.
pod | EIP Failover latency in sec |
---|---|
client-1-13103-78c6585bbb-jkr8h4 | 41.0 sec |
client-1-2777-7d86cd47bf-djgnf | 38.0 sec |
client-1-2609-79cfd5ff55-7z446 | 23.2 sec |
client-1-22868-7bf96cd49-fjrtj | 16.0 sec |
client-1-23491-56f499cc69-w5hbr | 9.01 sec |
client-1-11301-78b5bbc987-vrs8s | 9.01 sec |
client-1-6098-64b7d9d4f4-b62zm | 2.00 sec |
client-1-22599-5975f8bdc4-hgng2 | 2.00 sec |
client-1-15570-86b979d584-j7cpb | 221 msec |
CPU usage and ovs flow metrics avaibale in grafana dashbaord https://grafana.rdu2.scalelab.redhat.com:3000/d/FwPsenbaa/kube-burner-report-eip?orgId=1&from=1712835501022&to=1712857101023&var-Datasource=AWS+Pro+-+ripsaw-kube-burner&var-workload=egressip&var-uuid=7f8a09af-8ed6-4027-bbc7-0583aa18db10&var-master=f20-h02-000-r640.rdu2.scalelab.redhat.com&var-worker=f20-h11-000-r640.rdu2.scalelab.redhat.com&var-infra=f36-h10-000-r640.rdu2.scalelab.redhat.com&var-namespace=All&var-latencyPercentile=P99
must-gahter http://storage.scalelab.redhat.com/anilvenkata/eip_failover_mg/must-gather.local.2880304935723177257.tgz
All the resources were already created before we issued node failover. Node on which port 9107 is blcoked also hosts 200 pods. This node also has 200 EIPs. We only issued iptables command to block port 9107
sudo iptables -A INPUT -p tcp --dport 9107 -j DROP
and we didn't delete any conntrack entries or ovs flows etc .. for failover simulation.
- blocks
-
OCPBUGS-34570 High Egress IP failover latency during scale testing
- Closed
- clones
-
OCPBUGS-32161 High Egress IP failover latency during scale testing
- Closed
- is blocked by
-
OCPBUGS-32161 High Egress IP failover latency during scale testing
- Closed
- is cloned by
-
OCPBUGS-34570 High Egress IP failover latency during scale testing
- Closed
- links to
-
RHSA-2024:3327 OpenShift Container Platform 4.15.z security update