-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
Yes
-
None
-
None
-
CORENET Sprint 273, CORENET Sprint 274, CORENET Sprint 275, CORENET Sprint 277, CORENET Sprint 278, CORENET Sprint 279
-
6
-
Customer Escalated
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
After upgrading an Azure Red Hat OpenShift (ARO) cluster to OCP version 4.18.13, nodes assigned EgressIP addresses respond to Azure Load Balancer health probes for ingress NodePort services. As a result, the Azure Load Balancer marks these EgressIP addresses as healthy backends for ingress traffic. When the load balancer forwards external connections to an EgressIP, connections are immediately refused (TCP RST), since no router pod is listening on the EgressIP. This results in intermittent “connection refused” errors for all OpenShift ingress routes exposed via the default ingress controller. This behavior was not present in OCP 4.17.25 (in ARO), where only the primary node IPs responded to health probes and served ingress traffic.
Version-Release number of selected component (if applicable):
OpenShift Container Platform (OCP) 4.18.13, running on Azure Red Hat OpenShift (ARO)
How reproducible:
Not sure, assuming that always (after upgrade to OCP 4.18.13, with EgressIP assigned to nodes hosting router pods)
Steps to Reproduce:
1. Deploy an ARO cluster running OCP 4.18.13 multi-node.
2. Assign EgressIP addresses to nodes that also host default ingress router pods (label nodes with k8s.ovn.org/egress-assignable=true).
3. Observe Azure Load Balancer backend health for the ingress service. Note that both primary node IPs and EgressIPs are marked as healthy.
4. From an external client, attempt to access an application via a cluster ingress route repeatedly.
Actual results:
1. External access to ingress routes is intermittently refused (TCP RST) by the Azure Load Balancer. 2. Packet captures show that connections sent by the LB to an EgressIP are immediately refused, as no router pod is listening on the EgressIP. 3. Internal access from within the cluster always succeeds.
Expected results:
1. Only primary node IPs (with router pods) should respond to Azure LB health probes and be marked healthy for ingress. 2. EgressIP addresses should not respond to ingress health probes or be used as LB backends for ingress, since they are intended for outbound connections only. 3. External access to ingress routes should always succeed, with no intermittent connection refusals.
Additional info:
1. Removing the k8s.ovn.org/egress-assignable label from nodes hosting router pods resolves the issue, but disables EgressIP assignment for those nodes (not a viable long-term solution if SNAT/EgressIP is needed for workloads). 2. This regression is not present in OCP 4.17.25 (on ARO). 3. See attached packet captures, Azure LB backend health screenshots, and relevant oc get outputs for reference: $ arping -I br-ex 10.71.1.254 returned only 1 MAC address consistently: Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.193ms Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.072ms Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.250ms Sent 873 probes (1 broadcast(s)) Received 873 response(s) * Issue pattern observed: - External access: Intermittent connection refused errors occur when connecting from any external source (jumphost VM, VPN connections, other subnets) - Internal access: 100% success rate when connecting from within the load balancer subnet (worker nodes) or from pods inside the ARO cluster - Scope: Affects all ingress routes, including the ARO API endpoints DEV * Current Environment (ARO 4.18.13): - 5 backends marked as "Up” for the default router: 10.71.1.133 (worker ip), 10.71.1.134(worker ip), 10.71.1.245, 10.71.1.239, 10.71.1.240 * UAT Environment (ARO 4.17.25): - 2 backends marked as "Up" for the default router: 10.71.5.135, 10.71.5.136 (router nodes actual worker IPs only) * Pcaps checks revealed: $ tshark -nr 0060-jumpbox-capture-20250610-000451.pcap -Y "tcp.stream == 13" -T fields -e tcp.stream -e ip.id -e ip.ttl -e tcp.time_delta -e frame.time -e ip.src -e ip.dst -e _ws.col.Protocol -e _ws.col.Info 13 0x8158 64 0.000000000 Jun 9, 2025 22:05:20.938123000 UTC 10.71.0.30 10.71.1.254TCP 50366 → 443 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=2905919441 TSecr=0 WS=128 13 0x0000 64 0.001239000 Jun 9, 2025 22:05:20.939362000 UTC 10.71.1.254 10.71.0.30 TCP 443 → 50366 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 4. Cluster networking: OVN-Kubernetes. 5. Service is configured with externalTrafficPolicy: Local. 6. Customer needs a safe path to upgrade their production ARO cluster to OCP 4.18.z without facing this issue.
**UPDATE: attaching KCS: https://access.redhat.com/solutions/7128717