Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:EgressIP
- network

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.21.0
Release Blocker:
None
Sprint:
CORENET Sprint 273, CORENET Sprint 274, CORENET Sprint 275, CORENET Sprint 277, CORENET Sprint 278, CORENET Sprint 279
sprint_count:
6

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

After upgrading an Azure Red Hat OpenShift (ARO) cluster to OCP version 4.18.13, nodes assigned EgressIP addresses respond to Azure Load Balancer health probes for ingress NodePort services. As a result, the Azure Load Balancer marks these EgressIP addresses as healthy backends for ingress traffic. When the load balancer forwards external connections to an EgressIP, connections are immediately refused (TCP RST), since no router pod is listening on the EgressIP. This results in intermittent “connection refused” errors for all OpenShift ingress routes exposed via the default ingress controller.
This behavior was not present in OCP 4.17.25 (in ARO), where only the primary node IPs responded to health probes and served ingress traffic.

Version-Release number of selected component (if applicable):

OpenShift Container Platform (OCP) 4.18.13, running on Azure Red Hat OpenShift (ARO)

How reproducible:

Not sure, assuming that always (after upgrade to OCP 4.18.13, with EgressIP assigned to nodes hosting router pods)

Steps to Reproduce:

    1. Deploy an ARO cluster running OCP 4.18.13 multi-node.
    2. Assign EgressIP addresses to nodes that also host default ingress router pods (label nodes with k8s.ovn.org/egress-assignable=true).
    3. Observe Azure Load Balancer backend health for the ingress service. Note that both primary node IPs and EgressIPs are marked as healthy.
    4. From an external client, attempt to access an application via a cluster ingress route repeatedly.

Actual results:

1. External access to ingress routes is intermittently refused (TCP RST) by the Azure Load Balancer.

2. Packet captures show that connections sent by the LB to an EgressIP are immediately refused, as no router pod is listening on the EgressIP.

3. Internal access from within the cluster always succeeds.

Expected results:

1. Only primary node IPs (with router pods) should respond to Azure LB health probes and be marked healthy for ingress.


2. EgressIP addresses should not respond to ingress health probes or be used as LB backends for ingress, since they are intended for outbound connections only.


3. External access to ingress routes should always succeed, with no intermittent connection refusals.

Additional info:

1. Removing the k8s.ovn.org/egress-assignable label from nodes hosting router pods resolves the issue, but disables EgressIP assignment for those nodes (not a viable long-term solution if SNAT/EgressIP is needed for workloads).

2. This regression is not present in OCP 4.17.25 (on ARO).

3. See attached packet captures, Azure LB backend health screenshots, and relevant oc get outputs for reference:

$ arping -I br-ex 10.71.1.254 returned only 1 MAC address consistently:
   Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.193ms
   Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.072ms
   Unicast reply from 10.71.1.254 [12:34:56:78:9A:BC] 1.250ms
   Sent 873 probes (1 broadcast(s))
   Received 873 response(s)

* Issue pattern observed:

- External access: Intermittent connection refused errors occur when connecting from any external source (jumphost VM, VPN connections, other subnets)
- Internal access: 100% success rate when connecting from within the load balancer subnet (worker nodes) or from pods inside the ARO cluster
- Scope: Affects all ingress routes, including the ARO API endpoints  DEV 

* Current Environment (ARO 4.18.13):

- 5 backends marked as "Up” for the default router: 10.71.1.133 (worker ip), 10.71.1.134(worker ip), 10.71.1.245, 10.71.1.239, 10.71.1.240

* UAT Environment (ARO 4.17.25):

- 2 backends marked as "Up" for the default router: 10.71.5.135, 10.71.5.136 (router nodes actual worker IPs only)

* Pcaps checks revealed:

$  tshark -nr 0060-jumpbox-capture-20250610-000451.pcap -Y "tcp.stream == 13" -T fields -e tcp.stream -e ip.id -e ip.ttl -e tcp.time_delta -e frame.time -e ip.src -e ip.dst -e _ws.col.Protocol -e _ws.col.Info

13    0x8158    64    0.000000000    Jun  9, 2025 22:05:20.938123000 UTC    10.71.0.30    10.71.1.254TCP    50366 → 443 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=2905919441 TSecr=0 WS=128
13    0x0000    64    0.001239000    Jun  9, 2025 22:05:20.939362000 UTC    10.71.1.254    10.71.0.30    TCP    443 → 50366 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0

4. Cluster networking: OVN-Kubernetes. 

5. Service is configured with externalTrafficPolicy: Local.

6. Customer needs a safe path to upgrade their production ARO cluster to OCP 4.18.z without facing this issue.

**UPDATE: attaching KCS: https://access.redhat.com/solutions/7128717

links to

https://access.redhat.com/solutions/7128717

openshift/cloud-network-config-controller#180: OCPBUGS-57447,OCPBUGS-45056: Refrain from adding Egress IP to public LB backend pool

Assignee:: Arnab Ghosh

Reporter:: Alex Volkov

Need Info From:: None

Contributors:: None

QA Contact:: Ying Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Created:: 2025/06/13 12:01 PM

Updated:: 2025/10/24 6:58 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide