Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-72569

EgressIP Test failures in 4.20, 4.21 && 4.22

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • Approved
    • CORENET Sprint 282
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Slack thread

      EgressIP Test failures in 4.20, 4.21 && 4.22 first seen in payloads

      4.20.0-0.nightly-2026-01-09-015458
      4.21.0-0.nightly-2026-01-09-013516
      4.22.0-0.nightly-2026-01-09-023214

      We do not see a common code change across those releases but the payloads did contain rebuilt images that we believe included a go bump

      We saw a similar issue in OCPBUGS-72411 where a CVE fix in go exposed an existing issue within CNO that worked prior to the CVE update.

      weliang1@redhat.com provided the following analysis of the failures

      RCA on one 4.21 job:
      Executive Summary

      Four EgressIP test cases failed because pod traffic egressed with the node's internal IP address (10.0.171.187) instead of the assigned EgressIP (10.0.160.5). Despite correct EgressIP configuration, OVN-Kubernetes failed to SNAT traffic through the designated EgressIP node.

      Root Cause Analysis (RCA)

      Technical Breakdown:
      What Happened:
      EgressIP 10.0.160.5 was correctly assigned to node ip-10-0-165-72
      Test prober pod was scheduled on a different node: ip-10-0-171-187
      Expected: Pod traffic should be SNATed to EgressIP 10.0.160.5
      Actual: Traffic egressed with source IP 10.0.171.187 (the node's own IP)
      Result: Packet sniffer found map[10.0.171.187:10] instead of expected map[10.0.160.5:...]

      How EgressIP Should Work (Expected Flow)

      ┌─────────────────────────────────────────────────────────────────┐
      │ Step-by-Step Expected Traffic Flow: │
      └─────────────────────────────────────────────────────────────────┘

      1. Prober Pod (10.128.2.133) on Node ip-10-0-171-187

      │ HTTP GET to external target (54.68.26.160)


      2. OVN Logical Router on ip-10-0-171-187

      │ EgressIP policy match: Redirect to ip-10-0-165-72


      3. Geneve/STT Tunnel to Node ip-10-0-165-72

      │ Traffic tunneled to EgressIP-designated node


      4. OVN SNAT on Node ip-10-0-165-72

      │ Source NAT: 10.128.2.133 → 10.0.160.5 (EgressIP)


      5. External Network via br-ex

      │ Packet egresses with source IP: 10.0.160.5 :white_check_mark:


      6. Packet Sniffer on ip-10-0-165-72 Captures Traffic

      └─ Expected: "10.0.160.5" in tcpdump logs

      What Actually Happened (Failure)

      ┌─────────────────────────────────────────────────────────────────┐
      │ ACTUAL (BROKEN) Traffic Flow: │
      └─────────────────────────────────────────────────────────────────┘

      1. Prober Pod (10.128.2.133) on Node ip-10-0-171-187

      │ HTTP GET to external target (54.68.26.160)


      2. OVN Logical Router on ip-10-0-171-187

      │ :x: EgressIP policy FAILED - No redirect to ip-10-0-165-72


      3. Default Route on Node ip-10-0-171-187

      │ Traffic uses normal egress path (not EgressIP)


      4. SNAT to Node's Own IP

      │ Source NAT: 10.128.2.133 → 10.0.171.187 (NODE IP, NOT EgressIP!)


      5. External Network via br-ex

      │ Packet egresses with source IP: 10.0.171.187 :x: WRONG!


      6. Packet Sniffer on ip-10-0-171-187 (wrong node!) Captures Traffic

      └─ Actual: "10.0.171.187" in tcpdump logs (FAILURE)

      Evidence from the Logs
      Expected EgressIP Assignment:

      Line 1545: map[ip-10-0-165-72:[10.0.160.5] ip-10-0-171-187:[10.0.160.6]]
      Line 1556: Egress IP object does have all IPs for map[10.0.160.5:ip-10-0-165-72]

      :white_check_mark: EgressIP 10.0.160.5 correctly assigned to node ip-10-0-165-72
      Prober Pod Placement:

      Line 2022: prober-podplq5v: Successfully assigned to ip-10-0-171-187
      Line 2033: prober-podplq5v: eth0 [10.128.2.133/23]

      :white_check_mark: Pod scheduled on node ip-10-0-171-187 (different node - correct for test)
      Traffic Verification (FAILURE):

      Line 1710: Found map is: map[10.0.171.187:10]
      :x: Packet sniffer found traffic from 10.0.171.187 (the pod's node IP)
      :x: Expected to find traffic from 10.0.160.5 (the EgressIP)
      Test Timeout:
      Line 1557: Making sure that 10 requests with EgressIPs map[10.0.160.5:ip-10-0-165-72] were seen
      Line 2055: Timed out after 120.533s. Expected <bool>: false to be true
      :x: Test waited 120 seconds but never saw traffic from the EgressIP address

      Impacted tests

      [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be updated [Serial] [Suite:openshift/conformance/serial]
      
      [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should keep the assigned EgressIPs when being rescheduled to another node [Serial] [Suite:openshift/conformance/serial]
      
      [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] pods should have the assigned EgressIPs and EgressIPs can be deleted and recreated [Skipped:azure][apigroup:route.openshift.io] [Serial] [Suite:openshift/conformance/serial]
      
      [sig-network][Feature:EgressIP][apigroup:operator.openshift.io] [external-targets][apigroup:user.openshift.io][apigroup:security.openshift.io] only pods matched by the pod selector should have the EgressIPs [Serial] [Suite:openshift/conformance/serial]
      

      Version-Release number of selected component (if applicable):

      How reproducible:
      Permafailing aws serial jobs in payloads for 4.20,4.21 and 4.22

      Examples
      4.21-e2e-aws-ovn-serial-1of2/2010377540270559232
      4.21-e2e-aws-ovn-techpreview-serial-1of3/2010389507437760512
      4.21-e2e-aws-ovn-techpreview-serial-2of3/2010375508918800384
      4.21-e2e-aws-ovn-techpreview-serial-3of3/2010375240890191872

              rhn-support-arghosh Arnab Ghosh
              rh-ee-fbabcock Forrest Babcock
              None
              None
              Weibin Liang Weibin Liang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: