Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: 4.7
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No
Latest Status Summary:
09/16 Possible RCA provided. Waiting for the customer to test `net.ipv4.tcp_retries2 = 8` to see if it solves

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
CNF Network Sprint 241, CNF Network Sprint 242, CNF Network Sprint 243
sprint_count:
3

Internal Whiteboard:
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Customer has the following namespaces for their application: spk-rchltxekvzwcamf-y-ec-x-003 and rchltxekvzwcamf-y-ec-x-003

This application is for handing off calls to different cell towers as the cell phone moves. 

In the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, there are 2x f5-tmm pods. Only one serves as the primary. In the event one dies / disappears / deletes, we expect the traffic to failover to the non-primary as soon as possible.

From the customer's observations, when they delete the primary-acting f5-tmm pod in the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, the 'mobility' pods in the 'rchltxekvzwcamf-y-ec-x-003' namespace lose connection for about 16 minutes every time because their application continues to try reaching the old primary 'f5-tmm'.

During this event, customer sees the following log message in their alert system:


11898 amfN14PathFailure 2023-08-31 10:45:52 amf_set_awareness_sm minor communications 11898 The AMF has lost the N14 connection towards AMF = 1: rchltxekvzwcamf-y-ec-x-002.amf.5gc.mnc480.mcc311.3gppnetwork.org.

Version-Release number of selected component (if applicable):

4.7.55

How reproducible:

This is reproducible by deleting the 'f5-tmm' pod.

Steps to Reproduce:

1. Find primary 'f5-tmm' pod in 'spk-rchltxekvzwcamf-y-ec-x-003' namespace
2. Delete that pod
3. Observe the alert 'The AMF has lost the N14 connection towards AMF'

Actual results:

It hangs for 16 minutes before auto-mitigating

Expected results:

Near-instant failover

Additional info:

I initially thought it might've been related to stale UDP conntracks but I don't think so anymore but we ran this script and no stale entries returned: https://github.com/RHsyseng/openshift-checks/blob/main/scripts/ovn_cleanConntrack.sh

As a side note, there's an 'spk-coredns' namespace that serves as the NAT46 translations that is required for the pods to resolve these FQDNs

Assignee:: Andrea Panattoni

Reporter:: Albert Cardenas

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/09/05 4:07 PM

Updated:: 2025/09/13 11:47 PM

Resolved:: 2023/10/09 3:07 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide