-
Bug
-
Resolution: Obsolete
-
Major
-
None
-
4.7
-
No
-
CNF Network Sprint 241, CNF Network Sprint 242, CNF Network Sprint 243
-
3
-
Rejected
-
False
-
-
-
09/16 Possible RCA provided. Waiting for the customer to test `net.ipv4.tcp_retries2 = 8` to see if it solves
-
Description of problem:
Customer has the following namespaces for their application: spk-rchltxekvzwcamf-y-ec-x-003 and rchltxekvzwcamf-y-ec-x-003 This application is for handing off calls to different cell towers as the cell phone moves. In the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, there are 2x f5-tmm pods. Only one serves as the primary. In the event one dies / disappears / deletes, we expect the traffic to failover to the non-primary as soon as possible. From the customer's observations, when they delete the primary-acting f5-tmm pod in the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, the 'mobility' pods in the 'rchltxekvzwcamf-y-ec-x-003' namespace lose connection for about 16 minutes every time because their application continues to try reaching the old primary 'f5-tmm'. During this event, customer sees the following log message in their alert system: 11898 amfN14PathFailure 2023-08-31 10:45:52 amf_set_awareness_sm minor communications 11898 The AMF has lost the N14 connection towards AMF = 1: rchltxekvzwcamf-y-ec-x-002.amf.5gc.mnc480.mcc311.3gppnetwork.org.
Version-Release number of selected component (if applicable):
4.7.55
How reproducible:
This is reproducible by deleting the 'f5-tmm' pod.
Steps to Reproduce:
1. Find primary 'f5-tmm' pod in 'spk-rchltxekvzwcamf-y-ec-x-003' namespace 2. Delete that pod 3. Observe the alert 'The AMF has lost the N14 connection towards AMF'
Actual results:
It hangs for 16 minutes before auto-mitigating
Expected results:
Near-instant failover
Additional info:
I initially thought it might've been related to stale UDP conntracks but I don't think so anymore but we ran this script and no stale entries returned: https://github.com/RHsyseng/openshift-checks/blob/main/scripts/ovn_cleanConntrack.sh As a side note, there's an 'spk-coredns' namespace that serves as the NAT46 translations that is required for the pods to resolve these FQDNs