Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18555

Failover tests for application fails after deletion of primary f5-tmm

XMLWordPrintable

    • No
    • CNF Network Sprint 241, CNF Network Sprint 242, CNF Network Sprint 243
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • 09/16 Possible RCA provided. Waiting for the customer to test `net.ipv4.tcp_retries2 = 8` to see if it solves

      Description of problem:

      Customer has the following namespaces for their application: spk-rchltxekvzwcamf-y-ec-x-003 and rchltxekvzwcamf-y-ec-x-003
      
      This application is for handing off calls to different cell towers as the cell phone moves. 
      
      In the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, there are 2x f5-tmm pods. Only one serves as the primary. In the event one dies / disappears / deletes, we expect the traffic to failover to the non-primary as soon as possible.
      
      From the customer's observations, when they delete the primary-acting f5-tmm pod in the 'spk-rchltxekvzwcamf-y-ec-x-003' namespace, the 'mobility' pods in the 'rchltxekvzwcamf-y-ec-x-003' namespace lose connection for about 16 minutes every time because their application continues to try reaching the old primary 'f5-tmm'.
      
      During this event, customer sees the following log message in their alert system:
      
      
      11898 amfN14PathFailure 2023-08-31 10:45:52 amf_set_awareness_sm minor communications 11898 The AMF has lost the N14 connection towards AMF = 1: rchltxekvzwcamf-y-ec-x-002.amf.5gc.mnc480.mcc311.3gppnetwork.org.

      Version-Release number of selected component (if applicable):

      4.7.55

      How reproducible:

      This is reproducible by deleting the 'f5-tmm' pod. 
      

      Steps to Reproduce:

      1. Find primary 'f5-tmm' pod in 'spk-rchltxekvzwcamf-y-ec-x-003' namespace
      2. Delete that pod
      3. Observe the alert 'The AMF has lost the N14 connection towards AMF' 

      Actual results:

      It hangs for 16 minutes before auto-mitigating

      Expected results:

      Near-instant failover 

      Additional info:

      I initially thought it might've been related to stale UDP conntracks but I don't think so anymore but we ran this script and no stale entries returned: https://github.com/RHsyseng/openshift-checks/blob/main/scripts/ovn_cleanConntrack.sh
      
      As a side note, there's an 'spk-coredns' namespace that serves as the NAT46 translations that is required for the pods to resolve these FQDNs

            apanatto@redhat.com Andrea Panattoni
            rhn-support-acardena Albert Cardenas
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: