Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42079

Only one MetalLB speaker pod has BDF connection DOWN when BFD is configured in passive-mode

XMLWordPrintable

    • Important
    • None
    • CNF Network Sprint 260, CNF Network Sprint 262
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Our customer deployed MetalLB in BGP mode with BFD but they are facing problem with only one speaker pod that is not able to bring the connection up. The BFD profile is configured in passive-mode, all the other speaker have the BFD UP but this specfic pod is the only one having the BFD in DOWN status.
      We collected pcap of the worker where that pod is running and we see the router sends "Control Detection Time Expired, State Down" packets.
      In the speaker pod there are the following error messages:

      2024-09-11T22:40:24.725357249Z 2024/09/11 22:40:24.725 BFD: control-packet: no session found [mhop:no peer:10.47.38.35 local:10.47.38.43 port:13]
      2024-09-11T22:40:26.435398584Z 2024/09/11 22:40:26.435 BFD: control-packet: no session found [mhop:no peer:10.47.38.35 local:10.47.38.43 port:13]
      2024-09-11T22:40:28.138654975Z 2024/09/11 22:40:28.138 BFD: control-packet: no session found [mhop:no peer:10.47.38.35 local:10.47.38.43 port:13]
      

      We tried restarting the pod but it didn't help, so we increased the Log Level and collected a must-gather.
      During the last maintenance window, the customer removed the passive-mode from the configuration and collected a pcap. After removing the passive-mode, the BFD session also went UP for the last speaker pod.

      Additional info:
      In the case there are the following logs:

      1. sos-report collected with MetalLB in passive-mode and BFD session DOWN: 0040-sosreport-NLEIN01SP4OW009-03890679-2024-07-31-nmajagn.tar.xz
      2. must-gather with LogLevel debug and BFD session DOWN: 0170-must-gather-1d65ee83-848f-4ff2-be71-338ed4c6e00c.tar.gz
      3. pcap when the session was DOWN: 0080-bfd-capture.pcap
      4. sos-report after removing the passive-mode: 0190-sosreport-NLEIN01SP4OW009-2024-09-16-vfepmuo.tar.xz
      5. pcap after removing the passive-mode and with the BFD session UP: 0190-sosreport-NLEIN01SP4OW009-2024-09-16-vfepmuo.tar.xz

      Our customer would like to know why the speaker pod can't establish the bfd session in passive-mode, this is a production cluster with applications handling business traffic.

              fpaoline@redhat.com Federico Paolinelli
              dnessill@redhat.com Daniele Nessilli
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: