Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18700

metallb bgpd crash in SIGSEV speaker pod need to be restarted manually

XMLWordPrintable

    • Important
    • No
    • CNF Network Sprint 242, CNF Network Sprint 243
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated
    • 9/18: green

      Description of problem:

      Since the upgrade from OCP 4.10.X to OCP 4.12.29 we begin to see issue on metallb speaker pod.
      
      We have to restart speaker pod at a regular interval to restore service.
      
      We see the following stack trace related to a SIGSEV on bgpd daemon running on speaker pod.
      
      
      
      $ for pod in $(omg get pod -A -o wide|grep -i speaker | awk '{print $2}') ; do echo $pod ; omg logs -n metallb-system -c frr $pod |grep -A30 "Received signal 11" ; done|less        
      
      speaker-98mxj
      2023-09-04T14:47:49.455643366Z BGP: Received signal 11 at 1693838869 (si_addr 0x2, PC 0x7fcaec333c75); aborting...
      2023-09-04T14:47:49.455940093Z BGP: /usr/lib64/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x5b) [0x7fcaee3c129b]
      2023-09-04T14:47:49.455940093Z BGP: /usr/lib64/frr/libfrr.so.0(zlog_signal+0xe1) [0x7fcaee3c1491]
      2023-09-04T14:47:49.455940093Z BGP: /usr/lib64/frr/libfrr.so.0(+0x83b28) [0x7fcaee3e5b28]
      2023-09-04T14:47:49.455940093Z BGP: /lib64/libpthread.so.0(+0x12ce0) [0x7fcaec63ece0]
      2023-09-04T14:47:49.455940093Z BGP: /lib64/libc.so.6(+0xccc75) [0x7fcaec333c75]
      2023-09-04T14:47:49.455940093Z BGP: /lib64/libyang.so.1(lydict_remove+0x49) [0x7fcaee079709]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(lyd_free_attr+0x7b) [0x7fcaee0e11fb]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(+0x7d0f1) [0x7fcaee0e40f1]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(+0x7d209) [0x7fcaee0e4209]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(+0x7d221) [0x7fcaee0e4221]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(+0x7d221) [0x7fcaee0e4221]
      2023-09-04T14:47:49.455952425Z BGP: /lib64/libyang.so.1(+0x7d221) [0x7fcaee0e4221]
      2023-09-04T14:47:49.455959853Z BGP: /usr/lib64/frr/libfrr.so.0(nb_config_replace+0x32) [0x7fcaee3ca702]
      2023-09-04T14:47:49.455959853Z BGP: /usr/lib64/frr/libfrr.so.0(nb_candidate_commit_apply+0x61) [0x7fcaee3cd3b1]
      2023-09-04T14:47:49.455959853Z BGP: /usr/lib64/frr/libfrr.so.0(nb_candidate_commit+0x9e) [0x7fcaee3cd4be]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x6b8dc) [0x7fcaee3cd8dc]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(nb_cli_apply_changes+0x619) [0x7fcaee3d0959]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x4a7c5) [0x7fcaee3ac7c5]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x4ab81) [0x7fcaee3acb81]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x39525) [0x7fcaee39b525]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(cmd_execute_command+0x71) [0x7fcaee39d6f1]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(cmd_execute+0xd0) [0x7fcaee39d910]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x98da5) [0x7fcaee3fada5]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x98f80) [0x7fcaee3faf80]
      2023-09-04T14:47:49.455981286Z BGP: /usr/lib64/frr/libfrr.so.0(+0x9b9c0) [0x7fcaee3fd9c0]
      2023-09-04T14:47:49.455990616Z BGP: /usr/lib64/frr/libfrr.so.0(thread_call+0x5a) [0x7fcaee3f52aa]
      2023-09-04T14:47:49.455990616Z BGP: /usr/lib64/frr/libfrr.so.0(frr_run+0xe8) [0x7fcaee3bfe18]
      2023-09-04T14:47:49.456010473Z BGP: /usr/lib/frr/bgpd(main+0x30c) [0x5571a251f9fc]
      2023-09-04T14:47:49.456010473Z BGP: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fcaec2a1cf3]
      2023-09-04T14:47:49.456010473Z BGP: /usr/lib/frr/bgpd(_start+0x2e) [0x5571a2521c2e]
      ---
      2023-09-04T14:47:49.459087210Z BFD: Received signal 11 at 1693838869 (si_addr 0xffffffff8daa35f1, PC 0x7f605cf2ac92); aborting...
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x5b) [0x7f605cf1829b]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(zlog_signal+0xe1) [0x7f605cf18491]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x83b28) [0x7f605cf3cb28]
      2023-09-04T14:47:49.459662341Z BFD: /lib64/libpthread.so.0(+0x12ce0) [0x7f605b195ce0]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x71c92) [0x7f605cf2ac92]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x7225d) [0x7f605cf2b25d]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(prefix_list_entry_update_finish+0x6c) [0x7f605cf2d26c]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x4f037) [0x7f605cf08037]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x6acc5) [0x7f605cf23cc5]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x6b06e) [0x7f605cf2406e]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(nb_candidate_commit_apply+0x37) [0x7f605cf24387]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(nb_candidate_commit+0x9e) [0x7f605cf244be]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x6b8dc) [0x7f605cf248dc]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(nb_cli_apply_changes+0x619) [0x7f605cf27959]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x4a7c5) [0x7f605cf037c5]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x4ab81) [0x7f605cf03b81]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x39525) [0x7f605cef2525]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(cmd_execute_command+0x71) [0x7f605cef46f1]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(cmd_execute+0xd0) [0x7f605cef4910]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x98da5) [0x7f605cf51da5]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x98f80) [0x7f605cf51f80]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(+0x9b9c0) [0x7f605cf549c0]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(thread_call+0x5a) [0x7f605cf4c2aa]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib64/frr/libfrr.so.0(frr_run+0xe8) [0x7f605cf16e18]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib/frr/bfdd(main+0x27b) [0x5628836673db]
      2023-09-04T14:47:49.459662341Z BFD: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f605adf8cf3]
      2023-09-04T14:47:49.459662341Z BFD: /usr/lib/frr/bfdd(_start+0x2e) [0x5628836675be]
      2023-09-04T14:47:49.459662341Z BFD: in thread vtysh_read scheduled from lib/vty.c:2682
      2023-09-04T14:47:49.460284519Z 2023/09/04 14:47:49 WATCHFRR: [EC 268435457] bfdd state -> down : read returned EOF
      2023-09-04T14:47:49.460405905Z 2023/09/04 14:47:49.460 ZEBRA: [EC 4043309122] Client 'bfd' encountered an error and is shutting down.
      
      Counting the number of occurence show the symptom affect near all speaker pods.
      
       $ for pod in $(omg get pod -A -o wide|grep -i spea | awk '{print $2}') ; do SIGSEVCOUNT=$(omg logs -n metallb-system -c frr $pod |grep -ci "Received signal 11" ) ; printf "%s %d\n" $pod $SIGSEVCOUNT ; done
      speaker-2hz45 7
      speaker-66wcc 0
      speaker-6fjrr 14
      speaker-7pnzf 7
      speaker-8bqvp 0
      speaker-8fprv 0
      speaker-98mxj 5
      speaker-d4btd 5
      speaker-dd95v 13
      speaker-dl2x7 3
      speaker-gwqg8 11
      speaker-kbl82 21
      speaker-rcfjx 28
      speaker-s6xx5 3
      speaker-v6mmk 0
      speaker-vnf28 3
      speaker-zlkt9 10
      
      Restarting affected pod seem to restore the service for a short period of time : a day or few hours.

      Version-Release number of selected component (if applicable):

      metallb-operator.4.12.0-202308071502
      OCP 4.12.29

      How reproducible:

      Quite often without touching to anything.
      Once every day/few days.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

      No crash of bgpd

      Additional info:

      Look quite similar to : 
      https://issues.redhat.com/browse/OCPBUGS-16795
      
      We open this one to have confirmation that we hit the same issue. 

        1. frr.tgz
          4 kB
          Federico Paolinelli
        2. master.tgz
          1 kB
          Federico Paolinelli

            fpaoline@redhat.com Federico Paolinelli
            rhn-support-jpeyrard Johann Peyrard
            Arti Sood Arti Sood
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: