Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34761

metallb frr starts with incomplete config

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • 4.17
    • 4.14.z
    • Networking / Metal LB
    • None
    • No
    • CNF Network Sprint 256
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      when the system is loaded, frr can restart without the full config. In particular missing the zebra config ```ip nht resolve-via-default``` and breaking connection to peers.
          

      Version-Release number of selected component (if applicable):

      OCP 4.14.25
      metallb-operator.v4.14.0-202405201438 
          

      How reproducible:

      can be reproduced easily on a standalone frr installation, here with frr-8.5.3-4.el9.x86_64
          

      Steps to Reproduce:

          1. we rebuild frr with this patch to artificially slow it down:
      diff --git a/zebra/main.c b/zebra/main.c
      index 87f3de2..04f4d55 100644
      --- a/zebra/main.c
      +++ b/zebra/main.c
      @@ -471,6 +471,7 @@ int main(int argc, char **argv)
              /* Error init */
              zebra_error_init();
      
      +       sleep(30);
              frr_run(zrouter.master);
      
              /* Not reached... */
      
      we also provide a configuration for frr in /etc/frr/frr.conf
      frr version 8.3.1
      frr defaults traditional
      hostname stream
      log file /tmp/frr.log
      log timestamp precision 3
      service integrated-vtysh-config
      !
      debug zebra events
      debug zebra kernel
      debug zebra rib
      debug zebra nht
      debug zebra nexthop
      !
      ip nht resolve-via-default
      !
      ipv6 nht resolve-via-default
      !
      
          2. systemctl restart frr
          3. journalctl -t watchfrr  -e ; vtysh -c "show run zebra"
          

      Actual results:

      those errors:
      juin 02 10:47:21 stream watchfrr[82055]: [ZE9RA-19PS5] restart all child process 82056 still running after 20 seconds, sending signal 15
      juin 02 10:47:21 stream watchfrr[82055]: [SK7QP-A2GT9] restart all process 82056 terminated due to signal 15
      
      this incomplete config:
      # vtysh -c "show run zebra"
      Building configuration...
      
      Current configuration:
      !
      frr version 8.5.3
      frr defaults traditional
      hostname stream
      no ipv6 forwarding
      !
      end
          

      Expected results:

      I'd expect either frr to restart with the proper config, either frr service to not go into "active". 
      
      From the point of view of metallb, the liveness probe should not claim that frr is ready.
          

      Additional info:

      frr issue reported on https://github.com/FRRouting/frr/issues/15799
          

              fpaoline@redhat.com Federico Paolinelli
              frigault Francois Rigault
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: