• Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • None
    • None
    • ovn23.09
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Given a system administrator sets up a basic OVN environment,

      When they start the ovn-controller service,

      Then the ovn-controller should start successfully and continue running without crashing.

      Show
      Given a system administrator sets up a basic OVN environment, When they start the ovn-controller service, Then the ovn-controller should start successfully and continue running without crashing.
    • rhel-sst-network-fastdatapath-ovn
    • ssg_networking
    • Critical
    • +

      description:

      with simple ovn setup, ovn-controller would crash

      version:

      ovn23.09-23.09.6-4.el9fdp.x86_64

      steps:

      systemctl start openvswitch                          
      systemctl start ovn-northd
      ovn-nbctl set-connection ptcp:6641
      ovn-sbctl set-connection ptcp:6642
      ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.88.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.88.25
      systemctl restart ovn-controller
      ovs-vsctl add-br br-ext
      ovs-vsctl set Open_vSwitch . external-ids:ovn-bridge-mappings=phynet:br-ext
      ovs-vsctl add-port br-ext ens1f1np1
      ip link set ens1f1np1 up
      ovn-nbctl lr-add lr1
      ovn-nbctl lrp-add lr1 lr1-ls1 00:00:01:ff:02:03 192.168.1.254/24 1111::a/64
      ovn-nbctl ls-add ls1
      ovn-nbctl lsp-add ls1 ls1p1
      ovn-nbctl lsp-set-addresses ls1p1 "00:00:01:01:01:01 192.168.1.1 1111::1"
      ovn-nbctl lsp-add ls1 ls1p2
      ovn-nbctl lsp-set-addresses ls1p2 "00:00:01:01:01:02 192.168.1.12 1111::2"
      ovn-nbctl lsp-add ls1 ls1-lr1
      ovn-nbctl lsp-set-type ls1-lr1 router
      ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1
      ovn-nbctl lsp-set-addresses ls1-lr1 router
      ovn-nbctl ls-add ls2
      ovn-nbctl lsp-add ls2 ls2p1
      ovn-nbctl lsp-set-addresses ls2p1 "00:00:01:01:02:01 192.168.2.1 1112::1"
      ovn-nbctl lsp-add ls2 ls2p2
      ovn-nbctl lsp-set-addresses ls2p2 "00:00:01:01:02:02 192.168.2.2 1112::2"
      ovn-nbctl lrp-add lr1 lr1-ls2 00:00:01:ff:22:03 192.168.2.254/24 1112::a/64
      ovn-nbctl lsp-add ls2 ls2-lr1
      ovn-nbctl lsp-set-type ls2-lr1 router
      ovn-nbctl lsp-set-options ls2-lr1 router-port=lr1-ls2
      ovn-nbctl lsp-set-addresses ls2-lr1 router
      ovn-nbctl ls-add pub
      ovn-nbctl lrp-add lr1 lr1-pub 00:00:01:ff:01:03 172.16.1.254/24 172:16::a/64
      ovn-nbctl lrp-set-gateway-chassis lr1-pub hv1
      ovn-nbctl lsp-add pub pub-lr1
      ovn-nbctl lsp-set-type pub-lr1 router
      ovn-nbctl lsp-set-addresses pub-lr1 router
      ovn-nbctl lsp-set-options pub-lr1 router-port=lr1-pub
      ovn-nbctl lsp-add pub pub-ln
      ovn-nbctl lsp-set-type pub-ln localnet
      ovn-nbctl lsp-set-addresses pub-ln unknown
      ovn-nbctl lsp-set-options pub-ln network_name=phynet
      ovn-nbctl lsp-add ls1 ls1-ln
      ovn-nbctl lsp-set-type ls1-ln localnet
      ovn-nbctl lsp-set-addresses ls1-ln unknown
      ovn-nbctl lsp-set-options ls1-ln network_name=phynet
      ovn-nbctl lsp-add ls2 ls2-ln
      ovn-nbctl lsp-set-type ls2-ln localnet
      ovn-nbctl lsp-set-addresses ls2-ln unknown
      ovn-nbctl lsp-set-options ls2-ln network_name=phynet
      ovn-nbctl set logical_switch_port ls2-ln tag_request=50
      ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.21 192.168.2.1 ls2p1 00:00:0f:01:02:01
      ovn-nbctl lr-nat-add lr1 dnat_and_snat 172.16.1.22 192.168.2.2 ls2p2 00:00:0f:01:02:02
      ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1
      ip netns add ls1p1
      ip link set ls1p1 netns ls1p1
      ip netns exec ls1p1 ip link set ls1p1 address 00:00:01:01:01:01
      ip netns exec ls1p1 ip link set ls1p1 up
      ip netns exec ls1p1 ip addr add 192.168.1.1/24 dev ls1p1
      ip netns exec ls1p1 ip route add default via 192.168.1.254
      ip netns exec ls1p1 ip addr add 1111::1/64 dev ls1p1
      ip netns exec ls1p1 ip -6 route add default via 1111::a
      ovs-vsctl add-port br-int ls2p1 -- set interface ls2p1 type=internal external_ids:iface-id=ls2p1
      ip netns add ls2p1
      ip link set ls2p1 netns ls2p1
      ip netns exec ls2p1 ip link set ls2p1 address 00:00:01:01:02:01
      ip netns exec ls2p1 ip link set ls2p1 up
      ip netns exec ls2p1 ip addr add 192.168.2.1/24 dev ls2p1
      ip netns exec ls2p1 ip route add default via 192.168.2.254
      ip netns exec ls2p1 ip addr add 1112::1/64 dev ls2p1
      ip netns exec ls2p1 ip -6 route add default via 1112::a
      ovs-vsctl add-port br-ext ext1 -- set interface ext1 type=internal
      ip netns add ext1
      ip link set ext1 netns ext1
      ip netns exec ext1 ip link set lo up
      ip netns exec ext1 ip link set ext1 up
      ip netns exec ext1 ip addr add 172.16.1.11/24 dev ext1
      ip netns exec ext1 ip addr add 172:16::11/64 dev ext1 

      actual result:

      ovn-controller crash:

      [root@wsfd-advnetlab20 test]# systemctl status ovn-controller
      × ovn-controller.service - OVN controller daemon
           Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; disabled; preset: disabled)
           Active: failed (Result: signal) since Thu 2024-10-31 02:25:27 EDT; 3min 22s ago
         Duration: 40ms
          Process: 57340 ExecStart=/usr/share/ovn/scripts/ovn-ctl --no-monitor --ovn-user=${OVN_USER_ID} start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)
         Main PID: 57367 (code=killed, signal=ABRT)
              CPU: 145ms
      
      
      Oct 31 02:25:27 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: ovn-controller.service: Scheduled restart job, restart counter is at 5.
      Oct 31 02:25:27 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: Stopped OVN controller daemon.
      Oct 31 02:25:27 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: ovn-controller.service: Start request repeated too quickly.
      Oct 31 02:25:27 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: ovn-controller.service: Failed with result 'signal'.
      Oct 31 02:25:27 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: Failed to start OVN controller daemon.
      [root@wsfd-advnetlab20 test]# journalctl -xe -u ovn-controller --no-page
      Oct 31 02:25:24 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: Starting OVN controller daemon...
      ░░ Subject: A start job for unit ovn-controller.service has begun execution
      ░░ Defined-By: systemd
      ░░ Support: https://access.redhat.com/support
      ░░ 
      ░░ A start job for unit ovn-controller.service has begun execution.
      ░░ 
      ░░ The job identifier is 10782.
      Oct 31 02:25:24 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com ovn-ctl[57058]: Starting ovn-controller.
      Oct 31 02:25:24 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: Started OVN controller daemon.
      ░░ Subject: A start job for unit ovn-controller.service has finished successfully
      ░░ Defined-By: systemd
      ░░ Support: https://access.redhat.com/support
      ░░ 
      ░░ A start job for unit ovn-controller.service has finished successfully.
      ░░ 
      ░░ The job identifier is 10782.
      Oct 31 02:25:25 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: ovn-controller.service: Main process exited, code=killed, status=6/ABRT
      ░░ Subject: Unit process exited
      ░░ Defined-By: systemd
      ░░ Support: https://access.redhat.com/support
      ░░ 
      ░░ An ExecStart= process belonging to unit ovn-controller.service has exited.
      ░░ 
      ░░ The process' exit code is 'killed' and its exit status is 6.
      Oct 31 02:25:25 wsfd-advnetlab20.anl.eng.rdu2.dc.redhat.com systemd[1]: ovn-controller.service: Failed with result 'signal'.
      ░░ Subject: Unit failed
      ░░ Defined-By: systemd
      ░░ Support: https://access.redhat.com/support 

      expected result:

      ovn-controller doesn't crash

       

      the issue didn't happen on the early release ovn23.09-23.09.4-38.el9fdp

            [FDP-926] ovn-controller would crash with basic ovn setup

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (ovn23.09 bug fix and enhancement update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHBA-2024:10896

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (ovn23.09 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10896

            Jianlin Shi added a comment -

            Verified on ovn23.09.6-6:

            [root@dell-per740-69 FDP-926]# systemctl status ovn-controller                           
            ● ovn-controller.service - OVN controller daemon                                         
                 Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; disabled; preset: disabled)
                 Active: active (running) since Tue 2024-11-05 21:36:38 EST; 37s ago                 
                Process: 34551 ExecStart=/usr/share/ovn/scripts/ovn-ctl --no-monitor --ovn-user=${OVN_USER_ID} start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)
               Main PID: 34578 (ovn-controller)                                                      
                  Tasks: 5 (limit: 250502)                                                           
                 Memory: 6.8M                                                                        
                    CPU: 112ms                                                                       
                 CGroup: /system.slice/ovn-controller.service                                        
                         └─34578 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/var/log/ovn/ovn-c>
                                                                                                     
            Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com systemd[1]: Starting OVN controller daemon...
            Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com ovn-ctl[34551]: Starting ovn-controller.
            Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com systemd[1]: Started OVN controller daemon.
            [root@dell-per740-69 FDP-926]# ovn-nbctl --wait=hv sync                                  
            [root@dell-per740-69 FDP-926]# rpm -qa | grep -E "openvswitch3.3|ovn23.09"               
            openvswitch3.3-3.3.0-54.el9fdp.x86_64                                                    
            ovn23.09-23.09.6-6.el9fdp.x86_64                                                         
            ovn23.09-central-23.09.6-6.el9fdp.x86_64                                                 
            ovn23.09-host-23.09.6-6.el9fdp.x86_64 

            set Verified

            Jianlin Shi added a comment - Verified on ovn23.09.6-6: [root@dell-per740-69 FDP-926]# systemctl status ovn-controller                            ● ovn-controller.service - OVN controller daemon                                               Loaded: loaded (/usr/lib/systemd/system/ovn-controller.service; disabled; preset: disabled)      Active: active (running) since Tue 2024-11-05 21:36:38 EST; 37s ago                      Process : 34551 ExecStart=/usr/share/ovn/scripts/ovn-ctl --no-monitor --ovn-user=${OVN_USER_ID} start_controller $OVN_CONTROLLER_OPTS (code=exited, status=0/SUCCESS)    Main PID: 34578 (ovn-controller)                                                             Tasks: 5 (limit: 250502)                                                                 Memory: 6.8M                                                                                 CPU: 112ms                                                                             CGroup: /system.slice/ovn-controller.service                                                      └─34578 ovn-controller unix:/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --user openvswitch:openvswitch --no-chdir --log-file=/ var /log/ovn/ovn-c>                                                                                           Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com systemd[1]: Starting OVN controller daemon... Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com ovn-ctl[34551]: Starting ovn-controller. Nov 05 21:36:38 dell-per740-69.rhts.eng.pek2.redhat.com systemd[1]: Started OVN controller daemon. [root@dell-per740-69 FDP-926]# ovn-nbctl --wait=hv sync                                   [root@dell-per740-69 FDP-926]# rpm -qa | grep -E "openvswitch3.3|ovn23.09"                 openvswitch3.3-3.3.0-54.el9fdp.x86_64                                                     ovn23.09-23.09.6-6.el9fdp.x86_64                                                          ovn23.09-central-23.09.6-6.el9fdp.x86_64                                                  ovn23.09-host-23.09.6-6.el9fdp.x86_64 set Verified

            OVN Team added a comment -

            A review mentioning this issue has been posted to https://patchwork.ozlabs.org/project/ovn/list/?series=431167.

            OVN Team added a comment - A review mentioning this issue has been posted to https://patchwork.ozlabs.org/project/ovn/list/?series=431167 .

            I bisected this to the following culprit commit:

            https://github.com/ovn-org/ovn/commit/edc064b4c589ab1bb69352523481bd6d997aa1ca
             

            edc064b4c589ab1bb69352523481bd6d997aa1ca is the first bad commit
            commit edc064b4c589ab1bb69352523481bd6d997aa1ca
            Author: Xavier Simonart <xsimonar@redhat.com>
            Date:   Tue Oct 1 17:17:04 2024 +0200    controller: Properly handle localnet flows in I+P.
                
                Delete flows on localnet port deletion, and add localnet
                related flows when peer ports are added. This was properly done when
                recomputing, but not when doing IP.
                
                When peer ports are added, some flows such as chassis_mac flows
                must be added.
                
                Signed-off-by: Xavier Simonart <xsimonar@redhat.com>
                Acked-by: Ales Musil <amusil@redhat.com>
                Signed-off-by: Numan Siddique <numans@ovn.org> 

            The backtrace is:

            #0  0x00007ffff77cd834 in __pthread_kill_implementation () from /lib64/libc.so.6
            #1  0x00007ffff777b8ee in raise () from /lib64/libc.so.6
            #2  0x00007ffff77638ff in abort () from /lib64/libc.so.6
            #3  0x000000000042f861 in flow_is_preferred (a=0xafe9a0, b=0xaf9380) at controller/ofctrl.c:966
            #4  0x000000000042f340 in link_installed_to_desired (i=0xb2eaf0, d=0xafe9a0) at controller/ofctrl.c:987
            #5  0x000000000042c17c in update_installed_flows_by_track (flow_table=0x813a80, bc=0x7ffffffcc740, installed_flows=0x7894e0 <installed_pflows>, msgs=0x7ffffffcc790) at controller/ofctrl.c:2583
            #6  0x000000000042af14 in ofctrl_put (lflow_table=0x810180, pflow_table=0x813a80, pending_ct_zones=0x8129b0, pending_lb_tuples=0x80e030, sbrec_meter_by_name=0x7e6840, req_cfg=0, lflows_changed=true, pflows_changed=true)
                at controller/ofctrl.c:2826
            #7  0x000000000045015c in main (argc=1, argv=0x7fffffffe218) at controller/ovn-controller.c:5788 

            Dumitru Ceara added a comment - I bisected this to the following culprit commit: https://github.com/ovn-org/ovn/commit/edc064b4c589ab1bb69352523481bd6d997aa1ca   edc064b4c589ab1bb69352523481bd6d997aa1ca is the first bad commit commit edc064b4c589ab1bb69352523481bd6d997aa1ca Author: Xavier Simonart <xsimonar@redhat.com> Date:   Tue Oct 1 17:17:04 2024 +0200    controller: Properly handle localnet flows in I+P.          Delete flows on localnet port deletion, and add localnet     related flows when peer ports are added. This was properly done when     recomputing, but not when doing IP.          When peer ports are added, some flows such as chassis_mac flows     must be added.          Signed-off-by: Xavier Simonart <xsimonar@redhat.com>     Acked-by: Ales Musil <amusil@redhat.com>     Signed-off-by: Numan Siddique <numans@ovn.org> The backtrace is: #0  0x00007ffff77cd834 in __pthread_kill_implementation () from /lib64/libc.so.6 #1  0x00007ffff777b8ee in raise () from /lib64/libc.so.6 #2  0x00007ffff77638ff in abort () from /lib64/libc.so.6 #3  0x000000000042f861 in flow_is_preferred (a=0xafe9a0, b=0xaf9380) at controller/ofctrl.c:966 #4  0x000000000042f340 in link_installed_to_desired (i=0xb2eaf0, d=0xafe9a0) at controller/ofctrl.c:987 #5  0x000000000042c17c in update_installed_flows_by_track (flow_table=0x813a80, bc=0x7ffffffcc740, installed_flows=0x7894e0 <installed_pflows>, msgs=0x7ffffffcc790) at controller/ofctrl.c:2583 #6  0x000000000042af14 in ofctrl_put (lflow_table=0x810180, pflow_table=0x813a80, pending_ct_zones=0x8129b0, pending_lb_tuples=0x80e030, sbrec_meter_by_name=0x7e6840, req_cfg=0, lflows_changed=true, pflows_changed=true)     at controller/ofctrl.c:2826 #7  0x000000000045015c in main (argc=1, argv=0x7fffffffe218) at controller/ovn-controller.c:5788

              xsimonar@redhat.com Xavier Simonart
              rhn-support-jishi Jianlin Shi
              Jianlin Shi Jianlin Shi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: