Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-11704

VM connectivity lost after openvswitch.service restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • rhos-18.0.5
    • rhos-18.0 Feature Release 1 (Nov 2024)
    • neutron-operator
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • None
    • Critical

      Having a workload like this:

      sh-5.1$ openstack server list --all --long
      +--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------------+---------------------------------+--------------------------------------+----------------------------+-------------------+--------------------------------+------------+-------------+
      | ID                                   | Name      | Status | Task State | Power State | Networks                                                 | Image Name                      | Image ID                             | Flavor                     | Availability Zone | Host                           | Properties | Host Status |
      +--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------------+---------------------------------+--------------------------------------+----------------------------+-------------------+--------------------------------+------------+-------------+
      | 9da89876-b12b-44ce-998a-e15f3bd2c10c | instance6 | ACTIVE | None       | Running     | data=10.10.166.117; dpdkmgmt=10.10.10.139, 10.46.141.161 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host1 | nova              | compute-1.ctlplane.example.com |            | UP          |
      | 118a85e1-fec4-409d-b445-b94d6a7f6f2e | instance5 | ACTIVE | None       | Running     | data=10.10.166.118; dpdkmgmt=10.10.10.181, 10.46.141.165 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host0 | nova              | compute-0.ctlplane.example.com |            | UP          |
      | bbf172cb-a2d8-4ce1-8f42-185c450aff62 | instance4 | ACTIVE | None       | Running     | data=10.10.166.125; dpdkmgmt=10.10.10.200, 10.46.141.170 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host1 | nova              | compute-1.ctlplane.example.com |            | UP          |
      | 6ea17395-14d7-4eed-89e7-525460990ee8 | instance3 | ACTIVE | None       | Running     | data=10.10.166.144; dpdkmgmt=10.10.10.148, 10.46.141.167 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host0 | nova              | compute-0.ctlplane.example.com |            | UP          |
      | f3d82b52-01c2-4d03-9bc2-e7bdd372b832 | instance2 | ACTIVE | None       | Running     | data=10.10.166.171; dpdkmgmt=10.10.10.124, 10.46.141.162 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host1 | nova              | compute-1.ctlplane.example.com |            | UP          |
      | 98f17950-dd1e-4861-9877-2ad6d42c8110 | instance1 | ACTIVE | None       | Running     | data=10.10.166.170; dpdkmgmt=10.10.10.159, 10.46.141.169 | rhel-guest-image-9.5-20241009.2 | 44cdcd14-8101-4653-8cf0-29a7bd7d218e | m1_medium_huge_pages_host0 | nova              | compute-0.ctlplane.example.com |            | UP          |
      +--------------------------------------+-----------+--------+------------+-------------+----------------------------------------------------------+---------------------------------+--------------------------------------+----------------------------+-------------------+--------------------------------+------------+-------------+
      

      Ping works normally for VMs:

      [zuul@controller-0 ~]$ ping 10.46.141.169
      PING 10.46.141.169 (10.46.141.169) 56(84) bytes of data.
      64 bytes from 10.46.141.169: icmp_seq=1 ttl=61 time=0.973 ms
      64 bytes from 10.46.141.169: icmp_seq=2 ttl=61 time=0.460 ms
      ...
      

      Then we restart the openvswitch.service in one of the computes (in this case in compute-0):

      [zuul@panther06 ~]$ ssh -ostricthostkeychecking=no -ouserknownhostsfile=/dev/null -i /tmp/k cloud-admin@192.168.122.101
      Warning: Permanently added '192.168.122.101' (ED25519) to the list of known hosts.
      Register this system with Red Hat Insights: insights-client --register
      Create an account or view all your systems at https://red.ht/insights-dashboard
      Last login: Mon Nov 18 08:24:22 2024 from 192.168.122.1
      [cloud-admin@compute-0 ~]$ sudo ovs-vsctl show
      d49151e9-54fb-448a-a157-a719a460cabe
          Manager "ptcp:6640:127.0.0.1"
              is_connected: true
          Bridge br-link0
              fail_mode: standalone
              datapath_type: netdev
              Port br-link0
                  tag: 164
                  Interface br-link0
                      type: internal
              Port dpdkbond0
                  Interface dpdk0
                      type: dpdk
                      options: {dpdk-devargs="0000:06:00.0", n_rxq="2"}
                  Interface dpdk1
                      type: dpdk
                      options: {dpdk-devargs="0000:06:00.1", n_rxq="2"}
          Bridge br-dpdk1
              fail_mode: standalone
              datapath_type: netdev
              Port dpdk4
                  Interface dpdk4
                      type: dpdk
                      options: {dpdk-devargs="0000:82:00.1", n_rxq="3"}
              Port br-dpdk1
                  Interface br-dpdk1
                      type: internal
          Bridge br-int
              fail_mode: secure
              datapath_type: netdev
              Port patch-br-int-to-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7
                  Interface patch-br-int-to-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7
                      type: patch
                      options: {peer=patch-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7-to-br-int}
              Port ovn-71cf59-0
                  Interface ovn-71cf59-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="172.19.0.31", tos="0"}
                      bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="Control Detection Time Expired", remote_state=up, state=up}
              Port tap10c517b9-c0
                  Interface tap10c517b9-c0
              Port vhuca97fc52-8a
                  Interface vhuca97fc52-8a
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhuca97fc52-8a"}
              Port vhu430dd602-36
                  Interface vhu430dd602-36
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhu430dd602-36"}
              Port br-int
                  Interface br-int
                      type: internal
              Port vhu63995776-9f
                  Interface vhu63995776-9f
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhu63995776-9f"}
              Port vhuc4181b43-96
                  Interface vhuc4181b43-96
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhuc4181b43-96"}
              Port tap3bc86f39-c0
                  Interface tap3bc86f39-c0
              Port vhu0996cf4b-fe
                  Interface vhu0996cf4b-fe
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhu0996cf4b-fe"}
              Port ovn-a4fb3c-0
                  Interface ovn-a4fb3c-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="172.19.0.100", tos="0"}
              Port vhu7c1125d6-c9
                  Interface vhu7c1125d6-c9
                      type: dpdkvhostuserclient
                      options: {vhost-server-path="/var/lib/vhost_sockets/vhu7c1125d6-c9"}
              Port ovn-64b940-0
                  Interface ovn-64b940-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="172.19.0.32", tos="0"}
                      bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="Control Detection Time Expired", remote_state=up, state=up}
              Port ovn-07eed2-0
                  Interface ovn-07eed2-0
                      type: geneve
                      options: {csum="true", key=flow, remote_ip="172.19.0.30", tos="0"}
                      bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="Control Detection Time Expired", remote_state=up, state=up}
          Bridge br-dpdk0
              fail_mode: standalone
              datapath_type: netdev
              Port dpdkbond1
                  Interface dpdk2
                      type: dpdk
                      options: {dpdk-devargs="0000:82:00.2", n_rxq="3"}
                  Interface dpdk3
                      type: dpdk
                      options: {dpdk-devargs="0000:82:00.3", n_rxq="3"}
              Port br-dpdk0
                  Interface br-dpdk0
                      type: internal
              Port patch-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7-to-br-int
                  Interface patch-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7-to-br-int
                      type: patch
                      options: {peer=patch-br-int-to-provnet-55a983c0-4995-4d06-95bd-efcd7b6077a7}
          ovs_version: "3.3.3-49.el9fdp"
      [cloud-admin@compute-0 ~]$ systemctl -a |grep openvs
        openvswitch.service                                                                                                                  loaded    active   exited    Open vSwitch
      [cloud-admin@compute-0 ~]$ systemctl status openvswitch.service
      ● openvswitch.service - Open vSwitch
           Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; preset: disabled)
           Active: active (exited) since Mon 2024-11-18 11:52:54 UTC; 2h 9min ago
          Process: 245106 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
         Main PID: 245106 (code=exited, status=0/SUCCESS)
              CPU: 2ms
      [cloud-admin@compute-0 ~]$ 
      [cloud-admin@compute-0 ~]$ sudo systemctl restart openvswitch.service
      
      [cloud-admin@compute-0 ~]$ rpm -qi openvswitch3.3
      Name        : openvswitch3.3
      Version     : 3.3.0
      Release     : 49.el9fdp
      Architecture: x86_64
      Install Date: Thu 14 Nov 2024 08:33:15 AM UTC
      Group       : System Environment/Daemons daemon/database/utilities
      Size        : 24895143
      License     : ASL 2.0 and LGPLv2+ and SISSL
      Signature   : RSA/SHA256, Mon 16 Sep 2024 04:17:54 PM UTC, Key ID 199e2f91fd431d51
      Source RPM  : openvswitch3.3-3.3.0-49.el9fdp.src.rpm
      Build Date  : Mon 16 Sep 2024 08:40:03 AM UTC
      Build Host  : x86-64-04.build.eng.rdu2.redhat.com
      Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
      Vendor      : Red Hat, Inc.
      URL         : http://www.openvswitch.org/
      Summary     : Open vSwitch
      Description :
      Open vSwitch provides standard network bridging functions and
      support for the OpenFlow protocol for remote per-flow control of
      traffic.
      

      After doing the openvswitch service restart we lose connectivity the VMs hosted by that compute (ping loss corresponds to the moment the opevswitch service was restarted):

      [zuul@controller-0 ~]$ ping 10.46.141.169                   
      PING 10.46.141.169 (10.46.141.169) 56(84) bytes of data.
      64 bytes from 10.46.141.169: icmp_seq=1 ttl=61 time=0.973 ms
      64 bytes from 10.46.141.169: icmp_seq=2 ttl=61 time=0.460 ms
      64 bytes from 10.46.141.169: icmp_seq=3 ttl=61 time=0.492 ms
      64 bytes from 10.46.141.169: icmp_seq=4 ttl=61 time=0.415 ms
      64 bytes from 10.46.141.169: icmp_seq=5 ttl=61 time=0.448 ms
      64 bytes from 10.46.141.169: icmp_seq=6 ttl=61 time=0.457 ms
      64 bytes from 10.46.141.169: icmp_seq=7 ttl=61 time=0.449 ms
      64 bytes from 10.46.141.169: icmp_seq=8 ttl=61 time=0.519 ms
      64 bytes from 10.46.141.169: icmp_seq=9 ttl=61 time=0.460 ms
      64 bytes from 10.46.141.169: icmp_seq=10 ttl=61 time=0.457 ms
      64 bytes from 10.46.141.169: icmp_seq=11 ttl=61 time=0.507 ms
      64 bytes from 10.46.141.169: icmp_seq=12 ttl=61 time=0.489 ms
      64 bytes from 10.46.141.169: icmp_seq=13 ttl=61 time=0.532 ms
      64 bytes from 10.46.141.169: icmp_seq=14 ttl=61 time=0.593 ms
      64 bytes from 10.46.141.169: icmp_seq=15 ttl=61 time=0.482 ms
      64 bytes from 10.46.141.169: icmp_seq=16 ttl=61 time=0.485 ms
      64 bytes from 10.46.141.169: icmp_seq=17 ttl=61 time=0.600 ms
      64 bytes from 10.46.141.169: icmp_seq=18 ttl=61 time=0.569 ms
      64 bytes from 10.46.141.169: icmp_seq=19 ttl=61 time=0.556 ms
      64 bytes from 10.46.141.169: icmp_seq=20 ttl=61 time=0.553 ms
      64 bytes from 10.46.141.169: icmp_seq=21 ttl=61 time=0.503 ms
      64 bytes from 10.46.141.169: icmp_seq=22 ttl=61 time=0.495 ms
      64 bytes from 10.46.141.169: icmp_seq=23 ttl=61 time=27.4 ms
      64 bytes from 10.46.141.169: icmp_seq=39 ttl=61 time=1.28 ms
      ^C
      --- 10.46.141.169 ping statistics ---
      65 packets transmitted, 24 received, 63.0769% packet loss, time 65486ms
      rtt min/avg/max/mdev = 0.415/1.673/27.380/5.363 ms
      [zuul@controller-0 ~]$ ping -c1 10.46.141.169
      PING 10.46.141.169 (10.46.141.169) 56(84) bytes of data.
      
      --- 10.46.141.169 ping statistics ---
      1 packets transmitted, 0 received, 100% packet loss, time 0ms
      

      And we need reboot the VM (with virsh console) to get ping working again (in this case ping loss corresponds to the moment the VM is starting after the reboot and then we have ping again):

      [zuul@controller-0 ~]$ ping 10.46.141.169                                                                                                                       
      PING 10.46.141.169 (10.46.141.169) 56(84) bytes of data.
      64 bytes from 10.46.141.169: icmp_seq=21 ttl=61 time=1.63 ms
      64 bytes from 10.46.141.169: icmp_seq=22 ttl=61 time=0.953 ms
      64 bytes from 10.46.141.169: icmp_seq=23 ttl=61 time=0.522 ms
      ^C
      --- 10.46.141.169 ping statistics ---
      23 packets transmitted, 3 received, 86.9565% packet loss, time 22504ms
      rtt min/avg/max/mdev = 0.522/1.035/1.632/0.456 ms
      

      However ping with the rest of VMs hosted by compute-0 still doesn't work:

      [zuul@controller-0 ~]$ ping -c1 10.46.141.167
      PING 10.46.141.167 (10.46.141.167) 56(84) bytes of data.
      
      --- 10.46.141.167 ping statistics ---
      1 packets transmitted, 0 received, 100% packet loss, time 0ms
      
      [zuul@controller-0 ~]$ ping -c1 10.46.141.165
      PING 10.46.141.165 (10.46.141.165) 56(84) bytes of data.
      
      --- 10.46.141.165 ping statistics ---
      1 packets transmitted, 0 received, 100% packet loss, time 0ms
      
      

      This issue has also been reproduced with RHEL 8.4 as guest.

      Find attached:
      ovs-vswitchd-post_reboot_vm.log: ovs-vswitchd.log before doing the restart service
      ovs-vswitchd-pre_reboot_vm.log: ovs-vswitchd.log before doing the restart service and the VM reboot
      instance1_messages: messages of instance1

      Reproduction procedure:
      1 Deploy VMs
      2 Ping works
      3 Restart opevswitch service in compute
      4 Ping stops working
      5 Reboot the VM
      6 Ping works again

              Unassigned Unassigned
              rdiazcam@redhat.com Ricardo Diaz Campos
              rhos-dfg-nfv
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: