Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-692

During RHOSP upgrade fresh version of OVN controller failed to establish relationships with old control plane

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • None
    • ovn-2021, ovn23.09
    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • -

      VMs in customer's RHOSP deployment lost network connectivity during FFU 16.2 (ovn-2021) -> 17.1 (ovn23.09) procedure: after ovn-controller container was updated and restarted it failed to establish relationships with control plane. On all nodes running ovn-controller process same sequence of log messages is observed:

      2024-06-10T18:20:22.655Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-controller.log
      2024-06-10T18:20:22.655Z|00002|reconnect|INFO|unix:/run/openvswitch/db.sock: connecting...
      2024-06-10T18:20:22.655Z|00003|reconnect|INFO|unix:/run/openvswitch/db.sock: connected
      2024-06-10T18:20:22.662Z|00004|main|INFO|OVN internal version is : [22.12.2-20.27.0-70.6]
      2024-06-10T18:20:22.662Z|00005|main|INFO|OVS IDL reconnected, force recompute.
      2024-06-10T18:20:22.775Z|00006|reconnect|INFO|ssl:IP_ADDR:6642: connecting...
      2024-06-10T18:20:22.775Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
      2024-06-10T18:20:23.776Z|00008|reconnect|INFO|ssl:IP_ADDR:6642: connection attempt timed out
      2024-06-10T18:20:24.777Z|00009|reconnect|INFO|ssl:IP_ADDR:6642: connecting...
      2024-06-10T18:20:25.779Z|00010|reconnect|INFO|ssl:IP_ADDR:6642: connection attempt timed out
      2024-06-10T18:20:25.779Z|00011|reconnect|INFO|ssl:IP_ADDR:6642: waiting 2 seconds before reconnect
      2024-06-10T18:20:27.782Z|00012|reconnect|INFO|ssl:IP_ADDR:6642: connecting...
      2024-06-10T18:20:29.785Z|00013|reconnect|INFO|ssl:IP_ADDR:6642: connection attempt timed out
      2024-06-10T18:20:29.785Z|00014|reconnect|INFO|ssl:IP_ADDR:6642: waiting 4 seconds before reconnect
      2024-06-10T18:20:33.789Z|00015|memory|INFO|11080 kB peak resident set size after 11.1 seconds
      2024-06-10T18:20:33.789Z|00016|memory|INFO|idl-cells-Open_vSwitch:8014
      2024-06-10T18:20:33.790Z|00017|reconnect|INFO|ssl:IP_ADDR:6642: connecting...
      2024-06-10T18:20:37.795Z|00018|reconnect|INFO|ssl:IP_ADDR:6642: connection attempt timed out
      2024-06-10T18:20:37.795Z|00019|reconnect|INFO|ssl:IP_ADDR:6642: continuing to reconnect in the background but suppressing further logging
      2024-06-10T18:21:06.243Z|00020|reconnect|INFO|ssl:IP_ADDR:6642: connected
      2024-06-10T18:21:06.251Z|00021|ovsdb_idl|WARN|OVN_Southbound database lacks Chassis_Template_Var table (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00022|ovsdb_idl|WARN|Load_Balancer table in OVN_Southbound database lacks datapath_group column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00023|ovsdb_idl|WARN|MAC_Binding table in OVN_Southbound database lacks timestamp column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00024|ovsdb_idl|WARN|OVN_Southbound database lacks Mirror table (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00025|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_chassis column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00026|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks additional_encap column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00027|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks mirror_rules column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00028|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks port_security column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00029|ovsdb_idl|WARN|Port_Binding table in OVN_Southbound database lacks requested_additional_chassis column (database needs upgrade?)
      2024-06-10T18:21:06.251Z|00030|ovsdb_idl|WARN|OVN_Southbound database lacks Static_MAC_Binding table (database needs upgrade?)
      2024-06-10T18:21:12.362Z|00031|memory|INFO|peak resident set size grew 217% in last 38.6 seconds, from 11080 kB to 35104 kB
      2024-06-10T18:21:12.362Z|00032|memory|INFO|idl-cells-Open_vSwitch:8014
      2024-06-10T18:21:20.899Z|00033|features|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2024-06-10T18:21:20.900Z|00034|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2024-06-10T18:21:20.908Z|00035|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2024-06-10T18:21:20.911Z|00036|features|INFO|OVS Feature: ct_zero_snat, state: supported
      2024-06-10T18:21:20.912Z|00037|main|INFO|OVS feature set changed, force recompute.
      2024-06-10T18:21:20.913Z|00038|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2024-06-10T18:21:20.915Z|00039|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2024-06-10T18:21:20.917Z|00040|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2024-06-10T18:21:20.918Z|00041|ofctrl|INFO|ofctrl-wait-before-clear is now 8000 ms (was 0 ms)
      2024-06-10T18:21:20.921Z|00042|main|INFO|OVS OpenFlow connection reconnected,force recompute.
      2024-06-10T18:21:20.965Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
      2024-06-10T18:21:20.965Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
      2024-06-10T18:21:20.965Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
      2024-06-10T18:21:20.989Z|00043|timeval|WARN|Unreasonably long 1128ms poll interval (923ms user, 198ms system)
      2024-06-10T18:21:20.989Z|00044|timeval|WARN|faults: 88544 minor, 0 major
      2024-06-10T18:21:20.990Z|00045|timeval|WARN|disk: 0 reads, 8 writes
      2024-06-10T18:21:20.991Z|00046|timeval|WARN|context switches: 0 voluntary, 10 involuntary
      2024-06-10T18:21:20.994Z|00047|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=97cf558d:
      2024-06-10T18:21:20.994Z|00048|coverage|INFO|cmap_expand                0.0/sec     0.000/sec        0.0000/sec   total: 3
      2024-06-10T18:21:20.994Z|00049|coverage|INFO|netlink_sent               0.0/sec     0.000/sec        0.0000/sec   total: 12
      2024-06-10T18:21:20.994Z|00050|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.000/sec        0.0000/sec   total: 1
      2024-06-10T18:21:20.994Z|00051|coverage|INFO|netlink_received           0.0/sec     0.000/sec        0.0000/sec   total: 12
      2024-06-10T18:21:20.994Z|00052|coverage|INFO|vconn_sent                 0.0/sec     0.000/sec        0.0000/sec   total: 5
      2024-06-10T18:21:20.994Z|00053|coverage|INFO|vconn_received             0.0/sec     0.000/sec        0.0000/sec   total: 4
      2024-06-10T18:21:20.994Z|00054|coverage|INFO|vconn_open                 0.0/sec     0.000/sec        0.0000/sec   total: 3
      2024-06-10T18:21:20.994Z|00055|coverage|INFO|util_xalloc              396182.8/sec 53981.183/sec      899.6864/sec   total: 5873685
      2024-06-10T18:21:20.994Z|00056|coverage|INFO|stream_open                0.0/sec     0.117/sec        0.0019/sec   total: 10
      2024-06-10T18:21:20.994Z|00057|coverage|INFO|pstream_open               0.0/sec     0.017/sec        0.0003/sec   total: 1
      2024-06-10T18:21:20.994Z|00058|coverage|INFO|seq_change                22.8/sec     3.400/sec        0.0567/sec   total: 243
      2024-06-10T18:21:20.994Z|00059|coverage|INFO|rconn_sent                 0.0/sec     0.000/sec        0.0000/sec   total: 2
      2024-06-10T18:21:20.994Z|00060|coverage|INFO|rconn_queued               0.0/sec     0.000/sec        0.0000/sec   total: 2
      2024-06-10T18:21:20.994Z|00061|coverage|INFO|poll_zero_timeout         22.0/sec     2.833/sec        0.0472/sec   total: 196
      2024-06-10T18:21:20.994Z|00062|coverage|INFO|poll_create_node          69.2/sec    10.300/sec        0.1717/sec   total: 719
      2024-06-10T18:21:20.994Z|00063|coverage|INFO|txn_incomplete             0.0/sec     0.000/sec        0.0000/sec   total: 2
      2024-06-10T18:21:20.994Z|00064|coverage|INFO|txn_unchanged             22.8/sec     3.267/sec        0.0544/sec   total: 223
      2024-06-10T18:21:20.994Z|00065|coverage|INFO|hmap_reserve               0.0/sec     0.000/sec        0.0000/sec   total: 76
      2024-06-10T18:21:20.994Z|00066|coverage|INFO|hmap_expand              8996.4/sec  1176.017/sec       19.6003/sec   total: 91957
      2024-06-10T18:21:20.994Z|00067|coverage|INFO|hmap_pathological          0.4/sec     0.817/sec        0.0136/sec   total: 257
      2024-06-10T18:21:20.994Z|00068|coverage|INFO|miniflow_malloc            0.0/sec     0.000/sec        0.0000/sec   total: 272
      2024-06-10T18:21:20.994Z|00069|coverage|INFO|physical_run               0.0/sec     0.000/sec        0.0000/sec   total: 1
      2024-06-10T18:21:20.994Z|00070|coverage|INFO|consider_logical_flow      0.0/sec     0.000/sec        0.0000/sec   total: 70000
      2024-06-10T18:21:20.994Z|00071|coverage|INFO|lflow_run                  0.0/sec     0.000/sec        0.0000/sec   total: 1
      2024-06-10T18:21:20.994Z|00072|coverage|INFO|116 events never hit
      2024-06-10T18:21:25.915Z|00073|memory|INFO|peak resident set size grew 653% in last 13.6 seconds, from 35104 kB to 264164 kB
      2024-06-10T18:21:25.915Z|00074|memory|INFO|idl-cells-OVN_Southbound:862796 idl-cells-Open_vSwitch:8014 ofctrl_desired_flow_usage-KB:65 ofctrl_sb_flow_ref_usage-KB:20
      2024-06-10T18:21:52.209Z|00004|fatal_signal(ovn_pinctrl0)|WARN|terminating with signal 15 (Terminated)

       

            [FDP-692] During RHOSP upgrade fresh version of OVN controller failed to establish relationships with old control plane

            The OSP folks found that this was not a problem in OVN and have requested this issue be closed.

            Mark Michelson added a comment - The OSP folks found that this was not a problem in OVN and have requested this issue be closed.

            I think we can close this ticket. The issue is in a way TripleO generates pid file paths for the systemd service file and conmon - they are not the same which leads to situation conmon generates a pid file that is not seen by the systemd. Then the systemd keeps restaring ovn-controller. Since the restarts are periodic at some point ovn-controller is restarted while OVN databases are being upgraded and hence down. ovn-controller loses the data about services such as DNS and DHCP. Then workloads using it can't resolve hostnames or if the DHCP lease expires they lose connectivity completely.

            The right fix should go to the tripleo.

            Jakub Libosvar added a comment - I think we can close this ticket. The issue is in a way TripleO generates pid file paths for the systemd service file and conmon - they are not the same which leads to situation conmon generates a pid file that is not seen by the systemd. Then the systemd keeps restaring ovn-controller. Since the restarts are periodic at some point ovn-controller is restarted while OVN databases are being upgraded and hence down. ovn-controller loses the data about services such as DNS and DHCP. Then workloads using it can't resolve hostnames or if the DHCP lease expires they lose connectivity completely. The right fix should go to the tripleo.

            Hello.

             

            Thank you very much for taking a look. I want to ask one follow-up questions before proceeding with reproducer (environment was fixed and upgrade was completed some time ago, so we will have to figure out if we can implement something similar). 

             

            When I looked at the logs, I saw SSL errors first (not sure if connectivity or SSL related), but then OVN controllers started complaining about OVN DB contents (I assumed that OVN controllers were able to establish connection to control plane and then actual problem was triggered). Is it wrong and controllers were not able to communicate with control plane after initial errors?

            Alex Stupnikov added a comment - Hello.   Thank you very much for taking a look. I want to ask one follow-up questions before proceeding with reproducer (environment was fixed and upgrade was completed some time ago, so we will have to figure out if we can implement something similar).    When I looked at the logs, I saw SSL errors first (not sure if connectivity or SSL related), but then OVN controllers started complaining about OVN DB contents (I assumed that OVN controllers were able to establish connection to control plane and then actual problem was triggered). Is it wrong and controllers were not able to communicate with control plane after initial errors?

              ovnteam@redhat.com OVN Team
              rhn-support-astupnik Alex Stupnikov
              Jianlin Shi Jianlin Shi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: