Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-9899

ovn-controller loses the connection to ovsdbservers after nmstate is automatically upgraded to newer version

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • rhos-18.0.0
    • ovn-operator
    • None
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • None
    • 5
    • Neutron Sprint 3, Neutron Sprint 4, Neutron Sprint 5
    • 3
    • Important

      (I am not sure about which component to choose for the report but the affected component is ovn-controller, so for now I chose ovn-operator)

      I've been chasing the problem for almost 6 weeks on a deployment which is up and running for 2 months because I couldn't find the cause and I did not know how to easily reproduce it.

      I had two different RHOSO18 Beta deployments which suddenly became disfunctional in the mid of July - the networking of created VMs did not work (port could not be properly created and assigned). It turned out that ovn-controller instances lost their connection to the ovsdb server. The reason was that there was no network interface with the IP address which was supposed to be the ovsdb server endpoint (ovsdbserver-sb.openstack.svc) on the ovsdbserver-sb-0/ovsdbserver-nb-0 pods even tho the output of "oc describe pod ovsdbserver-sb-0" showed the interface in the k8s.v1.cni.cncf.io/network-status and the interface was created in the pod at the time of pod creation.

      It seemed based on the logs that the interface loss in the ovdbserver pods happened at the same time when metallb was automatically (I do not know how that happens automatically) updated to newer version but I could not reproduce it manually by downgrading the metallb and letting it be updated automatically or by anything I tried. So I let the deployment idle.

      The problem reoccurred once the newer version of metallb was released and It got automatically updated on my setup about 2 weeks ago. The version of metallb It got updated to was: https://catalog.redhat.com/software/containers/openshift4/metallb-rhel9/6528009bdb21f9aee03ebf69?image=66b4fa511db8d828526ac531&container-tabs=gti
      The first time I experienced the problem was when the deployment got updated to the version: https://catalog.redhat.com/software/containers/openshift4/metallb-rhel9/6528009bdb21f9aee03ebf69?image=668bc3a9a0eef2d338fcfc28&container-tabs=gti in te mid of July. The OCP is on 4.15 because the deployment was created 2 months ago.

      The current situation after the metallb got updated is following:
      1. The ovn-controller instances lost their connection to ovsdb servers at the time of metallb update:
      2024-08-22T12:23:26.041Z|00054|reconnect|ERR|ssl:ovsdbserver-sb.openstack.svc:6642: no response to inactivity probe after 60 seconds, disconnecting
      2024-08-22T12:23:26.041Z|00055|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection dropped
      2024-08-22T12:23:26.042Z|00056|main|INFO|OVNSB commit failed, force recompute next time.
      2024-08-22T12:23:27.044Z|00057|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
      2024-08-22T12:23:28.044Z|00058|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt timed out
      2024-08-22T12:23:28.045Z|00059|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: waiting 2 seconds before reconnect
      2024-08-22T12:23:30.049Z|00060|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
      2024-08-22T12:23:32.051Z|00061|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt timed out
      2024-08-22T12:23:32.052Z|00062|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: waiting 4 seconds before reconnect
      2024-08-22T12:23:36.057Z|00063|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
      2024-08-22T12:23:38.237Z|00064|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt failed (No route to host)
      2024-08-22T12:23:38.237Z|00065|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: continuing to reconnect in the background but suppressing further logging
      Because the endpoint is no reachable:

      1. ping ovsdbserver-sb.openstack.svc
        PING ovsdbserver-sb.openstack.svc (172.17.0.30) 56(84) bytes of data.
        From 172.17.10.1 (172.17.10.1) icmp_seq=3 Destination Host Unreachable
        From 172.17.10.1 (172.17.10.1) icmp_seq=4 Destination Host Unreachable
        From 172.17.10.1 (172.17.10.1) icmp_seq=5 Destination Host Unreachable

      2. That's because there is no such interface on the ovsdbserver-sb pod:
      $ oc rsh ovsdbserver-sb-0
      sh-5.1$ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: eth0@if3773: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
          link/ether 0a:58:c0:a8:1b:6b brd ff:ff:ff:ff:ff:ff link-netnsid 0
          inet 192.168.27.107/22 brd 192.168.27.255 scope global eth0
             valid_lft forever preferred_lft forever
          inet6 fe80::858:c0ff:fea8:1b6b/64 scope link 
             valid_lft forever preferred_lft forever
      sh-5.1$
      Even tho I am sure It was functional for a month and internalapi interface with the IP from internalapi network range was assigned as described in the pod definition:
      $ oc describe pod ovsdbserver-sb-0
      Name:             ovsdbserver-sb-0
      Namespace:        openstack
      Priority:         0
      Service Account:  ovncluster-ovndbcluster-sb
      Node:             master-2/192.168.111.22
      Start Time:       Mon, 29 Jul 2024 16:29:11 -0400
      Labels:           apps.kubernetes.io/pod-index=0
                        controller-revision-hash=ovsdbserver-sb-c5bc68d98
                        service=ovsdbserver-sb
                        statefulset.kubernetes.io/pod-name=ovsdbserver-sb-0
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["192.168.27.107/22"],"mac_address":"0a:58:c0:a8:1b:6b","gateway_ips":["192.168.24.1"],"routes":[{"dest":"192.1...
                        k8s.v1.cni.cncf.io/network-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "192.168.27.107"
                              ],
                              "mac": "0a:58:c0:a8:1b:6b",
                              "default": true,
                              "dns": {}
                          },{
                              "name": "openstack/internalapi",
                              "interface": "internalapi",
                              "ips": [
                                  "172.17.0.30"
                              ],
                              "mac": "5a:1a:e8:59:f0:d7",
                              "dns": {}
                          }]
                        k8s.v1.cni.cncf.io/networks: [\{"name":"internalapi","namespace":"openstack","interface":"internalapi"}]
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Running

      I am not able to reproduce the problem If I try to restart/recreate/donwgrade the metallb manually neither I really understand how this automatic update happens but I am quite sure that automatic upgrade of metallb is the trigger for the problem on the setups I had/have (which are 2 months old RHOSO18 deployment on OCP4.15).
      If I delete the ovsdbserver pods and they get recreated It fixes the problem and the internalapi interface is recreated.
      I am not sure which logs to provide If any, so please let me know.

              ykarel@redhat.com Yatin Karel
              mkrcmari@redhat.com Marian Krcmarik
              Renjing Xiao Renjing Xiao
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: