-
Bug
-
Resolution: Done
-
Critical
-
None
-
rhos-18.0.0
-
None
(I am not sure about which component to choose for the report but the affected component is ovn-controller, so for now I chose ovn-operator)
I've been chasing the problem for almost 6 weeks on a deployment which is up and running for 2 months because I couldn't find the cause and I did not know how to easily reproduce it.
I had two different RHOSO18 Beta deployments which suddenly became disfunctional in the mid of July - the networking of created VMs did not work (port could not be properly created and assigned). It turned out that ovn-controller instances lost their connection to the ovsdb server. The reason was that there was no network interface with the IP address which was supposed to be the ovsdb server endpoint (ovsdbserver-sb.openstack.svc) on the ovsdbserver-sb-0/ovsdbserver-nb-0 pods even tho the output of "oc describe pod ovsdbserver-sb-0" showed the interface in the k8s.v1.cni.cncf.io/network-status and the interface was created in the pod at the time of pod creation.
It seemed based on the logs that the interface loss in the ovdbserver pods happened at the same time when metallb was automatically (I do not know how that happens automatically) updated to newer version but I could not reproduce it manually by downgrading the metallb and letting it be updated automatically or by anything I tried. So I let the deployment idle.
The problem reoccurred once the newer version of metallb was released and It got automatically updated on my setup about 2 weeks ago. The version of metallb It got updated to was: https://catalog.redhat.com/software/containers/openshift4/metallb-rhel9/6528009bdb21f9aee03ebf69?image=66b4fa511db8d828526ac531&container-tabs=gti
The first time I experienced the problem was when the deployment got updated to the version: https://catalog.redhat.com/software/containers/openshift4/metallb-rhel9/6528009bdb21f9aee03ebf69?image=668bc3a9a0eef2d338fcfc28&container-tabs=gti in te mid of July. The OCP is on 4.15 because the deployment was created 2 months ago.
The current situation after the metallb got updated is following:
1. The ovn-controller instances lost their connection to ovsdb servers at the time of metallb update:
2024-08-22T12:23:26.041Z|00054|reconnect|ERR|ssl:ovsdbserver-sb.openstack.svc:6642: no response to inactivity probe after 60 seconds, disconnecting
2024-08-22T12:23:26.041Z|00055|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection dropped
2024-08-22T12:23:26.042Z|00056|main|INFO|OVNSB commit failed, force recompute next time.
2024-08-22T12:23:27.044Z|00057|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
2024-08-22T12:23:28.044Z|00058|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt timed out
2024-08-22T12:23:28.045Z|00059|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: waiting 2 seconds before reconnect
2024-08-22T12:23:30.049Z|00060|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
2024-08-22T12:23:32.051Z|00061|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt timed out
2024-08-22T12:23:32.052Z|00062|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: waiting 4 seconds before reconnect
2024-08-22T12:23:36.057Z|00063|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connecting...
2024-08-22T12:23:38.237Z|00064|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: connection attempt failed (No route to host)
2024-08-22T12:23:38.237Z|00065|reconnect|INFO|ssl:ovsdbserver-sb.openstack.svc:6642: continuing to reconnect in the background but suppressing further logging
Because the endpoint is no reachable:
- ping ovsdbserver-sb.openstack.svc
PING ovsdbserver-sb.openstack.svc (172.17.0.30) 56(84) bytes of data.
From 172.17.10.1 (172.17.10.1) icmp_seq=3 Destination Host Unreachable
From 172.17.10.1 (172.17.10.1) icmp_seq=4 Destination Host Unreachable
From 172.17.10.1 (172.17.10.1) icmp_seq=5 Destination Host Unreachable
2. That's because there is no such interface on the ovsdbserver-sb pod:
$ oc rsh ovsdbserver-sb-0
sh-5.1$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if3773: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:c0:a8:1b:6b brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.27.107/22 brd 192.168.27.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::858:c0ff:fea8:1b6b/64 scope link
valid_lft forever preferred_lft forever
sh-5.1$
Even tho I am sure It was functional for a month and internalapi interface with the IP from internalapi network range was assigned as described in the pod definition:
$ oc describe pod ovsdbserver-sb-0
Name: ovsdbserver-sb-0
Namespace: openstack
Priority: 0
Service Account: ovncluster-ovndbcluster-sb
Node: master-2/192.168.111.22
Start Time: Mon, 29 Jul 2024 16:29:11 -0400
Labels: apps.kubernetes.io/pod-index=0
controller-revision-hash=ovsdbserver-sb-c5bc68d98
service=ovsdbserver-sb
statefulset.kubernetes.io/pod-name=ovsdbserver-sb-0
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["192.168.27.107/22"],"mac_address":"0a:58:c0:a8:1b:6b","gateway_ips":["192.168.24.1"],"routes":[{"dest":"192.1...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"192.168.27.107"
],
"mac": "0a:58:c0:a8:1b:6b",
"default": true,
"dns": {}
},{
"name": "openstack/internalapi",
"interface": "internalapi",
"ips": [
"172.17.0.30"
],
"mac": "5a:1a:e8:59:f0:d7",
"dns": {}
}]
k8s.v1.cni.cncf.io/networks: [\{"name":"internalapi","namespace":"openstack","interface":"internalapi"}]
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
I am not able to reproduce the problem If I try to restart/recreate/donwgrade the metallb manually neither I really understand how this automatic update happens but I am quite sure that automatic upgrade of metallb is the trigger for the problem on the setups I had/have (which are 2 months old RHOSO18 deployment on OCP4.15).
If I delete the ovsdbserver pods and they get recreated It fixes the problem and the internalapi interface is recreated.
I am not sure which logs to provide If any, so please let me know.
- is triggering
-
OCPBUGS-45476 With multiple routes(and no table-id set) vlan interfaces are recreated on nmstate-operator update or nmstate-handler restart
-
- New
-
-
OSPRH-11003 Root Cause and Refine OSPRH-9899
-
- Closed
-
- links to