-
Task
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
False
-
-
False
-
-
rhel-9
-
None
-
-
This task is tracking the test case writing activities to cover the bug described below.
Configuring multiple nodes with the same ovn-encap-ip cause OVS db (conf.db) to increase rapidly in size.
Let's suppose multiple nodes properly configured, all using a different local_ip, e.g hv1 using ip1, hv2 using ip2 and hv2 using ip3.
If then one node (e.g. hv1) is (re)configured with the same ovn-encap-ip as another node (e.g. hv2) (i.e. as hv1 ovs-vsctl set open . external-ids:ovn-encap-ip=ip2) , the current ovn-controller will:
- Delete existing geneve tunnels (using ip1).
- Recreate new tunnels (using ip2).
- Create Encap with ip2 and chassis=hv1 in sb.
This commit will fail, as there is already an Encap with ip2 and chassis=hv2. When detecting the commit failure, OVN will :
- Delete existing geneve tunnels (using ip2).
- Recreate new tunnels using ip1.
In the next run, ovn-controller will try again to use ip2, deleting existing tunnels (using ip1), recreating new ones (using ip2), and trying again to commit to sb. And so on.
Ovn-controller could detect that an Encap with ip2, same type, and chassis != hv1 already exists in sb, and in this case prevent deleting/recreating tunnels, preventing trying (and failing) to write Encap in sb, and log an (rate limited to e.g. 1 per 10 seconds) error.
It is a configuration issue, so the configuration issue should be fixed by CMS/User. However, while the issue is present, this causes high CPU on SB, ovn-controller and ovs. In addition and OVS db (conf.db) is increasing quickly in size (a local test with a few chassis show an increase of 1 MB per second).
When the issue is resolved by CMS (corrected ovn-encap-ip on hv1 or hv2), then ovn-controller wakes up due to ovs or sb change and properly handles the change.
Reproduction Steps: Provide detailed steps or scripts to replicate the issue.
There might be simpler ways, but it can be easily reproduced using ovn-fake-multinode:
Find local_ip for ovn-chassis-2: ip2=$(podman exec ovn-chassis-2 ovs-vsctl get open . external_ids:ovn-encap-ip)
Use this ip in chassis-3: podman exec ovn-chassis-3 ovs-vsctl set open . external-ids:ovn-encap-ip=$ip2
Expected Behavior: Describe what should happen under normal circumstances.
ovn-controller should log an (rate limited) error, and avoid looping in deletion & creation of tunnels and commits failures to sb.
Observed Behavior: Explain what actually happens.
conf.db on ovn-chassis-3 (dramatically) increases in size, we see commit failures in sb, ovn-controller deletes and recreate tunnels.