-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.13.z, 4.12.z, 4.14.z
This is a clone of issue OCPBUGS-48777. The following is the description of the original issue:
—
Description of problem:
This issue started on one of the customers that our partner manages, where they were getting route cache entries with incorrect MTUs, very similar to the issue we worked a few months ago mentioned below. However the root cause doesn't look to be the same, since customer version has all the fixes and it seems very particular to the usage they have on the cluster. It is also very hard (to almost impossible) to reproduce, unlike it was before we introduced the fixes from the Jira below.
So far, me and the partner, we were only able to reproduce by doing upgrades, inthis case EUS to EUS from 4.12 to 4.14. In that regard I ran several reproducers doing this upgrade to see when the issue appears.
Another difference from the previous bug, is that flushing cache doesn't help, since immediately after being flush the route cache entries appear again. The only solution is to reboot the nodes.
The results of this are practically the same as we have seen before, with TCP connections being severely affected causing many network connectivity failures and degradation in the network performance, along with high number of packet errors and drops in the geneve port, ICMP need defrag seen in tcpdumps, etc.
These clusters are bundles created by our partner that consist in:
- 3 master/node cluster with optionally 2 or 3 additional worker nodes;
- IPI bare metal installation;
- ODF, CNV, SRIOV, MetalLB abd NMState operators as base;
- Additional bridges and VLANs used for SRIOV to be used on VMs;
- Default subdomain and public API routing done via MetalLB LoadBalancer service;
- Main network bond interface has 2 VLANs. One VLAN is for machine network for internal traffic only and a second VLAN on a different network for public access. This VLAN is where both services above are exposed. This VLAN is where the default route is, changed after installation.
- OVN gateway router set to local mode.
- Cluster MTU set to 9000/8900
Test I made to try to reduce the possible components involved in causing the issue, starting by the ones where I didn't see the issues:
- Cluster with specs mentioned above, apart from SRIOV. Instead I'm using another interface in my VMs where I create a bridge using NMState operator. This bridge is then used for the cnv-bridge net-attach-def (example similar to what we have in our OpenShift Virtualization docs). This test with VMs with some using interfaces on the cnv-bridge. ODF being used for rbd and cephfs. OVN gateway in shared mode.
- Cluster with all the same specs but no VMs nor additional bridge configured (CNV alone being configured or not makes no difference). No issue as expected which seems to prove there is no regression from the Jira below.
The ones I see the issue seem to have one thing in common, which is either having VMs with additional networks or additional bridges or combination of both, plus the OVN gateway in local mode:
- Tested cluster with VMs with interfaces on cnv-bridge with and without live migration enabled.
- Tested cluster with CNV configured including cnv-bridge ready for VMs, but no VMs created.
- Tested cluster with CNV configured including cnv-bridge and VMs running on the cluster network and on this bridge. External VLAN was also configured with MTU 9000 the same way as the internal VLAN.
- These result in seeing results like this (taken from several tests):
https://privatebin.corp.redhat.com/?47cb4fde3d39062f#98gvHAC7ptsoqcLfGfj2LXSd1hg9oygenkukU6NVVBHW
Results like this below is when I try to run `ip route flush cache`. A second or 2 below the cache is immediately recreated and network issues on the cluster don't get any better:
[WARNING]: Platform linux on host master-node0.ocp4-aio-
cluster.redhatrules.local is using the discovered Python interpreter at
/usr/bin/python3.9, but future installation of another Python interpreter could
change the meaning of that path. See https://docs.ansible.com/ansible-
core/2.16/reference_appendices/interpreter_discovery.html for more information.
master-node0.ocp4-aio-cluster.redhatrules.local | CHANGED | rc=0 >>
unicast 172.23.184.101 dev br-ex
cache users 1 age 15sec
I have many data gathered from my clusters and also customer has several uploaded to the case attached to the Jira.
Let me know what is the best way to share the data.
Version-Release number of selected component (if applicable):
So far seen in OCP 4.12, 4.13 and 4.14.
How reproducible:
Sometimes. So far we could only reproduce during upgrades.
Steps to Reproduce:
1. Several ways to reproduce. So far only with upgrades and issue starts when the CNO starts the upgrade and restarts OVN.
Actual results:
Expected results:
Additional info:
This is very similar to an old bug we had on jira - https://issues.redhat.com/browse/OCPBUGS-26700 - but nothing indicates to be a regression, since this issue is not reproducible with simple OCP even with the same vlan configurations and cluster network setup.
- clones
-
OCPBUGS-48777 [4.17] [OVN+CNV] Cached route with incorrect MTU causes huge instability in the cluster
-
- Closed
-
- is blocked by
-
OCPBUGS-48777 [4.17] [OVN+CNV] Cached route with incorrect MTU causes huge instability in the cluster
-
- Closed
-
- links to
-
RHBA-2025:8556 OpenShift Container Platform 4.16.42 bug fix update