-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
Customer Escalated
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On OpenShift 4.16; provision 4 vlan interfaces with the following configs in (reproducer) steps below. - Observe that virtual machines are provisioned properly and vms come online successfully with the target networks - Observe after a day or so, vms are offline and vlan interfaces can no longer reach the gateway. - Rebuilding the interfaces and configs will grant another day or so of connectivity - unclear what is killing the interfaces but is blocking go-live for major project.
Version-Release number of selected component (if applicable):
4.16.10
How reproducible:
twice now - customer env
Steps to Reproduce:
1. Deploy the following template config for creating a net-attach-definition and nncp:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: ovs-br${VLANTAG}-vlan
spec:
nodeSelector:
node-role.kubernetes.io/worker: ''
desiredState:
interfaces:
- name: ovs-br-${VLANTAG}
description: |-
A dedicated OVS bridge with eth1 as a port
allowing all VLANs and untagged traffic
type: ovs-bridge
state: up
bridge:
allow-extra-patch-ports: true
options:
stp: false
port:
- name: bond0
ovn:
bridge-mappings:
- localnet: localnet-vlan${VLANTAG}
bridge: ovs-br-${VLANTAG}
state: present
...
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: localnet-vlan${VLANTAG}
namespace: ${VMPROJECT}
spec:
config: |2
{
"cniVersion": "0.4.0",
"name": "localnet-vlan${VLANTAG}",
"type": "ovn-k8s-cni-overlay",
"topology": "localnet",
"netAttachDefName": "${VMPROJECT}/localnet-vlan${VLANTAG}",
"vlanID": ${VLANTAG}
}
Actual results:
- Vms come online, are accessible - fail after `n` hours and are not recoverable
Expected results:
virtualized machines and sub-network interfaces should remain stable after provisioning.
Nodes/vms/network resources were not power cycled or migrated to customer knowledge during this timeframe (weekend).
Additional info:
Observing the ovsvctl outputs on the nodes we see that vlan23 remained accessible but vlan 18, 16, 20 are unavailable:
~~~
$ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.stats-show_ovs-br-23
Statistics for bridge "ovs-br-23":
Current/maximum MAC entries in the table: 1168/8192
Current static MAC entries in the table : 0
Total number of learned MAC entries : 377359
Total number of expired MAC entries : 376191
Total number of evicted MAC entries : 0
Total number of port moved MAC entries : 0
$ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.show_ovs-br-23 | wc -l
1169
~~~
On the other hand, the br-16 has only one in the port 2. This explains why this is not communication outside:
~~~
$ cat 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.show_ovs-br-16
port VLAN MAC Age
2 16 02:xx:xx:xx:xx:5d 0
~~~
We can see differences in the patch between br-23 and br-16:
~~~
$ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-vsctl_-t_5_show
Bridge ovs-br-16
Port patch-localnet.vlan16_ovn_localnet_port-to-br-int
Interface patch-localnet.vlan16_ovn_localnet_port-to-br-int
type: patch
options: {peer=patch-br-int-to-localnet.vlan16_ovn_localnet_port}
Bridge ovs-br-23
Port bond0
Interface bond0
type: system
~~~