-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
Customer Escalated
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On OpenShift 4.16; provision 4 vlan interfaces with the following configs in (reproducer) steps below. - Observe that virtual machines are provisioned properly and vms come online successfully with the target networks - Observe after a day or so, vms are offline and vlan interfaces can no longer reach the gateway. - Rebuilding the interfaces and configs will grant another day or so of connectivity - unclear what is killing the interfaces but is blocking go-live for major project.
Version-Release number of selected component (if applicable):
4.16.10
How reproducible:
twice now - customer env
Steps to Reproduce:
1. Deploy the following template config for creating a net-attach-definition and nncp:
apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: ovs-br${VLANTAG}-vlan spec: nodeSelector: node-role.kubernetes.io/worker: '' desiredState: interfaces: - name: ovs-br-${VLANTAG} description: |- A dedicated OVS bridge with eth1 as a port allowing all VLANs and untagged traffic type: ovs-bridge state: up bridge: allow-extra-patch-ports: true options: stp: false port: - name: bond0 ovn: bridge-mappings: - localnet: localnet-vlan${VLANTAG} bridge: ovs-br-${VLANTAG} state: present ... apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: localnet-vlan${VLANTAG} namespace: ${VMPROJECT} spec: config: |2 { "cniVersion": "0.4.0", "name": "localnet-vlan${VLANTAG}", "type": "ovn-k8s-cni-overlay", "topology": "localnet", "netAttachDefName": "${VMPROJECT}/localnet-vlan${VLANTAG}", "vlanID": ${VLANTAG} }
Actual results:
- Vms come online, are accessible - fail after `n` hours and are not recoverable
Expected results:
virtualized machines and sub-network interfaces should remain stable after provisioning. Nodes/vms/network resources were not power cycled or migrated to customer knowledge during this timeframe (weekend).
Additional info:
Observing the ovsvctl outputs on the nodes we see that vlan23 remained accessible but vlan 18, 16, 20 are unavailable: ~~~ $ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.stats-show_ovs-br-23 Statistics for bridge "ovs-br-23": Current/maximum MAC entries in the table: 1168/8192 Current static MAC entries in the table : 0 Total number of learned MAC entries : 377359 Total number of expired MAC entries : 376191 Total number of evicted MAC entries : 0 Total number of port moved MAC entries : 0 $ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.show_ovs-br-23 | wc -l 1169 ~~~ On the other hand, the br-16 has only one in the port 2. This explains why this is not communication outside: ~~~ $ cat 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-appctl_fdb.show_ovs-br-16 port VLAN MAC Age 2 16 02:xx:xx:xx:xx:5d 0 ~~~ We can see differences in the patch between br-23 and br-16: ~~~ $ less 0070-sosreport-w-7/sosreport/sos_commands/openvswitch/ovs-vsctl_-t_5_show Bridge ovs-br-16 Port patch-localnet.vlan16_ovn_localnet_port-to-br-int Interface patch-localnet.vlan16_ovn_localnet_port-to-br-int type: patch options: {peer=patch-br-int-to-localnet.vlan16_ovn_localnet_port} Bridge ovs-br-23 Port bond0 Interface bond0 type: system ~~~