-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Experiencing a high level of degradation with OVN to the point of OVN failing to attach networks to pods, causing them to fail to start.
This impacts NetPol and MultiNetPol, but was found with a MNP use case.
This is triggered with an egress policy with a larger number of except entries.
apiVersion: k8s.cni.cncf.io/v1beta1 kind: MultiNetworkPolicy metadata: annotations: k8s.v1.cni.cncf.io/policy-for: default/vlan530 name: egressblock spec: egress: - to: - ipBlock: cidr: 0.0.0.0/0 except: - 10.6.153.254/32 - 10.0.0.0/8 - 10.128.8.1/32 - 10.129.0.1/32 - 10.128.0.1/32 - 10.131.0.1/32 - 10.130.0.1/32 - 10.130.6.1/32 - 10.129.4.1/32 - 10.129.2.1/32 - 10.130.2.1/32 - 10.131.4.1/32 - 10.128.6.1/32 - 10.131.8.1/32 - 10.129.6.1/32 - 10.131.6.1/32 - 10.128.4.1/32 - 10.130.4.1/32 - 10.129.8.1/32 - 100.64.0.0/10 - 129.0.1.4/32 - 129.0.1.5/32 - 129.0.2.153/32 - 144.42.16.0/24 - 144.42.27.0/24 - 144.42.28.0/24 - 144.42.3.0/24 - 144.42.34.0/24 - 144.42.56.0/24 - 169.254.0.0/16 - 170.40.0.0/17 - 172.16.0.0/12 - 144.42.16.0/24 - 144.42.27.0/24 - 144.42.28.0/24 - 144.42.3.0/24 - 144.42.34.0/24 - 144.42.56.0/24 - 152.161.230.196/30 - 152.162.200.128/30 - 152.181.136.88/30 - 152.181.52.4/30 - 152.181.57.100/30 - 152.181.57.112/30 - 152.181.57.92/30 - 152.181.58.12/30 - 152.181.59.208/30 - 152.181.60.44/30 - 152.181.60.56/30 - 152.183.126.220/30 podSelector: matchLabels: internet: "true" policyTypes: - Egress
The policy when applied has no immediate impact.
When you deploy a pod with a secondary nic and the "internet: "true"" podSelector and it is scheduled to nodeA, nodeA will start to fail.
ovn-controller cpu goes to 100% on the node.
If this is the first deployment, it will successfully attach the pod network and fail attaching the secondary network.
Warning FailedCreatePodSandBox 76s kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-rhel9-turquoise-crawdad-1-xs7vr_mrobson-mnp-test_e08df928-c8f0-4b50-8be7-afad26f9f31a_0(f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b): error adding pod mrobson-mnp-test_virt-launcher-rhel9-turquoise-crawdad-1-xs7vr to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b" Netns:"/var/run/netns/003dff83-3e2b-4fd8-ac1d-e22965482b55" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=mrobson-mnp-test;K8S_POD_NAME=virt-launcher-rhel9-turquoise-crawdad-1-xs7vr;K8S_POD_INFRA_CONTAINER_ID=f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b;K8S_POD_UID=e08df928-c8f0-4b50-8be7-afad26f9f31a" Path:"" ERRORED: error configuring pod [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr] networking: [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr/e08df928-c8f0-4b50-8be7-afad26f9f31a:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b network default NAD default] [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:82:06:27 [10.130.6.39/23]
Existing workloads will continue to work, but any pod started or restarted on that node will then fail earlier trying to attach the pod network.
This pushes the impact beyond secondary networks and impacts all workloads on that node.
It seems like the controller is stuck in some kind of loop preventing anything else from happening.
If I delete the pod and the policy, ovn-controller still stays at 100% cpu and no networks can attach.
Need to restart ovnkube-node on the node to recover it.
Other nodes continue to operate normally until a workloads with podSelector label is deployed there.
Version-Release number of selected component (if applicable):
4.19.9/.10
How reproducible:
Always
Steps to Reproduce:
Will add some more steps
Actual results:
OVN stops attaching networks
Expected results:
Policies don't have such a detrimental impact on OVN, it recovers on delete of failed policies.
Additional info:
Affected Platforms:
customer issue
internally reproduced
- relates to
-
FDP-1713 ovn-controller may perform very cpu-heavy computations translating != prefix matches to openflow
-
- Backlog
-