Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19.z
Component/s: Documentation
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Experiencing a high level of degradation with OVN to the point of OVN failing to attach networks to pods, causing them to fail to start.

This impacts NetPol and MultiNetPol, but was found with a MNP use case.

This is triggered with an egress policy with a larger number of except entries.

apiVersion: k8s.cni.cncf.io/v1beta1
kind: MultiNetworkPolicy
metadata:
  annotations:
    k8s.v1.cni.cncf.io/policy-for: default/vlan530
  name: egressblock
spec:
    egress:
    - to:
      - ipBlock:
          cidr: 0.0.0.0/0
          except:
          - 10.6.153.254/32
          - 10.0.0.0/8
          - 10.128.8.1/32
          - 10.129.0.1/32
          - 10.128.0.1/32
          - 10.131.0.1/32
          - 10.130.0.1/32
          - 10.130.6.1/32
          - 10.129.4.1/32
          - 10.129.2.1/32
          - 10.130.2.1/32
          - 10.131.4.1/32
          - 10.128.6.1/32
          - 10.131.8.1/32
          - 10.129.6.1/32
          - 10.131.6.1/32
          - 10.128.4.1/32
          - 10.130.4.1/32
          - 10.129.8.1/32
          - 100.64.0.0/10
          - 129.0.1.4/32
          - 129.0.1.5/32
          - 129.0.2.153/32
          - 144.42.16.0/24
          - 144.42.27.0/24
          - 144.42.28.0/24
          - 144.42.3.0/24
          - 144.42.34.0/24
          - 144.42.56.0/24
          - 169.254.0.0/16
          - 170.40.0.0/17
          - 172.16.0.0/12
          - 144.42.16.0/24
          - 144.42.27.0/24
          - 144.42.28.0/24
          - 144.42.3.0/24
          - 144.42.34.0/24
          - 144.42.56.0/24
          - 152.161.230.196/30
          - 152.162.200.128/30
          - 152.181.136.88/30
          - 152.181.52.4/30
          - 152.181.57.100/30
          - 152.181.57.112/30
          - 152.181.57.92/30
          - 152.181.58.12/30
          - 152.181.59.208/30
          - 152.181.60.44/30
          - 152.181.60.56/30
          - 152.183.126.220/30
    podSelector:
      matchLabels:
        internet: "true"
    policyTypes:
    - Egress

The policy when applied has no immediate impact.

When you deploy a pod with a secondary nic and the "internet: "true"" podSelector and it is scheduled to nodeA, nodeA will start to fail.

ovn-controller cpu goes to 100% on the node.

If this is the first deployment, it will successfully attach the pod network and fail attaching the secondary network.

  Warning  FailedCreatePodSandBox  76s  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-rhel9-turquoise-crawdad-1-xs7vr_mrobson-mnp-test_e08df928-c8f0-4b50-8be7-afad26f9f31a_0(f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b): error adding pod mrobson-mnp-test_virt-launcher-rhel9-turquoise-crawdad-1-xs7vr to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b" Netns:"/var/run/netns/003dff83-3e2b-4fd8-ac1d-e22965482b55" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=mrobson-mnp-test;K8S_POD_NAME=virt-launcher-rhel9-turquoise-crawdad-1-xs7vr;K8S_POD_INFRA_CONTAINER_ID=f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b;K8S_POD_UID=e08df928-c8f0-4b50-8be7-afad26f9f31a" Path:"" ERRORED: error configuring pod [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr] networking: [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr/e08df928-c8f0-4b50-8be7-afad26f9f31a:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b network default NAD default] [mrobson-mnp-test/virt-launcher-rhel9-turquoise-crawdad-1-xs7vr f711bbf1bc912d168356060ccc355d2142d2290035b3ba26952e0ff9ffd2ee8b network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:82:06:27 [10.130.6.39/23]

Existing workloads will continue to work, but any pod started or restarted on that node will then fail earlier trying to attach the pod network.

This pushes the impact beyond secondary networks and impacts all workloads on that node.

It seems like the controller is stuck in some kind of loop preventing anything else from happening.

If I delete the pod and the policy, ovn-controller still stays at 100% cpu and no networks can attach.

Need to restart ovnkube-node on the node to recover it.

Other nodes continue to operate normally until a workloads with podSelector label is deployed there.

Version-Release number of selected component (if applicable):

4.19.9/.10

How reproducible:

Always

Steps to Reproduce:

Will add some more steps

Actual results:

OVN stops attaching networks

Expected results:

Policies don't have such a detrimental impact on OVN, it recovers on delete of failed policies.

Additional info:

Affected Platforms:

customer issue
internally reproduced

relates to

FDP-1713 ovn-controller may perform very cpu-heavy computations translating != prefix matches to openflow

In Progress

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates