Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3050

Network policies are not implemented or updated by OVN-Kubernetes

    XMLWordPrintable

Details

    Description

      This bug is a backport clone of [Bugzilla Bug 2115926](https://bugzilla.redhat.com/show_bug.cgi?id=2115926). The following is the description of the original bug:

      +++ This bug was initially created as a clone of Bug #2109442 +++

      An important commit was missed during the downstream merge
      Commit: https://github.com/openshift/ovn-kubernetes/pull/956/commits/96b2a2555a654d72a8546366032063a98a016f29
      Initial downstream merge to master branch: https://github.com/openshift/ovn-kubernetes/pull/956
      Downstream merge into the Release 4.10 branch: https://github.com/openshift/ovn-kubernetes/pull/971
      Pull Request, um den fehlenden Commit in Release 4.10 aufzunehmen: https://github.com/openshift/ovn-kubernetes/pull/1195

      +++ This bug was initially created as a clone of Bug #2048538 +++

      Description of problem:

      In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes.
      For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented.

      In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy:
      ~~~
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
      creationTimestamp: "2022-01-27T14:41:05Z"
      generation: 2
      name: default-deny
      namespace: customer-debug
      resourceVersion: "311846645"
      uid: 87646222-c86d-4000-8997-7f0557ac34cf
      spec:
      podSelector: {}
      policyTypes:

      • Ingress
      • Egress
        ~~~

      In one of our dev clusters this network policy is enforced.

      Version-Release number of selected component (if applicable):

      OCP 4.8.25

      How reproducible:

      This happens randomly and very difficult to predict.

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

      The case has the must-gathers in from the cluster.

      — Additional comment from Tim Rozet on 2022-02-03 16:04:01 UTC —

      Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be:

      1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in

      {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }

      This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

      2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator]

      This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add.

      3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache

      This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29

      4. policy.go:1166] failed to add IPs ... set contains duplicate value

      The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z.

      5. E0125 18:40:32.759129 1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat

      This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.

      — Additional comment from Tim Rozet on 2022-02-03 21:59:27 UTC —

      Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792

      — Additional comment from Tim Rozet on 2022-02-03 22:41:21 UTC —

      Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794

      — Additional comment from Tim Rozet on 2022-02-04 23:23:59 UTC —

      Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797

      Will need a follow up part 2 after this is reviewed + accepted.

      — Additional comment from Tim Rozet on 2022-02-09 01:45:15 UTC —

      Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.

      — Additional comment from Andy Bartlett on 2022-02-09 10:33:15 UTC —

      @trozet@redhat.com Do you have a link for the BZ / PR for:

      1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in

      {update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }

      This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

      Many thanks,

      Andy

      — Additional comment from Tim Rozet on 2022-02-14 16:55:30 UTC —

      Yeah the fix for number 1 is a one liner in the ebay/libovsdb library:

      https://github.com/openshift/ovn-kubernetes/commit/35677418d2bbfddb6229e1d776bba2064dde646b#diff-88e093886eb91e9ca5f9234d74a5f756c0251d685c141c902a7833d95bec5345R27

      @@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error)

      { return nil, errors.New("OvsSet supports only Go Slice types") }
      • var ovsSet []interface{}
        + ovsSet := make([]interface{}, 0, v.Len())
        for i := 0; i < v.Len(); i++ { ovsSet = append(ovsSet, v.Index(i).Interface()) }

      — Additional comment from Tim Rozet on 2022-02-15 14:51:40 UTC —

      Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823

      — Additional comment from Tim Rozet on 2022-02-16 17:21:06 UTC —

      Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826

      — Additional comment from anusaxen@redhat.com on 2022-07-21 20:30:21 UTC —

      Tested with cluster bot build referencing PR #1195

      All networkpolicy regression and checks passed in QE env

      — Additional comment from trozet@redhat.com on 2022-07-25 13:52:33 UTC —

      The description states the found in version is OCP 4.8. But the version on this bug is 4.10. It looks like the bug exists in 4.8 as well. Can we fix the version and make sure we backport to 4.8 and 4.9?

      — Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-05 03:32:07 UTC —

      Elliott changed bug status from MODIFIED to ON_QA.
      This bug is expected to ship in the next 4.10 release.

      Attachments

        Activity

          People

            ffernand@redhat.com Flavio Fernandes (Inactive)
            openshift-crt-jira-prow OpenShift Prow Bot
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: