Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3778

[4.10] ovn-k network policy races

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • 4.12, 4.11, 4.10
    • None
    • SDN Sprint 228, SDN Sprint 229
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list
      
      1. all kinds of add/delete port to/from default deny port group failures, possible symptoms:
        - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed
        - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped
        - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done
      2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed
      3. handle error when getting local pod from the cache fails, possible symptoms
        - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message
        - pod is not added to netpol port groups, network policy is not applied
      4. creating deleted namespace via ensureNamespaceLocked, symptoms:
        - namespace was deleted, but address set is present in the db
      5. policy acl loglevel update wasn’t applied, possible symptoms:
        - netpol acl log level isn’t set/updated to namespace loglevel
      6. netpol cleanup failures, symptoms:
        - network policy failed to be deleted, something is still left in the db, error messages like
        - "failed to destroy network policy"
        - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy"
      7. concurrent write to sets.String - this will panic, you won’t miss
      8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g.
        - "peer AddressSet is nil, cannot add <object>"
      9. host network and completed pods selected by network policy can produce error logs, no real harm
        - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err"
      10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak
      11. add local pod failure, since netpol port group is not committed to db yet, error looks like
        - "Failed to create *factory.localPodSelector <name>, error: object not found"
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      Example 1
      1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: test-network-policy
        namespace: test
      spec:
        podSelector: {}
        policyTypes:
          - Ingress
        ingress:
          - from:
              - namespaceSelector:
                  matchLabels:
                    project: myproject
      
      2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time
      3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times
      4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling
      5. With the new version no error messages should be present
      
      Example 2
      1. create network policy that applies to a namespace test
      piVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: test-network-policy
        namespace: test
      spec:
        podSelector: {}
        policyTypes:
          - Ingress
        ingress:
      2. Create host network pod in namespace test
      3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: "
      4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy"
      5. With the new version no error message should be present
      
      All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Only some parts were backported to 4.10 due to significant releases differences.
      The problems that are fixed, and perf improvement:
      1. don't retry unscheduled pod, wait for update event instead. 
      2.  Cleanup pod handler for namespaceAndPod handler on namespace delete
      event.
      3. Only update localPods after successful db transaction, 
      return fast from localPod handlers based on `np.localPods`
      4. Don't retry fetching lsp from the lspCache, that "stops" the handler for 1 second
      5.  Use stored portUUID to delete local pods instead of getting that info
      from lspCache

      Attachments

        Issue Links

          Activity

            People

              npinaeva@redhat.com Nadia Pinaeva
              npinaeva@redhat.com Nadia Pinaeva
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: