Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23450

[4.14 placeholder] High memory usage on one of the ovnkube-master pods on some clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • None
    • 4.12.z
    • None

      Description of problem:

      My customer is seeing one of the ovnkube-master pods high memory load. There are not many networkpolicies active.
      
      $ oc adm top pod
      NAME                   CPU(cores)   MEMORY(bytes)
      ovnkube-master-6wwl8   136m         355Mi
      ovnkube-master-785ft   408m         14613Mi
      ovnkube-master-xlrj7   33m          376Mi
      ovnkube-node-2v9g4     10m          188Mi
      ovnkube-node-4tn6j     12m          196Mi
      ovnkube-node-5ffn6     16m          200Mi
      ovnkube-node-67vrd     9m           184Mi
      ovnkube-node-7sw29     16m          184Mi
      ovnkube-node-9fsnc     14m          193Mi
      ovnkube-node-dngjh     13m          239Mi
      ovnkube-node-dphq9     15m          231Mi
      ovnkube-node-jpc27     5m           217Mi
      ovnkube-node-nfsvf     14m          196Mi
      ovnkube-node-sczr9     14m          199Mi
      ovnkube-node-shbqq     14m          187Mi
      ovnkube-node-tqftp     14m          218Mi
      ovnkube-node-w5747     12m          195Mi
      ovnkube-node-wdgb7     8m           154Mi

      Version-Release number of selected component (if applicable):

      4.12.26

      How reproducible:

      Only 2 clusters affected so very hard to say

      Steps to Reproduce:

      1. Get pprof trace from ovnk-master leader (any chosen ovnkube-controller pod for 4.14)
      You can use network-tools https://github.com/openshift/network-tools ovn-pprof-forwarding for that.
      Collect data by running
      curl localhost:<choose port>/debug/pprof/trace?seconds=40 > trace
      2. create and delete the following network policy 3 times
      ```
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
       name: test-policy
       namespace: default
      spec:
       podSelector: {}
       policyTypes:
       - Ingress
       ingress:
       - from:
         - namespaceSelector:
             matchLabels:
               kubernetes.io/metadata.name: default
           podSelector:
             matchLabels:
               app: test
      ```
      3. Collect one more trace `trace2`
      4. now compare traces, to do so run
      go tool trace <trace file>
      it will open a browser window, go to Goroutine analysis, then note the N value for either 
      `github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).periodicallyRetryResources` or `github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).WatchResourceFiltered.func4`
      It shows the number of created goroutines, when the bug is present `trace2` will show higher N compared to `trace`, and when the bug is fixed, it should show the same

      Actual results:

       

      Expected results:

       

      Additional info:

       

            npinaeva@redhat.com Nadia Pinaeva
            rhn-support-andbartl Andy Bartlett
            Arti Sood Arti Sood
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: