Uploaded image for project: 'Openshift sandboxed containers'
  1. Openshift sandboxed containers
  2. KATA-2639

kata shim metrics: gathering too many metrics makes Prometheus containers OOM-killed

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: High High
    • None
    • OSC 1.5.0
    • kata-containers
    • False
    • None
    • False
    • Hide
      .Excessive metric reporting causes Prometheus pods to fail

      Previously, the `kata_shim_netdev` metric reported an excessively large volume of metrics, which caused Prometheus pods to fail with `out of memory` errors. In the current release, the issue has been fixed.
      Show
      .Excessive metric reporting causes Prometheus pods to fail Previously, the `kata_shim_netdev` metric reported an excessively large volume of metrics, which caused Prometheus pods to fail with `out of memory` errors. In the current release, the issue has been fixed.
    • Bug Fix
    • Done
    • Kata Sprint #250, Kata Sprint #252
    • 0
    • 0

      Description

      The kata shim gathers metrics for the OSC process.
      One of these metrics, `kata_shim_netdev`, is known to retrieve too much information compared to what's actually useful.
      This is described in upstream issue: https://github.com/kata-containers/kata-containers/issues/5738

      Doing that causes Prometheus containers to increase their memory usage. As the network interfaces change (whenever containers are created/deleted), new metrics are added, and overtime, this can lead the Prometheus containers to fail due to lack of memory.

      Steps to reproduce

      I'm using the following deployment and code to have a loop of containers created/deleted.

      Deployment:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: nginx-kata-deployment
      #  namespace: default
      spec:
        replicas: 3
        selector:
          matchLabels:
            app: nginx
        template:
          metadata:
            labels:
              app: nginx
              run: nginx
          spec:
            containers:
            - name: nginx-kata
              image: bitnami/nginx
      #      runtimeClassName: kata
      

      Note that the runtimeClassName is commented out: I don't actually need the containers to be running kata, as long as there is at least one kata container running elsewhere on the same node.
      The problem we have is from the running kata container gathering data and keeping a history of it. The containers from the deployment being deleted regularly, they won't be the cause of the issue, because their data will be lost whenver they're scaled down.

      Script:

      #!/bin/bash
      
      function wait_for_scaling() {
        sleep 5
        deployed=$(oc get deployment nginx-kata-deployment | tail -n1 | awk '{print $2}')
        while [ ! "$deployed" = "$1/$1" ]; do
          echo "Deployed $deployed - waiting..."
          sleep 1
          deployed=$(oc get deployment nginx-kata-deployment | tail -n1 | awk '\{print $2}')
        done
      }
      
      oc apply -f nginx_deployment.yaml
      
      while [ true ]; do
        oc scale deployment nginx-kata-deployment --replicas=10
        wait_for_scaling 10
      
        oc scale deployment nginx-kata-deployment --replicas=1
        wait_for_scaling 1
      done
      

      This will create 10 containers, then delete them, in a loop.

      Expected result

      The prometheus pods should not grow overtime, and not be OOM-killed

       

      How to check the fix

      The prometheus pods that I've been looking at are named "prometheus-k8s-[number]. You can check their memory usage, and/or whether they are OOM-Killed, but OOM-kill can take a long time.

       

      Here is the requests I used in the "Observe/Metrics" panel of Openshift console:

          sum(container_memory_working_set_bytes{pod='prometheus-k8s-0',namespace='openshift-monitoring',container='',}) BY (pod, namespace)
          
          sum(container_memory_working_set_bytes{pod='prometheus-k8s-1',namespace='openshift-monitoring',container='',}) BY (pod, namespace)

       

      Alternatively, you can check that "kata_shim_netdev" metric is not visible anymore after the patch is applied. 

       

          count(group(kata_shim_netdev) by (interface))
       

      Actual result

      With the above script, I can see the number of kata_shim_netdev metric entries grow in prometheus. The longer the test runs, the higher the value.
      I can see a grow in memory usage for prometheus pods that are linked to this. The grow is not as strong when a kata container doesn't run.

      I did not reproduce the OOM-kill, but I probably would need to run this test longer. It also depends on the cluster's memory limits. But I feel that the mechanism is there.

      Env

      The first occurrence of this problem was found with OCP 4.12, using OSC 1.4.1
      I've been running the test above with OCP 4.14 and OSC 1.5.1

              jrope Julien ROPE
              jrope Julien ROPE
              Miriam Weiss Miriam Weiss
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: