-
Bug
-
Resolution: Done
-
High
-
None
-
OSC 1.5.0
-
False
-
None
-
False
-
-
Bug Fix
-
Done
-
-
-
Kata Sprint #250, Kata Sprint #252
-
0
-
0
Description
The kata shim gathers metrics for the OSC process.
One of these metrics, `kata_shim_netdev`, is known to retrieve too much information compared to what's actually useful.
This is described in upstream issue: https://github.com/kata-containers/kata-containers/issues/5738
Doing that causes Prometheus containers to increase their memory usage. As the network interfaces change (whenever containers are created/deleted), new metrics are added, and overtime, this can lead the Prometheus containers to fail due to lack of memory.
Steps to reproduce
I'm using the following deployment and code to have a loop of containers created/deleted.
Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-kata-deployment
# namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
run: nginx
spec:
containers:
- name: nginx-kata
image: bitnami/nginx
# runtimeClassName: kata
Note that the runtimeClassName is commented out: I don't actually need the containers to be running kata, as long as there is at least one kata container running elsewhere on the same node.
The problem we have is from the running kata container gathering data and keeping a history of it. The containers from the deployment being deleted regularly, they won't be the cause of the issue, because their data will be lost whenver they're scaled down.
Script:
#!/bin/bash function wait_for_scaling() { sleep 5 deployed=$(oc get deployment nginx-kata-deployment | tail -n1 | awk '{print $2}') while [ ! "$deployed" = "$1/$1" ]; do echo "Deployed $deployed - waiting..." sleep 1 deployed=$(oc get deployment nginx-kata-deployment | tail -n1 | awk '\{print $2}') done } oc apply -f nginx_deployment.yaml while [ true ]; do oc scale deployment nginx-kata-deployment --replicas=10 wait_for_scaling 10 oc scale deployment nginx-kata-deployment --replicas=1 wait_for_scaling 1 done
This will create 10 containers, then delete them, in a loop.
Expected result
The prometheus pods should not grow overtime, and not be OOM-killed
How to check the fix
The prometheus pods that I've been looking at are named "prometheus-k8s-[number]. You can check their memory usage, and/or whether they are OOM-Killed, but OOM-kill can take a long time.
Here is the requests I used in the "Observe/Metrics" panel of Openshift console:
sum(container_memory_working_set_bytes{pod='prometheus-k8s-0',namespace='openshift-monitoring',container='',}) BY (pod, namespace)
sum(container_memory_working_set_bytes{pod='prometheus-k8s-1',namespace='openshift-monitoring',container='',}) BY (pod, namespace)
Alternatively, you can check that "kata_shim_netdev" metric is not visible anymore after the patch is applied.
count(group(kata_shim_netdev) by (interface))
Actual result
With the above script, I can see the number of kata_shim_netdev metric entries grow in prometheus. The longer the test runs, the higher the value.
I can see a grow in memory usage for prometheus pods that are linked to this. The grow is not as strong when a kata container doesn't run.
I did not reproduce the OOM-kill, but I probably would need to run this test longer. It also depends on the cluster's memory limits. But I feel that the mechanism is there.
Env
The first occurrence of this problem was found with OCP 4.12, using OSC 1.4.1
I've been running the test above with OCP 4.14 and OSC 1.5.1
1.
|
downstream: gathering too many metrics makes Prometheus containers OOM killed | Closed | Unassigned |