-
Bug
-
Resolution: Done-Errata
-
Major
-
4.15
-
Important
-
No
-
Rejected
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
In Reliability (loaded longrun, the load is stable) test, the leader openshift-kube-scheduler pod's memory increased from 100+ MiB to ~13GB in 6 days. The other 2 openshift-kube-scheduler pods were ok.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-10-31-054858
How reproducible:
Met this the first time. I did not see this in 4.14's Reliability test.
Steps to Reproduce:
1. Install a AWS compact cluster with 3 masters, workers are on master nodes too. 2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster. config: 1 admin, 5 dev-test, 5 dev-prod, 1 dev-cron. 3. Monitor the performance dashboard. Performance dashboard:http://dittybopper-dittybopper.apps.qili-comp-etcd.qe-lrc.devcluster.openshift.com/d/go4AGIVSk/openshift-performance?orgId=1&from=1698806511000&to=now&var-datasource=Cluster%20Prometheus&var-_master_node=ip-10-0-52-74.us-east-2.compute.internal&var-_master_node=ip-10-0-54-53.us-east-2.compute.internal&var-_master_node=ip-10-0-75-225.us-east-2.compute.internal&var-_worker_node=ip-10-0-52-74.us-east-2.compute.internal&var-_infra_node=&var-namespace=All&var-block_device=All&var-net_device=All&var-interval=2m
Actual results:
The leader openshift-kube-scheduler pod's memory linearly increased from 100+ MiB to ~13GB in 6 days. The other 2 openshift-kube-scheduler pods were ok. The peak CPU usage of the leader openshift-kube-scheduler pod also increased from <10% to 40%+.
Please see the screenshot here memory-cpu-on-leader-pod.png
Expected results:
Memory should be stable on a reasonable level with a stable workload.
Additional info:
oc adm top pod -n openshift-kube-scheduler --sort-by memory NAME CPU(cores) MEMORY(bytes) openshift-kube-scheduler-ip-10-0-54-53.us-east-2.compute.internal 3m 13031Mi openshift-kube-scheduler-ip-10-0-52-74.us-east-2.compute.internal 4m 146Mi openshift-kube-scheduler-ip-10-0-75-225.us-east-2.compute.internal 3m 136Mi openshift-kube-scheduler-guard-ip-10-0-52-74.us-east-2.compute.internal 0m 0Mi openshift-kube-scheduler-guard-ip-10-0-54-53.us-east-2.compute.internal 0m 0Mi openshift-kube-scheduler-guard-ip-10-0-75-225.us-east-2.compute.internal 0m 0Mi
Studying materials
- https://www.neteye-blog.com/2019/06/go-pprof-how-to-understand-where-there-is-memory-retention/
- https://github.com/google/pprof/blob/main/doc/README.md#interpreting-the-callgraph
- https://go101.org/article/memory-leaking.html
- https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/
- https://blog.detectify.com/industry-insights/how-we-tracked-down-a-memory-leak-in-one-of-our-go-microservices/
- links to
-
RHSA-2023:7198 OpenShift Container Platform 4.15 security update