Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22948

[Reliability][regression]openshift-kube-scheduler leader pod memory increased in 6 days from 100+ MiB to 13+GB

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      In Reliability (loaded longrun, the load is stable) test, the leader openshift-kube-scheduler pod's memory increased from 100+ MiB to ~13GB in 6 days. The other 2 openshift-kube-scheduler pods were ok.

      Version-Release number of selected component (if applicable):

      4.15.0-0.nightly-2023-10-31-054858

      How reproducible:

      Met this the first time. I did not see this in 4.14's Reliability test.

      Steps to Reproduce:

      1. Install a AWS compact cluster with 3 masters, workers are on master nodes too.
      2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster.
      config: 1 admin, 5 dev-test, 5 dev-prod, 1 dev-cron. 
      3. Monitor the performance dashboard.
      
      Performance dashboard:http://dittybopper-dittybopper.apps.qili-comp-etcd.qe-lrc.devcluster.openshift.com/d/go4AGIVSk/openshift-performance?orgId=1&from=1698806511000&to=now&var-datasource=Cluster%20Prometheus&var-_master_node=ip-10-0-52-74.us-east-2.compute.internal&var-_master_node=ip-10-0-54-53.us-east-2.compute.internal&var-_master_node=ip-10-0-75-225.us-east-2.compute.internal&var-_worker_node=ip-10-0-52-74.us-east-2.compute.internal&var-_infra_node=&var-namespace=All&var-block_device=All&var-net_device=All&var-interval=2m 

      Actual results:

      The leader openshift-kube-scheduler pod's memory linearly increased from 100+ MiB to ~13GB in 6 days. The other 2 openshift-kube-scheduler pods were ok.
      The peak CPU usage of the leader openshift-kube-scheduler pod also increased from <10% to 40%+.

      Please see the screenshot here memory-cpu-on-leader-pod.png

      Expected results:

      Memory should be stable on a reasonable level with a stable workload.

      Additional info:

      oc adm top pod -n openshift-kube-scheduler --sort-by memory 
      NAME                                                                       CPU(cores)   MEMORY(bytes)   
      openshift-kube-scheduler-ip-10-0-54-53.us-east-2.compute.internal          3m           13031Mi         
      openshift-kube-scheduler-ip-10-0-52-74.us-east-2.compute.internal          4m           146Mi           
      openshift-kube-scheduler-ip-10-0-75-225.us-east-2.compute.internal         3m           136Mi           
      openshift-kube-scheduler-guard-ip-10-0-52-74.us-east-2.compute.internal    0m           0Mi             
      openshift-kube-scheduler-guard-ip-10-0-54-53.us-east-2.compute.internal    0m           0Mi             
      openshift-kube-scheduler-guard-ip-10-0-75-225.us-east-2.compute.internal   0m           0Mi
      

      Studying materials

            jchaloup@redhat.com Jan Chaloupka
            rhn-support-qili Qiujie Li
            Rama Kasturi Narra Rama Kasturi Narra
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: