-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During the recent ROSA 4.20 Perf & Scale testing, A memory usage regression has been observed for the kube-controller-manager component in OpenShift 4.20 compared to versions 4.19 and 4.18, especially under high-scale(using the cluster-density-v2 benchmark that represent a scenario closer to customer workload). The Average Resident Set Size (RSS) and Max Aggregated RSS for kube-controller-manager have increased substantially and disproportionately between releases. At the largest scale (249 workers), OCP 4.20's Average RSS usage is nearly 30% higher than 4.19 - The increase in Average RSS for the largest scale (249 workers) from 4.19 to 4.20 is 29% ((4.62GiB−3.58GiB)/3.58GiB≈29.05%), exceeding the normal infrastructure variability threshold of 10%. Summary of Memory Regression (Average Trend): 24 workers Version Average RSS Usage Max Aggregated RSS Usage 4.20 966 MiB 1.31 GiB 4.19 826 MiB 1.28 GiB 4.18 647 MiB 1.11 GiB Increase (4.20 vs 4.19):- Average RSS: 16.9% Max Aggregated RSS Usage: 2.34% 120 workers Version Average RSS Usage Max Aggregated RSS Usage 4.20 2.75 GiB 4.02 GiB 4.19 1.88 GiB 3.58 GiB 4.18 1.40 GiB 3.83 GiB Increase (4.20 vs 4.19) Average RSS: 46.3% Max Aggregated RSS Usage: 12.34% 249 workers Version Average RSS Usage Max Aggregated RSS Usage 4.20 4.62 GiB 7.06 GiB 4.19 3.58 GiB 6.50 GiB 4.18 2.99 GiB 5.74 GiB Increase (4.20 vs 4.19) Average RSS: 29.0% Max Aggregated RSS Usage: 8.6%
Version-Release number of selected component (if applicable):
4.20.z
How reproducible: Reproducible at various scale, specially at higher worker scale.
Steps to Reproduce:
* Deployment: Deploy ROSA classic clusters for the target versions (4.20, 4.19, 4.18) with varying worker node sizes (24, 120, and ≈249 workers). * Workload Tool Setup: Download and extract the OpenShift performance wrapper for kube-burner: https://github.com/kube-burner/kube-burner-ocp. * Execute Workload: Run the cluster-density-v2 workload on each cluster,For the 249-worker scale, the iteration count is set to 2241 (9 ×249 workers≈2241 iterations). Example Command (for 249 scale): ./kube-burner-ocp cluster-density-v2 --check-health=false --log-level=info --qps=20 --burst=20 --gc=true --churn-duration=20m --service-latency --gc-metrics=true --profile-type=reporting --iterations=2241 --churn=true Observation: After the workload completes, query the monitoring system (Prometheus) for the Average RSS Usage and Max Aggregated RSS Usage of the kube-controller-manager pods across the run duration.
Actual results:
The kube-controller-manager memory usage is substantially higher in OCP 4.20 compared to OCP 4.19 and 4.18 at all scale points, with the difference being most severe at the largest scale (249 workers), as documented in the table above.
Expected results:
Memory usage (Average RSS and Max Aggregated RSS) for kube-controller-manager should be consistent and stable across major/minor versions. The memory consumption of OCP 4.20 should be equal to or better than OCP 4.19 and 4.18.
Additional info:
Performance Metrics: The data provided covers the Average trend of the last 6 months