Description of problem:
A memory usage regression has been identified during ROSA 4.21 Performance and Scale testing while comparing OCP 4.21 with 4.20 nightly builds. The regression is also observed when comparing 4.21 nightlies around Nov 20 when etcd was updated from 3.5 to 3.6
The regression scales with cluster size, with larger clusters experiencing higher memory consumption.
24-nodes cluster dashboard screenshot
120-nodes cluster dashboard screenshot
250-nodes cluster dashboard screenshot
ROSA 250-nodes cluster dashboad screenshot
| Cluster Size | etcd 3.5 Avg RSS | etcd 3.6 Avg RSS | Avg Change | etcd 3.5 Max RSS | etcd 3.6 Max RSS | Max Change |
|---|---|---|---|---|---|---|
| 24 nodes | 359 MiB | 360 MiB | +0.3% | 468 MiB | 541 MiB | +15.6% |
| 120 nodes | 550 MiB | 601 MiB | +9.3% | 893 MiB | 1.24 GiB | +42.2% |
| 250 nodes | 682 MiB | 974 MiB | +42.8% | 1.13 GiB | 2.96 GiB | +162% |
Version-Release number of selected component (if applicable):
4.21
How reproducible:
Always - reproducible across multiple cluster sizes (24, 120, 250 nodes) and deployment types (self-managed and ROSA)
Steps to Reproduce:
1. Deploy OCP cluster with etcd 3.5, e.g. 4.20 nightly
2. Run cluster-density-v2 workload using kube-burner
3. Record etcd RSS memory usage(average and max)
4. Deploy OCP cluster with etcd 3.6 e.g., 4.21 nightly
5. Run the same cluster-density-v2 workload
6. Compare etcd RSS memory usage - observe significant increase, especially in max RSS
Test Environment:Platform: AWS
SDN: OVNKubernetes
Test workload: cluster-density-v2
Actual results:
etcd 3.6 shows significant memory usage regression compared to etcd 3.5
Expected results:
etcd 3.6 memory usage should be comparable to etcd 3.5, with no significant regression.
Additional info: