-
Bug
-
Resolution: Done
-
Undefined
-
ACM 2.8.0
-
3
-
False
-
-
False
-
-
-
GRC Sprint 2023-15, GRC Sprint 2023-16, GRC Sprint 2023-17, GRC Sprint 2023-18, GRC Sprint 2023-19
-
No
Description of problem:
While scale testing ACM 2.8 with 3500+ SNOs being deployed, managed and having the DU profile applied, it has become apparent that the rate in which SiteConfigs are git committed and applied to the cluster can result in lower scale in number of policies that a ZTP CGU can apply successfully.
For Example:
- Run 76 - Deploying 500 SNOs once every hour where the DU profile consists of 13 policies results in ~97% success rate in SNOs becoming fully DU compliant (See Graph Run 76)
- Run 71 - Deploying 40 SNOs once every 5 minutes (~480 SNOs per hour) where the DU profile consists of 13 policies results in 17% of SNOs failing on applying the DU profile (Mostly at the tail end) and an overall ~81% success rate in SNOs becoming fully DU compliant (See Graph Run 71)
This is despite the fact that Run 71 (40/5m) is actually a slightly slower deployment rate in that it takes 7.5 hours to commit all SiteConfigs where as 500/1hr actually only takes 7 hours to commit all of the same SiteConfigs. One would think the slower gradual rate in which clusters will complete would be easier on ACM components however it appears this is not the case when viewing the failure/success rates
Run 76 - resulted in 3506 SNOs becoming compliant with the 13 policy DU profile
Run 71 - resulted in only 2960 SNOs becoming compliant with the 13 policy DU Profile
Version-Release number of selected component (if applicable):
Hub and Spoke OCP - 4.13.5
ACM - (run 71) - 2023-06-14-19-14-12
ACM - (run 76) - 2023-07-19-14-16-54
How reproducible:
Running the same build later on with reduced policy counts shows that more policies with the "trickle" workload results in a slower CGU compliant timeframe. Basically the 500/1hr workload is faster and more successful at applying the policies vs the 40/5m workload.
Steps to Reproduce:
- ...