Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-6928

ACM Policy slower to resolve policy compliance for TALM when workload is steady vs spikes

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • GRC Sprint 2023-15, GRC Sprint 2023-16, GRC Sprint 2023-17, GRC Sprint 2023-18, GRC Sprint 2023-19
    • No

      Description of problem:

      While scale testing ACM 2.8 with 3500+ SNOs being deployed, managed and having the DU profile applied, it has become apparent that the rate in which SiteConfigs are git committed and applied to the cluster can result in lower scale in number of policies that a ZTP CGU can apply successfully.

       

      For Example:

      • Run 76 - Deploying 500 SNOs once every hour where the DU profile consists of 13 policies results in ~97% success rate in SNOs becoming fully DU compliant (See Graph Run 76)
      • Run 71 - Deploying 40 SNOs once every 5 minutes (~480 SNOs per hour) where the DU profile consists of 13 policies results in 17% of SNOs failing on applying the DU profile (Mostly at the tail end) and an overall ~81% success rate in SNOs becoming fully DU compliant (See Graph Run 71)

      This is despite the fact that Run 71 (40/5m) is actually a slightly slower deployment rate in that it takes 7.5 hours to commit all SiteConfigs where as 500/1hr actually only takes 7 hours to commit all of the same SiteConfigs.  One would think the slower gradual rate in which clusters will complete would be easier on ACM components however it appears this is not the case when viewing the failure/success rates

       

      Run 76 - resulted in 3506 SNOs becoming compliant with the 13 policy DU profile

      Run 71 - resulted in only 2960 SNOs becoming compliant with the 13 policy DU Profile

      Version-Release number of selected component (if applicable):

      Hub and Spoke OCP - 4.13.5

      ACM - (run 71) - 2023-06-14-19-14-12

      ACM - (run 76) - 2023-07-19-14-16-54

      How reproducible:

      Running the same build later on with reduced policy counts shows that more policies with the "trickle" workload results in a slower CGU compliant timeframe.  Basically the 500/1hr workload is faster and more successful at applying the policies vs the 40/5m workload.

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

        1. 0.log.20230810-215753.gz
          3.17 MB
        2. 0.log.20230810-225519.gz
          3.17 MB
        3. 0.log.20230810-234402.gz
          3.20 MB
        4. 0.log.20230811-001549.gz
          3.22 MB
        5. 0.log.gz
          2.79 MB
        6. grc-policy-propagator-5886b7768b-dt58k.recent.log.gz
          1.65 MB
        7. run71-share2-20230718-103713.png
          run71-share2-20230718-103713.png
          123 kB
        8. run76-share2-20230723-115934.png
          run76-share2-20230723-115934.png
          112 kB
        9. run82-share2-20230811-022738.png
          run82-share2-20230811-022738.png
          121 kB
        10. share2-20231128-142751.png
          share2-20231128-142751.png
          118 kB

              jkulikau@redhat.com Justin Kulikauskas
              akrzos@redhat.com Alex Krzos
              Derek Ho Derek Ho
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: