Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19415

Slow down in provisioning SNOs with 4.14.0 Hub in ACM / ZTP provisioning of SNOs at scale

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      While provisioning 3618 SNOs using ZTP with ACM 2.9, it was observed a large slow down in the completion times for SNOs to successfully complete installs.  This seems to start around 1800 managed clusters and continue on for the remainder of the test. Using the same deployed OCP and ACM version but with a 4.13 hub, this issue is not observed.
      
      Approximate difference in timing for clusters to install between two hub ocp versions.
      
      4.14 Hub
      Count: 3571
      Min: 2915.0
      Average: 9191.5
      50 percentile: 6537.0
      95 percentile: 16942.5
      99 percentile: 17948.5
      Max: 21003.0
      
      4.13 Hub
      Count: 3555
      Min: 2579.0
      Average: 4036.5
      50 percentile: 4046.0
      95 percentile: 4868.3
      99 percentile: 5174.8
      Max: 5741.0
      
      

      Version-Release number of selected component (if applicable):

      Hub OCP 4.14.0-rc.0
      Deployed SNO - 4.14.0-rc.0 (similiar slow down also witnessed with 4.14.0-rc.1)
      ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Attached are graphs from both a 4.14 and 4.13 run for comparison.  It isn't clear what is causing the slow down, however some analysis performed by scripts show a large growth in timing for the time required between when a cluster is register and when the bmh starts provisioning.
      
      Since the only difference is the hub ocp version, it must be a included component in OCP which is driving this slow down.

        1. 4.13hub-graph-per-cluster-stage_breakdown.png
          262 kB
          Alex Krzos
        2. 4.13hub-share2-20230913-060502.png
          113 kB
          Alex Krzos
        3. 4.14.0-rc.2-share2-20230926-101918.png
          112 kB
          Alex Krzos
        4. 4.14.0-rc.2-share2-20230927-093212.png
          112 kB
          Alex Krzos
        5. 4.14hub-graph-per-cluster-stage_breakdown.png
          232 kB
          Alex Krzos
        6. 4.14hub-share2-20230916-135528.png
          112 kB
          Alex Krzos

              hpokorny@redhat.com Honza Pokorny
              akrzos@redhat.com Alex Krzos
              None
              None
              Michael Brudnoy Michael Brudnoy
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: