Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-73881

10m catalog sync interval contributes to unbounded etcd growth

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • 4.21.0
    • 4.18.z, 4.19.z, 4.20.z, 4.22.0, 4.21.z
    • OLM
    • None
    • False
    • Hide

      None

      Show
      None
    • 1
    • Critical
    • None
    • Rejected
    • Vaporeon Sprint 282, Weedle Sprint 283
    • 2
    • Done
    • Bug Fix
    • Hide
      * Before this update, catalog sync triggered high I/O on masters, and caused etcd leader election and TTL counter resets. As a consequence, catalog sync caused high I/O and etcd events persistence, affecting user cluster performance. With this release, catalog sync duration is reduced from 10 minutes to 4 hours, reducing I/O load and etcd events. As a result, catalog sync duration is minimized, improving system performance.

      The default catalog polling interval has been increased from 10 minutes to 4 hours to reduce load on catalog sources.
      Show
      * Before this update, catalog sync triggered high I/O on masters, and caused etcd leader election and TTL counter resets. As a consequence, catalog sync caused high I/O and etcd events persistence, affecting user cluster performance. With this release, catalog sync duration is reduced from 10 minutes to 4 hours, reducing I/O load and etcd events. As a result, catalog sync duration is minimized, improving system performance. The default catalog polling interval has been increased from 10 minutes to 4 hours to reduce load on catalog sources.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-69441. The following is the description of the original issue:

      Description of problem:

      It's been observed that the catalog sync triggers high I/O on masters where etcd runs. This then triggers an etcd leader election which then resets TTL counters on keys, in particular resulting in etcd events never clearing.
      
      It seems unlikely that a 10 minute catalog update factors critically into anyone's operational plans. Therefore we should reduce the catalog source sync duration to four hours avoiding the etcd knock on effects in the local cluster while also reducing load on quay.io or their local mirrors by ~ 95%.

      Version-Release number of selected component (if applicable):

      All, but lets only bother with 4.18-4.22

      How reproducible:

      100% 

      Steps to Reproduce:

      Observe catalog update duration
          

      Actual results:

      Happens every 10 minutes

      Expected results:

      Happens every 240 minutes

      Additional info:

      While I suspect the backend load on our infrastructure or the customer's infrastructure isn't horrible it would be good if we ensured there was an appropriate jitter added so that we avoid any stampeding herd effects of a mass reboot like a datacenter outage. A random sleep of up to 10 minutes is probably sufficient. We should consider whether or not an admin wishing to update the catalog right now would need to have a method to skip the jitter or not, but "restart the pod and wait up to 10 minutes" is probably not horrible.
      
      We should also make sure that our release notes mention this change and that we document the preferred path for updating the catalog source right now.

              rashmigottipati Rashmi Gottipati
              rhn-support-sdodson Scott Dodson
              None
              None
              Jian Zhang Jian Zhang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: