Description of problem:
It's been observed that the catalog sync triggers high I/O on masters where etcd runs. This then triggers an etcd leader election which then resets TTL counters on keys, in particular resulting in etcd events never clearing. It seems unlikely that a 10 minute catalog update factors critically into anyone's operational plans. Therefore we should reduce the catalog source sync duration to four hours avoiding the etcd knock on effects in the local cluster while also reducing load on quay.io or their local mirrors by ~ 95%.
Version-Release number of selected component (if applicable):
All, but lets only bother with 4.18-4.22
How reproducible:
100%
Steps to Reproduce:
Observe catalog update duration
Actual results:
Happens every 10 minutes
Expected results:
Happens every 240 minutes
Additional info:
While I suspect the backend load on our infrastructure or the customer's infrastructure isn't horrible it would be good if we ensured there was an appropriate jitter added so that we avoid any stampeding herd effects of a mass reboot like a datacenter outage. A random sleep of up to 10 minutes is probably sufficient. We should consider whether or not an admin wishing to update the catalog right now would need to have a method to skip the jitter or not, but "restart the pod and wait up to 10 minutes" is probably not horrible. We should also make sure that our release notes mention this change and that we document the preferred path for updating the catalog source right now.
- relates to
-
OCPBUGS-57118 certified-operators are failing regularly due to startup probe timing out frequently and generating alert for KubePodCrashLooping
-
- New
-