Description of problem:
Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized. Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster). Controller manager is filled with entries like: - "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4" - "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org" In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check. The duplication is evident in OpenShiftControllerManager/cluster: dockerPullSecret: internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 registryURLs: - registry.build01.ci.openshift.org - registry.build01.ci.openshift.org But there is only one hostname in config.imageregistry.operator.openshift.io/cluster: routes: - hostname: registry.build01.ci.openshift.org name: public-routes secretName: public-route-tls
Version-Release number of selected component (if applicable):
4.17.0-rc.3
How reproducible:
Constant on build01 but not on other build farms
Steps to Reproduce:
1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager. 2. 3.
Actual results:
- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces. - The openshift-controller-manager is hot looping and experiencing client throttling.
Expected results:
1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes. 2. openshift-controller-manager should not possess duplicate entries. 3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries. 4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.
Additional info:
- blocks
-
OCPBUGS-42362 Continuous pull-secret updates / slow initialization on build01 (test platform infrastructure)
- Closed
- is blocked by
-
OCPBUGS-42237 Samples Operator Sync Breaks Build Suite Tests
- Verified
- is caused by
-
OCPBUGS-1689 Modifying a namespace or route label to opt-out of a router shard doesn't update the route admitted status
- Closed
- is cloned by
-
OCPBUGS-42362 Continuous pull-secret updates / slow initialization on build01 (test platform infrastructure)
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update