Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42362

Continuous pull-secret updates / slow initialization on build01 (test platform infrastructure)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.17.0
    • ImageStreams
    • None
    • Important
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      This is a clone of issue OCPBUGS-42106. The following is the description of the original issue:

      Description of problem:

      Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized.
      Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster).
      
      Controller manager is filled with entries like: 
      - "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4"
      - "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org"
      
      In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check.
      
      The duplication is evident in OpenShiftControllerManager/cluster:
            dockerPullSecret:
              internalRegistryHostname: image-registry.openshift-image-registry.svc:5000
              registryURLs:
              - registry.build01.ci.openshift.org
              - registry.build01.ci.openshift.org
      
      
      But there is only one hostname in config.imageregistry.operator.openshift.io/cluster:
        routes:
        - hostname: registry.build01.ci.openshift.org
          name: public-routes
          secretName: public-route-tls
      

      Version-Release number of selected component (if applicable):

      4.17.0-rc.3

      How reproducible:

      Constant on build01 but not on other build farms

      Steps to Reproduce:

          1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager.
          2.
          3.
          

      Actual results:

      - Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces.    
      - The openshift-controller-manager is hot looping and experiencing client throttling.

      Expected results:

      1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes.
      2. openshift-controller-manager should not possess duplicate entries.
      3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries.
      4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.

      Additional info:

          

            fmissi Flavian Missi
            openshift-crt-jira-prow OpenShift Prow Bot
            XiuJuan Wang XiuJuan Wang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: