Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7391

Monitoring operator long delay reconciling extension-apiserver-authentication


    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.13
    • Monitoring
    • None
    • -
    • Low
    • False
    • Hide


    • N/A
    • Bug Fix
    • Done

      Description of problem:

      As part of the efforts of improving the installation time of single node openshift, we've noticed the monitoring operator takes a long* time to finish installation.
      It's hard for me to tell what exactly the monitoring operator is waiting for, but it becoming happy (as far as clusteroperator conditions are concerned) always seems to coincide with the operator finally realizing and reconciling** the 2 additional certificates inside the extension-apiserver-authentication that are being added by the apiserver operator. 
      Usually this "realization" happens minutes after the two certs are being added, and ideally we'd like to cut back on that time, because sometimes those minutes lead to the monitoring operator being the last to roll out.
      *Long time on the order of just a few minutes, which are not a lot but they add up. This ticket is one in a series of ticket we're opening for many other components
      **The "marker" I use to know when this happened is when the monitoring operator, among other things, replaces the old prometheus-adapter-<hash_x> secret containing just the original certs of extension-apiserver-authentication with a new prometheus-adapter-<hash_y> which also contains the 2 new certs

      Version-Release number of selected component (if applicable):

      nightly 4.13 OCP

      How reproducible:


      Steps to Reproduce:

      1. Install single-node-openshift

      Actual results:

      Monitoring operator long delay reconciling extension-apiserver-authentication

      Expected results:

      Monitoring operator immediate reconciliation of extension-apiserver-authentication

      Additional info:

      Originally I suspected this might be due to api server downtime (which is a property of SNO), but this issue doesn't seem to correlate with API downtime

        1. cmo_errors.log
          8 kB
        2. kaso.log
          78 kB
        3. my.tar.gz
          9.87 MB

            spasquie@redhat.com Simon Pasquier
            otuchfel@redhat.com Omer Tuchfeld
            Junqi Zhao Junqi Zhao
            0 Vote for this issue
            9 Start watching this issue