Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61288

When an Account Email is empty from OCM, the Console will consistently redeploy

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.19
    • Management Console
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      We recently had a customer notify us that their Managed OpenShift cluster was continually redeploying their console pods, which was causing their open sessions to close and forcing them to re authenticate every 5 minutes or so.
      
      Throughout the incident, we tracked down the source of the problem to be an empty email in the cluster owner's account, as the cluster owner was a Service Account in OCM. 
      
      The error originally was portrayed as a 429 Request Limit Exceeded error, but we now believe that the 429 was just a symptom of the empty email. We believe this is the case because the cache is not written with empty values in these lines of code: https://github.com/openshift/console-operator/blob/76ae5dafe7640111fa8e9c7f745122e4844d4a5d/pkg/console/telemetry/telemetry.go#L124-L133.
      
      What I believe has happened is that because the email comes back from OCM as an empty string, the cache is not written. Then as the config object is saved it causes the operator to reconcile again. Normally this would noop, but because the value is empty it then attempts to fetch it again, and it redeploys the console pod with the new config. Then repeat this process of reconciling on the change until you are rate limited. Then it attempts to continue to query OCM until the rate limit expires, and then the successful query to OCM causes the console operator to re-deploy the console pods again, starting the cycle again until it runs into the rate limit error again.
          

      Version-Release number of selected component (if applicable):

      We observed this on a cluster that is running 4.19.7, but I imagine that this would apply to all openshift versions at this time.
          

      How reproducible:

      I'm not sure - I don't think I have the resources to attempt a full reproduction, but I think this might be easy for your team to reproduce locally given your experience with the codebase.
          

      Steps to Reproduce:

          1. Get a response from OCM that includes an empty string in the {{fetchedOCMRespose.Creator.Email}} field
          2. Observe the console operator thrash the console pod deployment (creating many many replicasets)
          

      Actual results:

      
          

      Expected results:

      
          

      Additional info:

      From further review in the code, when we saw this issue we were able implement a workaround by adding the ORGANIZATION_ID and ACCOUNT_MAIL (with another email address from their organization) in the telemetry-config configmap in the openshift-console-operator namespace. It ended up looking like the example below:
          
      $ oc get configmap -n openshift-console-operator telemetry-config -o yaml
      apiVersion: v1
      data:
        ACCOUNT_MAIL: <EMAIL FROM ANOTHER CLUSTER ADMIN IN THE ORG>
        ORGANIZATION_ID: <ORG ID FROM OCM>
        SEGMENT_API_HOST: [REDACTED]
        SEGMENT_JS_HOST: [REDACTED]
        SEGMENT_PUBLIC_API_KEY: [REDACTED]
      kind: ConfigMap
      metadata:
        name: telemetry-config
        namespace: openshift-console-operator
      

      While we believe we've successfully worked around the issue in this case, we believe that the Console Operator should be able to handle an empty email address when service-accounts are used to create the clusters.

      I've linked the other issues related to rate limiting to this ticket, but as noted before I believe that the rate-limit is the symptom of the issue, and not the root cause.

              jhadvig@redhat.com Jakub Hadvig
              iamkirkbater Kirk Bater
              None
              None
              YaDan Pei YaDan Pei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: