Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63345

hypershift_cluster_invalid_aws_creds displays unknown state as invalid

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      On Oct 20th, AWS had a major outage in us-east-1. During this outage, IAM services returned internal errors.
      
      During this time, hypershift_cluster_invalid_aws_creds metric was set to "1", meaning the credentials are invalid. This is not accurate.
      
      SRE uses this metric to automate notifications to customers and disable alerting for clusters. 
      
      

      Version-Release number of selected component (if applicable):

          N/A

      How reproducible:

          100%

      Steps to Reproduce:

          1. Set ValidAWSCredentials to Unknown to simulate AWS returning 500s / healthchecks not working
          2. Check metric state
          3.
          

      Actual results:

          Metric is set to 1 (invalid credentials) for unknown states

      Expected results:

          Metric should distinguish between true/false/unknown

      Additional info:

          

              cbusse.openshift Claudio Busse
              cbusse.openshift Claudio Busse
              None
              None
              Martin Gencur Martin Gencur
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: