Uploaded image for project: 'Managed Service - API'
  1. Managed Service - API
  2. MGDAPI-4031

No alert fire when RHOAM can not complete reconcile loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 1.23.0
    • None
    • None
    • None
    • MGDAPI - Sprint 26

      What
      There is an alert that is meant to fire when the operator is in a complete state and the controller has errors. This alert is RHOAMInstallationControllerIsInReconcilingErrorState. While this alert is also broken for a different reason it does not handle the operator not going to a complete state. There is a large number of unknown reason that can cause the operator to go into a un-complete state.

      For the RHOAMInstallationControllerIsInReconcilingErrorState it uses reference to the controller="installation-controller" which does not exist. It is not clear what this alert is trying to alert on but is clearing not working correct. Alert definition.

      There should be an alert that fires if the operator is in an installed state but not a completed state for some time.

      How
      Recent changes to the metrics has added a "status" field to the rhoam_version metric. This states if the operator is installing, upgrading or installed. The metric rhoam_status has a field called "stage". This reflects the current stage for the operator. So if rhoam_version

      {status="Installed"}

      and rhoam_status

      {stage!="complete"}

      for some length of time the alert should fire.

      SOP
      If this is a new alert or the updating of the mentioned alert, the SOP should be updated. This alert is a catch all. Any existing alerts should be addressed first. It is possible that SRE will need to investigate and/or get engineering to help investigate the root cause of the alert.

      Testing
      What is the best way to test alerts?

      Done

      • A alert fires when the installed operator is not in a complete state.
      • The RHOAMInstallationControllerIsInReconcilingErrorState is fixed, removed or explained

              mstoklus_rhmi Michal Stokluska
              jfitzpat_rhmi Jim Fitzpatrick (Inactive)
              Patryk Stefanski Patryk Stefanski
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: