-
Bug
-
Resolution: Done
-
Blocker
-
None
-
None
-
None
-
3
-
False
-
None
-
False
-
Yes
-
MGDAPI - Sprint 26
What
There is an alert that is meant to fire when the operator is in a complete state and the controller has errors. This alert is RHOAMInstallationControllerIsInReconcilingErrorState. While this alert is also broken for a different reason it does not handle the operator not going to a complete state. There is a large number of unknown reason that can cause the operator to go into a un-complete state.
For the RHOAMInstallationControllerIsInReconcilingErrorState it uses reference to the controller="installation-controller" which does not exist. It is not clear what this alert is trying to alert on but is clearing not working correct. Alert definition.
There should be an alert that fires if the operator is in an installed state but not a completed state for some time.
How
Recent changes to the metrics has added a "status" field to the rhoam_version metric. This states if the operator is installing, upgrading or installed. The metric rhoam_status has a field called "stage". This reflects the current stage for the operator. So if rhoam_version
and rhoam_status
{stage!="complete"}for some length of time the alert should fire.
SOP
If this is a new alert or the updating of the mentioned alert, the SOP should be updated. This alert is a catch all. Any existing alerts should be addressed first. It is possible that SRE will need to investigate and/or get engineering to help investigate the root cause of the alert.
Testing
What is the best way to test alerts?
Done
- A alert fires when the installed operator is not in a complete state.
- The RHOAMInstallationControllerIsInReconcilingErrorState is fixed, removed or explained