Uploaded image for project: 'OpenShift Cloud'
  1. OpenShift Cloud
  2. OCPCLOUD-1693

Investigate creating an alert for machines that are "missing in action"

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Improvement
    • False
    • None
    • False

      User Story

      As a user I would like to know when a Machine has a valid phase that does not match the state from the infrastructure provider. Having an alert that informed me about this condition would help me to find these cases.

      Background

      Based on this slack discussion about a ci failureĀ  , a condition where a Machine was in `Running` state but the instance was terminated in the infrastructure console became apparent. This caused several failures for the controller updating the machine, but it never moved from `Running` to `Failed`, or something similar. In this case, the provider was GCP and the Machine had a `TERMINATED` instanceState inside its providerStatus.

      We should investigate the possibility of creating an alert around a condition where a Machine has a valid running state but that state is not accurate with the infrastructure. This might look different for each provider and as such we would need a way to expose that metric api, perhaps something related to the controller code in MAO.

      Is it feasible for us to have a provider specific metric/alert that would inform when Machines are in this state of having a `Running` phase, but are not actually running in the provider?

      Steps

      • investigate providers to see if there is a common pattern that we could use to create a metric from
      • record findings in a document that we can use to evaluate further action

      Stakeholders

      • openshift engineering

      Definition of Done

      • team has a concise summary of available options to make decision on next steps
      • Docs
      • n/a
      • Testing
      • n/a

              Unassigned Unassigned
              mimccune@redhat.com Michael McCune
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: