Uploaded image for project: 'OpenShift Cloud'
  1. OpenShift Cloud
  2. OCPCLOUD-1614

Maintainability: Add an alert for when mapi_instance_create_failed is high for a long period of time

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False

      User Story

      As a user I would like to know when a machineset is failing to create machines for an extended period of time. Having an alert on the mapi_instance_create_failed metric based around the machineset label will help me to see these problems more quickly.

      Background

      To help openshift users and operators it would be nice to have an alert for when mapi_instance_create_failed is increasing too rapidly for too long (tbd on rate/timing this should be, maybe 30m to start). This alert should be keyed around the machineset label proposed in OCPCLOUD-1613.

      Steps

      • Add alert to MAO manifests
      • Update documentation about the new alert, and add a runbook for solutions.

      Stakeholders

      • openshift engineering

      Definition of Done

      • Add new alert
      • Docs
      • Update alert docs, add runbook.
      • Testing
      • we don't currently have an e2e for failed machine creations

            Unassigned Unassigned
            mimccune@redhat.com Michael McCune
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: