-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
User Story
As a user I would like to know when a machineset is failing to create machines for an extended period of time. Having an alert on the mapi_instance_create_failed metric based around the machineset label will help me to see these problems more quickly.
Background
To help openshift users and operators it would be nice to have an alert for when mapi_instance_create_failed is increasing too rapidly for too long (tbd on rate/timing this should be, maybe 30m to start). This alert should be keyed around the machineset label proposed in OCPCLOUD-1613.
Steps
- Add alert to MAO manifests
- Update documentation about the new alert, and add a runbook for solutions.
Stakeholders
- openshift engineering
Definition of Done
- Add new alert
- Docs
- Update alert docs, add runbook.
- Testing
- we don't currently have an e2e for failed machine creations
- is blocked by
-
OCPCLOUD-1613 Maintainability: Add machineset name to mapi_instance_create_failed metric series
- To Do
- relates to
-
OCPCLOUD-1660 Improve error conditions for MachineSet failing to create new Machines
- Closed