-
Story
-
Resolution: Won't Do
-
Undefined
-
None
-
None
-
None
-
False
-
None
-
False
User Story
As a user I would like to be able to read error conditions on a MachineSet when it is failing to create new Machines, so that I can properly diagnose the failure.
Background
While investigating Bug 2104511 , it is possible to have a MachineSet with a providerSpec that will not pass webhook validation. When the replicas are increased on the MachineSet (using the scale subresource), a new Machine is created and rejected by the webhook but no condition is ever surfaced for the user to inspect. This can be determined by inspecting the events for the webhooks in the openshift-machine-api namespace. It would be convenient for users to see this information in the MachineSet conditions as well.
This might require some investigation about if we can add a condition to the MachineSet during a webhook operation.
We should also investigate if exporting the conditions from the MachineSet and creating alerts based on those conditions would be an improvement for users.
For reference about this issue please read this thread https://coreos.slack.com/archives/CBZHF4DHC/p1660837393467059
Steps
- investigate adding conditions to MachineSet from Machine webhook
- if possible/reasonable, add conditions to MachineSet when a validating webhook rejection occurs
- if reasonable, export conditions and create alerts for MachineSets based on error conditions
Stakeholders
- cloud infra team
Definition of Done
- user can observe Machine validation webhook failures on the MachineSet
- Docs
- we might need to update the product docs, need to double check if we have any guidance here already
- Testing
- should add unit testing at the least to ensure this transaction works
- is related to
-
OCPCLOUD-1614 Maintainability: Add an alert for when mapi_instance_create_failed is high for a long period of time
- To Do
-
OCPCLOUD-1661 Investigate reporting on expected versus observed replicas for MachineSets
- Closed
- relates to
-
OCPCLOUD-1704 RFE: Alert on consistent ScaleUpTimedOut
- To Do