Details
-
Bug
-
Resolution: Won't Do
-
Minor
-
None
-
4.12
-
No
-
False
-
Description
Description of problem:
We experienced an issue where the customer attempted to create machines in a region that AWS did not have capacity for. The way to debug this manually is to look in the logs of the machine-controller where we saw logs similar to:
AWS does not have capacity in the [region] for [instance-type]. Please select a different instance type or region
Version-Release number of selected component (if applicable):
N/A
How reproducible:
100% of the time that AWS is at capacity (or other issues arise)
Steps to Reproduce:
1. Create a machineset of a type of instance that AWS does not have capacity for 2. Let the machine api get the error and see it in the logs
Actual results:
Result is logged into the machine-api controller logs
Expected results:
It would be nice to see this in a condition on the machine, or some other machine-parseable format that we could hopefully either draw metrics from to better tune our alerting or to automate letting our managed-customers know that they can't spin up regions because of AWS capacity issues, or other issues. The main thing we're looking for is a deterministic way to see these issues that is not doing something like grepping through logs.
Additional info:
Similarly, if possible, it would be nice to see this happen for other classes of logs. For example, if the node becomes unscheduleable for whatever reason (i.e. - autoscaling scale-down event) and the machine is being replaced, if a pod is failing to drain it would be nice to have a deterministic reason why the machine is stuck in "deleting", something we could write a script against like: if machine is deleting: if machine conditions contains "type == DrainFailed" get machine condition where type == DrainFailed get Pods out of condition message -> notify customer of pods that are failing to drain