Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Cloud Compute / Unknown
Labels:
- ServiceDeliveryImpact

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

We experienced an issue where the customer attempted to create machines in a region that AWS did not have capacity for. The way to debug this manually is to look in the logs of the machine-controller where we saw logs similar to:

AWS does not have capacity in the [region] for [instance-type]. Please select a different instance type or region

Version-Release number of selected component (if applicable):

N/A

How reproducible:

100% of the time that AWS is at capacity (or other issues arise)

Steps to Reproduce:

1. Create a machineset of a type of instance that AWS does not have capacity for
2. Let the machine api get the error and see it in the logs

Actual results:

Result is logged into the machine-api controller logs

Expected results:

It would be nice to see this in a condition on the machine, or some other machine-parseable format that we could hopefully either draw metrics from to better tune our alerting or to automate letting our managed-customers know that they can't spin up regions because of AWS capacity issues, or other issues.

The main thing we're looking for is a deterministic way to see these issues that is not doing something like grepping through logs.

Additional info:

Similarly, if possible, it would be nice to see this happen for other classes of logs. For example, if the node becomes unscheduleable for whatever reason (i.e. - autoscaling scale-down event) and the machine is being replaced, if a pod is failing to drain it would be nice to have a deterministic reason why the machine is stuck in "deleting", something we could write a script against like:

if machine is deleting:
  if machine conditions contains "type == DrainFailed"
    get machine condition where type == DrainFailed
      get Pods out of condition message -> notify customer of pods that are failing to drain

Assignee:: Damiano Donati

Reporter:: Kirk Bater

Need Info From:: None

Contributors:: None

QA Contact:: Huali Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/05/30 7:27 PM

Updated:: 2025/07/26 11:54 AM

Resolved:: 2024/01/18 10:29 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates