Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14306

Machine CRs should have machine-parseable representations of errors

    XMLWordPrintable

Details

    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      We experienced an issue where the customer attempted to create machines in a region that AWS did not have capacity for. The way to debug this manually is to look in the logs of the machine-controller where we saw logs similar to:

      AWS does not have capacity in the [region] for [instance-type]. Please select a different instance type or region
      

      Version-Release number of selected component (if applicable):

      N/A
      

      How reproducible:

      100% of the time that AWS is at capacity (or other issues arise)
      

      Steps to Reproduce:

      1. Create a machineset of a type of instance that AWS does not have capacity for
      2. Let the machine api get the error and see it in the logs
      

      Actual results:

      Result is logged into the machine-api controller logs
      

      Expected results:

      It would be nice to see this in a condition on the machine, or some other machine-parseable format that we could hopefully either draw metrics from to better tune our alerting or to automate letting our managed-customers know that they can't spin up regions because of AWS capacity issues, or other issues.
      
      The main thing we're looking for is a deterministic way to see these issues that is not doing something like grepping through logs.
      

      Additional info:

      Similarly, if possible, it would be nice to see this happen for other classes of logs. For example, if the node becomes unscheduleable for whatever reason (i.e. - autoscaling scale-down event) and the machine is being replaced, if a pod is failing to drain it would be nice to have a deterministic reason why the machine is stuck in "deleting", something we could write a script against like:
      
      if machine is deleting:
        if machine conditions contains "type == DrainFailed"
          get machine condition where type == DrainFailed
            get Pods out of condition message -> notify customer of pods that are failing to drain
      

      Attachments

        Activity

          People

            ddonati@redhat.com Damiano Donati
            iamkirkbater Kirk Bater
            Huali Liu Huali Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: