Uploaded image for project: 'OpenShift Autoscaling'
  1. OpenShift Autoscaling
  2. AUTOSCALE-96

Implement HCP karpenter deletion events and metrics

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None

      As an HCP Karpenter management cluster admin, I want to expose events and metrics to improve observability and monitoring when deleting a hosted cluster with autonode on.

      Events:

      Requires a bit more research/experimentation on what could actually cause deletion to get stuck, but some very general reasons off the bat are probably:

      • Node/NodeClaim cannot be deleted, and it's been stuck for a long time
      • Nodepool cannot be deleted, and its been stuck for a long time

      Most likely, we won't be able to know why a provisioned instance cannot be deleted (if we do know, that would be better!), so waiting an reasonably long time without updates is probably a good indicator something went wrong.

      Metrics:

      This story should also cover the metrics that we could probably expose in the operator when initiating a deletion.

      Some common ones are probably:

      • How long a Karpenter tear down is taking
      • How many nodes are being deleted in the hosted cluster
      • Current progress of node/nodeclaims deleted e.g. 4/10 deleted

      Alerts:

      There might also be potential to include emiting of OpenShift alerts when a deletion has stalled or been completed.

              Unassigned Unassigned
              rh-ee-macao Max Cao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: