-
Story
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
-
As an HCP Karpenter management cluster admin, I want to expose events and metrics to improve observability and monitoring when deleting a hosted cluster with autonode on.
Events:
Requires a bit more research/experimentation on what could actually cause deletion to get stuck, but some very general reasons off the bat are probably:
- Node/NodeClaim cannot be deleted, and it's been stuck for a long time
- Nodepool cannot be deleted, and its been stuck for a long time
Most likely, we won't be able to know why a provisioned instance cannot be deleted (if we do know, that would be better!), so waiting an reasonably long time without updates is probably a good indicator something went wrong.
Metrics:
This story should also cover the metrics that we could probably expose in the operator when initiating a deletion.
Some common ones are probably:
- How long a Karpenter tear down is taking
- How many nodes are being deleted in the hosted cluster
- Current progress of node/nodeclaims deleted e.g. 4/10 deleted
Alerts:
There might also be potential to include emiting of OpenShift alerts when a deletion has stalled or been completed.
- depends on
-
AUTOSCALE-5 investigate which metrics we should expose
-
- Closed
-