Loading...

XML

Word

Printable

Type: Story
Resolution: Obsolete
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Activity Type:
Product / Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Expose karpenter-operator and karpenter metrics
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

As an HCP Karpenter management cluster admin, I want to expose events and metrics to improve observability and monitoring when deleting a hosted cluster with autonode on.

Events:

Requires a bit more research/experimentation on what could actually cause deletion to get stuck, but some very general reasons off the bat are probably:

Node/NodeClaim cannot be deleted, and it's been stuck for a long time
Nodepool cannot be deleted, and its been stuck for a long time

Most likely, we won't be able to know why a provisioned instance cannot be deleted (if we do know, that would be better!), so waiting an reasonably long time without updates is probably a good indicator something went wrong.

Metrics:

This story should also cover the metrics that we could probably expose in the operator when initiating a deletion.

Some common ones are probably:

How long a Karpenter tear down is taking
How many nodes are being deleted in the hosted cluster
Current progress of node/nodeclaims deleted e.g. 4/10 deleted

Alerts:

There might also be potential to include emiting of OpenShift alerts when a deletion has stalled or been completed.

depends on

AUTOSCALE-5 investigate which metrics we should expose

Closed

Assignee:: Unassigned

Reporter:: Max Cao

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/01/28 7:09 PM

Updated:: 2025/11/05 11:30 AM

Resolved:: 2025/07/31 5:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty