Loading...

XML

Word

Printable

Type: Story
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
SDE-3181
Intelligence Requested:
Market:

Cost of Delay:
0
WSJF:
0
Risk Score:
0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Context:

Currently, HO only tracks HostedCluster deletion through the metric hypershift_cluster_deletion_duration_seconds. Unfortunately, this metric is only emitted if a deletion is successful (see this). This hinders us from having an accurate idea of:

the number of HCs stuck in deletion
the reasons for these issues

Proposal:

A few changes could remediate this situation:

Track the deletion steps in the hcluster.Status: This also simplifies investigations since we won't need to search logs to understand the state of the deletion.
Introduce a timeout (or other logic) for the HO to know when a HC is stuck in a deletion step: The stuck state could also be reported in the hcluster.Status
Introduce logic in HO to know if the stuck state is due to a known issue: In OSD, this is achieved by matching logs to regexps. This will massively reduce SREPs investigation time on stuck clusters. The hcluster.Status could be populated with the known OCPBUG number, allowing SREP to only focus on unknown issues.
Emit a hypershift_cluster_deletion_stuck metric with labels for cluster_id, cluster_version and stuck_reason: This will allow for visibility in our monitoring.
(optional) Trigger a MustGather for HCs stuck due to unknown reasons: MustGather Operator with the right image can help with this step

Assignee:: Unassigned

Reporter:: Benson Ngoy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/06/13 8:14 AM

Updated:: 2024/06/16 11:16 PM

Resolved:: 2023/11/20 8:09 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates