-
Story
-
Resolution: Obsolete
-
Normal
-
None
-
None
-
None
Context:
Currently, HO only tracks HostedCluster deletion through the metric hypershift_cluster_deletion_duration_seconds. Unfortunately, this metric is only emitted if a deletion is successful (see this). This hinders us from having an accurate idea of:
- the number of HCs stuck in deletion
- the reasons for these issues
Proposal:
A few changes could remediate this situation:
- Track the deletion steps in the hcluster.Status: This also simplifies investigations since we won't need to search logs to understand the state of the deletion.
- Introduce a timeout (or other logic) for the HO to know when a HC is stuck in a deletion step: The stuck state could also be reported in the hcluster.Status
- Introduce logic in HO to know if the stuck state is due to a known issue: In OSD, this is achieved by matching logs to regexps. This will massively reduce SREPs investigation time on stuck clusters. The hcluster.Status could be populated with the known OCPBUG number, allowing SREP to only focus on unknown issues.
- Emit a hypershift_cluster_deletion_stuck metric with labels for cluster_id, cluster_version and stuck_reason: This will allow for visibility in our monitoring.
- (optional) Trigger a MustGather for HCs stuck due to unknown reasons: MustGather Operator with the right image can help with this step