Uploaded image for project: 'OpenShift Hosted Control Plane'
  1. OpenShift Hosted Control Plane
  2. HOSTEDCP-1047

Hypershift Operator doesn't track HCs stuck deletion

XMLWordPrintable

    • False
    • None
    • False
    • 0
    • 0
    • 0

      Context:

      Currently, HO only tracks HostedCluster deletion through the metric hypershift_cluster_deletion_duration_seconds. Unfortunately, this metric is only emitted if a deletion is successful (see this). This hinders  us from having an accurate idea of:

      • the number of HCs stuck in deletion
      • the reasons for these issues 

      Proposal:

      A few changes could remediate this situation: 

      • Track the deletion steps in the hcluster.Status: This also simplifies investigations since we won't need to search logs to understand the state of the deletion. 
      • Introduce a timeout (or other logic) for the HO to know when a HC is stuck in a deletion step: The stuck state could also be reported in the hcluster.Status
      • Introduce logic in HO to know if the stuck state is due to a known issue: In OSD, this is achieved by matching logs to regexps. This will massively reduce SREPs investigation time on stuck clusters. The hcluster.Status could be populated with the known OCPBUG number, allowing SREP to only focus on unknown issues.
      • Emit a hypershift_cluster_deletion_stuck metric with labels for cluster_id, cluster_version and stuck_reason: This will allow for visibility in our monitoring. 
      • (optional) Trigger a MustGather for HCs stuck due to unknown reasons: MustGather Operator with the right image can help with this step

              Unassigned Unassigned
              benson.ngoy Benson Ngoy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: