-
Feature Request
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
-
-
-
1. Proposed title of this feature request
Machines/workers status metrics
2. What is the nature and description of the request?
currently we report a condition on the nodepool that instance creation failed, but there's no detail on why. Similarly, when nodes fail to join the cluster, we get "Minimum availability requires 2 replicas, current 0 available" which only tells downstream systems or customers that nodes haven't been created but no details on what has happened.
3. Why does the customer need this? (List the business requirements here)
- There are many things can go wrong including Cloud provider running out of capacity or network not configured. Right now, customer is unable to know what has happened in managed HCP services because they don't have access to management clusters where the nodePool controllers run.
- Every time this happens, customer has to open support case; support has to create incident for RH SRE; SRE has to debug and respond. In managed (ROSA) context, there are limited permissions SRE has to be able to view customer account so analysis is impossible.
- Customers make an API call to create a HCP cluster that has control plane in RH account and worker nodes in customer account. Half of this succeeds while the 2nd half fails with either no action from the system or no useful information back to take corrective action.
4. List any affected packages or components.
Hosted Control Planes, CAPA/CAPI Machines
- depends on
-
OCPSTRAT-1615 Enhanced Debuggability for HyperShift Cluster NodePool Failures
- New