-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
100% To Do, 0% In Progress, 0% Done
-
7
-
0
Feature Overview (aka. Goal Summary)
Facilitate debugging of failures related to machine pool creation, node joining, and node readiness in HyperShift. This feature aims to provide clear, actionable insights into what conditions or issues are blocking these critical operations, thereby significantly reducing the time and effort needed to troubleshoot and resolve these issues.
Background
Current debugging processes for clusters' machine pool failures are manual and time-consuming. This feature aims to streamline and improve the efficiency of the debugging process by providing automated, clear insights into blocking conditions.
Goals (aka. expected user outcomes)
The primary outcome is for users, particularly system administrators and cluster service providers (in HCP terminology), to quickly identify and resolve blockages related to machine pool creation, nodes failing to join the cluster, and nodes that cannot become ready in HyperShift clusters. This feature will expand the functionality of the existing status and metrics system to include detailed indicators of machine pool creation progress, node joining status, node readiness, and specific conditions blocking these processes.
Ideally we want to surface detailed error messages (like the ones found in the AWSMachine's InstanceReady condition) to the Machine's InfrastructureReady condition, making troubleshooting easier.
An idea here would be to also utilize the NodeHealthCheck addon (and make sure it's compatible with HCP), for example, replicate conditions into the NodePool CR:
# skip other fields here... unhealthyNodes: - name: unhealthy-node-name remediations: - resource: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediation namespace: <SNR namespace> name: unhealthy-node-name uid: abcd-1234... started: 2023-03-20T15:05:05Z01:00 timedOut: 2023-03-20T15:10:05Z01:00 # timed out # when using `escalatingRemediations`, the next remediator will be appended: - resource: apiVersion: reprovison.example.com/v1 kind: ReprovisionRemediation namespace: example name: unhealthy-node-name uid: bcde-2345... started: 2023-03-20T15:10:07Z01:00 # no timeout set: ongoing remediation
unhealthyConditions:
- type: NetworkUnavailable
status: "True" duration: 300s
etc..
Use Cases:
- System administrator troubleshooting a machine pool failures.
- Automated systems monitoring and alerting on machine pool progress and blockages.
- Post-mortem analysis of machine pool related issues to prevent future occurrences.
Challenges:
- failureReason and failureMessage in the Machine status are considered terminal and cannot be reset once set. This may limit their use for transient errors like permission issues.
- The Cluster API doesn't explicitly require an InstanceReady condition on the infrastructure provider, which could make standardizing this behavior across cloud providers challenging.
Potential Solutions:
- Enhance provider-specific Machine controllers: Modify the AWSMachine controller (and potentially others) to populate failureReason and failureMessage with more specific information when appropriate, even for potentially transient errors. This would be the most straightforward approach for AWS.
- Introduce a new condition or field: Propose a change to the Cluster API to either introduce a new condition (like InstanceProvisioning) or add a field to the InfrastructureReady condition to specifically capture detailed error messages. This would require collaboration with the Cluster API community and would be a more comprehensive solution across all providers.
- Leverage Events more effectively: While Events are not a perfect solution due to their ephemeral nature, we could explore tools or mechanisms to surface relevant Events alongside the Machine status in the Hypershift console or CLI. This could provide a temporary workaround while a more robust solution is developed.
- ...
Requirements (aka. Acceptance Criteria):
- Provide clear status messages indicating the current stage of machine pool.
- Highlight specific conditions that are blocking machine pool creations, nodes joining the cluster or nodes becoming ready, such as quotas or instance availability.
- Integrate these indicators into the existing monitoring and metrics systems.
- Ensure compatibility with self-managed and managed deployments.
- Support for all architectures: x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x).
- Ensure the feature is secure, reliable, maintainable, and scalable.
- Backport to applicable versions as needed.
Questions to Answer:
- What specific conditions are most commonly blocking machine pool creations, nodes joining the cluster, and nodes becoming ready?
- How can these conditions be automatically detected and reported?
- What level of detail is needed in the status messages to be most useful?
Out of Scope
- Debugging non-machinepool creation related issues.
- Integration with third-party monitoring tools beyond the scope of OpenShift Console and OCM.
Deployment considerations
Deployment considerations | List applicable specific needs (N/A = not applicable) |
---|---|
Self-managed, managed, or both | Both |
Classic (standalone cluster) | N/A |
Hosted control planes | Applicable |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All architectures |
Operator compatibility | Must be compatible with relevant operators |
Backport needed (list applicable versions) | Specify versions as identified during refinement |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Integration with OpenShift Console and OCM |
Other (please specify) | N/A |
Documentation Considerations
Detailed documentation on how to interpret new status messages and metrics will be required. This should include troubleshooting guides and examples. Any changes should be reflected in the existing HyperShift and OpenShift documentation.
Interoperability Considerations
This feature impacts HyperShift machinepool creation on ROSA, ARO, and self-managed HCP. Interoperability test scenarios should include these environments to ensure consistent behavior and reliability across the portfolio.
- clones
-
OCPSTRAT-1598 Enhanced Debuggability for HyperShift Cluster Installation Failures
- Backlog
- is depended on by
-
RFE-5574 Observability (Metrics) for Worker Node Creation
- Accepted
-
RFE-6162 Facilitate debugging clusters stuck installing
- Accepted
- is related to
-
OCPSTRAT-1828 Enhance NodeHealthCheck (NHC) Functionality in Hosted Control Planes to Integrate with Upgrade Signals
- New