Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Planes
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-1853Enhanced Visibility into Control Plane and Data Plane Metrics
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Business Value:
8

Target Version:

openshift-4.21
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

Facilitate debugging of failures related to machine pool creation, node joining, and node readiness in HyperShift. This feature aims to provide clear, actionable insights into what conditions or issues are blocking these critical operations, thereby significantly reducing the time and effort needed to troubleshoot and resolve these issues.

Background

Current debugging processes for clusters' machine pool failures are manual and time-consuming. This feature aims to streamline and improve the efficiency of the debugging process by providing automated, clear insights into blocking conditions.

Goals (aka. expected user outcomes)

The primary outcome is for users, particularly system administrators and cluster service providers (in HCP terminology), to quickly identify and resolve blockages related to machine pool creation, nodes failing to join the cluster, and nodes that cannot become ready in HyperShift clusters. This feature will expand the functionality of the existing status and metrics system to include detailed indicators of machine pool creation progress, node joining status, node readiness, and specific conditions blocking these processes.

Ideally we want to surface detailed error messages (like the ones found in the AWSMachine's InstanceReady condition) to the Machine's InfrastructureReady condition, making troubleshooting easier.

An idea here would be to also utilize the NodeHealthCheck addon (and make sure it's compatible with HCP), for example, replicate conditions into the NodePool CR:

  # skip other fields here...
  unhealthyNodes:
    - name: unhealthy-node-name
      remediations:
        - resource:
            apiVersion: self-node-remediation.medik8s.io/v1alpha1
            kind: SelfNodeRemediation
            namespace: <SNR namespace>
            name: unhealthy-node-name
            uid: abcd-1234...
          started: 2023-03-20T15:05:05Z01:00
          timedOut: 2023-03-20T15:10:05Z01:00 # timed out
        # when using `escalatingRemediations`, the next remediator will be appended:   
        - resource:
            apiVersion: reprovison.example.com/v1
            kind: ReprovisionRemediation
            namespace: example
            name: unhealthy-node-name
            uid: bcde-2345...
          started: 2023-03-20T15:10:07Z01:00
          # no timeout set: ongoing remediation

unhealthyConditions:   
- type: NetworkUnavailable     
   status: "True"     duration: 300s

etc..

Use Cases:

System administrator troubleshooting a machine pool failures.
Automated systems monitoring and alerting on machine pool progress and blockages.
Post-mortem analysis of machine pool related issues to prevent future occurrences.

Challenges:

failureReason and failureMessage in the Machine status are considered terminal and cannot be reset once set. This may limit their use for transient errors like permission issues.

The Cluster API doesn't explicitly require an InstanceReady condition on the infrastructure provider, which could make standardizing this behavior across cloud providers challenging.

Potential Solutions:

Enhance provider-specific Machine controllers: Modify the AWSMachine controller (and potentially others) to populate failureReason and failureMessage with more specific information when appropriate, even for potentially transient errors. This would be the most straightforward approach for AWS.
Introduce a new condition or field: Propose a change to the Cluster API to either introduce a new condition (like InstanceProvisioning) or add a field to the InfrastructureReady condition to specifically capture detailed error messages. This would require collaboration with the Cluster API community and would be a more comprehensive solution across all providers.
Leverage Events more effectively: While Events are not a perfect solution due to their ephemeral nature, we could explore tools or mechanisms to surface relevant Events alongside the Machine status in the Hypershift console or CLI. This could provide a temporary workaround while a more robust solution is developed.
...

Requirements (aka. Acceptance Criteria):

Provide clear status messages indicating the current stage of machine pool.
Highlight specific conditions that are blocking machine pool creations, nodes joining the cluster or nodes becoming ready, such as quotas or instance availability.
Integrate these indicators into the existing monitoring and metrics systems.
Ensure compatibility with self-managed and managed deployments.
Support for all architectures: x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x).
Ensure the feature is secure, reliable, maintainable, and scalable.
Backport to applicable versions as needed.

Questions to Answer:

What specific conditions are most commonly blocking machine pool creations, nodes joining the cluster, and nodes becoming ready?
How can these conditions be automatically detected and reported?
What level of detail is needed in the status messages to be most useful?

Out of Scope

Debugging non-machinepool creation related issues.
Integration with third-party monitoring tools beyond the scope of OpenShift Console and OCM.

Deployment considerations

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Both
Classic (standalone cluster)	N/A
Hosted control planes	Applicable
Multi node, Compact (three node), or Single node (SNO), or all	N/A
Connected / Restricted Network	Both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All architectures
Operator compatibility	Must be compatible with relevant operators
Backport needed (list applicable versions)	Specify versions as identified during refinement
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	Integration with OpenShift Console and OCM
Other (please specify)	N/A

Documentation Considerations

Detailed documentation on how to interpret new status messages and metrics will be required. This should include troubleshooting guides and examples. Any changes should be reflected in the existing HyperShift and OpenShift documentation.

Interoperability Considerations

This feature impacts HyperShift machinepool creation on ROSA, ARO, and self-managed HCP. Interoperability test scenarios should include these environments to ensure consistent behavior and reliability across the portfolio.

clones

OCPSTRAT-1598 [Observability] Enhanced Debuggability for HyperShift Cluster Installation Failures

Backlog

is depended on by

RFE-5574 Observability (Metrics) for Worker Node Creation

Approved

RFE-6162 Facilitate debugging clusters stuck installing

Approved

is related to

RFE-7188 Node drain reporting improvements

Backlog

RFE-7979 Enhance visibility into ROSA HCP instances

Backlog

ACM-24687 Integration MC failing to provision ROSA HCP workers (EC2 looping, pipelines blocked)

New

OCPSTRAT-1828 Enhance NodeHealthCheck (NHC) Functionality in Hosted Control Planes to Integrate with Upgrade Signals

New

relates to

RFE-7673 Enable Hosted Cluster users to monitor CMO stack

Approved

(2 is related to, 1 relates to)

Details

Description

Feature Overview (aka. Goal Summary)

Background

Goals (aka. expected user outcomes)

Use Cases:

Challenges:

Potential Solutions:

Requirements (aka. Acceptance Criteria):

Questions to Answer:

Out of Scope

Deployment considerations

Documentation Considerations

Interoperability Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates