Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1615

Enhanced Debuggability for HyperShift Cluster NodePool Failures

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 100% To Do, 0% In Progress, 0% Done
    • 7
    • 0

      Feature Overview (aka. Goal Summary)

      Facilitate debugging of failures related to machine pool creation, node joining, and node readiness in HyperShift. This feature aims to provide clear, actionable insights into what conditions or issues are blocking these critical operations, thereby significantly reducing the time and effort needed to troubleshoot and resolve these issues.

      Background

      Current debugging processes for clusters' machine pool failures are manual and time-consuming. This feature aims to streamline and improve the efficiency of the debugging process by providing automated, clear insights into blocking conditions.

      Goals (aka. expected user outcomes)

      The primary outcome is for users, particularly system administrators and cluster service providers (in HCP terminology), to quickly identify and resolve blockages related to machine pool creation, nodes failing to join the cluster, and nodes that cannot become ready in HyperShift clusters. This feature will expand the functionality of the existing status and metrics system to include detailed indicators of machine pool creation progress, node joining status, node readiness, and specific conditions blocking these processes.

       

      Ideally we want to surface detailed error messages (like the ones found in the AWSMachine's InstanceReady condition) to the Machine's InfrastructureReady condition, making troubleshooting easier.

       

      An idea here would be to also utilize the NodeHealthCheck addon (and make sure it's compatible with HCP), for example, replicate conditions into the NodePool CR:

        # skip other fields here...
        unhealthyNodes:
          - name: unhealthy-node-name
            remediations:
              - resource:
                  apiVersion: self-node-remediation.medik8s.io/v1alpha1
                  kind: SelfNodeRemediation
                  namespace: <SNR namespace>
                  name: unhealthy-node-name
                  uid: abcd-1234...
                started: 2023-03-20T15:05:05Z01:00
                timedOut: 2023-03-20T15:10:05Z01:00 # timed out
              # when using `escalatingRemediations`, the next remediator will be appended:   
              - resource:
                  apiVersion: reprovison.example.com/v1
                  kind: ReprovisionRemediation
                  namespace: example
                  name: unhealthy-node-name
                  uid: bcde-2345...
                started: 2023-03-20T15:10:07Z01:00
                # no timeout set: ongoing remediation 
      unhealthyConditions:   
      - type: NetworkUnavailable     
         status: "True"     duration: 300s
      

      etc.. 

       

      Use Cases:

      • System administrator troubleshooting a machine pool failures.
      • Automated systems monitoring and alerting on machine pool progress and blockages.
      • Post-mortem analysis of machine pool related issues to prevent future occurrences.

       

      Challenges:

      • failureReason and failureMessage in the Machine status are considered terminal and cannot be reset once set. This may limit their use for transient errors like permission issues.
      • The Cluster API doesn't explicitly require an InstanceReady condition on the infrastructure provider, which could make standardizing this behavior across cloud providers challenging.

      Potential Solutions:

      • Enhance provider-specific Machine controllers: Modify the AWSMachine controller (and potentially others) to populate failureReason and failureMessage with more specific information when appropriate, even for potentially transient errors. This would be the most straightforward approach for AWS.
      • Introduce a new condition or field: Propose a change to the Cluster API to either introduce a new condition (like InstanceProvisioning) or add a field to the InfrastructureReady condition to specifically capture detailed error messages. This would require collaboration with the Cluster API community and would be a more comprehensive solution across all providers.
      • Leverage Events more effectively: While Events are not a perfect solution due to their ephemeral nature, we could explore tools or mechanisms to surface relevant Events alongside the Machine status in the Hypershift console or CLI. This could provide a temporary workaround while a more robust solution is developed.
      • ...

      Requirements (aka. Acceptance Criteria):

      1. Provide clear status messages indicating the current stage of machine pool.
      2. Highlight specific conditions that are blocking machine pool creations, nodes joining the cluster or nodes becoming ready, such as quotas or instance availability.
      3. Integrate these indicators into the existing monitoring and metrics systems.
      4. Ensure compatibility with self-managed and managed deployments.
      5. Support for all architectures: x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x).
      6. Ensure the feature is secure, reliable, maintainable, and scalable.
      7. Backport to applicable versions as needed.

      Questions to Answer:

      • What specific conditions are most commonly blocking machine pool creations, nodes joining the cluster, and nodes becoming ready?
      • How can these conditions be automatically detected and reported?
      • What level of detail is needed in the status messages to be most useful?

      Out of Scope

      • Debugging non-machinepool creation related issues.
      • Integration with third-party monitoring tools beyond the scope of OpenShift Console and OCM.

      Deployment considerations

       

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both Both
      Classic (standalone cluster) N/A
      Hosted control planes Applicable
      Multi node, Compact (three node), or Single node (SNO), or all N/A
      Connected / Restricted Network Both
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All architectures
      Operator compatibility Must be compatible with relevant operators
      Backport needed (list applicable versions) Specify versions as identified during refinement
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) Integration with OpenShift Console and OCM
      Other (please specify) N/A

      Documentation Considerations

      Detailed documentation on how to interpret new status messages and metrics will be required. This should include troubleshooting guides and examples. Any changes should be reflected in the existing HyperShift and OpenShift documentation.

      Interoperability Considerations

      This feature impacts HyperShift machinepool creation on ROSA, ARO, and self-managed HCP. Interoperability test scenarios should include these environments to ensure consistent behavior and reliability across the portfolio.

              azaalouk Adel Zaalouk
              azaalouk Adel Zaalouk
              W. Trevor King
              Matthew Werner Matthew Werner
              Alberto Garcia Lamela Alberto Garcia Lamela
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: