Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-75887

[HCP] | Missing Node Identification in HCCO In-Place Upgrader Logs and NodePool Status

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.17.z, 4.18.z
    • HyperShift
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The inplaceupgrader controller in the HyperShift Configuration Operator (HCCO) logs an error when it encounters a degraded node during an upgrade, but it does not specify which node is degraded. Additionally, the NodePool CR status remains "Healthy" (e.g., AllNodesHealthy: True) even when the reconciler is blocked by a degraded node, leading to a "silent" upgrade failure from a user's perspective.

      Version-Release number of selected component (if applicable):

      OCP management Version: 4.18.27
      ACM version: 2.13.5
      Platform: Hosted Control Planes (HCP) Agent
      Version: OpenShift 4.17.x 
      Targeting 4.17.37)
      Upgrade Strategy: InPlace

      How reproducible:

          NA

       

      During a customer HCP agent upgrade to 4.17.37, the upgrade stalled. The NodePool CR indicated that all nodes were updated and healthy, yet the HCCO logs showed a reconciliation error.     
      
      HCCO Log Output:
      
      {
        "level":"error",
        "ts":"2026-02-02T20:39:43Z",
        "msg":"Reconciler error",
        "controller":"inplaceupgrader",
        "object":{"name":"infra01-nodepool","namespace":"infra01-infra01"},
        "error":"degraded node found, cannot progress in-place upgrade. Degraded reason: disk validation failed: expected target osImageURL ... have ...",
        ...
      }
      
      
      

      Actual results:

      The log error "degraded node found" does not mention the Node Name, making it extremely difficult for admins to troubleshoot clusters with high node counts (43 nodes in this case).
      
      The NodePool status conditions (AllNodesHealthy, AllMachinesReady) reported True with the message "All is well," despite the upgrade being blocked by a validation failure.

      Expected results:

      The HCCO logs should explicitly state the name of the degraded node (e.g., "error": "degraded node [node-name] found...").
      
      The NodePool status conditions should reflect that the upgrade is stalled and identify the failing node/machine in the message field.

      Additional info:

          Required data will be uploaded and shared over the comment box

              Unassigned Unassigned
              rhn-support-dpateriy Divyam Pateriya
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: