Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27563

ARO HCP: Investigate and Root-Cause Maestro Bug Triggered by Wedged Cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Maestro
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate
    • None

      Description of problem: We had a ARO HCP cluster in stuck state for uninstall and we see errors related to Maestro which needs to be investigated, the cluster is nuked as part of production release but we would like for this to be investigated.

      Version-Release number of selected component (if applicable):

      How reproducible: mukrishn@redhat.com can help with the reproduction steps

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results: The cluster was not able to unistall and was in stuck state, please see logs below

       

      TIMESTAMP message 2025-10-24T13:43:00.000Z updating cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' state to 'pending' 2025-10-24T13:43:00.000Z Cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' created, now in 'validating' state 2025-10-24T13:45:00.000Z updating cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' state to 'installing' 2025-10-24T13:49:00.000Z updating cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' state to 'ready' 2025-10-24T21:35:00.000Z updating cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' state to 'uninstalling' 2025-11-11T18:42:00.000Z updating cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' state to 'error'
      

      the cluster was originally installed on 2025-10-24. it failed to uninstall then, and only transitioned to error state on 11-11 - perhaps new rollot

      message Count Running chain deletion to clean deleted cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin'. 2,893 Finished destruct chain for cluster 2,893 Running destructor 'hypershift-managed-cluster-destructor' for cluster 2,893 checking if config changed for shard '7ddeb645-ebe1-5a21-82db-5b5cd39f0038' 2,893 Starting destruct chain for cluster 2,893 Not continuing to the next destructor for cluster 2,893 The manifest's resource wrapped within Maestro's manifest bundle with id 'cfd2975d-b225-5199-b1f3-2afad9fdd400' does not have a status feedback value 2,010 The resource status of the manifest wrapped within Maestro's manifest bundle with id 'cfd2975d-b225-5199-b1f3-2afad9fdd400' is not set. Maestro server currently not aware of it 2,010 deleting managed cluster 'ocm-arohcpprod-2m4j8qdj80ivcim53qhrb9m1hms4puin' for cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' 2,010 requested managed cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin' deletion 2,010 Running destructor 'hypershift-manifest-work-destructor' for cluster 883 managed cluster does not exist for cluster '2m4j8qdj80ivcim53qhrb9m1hms4puin', skipping 883 requested manifest work 'local-cluster/2m4j8qdj80ivcim53qhrb9m1hms4puin-np-static-2' deletion 240 requested manifest work 'local-cluster/2m4j8qdj80ivcim53qhrb9m1hms4puin' deletion 240 requested manifest work 'local-cluster/2m4j8qdj80ivcim53qhrb9m1hms4puin-np-static-3' deletion 240
      

       
      During the day that it was transitioning to error state, it looks like some error in Maestro became terminal
       
      Slack threads:

      1. We reported this to the Maestro team here: https://redhat-internal.slack.com/archives/C08TM748CRW/p1762320453969019
      2. Reported in external channel: https://redhat-external.slack.com/archives/C076JQXM9PS/p1765464635773829?thread_ts=1765213783.694479&cid=C076JQXM9PS

      Additional info:

              Unassigned Unassigned
              rhn-engineering-sirkal Sajeel Irkal
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: