Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61176

HCP Clusters Stuck Installing/Uninstalling due to ACM not Coming Up [release-4.14]

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • 4.14.z
    • 4.14.z
    • HyperShift
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • Done
    • Bug Fix
    • Hide
      PR #6114 backported code that did not include the service network DNS entries for the kube apiserver. This was ok in releases 4.17 and newer because in those releases a separate certificate is created to serve those service network DNS entries. However in 4.16 and older, there is only one serving certificate for the kube apiserver. This resulted in clients like ACM failing to communicate with the kube apiserver because they use the service network endpoint to install the klusterlet in the hosted cluster. This fix adds the missing entries back into the dns names of the KAS serving certificate.
      Show
      PR #6114 backported code that did not include the service network DNS entries for the kube apiserver. This was ok in releases 4.17 and newer because in those releases a separate certificate is created to serve those service network DNS entries. However in 4.16 and older, there is only one serving certificate for the kube apiserver. This resulted in clients like ACM failing to communicate with the kube apiserver because they use the service network endpoint to install the klusterlet in the hosted cluster. This fix adds the missing entries back into the dns names of the KAS serving certificate.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-58506. The following is the description of the original issue:

      Description of problem:

      ITN-2025-00159

      Summary of cause known at this point: On install, certificate issues with the API endpoint in the affected versions is preventing ACM agent from coming up for that cluster. This hangs the installation process. At uninstallation time, those ACM agents are responsible for removing finalizers, which causes the clusters to get stuck uninstalling. 

       

      Version-Release number of selected component (if applicable):

      Affected cluster versions: 
      4.15.{52,53,54}
      4.16.{43}
      4.17.{34}

       

      How reproducible:

      Reproduced on cluster installs

      Steps to Reproduce:

      1. Provision an HCP cluster with one of the affected versions listed above.
      2. Login to the management cluster and look for logs containing "BootstrapSecretMissing,HubKubeConfigSecretMissing".
      oc describe klusterlet klusterlet-$id 

      Actual results:

      Cluster will hang while coming up and have the following error logs:

      2025-07-07 19:45:14 +0000 UTC hostedclusters creed-hcp2-test configuration is invalid: NamedCertificates get secret: Invalid value: "cluster-api-cert": Secret "cluster-api-cert" not found
      2025-07-07 19:45:14 +0000 UTC hostedclusters creed-hcp2-test ValidConfiguration condition is false: NamedCertificates get secret: Invalid value: "cluster-api-cert": Secret "cluster-api-cert" not found 

       

      Expected results:

      Cluster installs successfully and ACM comes up

       

      Additional info:

      Check affected clusters script [TODO ADD]

              Unassigned Unassigned
              dalong.openshift Dakota Long
              None
              None
              He Liu He Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: