Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-20337

Can't create ClusterInstance after upgrade ACM 2.12 to 2.13 with site-config enabled

XMLWordPrintable

    • Installer Sprint 2025-59, Installer Sprint 2025-60, Installer Sprint 2025-61
    • Important
    • None

      Description of problem:

      Upgrade of ACM from 2.12 to 2.13.2 with SiteConfig operator enabled. A cluster was deployed and managed under 2.12. Following the upgrade:

      • The SiteConfig operator pod in open-cluster-management namespace was not recreated (age showed 12d when all other pods were ~3h)
      • Editing or deleting the ClusterInstance CR for the previously deployed cluster failed with error about missing webhook, but the webhook exists:
        $ oc delete clusterinstance -n cnfdf02 cnfdf02
        Error from server (InternalError): Internal error occurred: failed calling webhook "clusterinstances.siteconfig.open-cluster-management.io": failed to call webhook: Post "https://webhook-clusterinstances-siteconfig-open-cluster-management-io.open-cluster-management.svc:443/validate-siteconfig-open-cluster-management-io-v1alpha1-clusterinstance?timeout=10s": no endpoints available for service "webhook-clusterinstances-siteconfig-open-cluster-management-io"
        $ oc get svc -n open-cluster-management
        <snip>
        webhook-clusterinstances-siteconfig-open-cluster-management-io   ClusterIP   172.30.110.105   <none>        443/TCP    7h40m
        Hub cluster is 3-node cluster. Dual-stack networking w/ ipv4 primary.

        Version-Release number of selected component (if applicable):

      ACM Upgrade from 2.12 to 2.13

      How reproducible:

      100%

      Steps to Reproduce:

      1. Install ACM 2.12
      2. enable SiteConfig Operator
      3. Upgrade to 2.13
      4. Try to create or delete a ClusterInstance CR

      Actual results:

      Error "failed calling webhook"

      Expected results:

      Success in creating/deleting ClusterInstance

      Additional info:

      https://access.redhat.com/solutions/7116347

      Resolution

      When upgrading from 2.12.x to 2.13.3, a deployment was updated with a new label selector. Under kubernetes restrictions, this is an immutable field, so the apply/patch failed. In order to actually modify this field, the resource must be deleted and re-created. This update also included a new label, which the webhook service uses in order to target the pod, causing the webhook to fail to call due to no targeted pods. A hyper-specific check was added to delete deployment/siteconfig-controller-manager only when upgrading from 2.12 and if siteconfig is enabled.

      To Test

      1. Install ACM 2.12.x
      2. Enable siteconfig
      3. Upgrade to ACM 2.13.3
      4. See that siteconfig-controller-manager deployment has the label control-plane: siteconfig-controller-manager (was previously control-plane: controller-manager
      5. This should match the service webhook-clusterinstances-siteconfig-open-cluster-management-io which has the label selector control-plane: siteconfig-controller-manager
      6. This service is what was throwing the webhook error mentioned in the ticket. Under these conditions, the error no longer throws when attempting to create/edit/delete a ClusterInstance under ACM 2.13.3

              rh-ee-ngraham Nathaniel Graham
              rhn-support-imiller Ian Miller
              Matthew Smigielski Matthew Smigielski
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: