Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-4710

governance policy framework and observability failing on managed cluster after hub restore

XMLWordPrintable

    • 1
    • False
    • None
    • False
    • GRC Sprint 2023-07
    • ?
    • No

      Description of problem:

      RHOCP 4.10.45 / ACM 2.6.3. With a new Hub Cluster build procedure prior to the restore. After the restore of the hub all Assisted Install CRs for the SNOs are in good states, and the GitOps applications are synced. But, none of the SNO policies show a compliance state, and metrics are not being sent from the SNOs to the Hub Cluster. After some number of hours (overnight sometime), 3 of the 4 SNOs associated with the restored Hub Cluster had all of their policies showing Compliant as originally expected. And, metrics were showing up for those 3 SNOs as well. The 4th SNO is still not syncing policies or metrics with the Hub Cluster (8+ hours now since the restore). What might be preventing the final SNO from syncing with the Hub Cluster, and why whatever that mechanism is takes so long to update after the restore (i.e. for the other 3 SNOs). For the SNO that is out of sync, it is not even showing any policies locally at this point.

       

      oc get policies -A
      No resources found

       

      Events:
        Type     Reason        Age                    From                   Message
        ----     ------        ----                   ----                   -------
        Warning  FailedCreate  6m48s (x211 over 20h)  replicaset-controller  Error creating: pods "governance-policy-framework-5668585b77-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the pod namespace "open-cluster-management-agent-addon" does not allow the workload type management

       

      above error was fixed with 
      oc annotate ns/open-cluster-management-agent-addon workload.openshift.io/allowed=management
      namespace/open-cluster-management-agent-addon annotated
      How did the Hub Cluster restore change that on the SNO. Or, could that somehow happen on a SNO when disconnected from the Hub Cluster for a while?

       
      It looks like the metrics not syncing is due to this:
       

      oc logs -n open-cluster-management-addon-observability endpoint-observability-operator-8569c9d497-rhjqr
      ...
      2023-03-30T23:23:47.227Z        ERROR   controllers.ObservabilityAddon  Failed to get observabilityaddon        {"Request.Namespace": "open-cluster-management-addon-observability", "Request.Name": "hub-info-secret", "namespace": "hv-1-sno-1", "error": "Get \"https://api.############.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227
      2023-03-30T23:23:47.227Z        ERROR   controller.observabilityaddon   Reconciler error        {"reconciler group": "observability.open-cluster-management.io", "reconciler kind": "ObservabilityAddon", "name": "hub-info-secret", "namespace": "open-cluster-management-addon-observability", "error": "Get \"https://api.#######.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227
       
      

       

       

      Version-Release number of selected component (if applicable): ACM 2.6.3

      How reproducible: Currently seen in customer environment

      Steps to Reproduce:

      1. Restore hub cluster and monitor SNO managed clusters
      2.  
      3. ...

      Actual results: managed clusters take a long time to sync with the hub or they have components failing with errors

       - governance-policy

       - observability

      Expected results:

      After hub restore components on managed clusters are working fully.

      Additional info:

              smeduri1@redhat.com Subbarao Meduri
              rhn-support-rspagnol Ryan Spagnola
              Xiang Yin Xiang Yin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: