-
Bug
-
Resolution: Test Pending
-
Normal
-
None
-
ACM 2.6.3
-
1
-
False
-
None
-
False
-
-
-
-
GRC Sprint 2023-07
-
?
-
No
Description of problem:
RHOCP 4.10.45 / ACM 2.6.3. With a new Hub Cluster build procedure prior to the restore. After the restore of the hub all Assisted Install CRs for the SNOs are in good states, and the GitOps applications are synced. But, none of the SNO policies show a compliance state, and metrics are not being sent from the SNOs to the Hub Cluster. After some number of hours (overnight sometime), 3 of the 4 SNOs associated with the restored Hub Cluster had all of their policies showing Compliant as originally expected. And, metrics were showing up for those 3 SNOs as well. The 4th SNO is still not syncing policies or metrics with the Hub Cluster (8+ hours now since the restore). What might be preventing the final SNO from syncing with the Hub Cluster, and why whatever that mechanism is takes so long to update after the restore (i.e. for the other 3 SNOs). For the SNO that is out of sync, it is not even showing any policies locally at this point.
oc get policies -A No resources found
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 6m48s (x211 over 20h) replicaset-controller Error creating: pods "governance-policy-framework-5668585b77-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the pod namespace "open-cluster-management-agent-addon" does not allow the workload type management
above error was fixed with
oc annotate ns/open-cluster-management-agent-addon workload.openshift.io/allowed=management
namespace/open-cluster-management-agent-addon annotated
How did the Hub Cluster restore change that on the SNO. Or, could that somehow happen on a SNO when disconnected from the Hub Cluster for a while?
It looks like the metrics not syncing is due to this:
oc logs -n open-cluster-management-addon-observability endpoint-observability-operator-8569c9d497-rhjqr ... 2023-03-30T23:23:47.227Z ERROR controllers.ObservabilityAddon Failed to get observabilityaddon {"Request.Namespace": "open-cluster-management-addon-observability", "Request.Name": "hub-info-secret", "namespace": "hv-1-sno-1", "error": "Get \"https://api.############.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227 2023-03-30T23:23:47.227Z ERROR controller.observabilityaddon Reconciler error {"reconciler group": "observability.open-cluster-management.io", "reconciler kind": "ObservabilityAddon", "name": "hub-info-secret", "namespace": "open-cluster-management-addon-observability", "error": "Get \"https://api.#######.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227
Version-Release number of selected component (if applicable): ACM 2.6.3
How reproducible: Currently seen in customer environment
Steps to Reproduce:
- Restore hub cluster and monitor SNO managed clusters
- ...
Actual results: managed clusters take a long time to sync with the hub or they have components failing with errors
- governance-policy
- observability
Expected results:
After hub restore components on managed clusters are working fully.