Loading...

XML

Word

Printable

Type: Bug
Resolution: Test Pending
Priority: Normal
Fix Version/s: None
Affects Version/s: ACM 2.6.3
Component/s: Observability
Labels:

Story Points:
1
Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:
RH Private Keywords:

Sprint:
GRC Sprint 2023-07

Test Coverage:

?
Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Priority Data:
PX Impact Score:
PX Review Complete:

Description of problem:

RHOCP 4.10.45 / ACM 2.6.3. With a new Hub Cluster build procedure prior to the restore. After the restore of the hub all Assisted Install CRs for the SNOs are in good states, and the GitOps applications are synced. But, none of the SNO policies show a compliance state, and metrics are not being sent from the SNOs to the Hub Cluster. After some number of hours (overnight sometime), 3 of the 4 SNOs associated with the restored Hub Cluster had all of their policies showing Compliant as originally expected. And, metrics were showing up for those 3 SNOs as well. The 4th SNO is still not syncing policies or metrics with the Hub Cluster (8+ hours now since the restore). What might be preventing the final SNO from syncing with the Hub Cluster, and why whatever that mechanism is takes so long to update after the restore (i.e. for the other 3 SNOs). For the SNO that is out of sync, it is not even showing any policies locally at this point.

oc get policies -A
No resources found

Events:
  Type     Reason        Age                    From                   Message
  ----     ------        ----                   ----                   -------
  Warning  FailedCreate  6m48s (x211 over 20h)  replicaset-controller  Error creating: pods "governance-policy-framework-5668585b77-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the pod namespace "open-cluster-management-agent-addon" does not allow the workload type management

above error was fixed with
oc annotate ns/open-cluster-management-agent-addon workload.openshift.io/allowed=management
namespace/open-cluster-management-agent-addon annotated
How did the Hub Cluster restore change that on the SNO. Or, could that somehow happen on a SNO when disconnected from the Hub Cluster for a while?

It looks like the metrics not syncing is due to this:

oc logs -n open-cluster-management-addon-observability endpoint-observability-operator-8569c9d497-rhjqr
...
2023-03-30T23:23:47.227Z        ERROR   controllers.ObservabilityAddon  Failed to get observabilityaddon        {"Request.Namespace": "open-cluster-management-addon-observability", "Request.Name": "hub-info-secret", "namespace": "hv-1-sno-1", "error": "Get \"https://api.############.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227
2023-03-30T23:23:47.227Z        ERROR   controller.observabilityaddon   Reconciler error        {"reconciler group": "observability.open-cluster-management.io", "reconciler kind": "ObservabilityAddon", "name": "hub-info-secret", "namespace": "open-cluster-management-addon-observability", "error": "Get \"https://api.#######.#####.####.com:6443/apis/observability.open-cluster-management.io/v1beta1/namespaces/hv-1-sno-1/observabilityaddons/observability-addon\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-apiserver-lb-signer\")"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227

Version-Release number of selected component (if applicable): ACM 2.6.3

How reproducible: Currently seen in customer environment

Steps to Reproduce:

Restore hub cluster and monitor SNO managed clusters
...

Actual results: managed clusters take a long time to sync with the hub or they have components failing with errors

- governance-policy

- observability

Expected results:

After hub restore components on managed clusters are working fully.

Additional info:

Assignee:: Subbarao Meduri

Reporter:: Ryan Spagnola

QA Contact:: Xiang Yin

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/04/03 3:06 PM

Updated:: 2024/02/13 3:00 PM

Resolved:: 2024/02/13 3:00 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable): ACM 2.6.3

How reproducible: Currently seen in customer environment

Steps to Reproduce:

Actual results: managed clusters take a long time to sync with the hub or they have components failing with errors

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates