-
Bug
-
Resolution: Done
-
Normal
-
ACM 2.9.0
-
1
-
True
-
None
-
False
-
-
-
RHOBS Sprint 20
-
Critical
-
No
Description of problem:
Version-Release number of selected component (if applicable):
OCP 4.14.0-0.nightly-2023-11-06-203803
advanced-cluster-management.v2.9.0-204
ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
Submariner brew.registry.redhat.io/rh-osbs/iib:615928
ODF 4.14.0-161
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT
How reproducible:
Steps to Reproduce:
1. On a Regional DR setup, setup ACM observability by whitelisting odf and rbd mirror metrics
names:
- odf_system_health_status
- odf_system_map
- odf_system_raw_capacity_total_bytes
- odf_system_raw_capacity_used_bytes
- ceph_rbd_mirror_snapshot_sync_bytes
- ceph_rbd_mirror_snapshot_snapshots
and prepare it for hub recovery where we have multiple workloads of both appset and subscription types backed by rbd and cephfs running on one of the managed cluster (where ODF is installed).
2. Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console.
3. Now take the latest backup and bring active hub completely down.
6. Restore backup on the passive hub and ensure both the managed clusters are successfully imported.
7. Wait for DRPolicy to get validated. Refresh the RHACM console and look for DR monitoring dashboard.
8. Run oc label namespace openshift-operators openshift.io/cluster-monitoring='true' to enable monitoring.
8.Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console even on the Passive hub cluster as mentioned in Step 2 from Active hub.
Actual results: ACM Observability doesn't work on passive hub post hub recovery
We see this error in the pod observability-observatorium-operator on the passive hub-
76c6685b5c-lwnb6
W1107 19:31:38.738119 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
level=error ts=2023-11-07T19:31:38.751518804Z caller=resource.go:202 msg="sync failed" key=open-cluster-management-observability/observability err="Operation cannot be fulfilled on observatoria.core.observatorium.io \"observability\": the object has been modified; please apply your changes to the latest version and try again"
E1107 19:31:38.751594 1 resource.go:204] Sync "open-cluster-management-observability/observability" failed: Operation cannot be fulfilled on observatoria.core.observatorium.io "observability": the object has been modified; please apply your changes to the latest version and try again
W1107 19:31:43.836758 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.037350 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.037466 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.358086 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
I1107 19:31:44.544479 1 request.go:655] Throttling request took 1.002350033s, request: GET:https://172.30.0.1:443/api/v1?timeout=32s
W1107 19:31:44.561837 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
amagrawa:~$ oc get MultiClusterObservability observability -o jsonpath='
{.status.conditions[1].status}'
True
amagrawa:~$ oc get configmap observability-metrics-custom-allowlist -n open-cluster-management-observability -o yaml
apiVersion: v1
data:
metrics_list.yaml: "names:\n - odf_system_health_status\n - odf_system_map\n -
odf_system_raw_capacity_total_bytes\n - odf_system_raw_capacity_used_bytes\n
\ - ceph_rbd_mirror_snapshot_sync_bytes\n - ceph_rbd_mirror_snapshot_snapshots\nmatches:\n
\ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-storage\",name=~\"odf-operator.*\"\n
\ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-dr-system\",name=~\"odr-cluster-operator.*\"
\n - _name_=\"csv_succeeded\",exported_namespace=\"openshift-operators\",name=~\"volsync.*\"\nrecording_rules:
\n - record: count_persistentvolumeclaim_total\n expr: count(kube_persistentvolumeclaim_info)\n"
kind: ConfigMap
metadata:
creationTimestamp: "2023-11-07T19:31:21Z"
labels:
cluster.open-cluster-management.io/backup: ""
velero.io/backup-name: acm-credentials-schedule-20231107190047
velero.io/restore-name: restore-acm-acm-credentials-schedule-20231107190047
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
resourceVersion: "419433"
uid: b9f88c0e-b1ad-49a0-8813-009f33de2717
amagrawa:~$ oc logs pod/metrics-collector-deployment-5d5554ff9f-mw2bp --tail 500
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.178534629Z msg="metrics collector initialized"
W1107 19:32:27.179376 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W1107 19:32:27.214574 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.229287549Z msg=NewHypershiftTransformer HostedClustersize=0
level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229357656Z component=forwarder msg=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive
level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229373851Z component=forwarder msg="not anonymizing any labels"
W1107 19:32:27.243238 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.255634674Z msg="starting metrics collector" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive listen=localhost:9002
level=debug caller=logger.go:45 ts=2023-11-07T19:32:28.892679903Z component=forwarder component=metricsclient timeseriesnumber=7613
level=info caller=logger.go:50 ts=2023-11-07T19:32:28.913977598Z component=forwarder component=metricsclient msg="metrics pushed successfully"
Must gather logs could be found here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/08nov23/
No graphs in Grafana, it's empty too
Expected results: ACM Observability should work on passive hub post hub recovery
Additional info: Relevant thread- https://redhat-internal.slack.com/archives/CUU609ZQC/p1699428385565049
- documents
-
ACM-9681 Known issue: ACM Grafana does not show data after a hub restore procedure
- Closed