Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: ACM 2.10.0
Affects Version/s: ACM 2.9.0
Component/s: Observability
Labels:

Story Points:
1
Blocked:
True
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
RHOBS Sprint 20
Severity:
Critical

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Version-Release number of selected component (if applicable):

OCP 4.14.0-0.nightly-2023-11-06-203803
advanced-cluster-management.v2.9.0-204
ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
Submariner brew.registry.redhat.io/rh-osbs/iib:615928
ODF 4.14.0-161
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT

How reproducible:

Steps to Reproduce:

1. On a Regional DR setup, setup ACM observability by whitelisting odf and rbd mirror metrics
names:

odf_system_health_status
odf_system_map
odf_system_raw_capacity_total_bytes
odf_system_raw_capacity_used_bytes
ceph_rbd_mirror_snapshot_sync_bytes
ceph_rbd_mirror_snapshot_snapshots
and prepare it for hub recovery where we have multiple workloads of both appset and subscription types backed by rbd and cephfs running on one of the managed cluster (where ODF is installed).
2. Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console.
3. Now take the latest backup and bring active hub completely down.
6. Restore backup on the passive hub and ensure both the managed clusters are successfully imported.
7. Wait for DRPolicy to get validated. Refresh the RHACM console and look for DR monitoring dashboard.
8. Run oc label namespace openshift-operators openshift.io/cluster-monitoring='true' to enable monitoring.
8.Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console even on the Passive hub cluster as mentioned in Step 2 from Active hub.

Actual results: ACM Observability doesn't work on passive hub post hub recovery

We see this error in the pod observability-observatorium-operator on the passive hub-

76c6685b5c-lwnb6
W1107 19:31:38.738119 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
level=error ts=2023-11-07T19:31:38.751518804Z caller=resource.go:202 msg="sync failed" key=open-cluster-management-observability/observability err="Operation cannot be fulfilled on observatoria.core.observatorium.io \"observability\": the object has been modified; please apply your changes to the latest version and try again"
E1107 19:31:38.751594 1 resource.go:204] Sync "open-cluster-management-observability/observability" failed: Operation cannot be fulfilled on observatoria.core.observatorium.io "observability": the object has been modified; please apply your changes to the latest version and try again
W1107 19:31:43.836758 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.037350 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.037466 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
W1107 19:31:44.358086 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
I1107 19:31:44.544479 1 request.go:655] Throttling request took 1.002350033s, request: GET:https://172.30.0.1:443/api/v1?timeout=32s
W1107 19:31:44.561837 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"

amagrawa:~$ oc get MultiClusterObservability observability -o jsonpath='

{.status.conditions[1].status}

'
True

amagrawa:~$ oc get configmap observability-metrics-custom-allowlist -n open-cluster-management-observability -o yaml
apiVersion: v1
data:
metrics_list.yaml: "names:\n - odf_system_health_status\n - odf_system_map\n -
odf_system_raw_capacity_total_bytes\n - odf_system_raw_capacity_used_bytes\n
\ - ceph_rbd_mirror_snapshot_sync_bytes\n - ceph_rbd_mirror_snapshot_snapshots\nmatches:\n
\ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-storage\",name=~\"odf-operator.*\"\n
\ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-dr-system\",name=~\"odr-cluster-operator.*\"
\n - _name_=\"csv_succeeded\",exported_namespace=\"openshift-operators\",name=~\"volsync.*\"\nrecording_rules:
\n - record: count_persistentvolumeclaim_total\n expr: count(kube_persistentvolumeclaim_info)\n"
kind: ConfigMap
metadata:
creationTimestamp: "2023-11-07T19:31:21Z"
labels:
cluster.open-cluster-management.io/backup: ""
velero.io/backup-name: acm-credentials-schedule-20231107190047
velero.io/restore-name: restore-acm-acm-credentials-schedule-20231107190047
name: observability-metrics-custom-allowlist
namespace: open-cluster-management-observability
resourceVersion: "419433"
uid: b9f88c0e-b1ad-49a0-8813-009f33de2717

amagrawa:~$ oc logs pod/metrics-collector-deployment-5d5554ff9f-mw2bp --tail 500
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.178534629Z msg="metrics collector initialized"
W1107 19:32:27.179376 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W1107 19:32:27.214574 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.229287549Z msg=NewHypershiftTransformer HostedClustersize=0
level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229357656Z component=forwarder msg=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive
level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229373851Z component=forwarder msg="not anonymizing any labels"
W1107 19:32:27.243238 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
level=info caller=logger.go:50 ts=2023-11-07T19:32:27.255634674Z msg="starting metrics collector" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive listen=localhost:9002
level=debug caller=logger.go:45 ts=2023-11-07T19:32:28.892679903Z component=forwarder component=metricsclient timeseriesnumber=7613
level=info caller=logger.go:50 ts=2023-11-07T19:32:28.913977598Z component=forwarder component=metricsclient msg="metrics pushed successfully"

Must gather logs could be found here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/08nov23/

No graphs in Grafana, it's empty too

Expected results: ACM Observability should work on passive hub post hub recovery

Additional info: Relevant thread- https://redhat-internal.slack.com/archives/CUU609ZQC/p1699428385565049

documents

ACM-9681 Known issue: ACM Grafana does not show data after a hub restore procedure

Closed

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results: ACM Observability doesn't work on passive hub post hub recovery

Expected results: ACM Observability should work on passive hub post hub recovery

Additional info: Relevant thread- https://redhat-internal.slack.com/archives/CUU609ZQC/p1699428385565049

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates