Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-8543

[RDR] ACM Observability doesn't work on passive hub post hub recovery

XMLWordPrintable

    • 1
    • True
    • None
    • False
    • No
    • RHOBS Sprint 20
    • Critical

      Description of problem:

      Version-Release number of selected component (if applicable):

      OCP 4.14.0-0.nightly-2023-11-06-203803
      advanced-cluster-management.v2.9.0-204
      ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
      Submariner brew.registry.redhat.io/rh-osbs/iib:615928
      ODF 4.14.0-161
      ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
      Latency 50ms RTT

      How reproducible:

      Steps to Reproduce:

      1. On a Regional DR setup, setup ACM observability by whitelisting odf and rbd mirror metrics
      names:

      • odf_system_health_status
      • odf_system_map
      • odf_system_raw_capacity_total_bytes
      • odf_system_raw_capacity_used_bytes
      • ceph_rbd_mirror_snapshot_sync_bytes
      • ceph_rbd_mirror_snapshot_snapshots
        and prepare it for hub recovery where we have multiple workloads of both appset and subscription types backed by rbd and cephfs running on one of the managed cluster (where ODF is installed).
        2. Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console.
        3. Now take the latest backup and bring active hub completely down.
        6. Restore backup on the passive hub and ensure both the managed clusters are successfully imported.
        7. Wait for DRPolicy to get validated. Refresh the RHACM console and look for DR monitoring dashboard.
        8. Run oc label namespace openshift-operators openshift.io/cluster-monitoring='true' to enable monitoring.
        8.Ensure that the Cluster operator is healthy and graphs are being populated with values for rbd backed workloads on the DR monitoring dashboard under the RHACM console even on the Passive hub cluster as mentioned in Step 2 from Active hub.

      Actual results: ACM Observability doesn't work on passive hub post hub recovery

      We see this error in the pod observability-observatorium-operator on the passive hub-

      76c6685b5c-lwnb6
      W1107 19:31:38.738119 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
      level=error ts=2023-11-07T19:31:38.751518804Z caller=resource.go:202 msg="sync failed" key=open-cluster-management-observability/observability err="Operation cannot be fulfilled on observatoria.core.observatorium.io \"observability\": the object has been modified; please apply your changes to the latest version and try again"
      E1107 19:31:38.751594 1 resource.go:204] Sync "open-cluster-management-observability/observability" failed: Operation cannot be fulfilled on observatoria.core.observatorium.io "observability": the object has been modified; please apply your changes to the latest version and try again
      W1107 19:31:43.836758 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
      W1107 19:31:44.037350 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
      W1107 19:31:44.037466 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
      W1107 19:31:44.358086 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"
      I1107 19:31:44.544479 1 request.go:655] Throttling request took 1.002350033s, request: GET:https://172.30.0.1:443/api/v1?timeout=32s
      W1107 19:31:44.561837 1 warnings.go:70] unknown field "metadata.ownerReferences[0].blockOwnerdeletion"

      amagrawa:~$ oc get MultiClusterObservability observability -o jsonpath='

      {.status.conditions[1].status}

      '
      True

      amagrawa:~$ oc get configmap observability-metrics-custom-allowlist -n open-cluster-management-observability -o yaml
      apiVersion: v1
      data:
      metrics_list.yaml: "names:\n - odf_system_health_status\n - odf_system_map\n -
      odf_system_raw_capacity_total_bytes\n - odf_system_raw_capacity_used_bytes\n
      \ - ceph_rbd_mirror_snapshot_sync_bytes\n - ceph_rbd_mirror_snapshot_snapshots\nmatches:\n
      \ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-storage\",name=~\"odf-operator.*\"\n
      \ - _name_=\"csv_succeeded\",exported_namespace=\"openshift-dr-system\",name=~\"odr-cluster-operator.*\"
      \n - _name_=\"csv_succeeded\",exported_namespace=\"openshift-operators\",name=~\"volsync.*\"\nrecording_rules:
      \n - record: count_persistentvolumeclaim_total\n expr: count(kube_persistentvolumeclaim_info)\n"
      kind: ConfigMap
      metadata:
      creationTimestamp: "2023-11-07T19:31:21Z"
      labels:
      cluster.open-cluster-management.io/backup: ""
      velero.io/backup-name: acm-credentials-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-credentials-schedule-20231107190047
      name: observability-metrics-custom-allowlist
      namespace: open-cluster-management-observability
      resourceVersion: "419433"
      uid: b9f88c0e-b1ad-49a0-8813-009f33de2717

      amagrawa:~$ oc logs pod/metrics-collector-deployment-5d5554ff9f-mw2bp --tail 500
      level=info caller=logger.go:50 ts=2023-11-07T19:32:27.178534629Z msg="metrics collector initialized"
      W1107 19:32:27.179376 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
      W1107 19:32:27.214574 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
      level=info caller=logger.go:50 ts=2023-11-07T19:32:27.229287549Z msg=NewHypershiftTransformer HostedClustersize=0
      level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229357656Z component=forwarder msg=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive
      level=warn caller=logger.go:55 ts=2023-11-07T19:32:27.229373851Z component=forwarder msg="not anonymizing any labels"
      W1107 19:32:27.243238 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
      level=info caller=logger.go:50 ts=2023-11-07T19:32:27.255634674Z msg="starting metrics collector" from=https://prometheus-k8s.openshift-monitoring.svc:9091 to=https://observatorium-api-open-cluster-management-observability.apps.amagrawa-hub2-7no.qe.rh-ocs.com/api/metrics/v1/default/api/v1/receive listen=localhost:9002
      level=debug caller=logger.go:45 ts=2023-11-07T19:32:28.892679903Z component=forwarder component=metricsclient timeseriesnumber=7613
      level=info caller=logger.go:50 ts=2023-11-07T19:32:28.913977598Z component=forwarder component=metricsclient msg="metrics pushed successfully"

      Must gather logs could be found here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/08nov23/

      No graphs in Grafana, it's empty too

      Expected results: ACM Observability should work on passive hub post hub recovery

      Additional info: Relevant thread- https://redhat-internal.slack.com/archives/CUU609ZQC/p1699428385565049

            rh-ee-doolivei Douglas Camata
            amagrawa@redhat.com Aman Agrawal
            Xiang Yin Xiang Yin
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: