-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
ACM 2.11.0
-
1
-
False
-
None
-
False
-
-
-
Moderate
-
None
Description of problem:
We observed 3 0f 3628 managed SNOs shows observabilityaddon degraded as shown below.
# oc get observabilityaddon -A -ojson | jq -r '.items[] | "(.status.conditions[] | select(.type=="Degraded" and .status=="True").lastTransitionTime) (.metadata.namespace)"'
{{ 2024-07-11T20:02:51Z vm01681 }}
{{2024-07-11T19:34:23Z vm03095 }}
{{2024-07-11T16:47:40Z vm03544 }}
#
These three clusters are not shown in graphana UI. in the metrics-collector pod log,metrics-collector-deployment_pod.log, we see:
level=error caller=logger.go:60 ts=2024-07-11T19:34:23.212608332Z component=collectrule/evaluator msg="failed to evaluate collect rule" err="Get \"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%281+-avg%28rate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B5m%5D%29%29%29%2A+100+%3E+70\": tls: failed to verify certificate: x509: certificate signed by unknown authority" rule="(1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100 > 70"
Talked to rh-ee-coquadro , was suggested to delete the observability-controller-open-cluster-management.io-observability-signer-client-cert. then the pod was recreated and the cluster was connected to obs server.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
- ...