-
Bug
-
Resolution: Done-Errata
-
Normal
-
Logging 5.8.11
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
Before this update, duplicate conditions in the LokiStack resource status caused Loki Operator to emit invalid metrics. With this update, the operator now removes duplicate conditions from the status.
-
Bug Fix
-
-
-
Log Storage - Sprint 257, Log Storage - Sprint 258
-
Moderate
Description of problem:
The `TargetDown` alert is present for the `loki-operator-controller-manager-metrics-service`
$ oc prometheus alertrule TargetDown -o yaml data: - alerts: - activeAt: "2024-06-03T17:16:35Z" annotations: description: 100% of the loki-operator-controller-manager-metrics-service/loki-operator-controller-manager-metrics-service targets in openshift-operators-redhat namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support. summary: Some targets were not reachable from the monitoring server for an extended period of time. labels: alertname: TargetDown job: loki-operator-controller-manager-metrics-service namespace: openshift-operators-redhat service: loki-operator-controller-manager-metrics-service severity: warning state: firing value: "1e+02"
Running a curl from prometheus against the `loki-operator-controller-manager-metrics-service` when the error is present returns the next:
$ oc -n openshift-monitoring rsh prometheus-k8s-0 $ token="<token>" sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics An error has occurred while serving metrics: 4 error(s) occurred: * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values
If the Loki Operator Pod is restarted, then, the alert dissappears, but after some time, the alert returns again.
When the Loki Operator Pod is restated and the metrics queried, they are returned, but visible some entries like:
sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics ... # HELP lokistack_status_condition Counts the current status conditions of the LokiStack. # TYPE lokistack_status_condition gauge lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1 lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0 lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1 lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0 lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 0 lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 1 lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1 lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
Version-Release number of selected component (if applicable):
OpenShift 4.14
RHOL 5.9.2 and RHOL 5.9.3
How reproducible:
Not able to reproduce
Steps to Reproduce:
Not able to reproduce, not found a pattern for when the alert is triggered as it's not present in all the clusters and when the Loki operator pod is restarted, the alert disappears to appear some time later.
Actual results:
The alert targetDown is visible for the `loki-operator-controller-manager-metrics-service` when it's reachable from the Prometheus pods as it exists an error indicating that the same metrics was collected before.
Expected results:
The `loki-operator-controller-manager-metrics-service` shows up and not the error below visible:
* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
NOTE: If the alert is noisy, it can be silenced as indicated in https://docs.openshift.com/container-platform/4.14/observability/monitoring/managing-alerts.html#silencing-alerts_managing-alerts
- clones
-
LOG-5696 Duplicate conditions in LokiStack status cause invalid operator metrics
- Closed
- links to
-
RHBA-2024:5123 Logging for Red Hat OpenShift - 5.8.11
- mentioned on