Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: Logging 5.8.11
Affects Version/s: Logging 5.8.11
Component/s: Log Storage
Labels:
- devel_ack+

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:
Before this update, duplicate conditions in the LokiStack resource status caused Loki Operator to emit invalid metrics. With this update, the operator now removes duplicate conditions from the status.
Release Note Type:
Bug Fix
Intelligence Requested:
Market:

Sprint:
Log Storage - Sprint 257, Log Storage - Sprint 258
Severity:
Moderate

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

The `TargetDown` alert is present for the `loki-operator-controller-manager-metrics-service`

$ oc prometheus alertrule TargetDown -o yaml 
data:
- alerts:
  - activeAt: "2024-06-03T17:16:35Z"
    annotations:
      description: 100% of the loki-operator-controller-manager-metrics-service/loki-operator-controller-manager-metrics-service
        targets in openshift-operators-redhat namespace have been unreachable for
        more than 15 minutes. This may be a symptom of network connectivity issues,
        down nodes, or failures within these components. Assess the health of the
        infrastructure and nodes running these targets and then contact support.
      summary: Some targets were not reachable from the monitoring server for an extended
        period of time.
    labels:
      alertname: TargetDown
      job: loki-operator-controller-manager-metrics-service
      namespace: openshift-operators-redhat
      service: loki-operator-controller-manager-metrics-service
      severity: warning
    state: firing
    value: "1e+02"

Running a curl from prometheus against the `loki-operator-controller-manager-metrics-service` when the error is present returns the next:

$ oc -n openshift-monitoring rsh prometheus-k8s-0
$ token="<token>"
sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics
An error has occurred while serving metrics:

4 error(s) occurred:
* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values
* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values

If the Loki Operator Pod is restarted, then, the alert dissappears, but after some time, the alert returns again.

When the Loki Operator Pod is restated and the metrics queried, they are returned, but visible some entries like:

sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics
...
# HELP lokistack_status_condition Counts the current status conditions of the LokiStack.
# TYPE lokistack_status_condition gauge
lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 0
lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 1
lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0

Version-Release number of selected component (if applicable):

OpenShift 4.14
RHOL 5.9.2 and RHOL 5.9.3

How reproducible:

Not able to reproduce

Steps to Reproduce:

Not able to reproduce, not found a pattern for when the alert is triggered as it's not present in all the clusters and when the Loki operator pod is restarted, the alert disappears to appear some time later.

Actual results:

The alert targetDown is visible for the `loki-operator-controller-manager-metrics-service` when it's reachable from the Prometheus pods as it exists an error indicating that the same metrics was collected before.

Expected results:

The `loki-operator-controller-manager-metrics-service` shows up and not the error below visible:

* collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values