Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-5855

[release-5.9] Duplicate conditions in LokiStack status cause invalid operator metrics

XMLWordPrintable

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Before this update, duplicate conditions in the LokiStack resource status caused Loki Operator to emit invalid metrics. With this update, the operator now removes duplicate conditions from the status.
    • Bug Fix
    • Log Storage - Sprint 257
    • Moderate

      Description of problem:

      The `TargetDown` alert is present for the `loki-operator-controller-manager-metrics-service`

      $ oc prometheus alertrule TargetDown -o yaml 
      data:
      - alerts:
        - activeAt: "2024-06-03T17:16:35Z"
          annotations:
            description: 100% of the loki-operator-controller-manager-metrics-service/loki-operator-controller-manager-metrics-service
              targets in openshift-operators-redhat namespace have been unreachable for
              more than 15 minutes. This may be a symptom of network connectivity issues,
              down nodes, or failures within these components. Assess the health of the
              infrastructure and nodes running these targets and then contact support.
            summary: Some targets were not reachable from the monitoring server for an extended
              period of time.
          labels:
            alertname: TargetDown
            job: loki-operator-controller-manager-metrics-service
            namespace: openshift-operators-redhat
            service: loki-operator-controller-manager-metrics-service
            severity: warning
          state: firing
          value: "1e+02" 

      Running a curl from prometheus against the `loki-operator-controller-manager-metrics-service` when the error is present returns the next:

      $ oc -n openshift-monitoring rsh prometheus-k8s-0
      $ token="<token>"
      sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics
      An error has occurred while serving metrics:
      
      4 error(s) occurred:
      * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
      * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values
      * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
      * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Ready"} label:{name:"reason" value:"ReadyComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"false"} gauge:{value:1}} was collected before with the same name and label values
      

      If the Loki Operator Pod is restarted, then, the alert dissappears, but after some time, the alert returns again.

      When the Loki Operator Pod is restated and the metrics queried, they are returned, but visible some entries like:

      sh-4.4$ curl --cacert /etc/prometheus/certs/secret_openshift-operators-redhat_loki-operator-controller-manager-metrics-token_service-ca.crt -H "Authorization: Bearer $token" https://loki-operator-controller-manager-metrics-service.openshift-operators-redhat.svc:8443/metrics
      ...
      # HELP lokistack_status_condition Counts the current status conditions of the LokiStack.
      # TYPE lokistack_status_condition gauge
      lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
      lokistack_status_condition{condition="Failed",reason="FailedComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
      lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
      lokistack_status_condition{condition="Pending",reason="PendingComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
      lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 0
      lokistack_status_condition{condition="Ready",reason="ReadyComponents",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 1
      lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="false"} 1
      lokistack_status_condition{condition="Warning",reason="StorageNeedsSchemaUpdate",size="1x.extra-small",stack_name="logging-loki",stack_namespace="openshift-logging",status="true"} 0
      

      Version-Release number of selected component (if applicable):

      OpenShift 4.14
      RHOL 5.9.2 and RHOL 5.9.3

      How reproducible:

      Not able to reproduce

      Steps to Reproduce:

      Not able to reproduce, not found a pattern for when the alert is triggered as it's not present in all the clusters and when the Loki operator pod is restarted, the alert disappears to appear some time later.

      Actual results:

      The alert targetDown is visible for the `loki-operator-controller-manager-metrics-service` when it's reachable from the Prometheus pods as it exists an error indicating that the same metrics was collected before.

      Expected results:

      The `loki-operator-controller-manager-metrics-service` shows up and not the error below visible:

      * collected metric "lokistack_status_condition" { label:{name:"condition" value:"Pending"} label:{name:"reason" value:"PendingComponents"} label:{name:"size" value:"1x.extra-small"} label:{name:"stack_name" value:"logging-loki"} label:{name:"stack_namespace" value:"openshift-logging"} label:{name:"status" value:"true"} gauge:{value:0}} was collected before with the same name and label values
      

       

      NOTE: If the alert is noisy, it can be silenced as indicated in https://docs.openshift.com/container-platform/4.14/observability/monitoring/managing-alerts.html#silencing-alerts_managing-alerts

              rojacob@redhat.com Robert Jacob
              rhn-support-dgautam Dhruv Gautam
              Kabir Bharti Kabir Bharti
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: