Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-5405

AlertingRules `LokiStackWriteRequestErrors` and `LokiRequestErrors` do not fire when Loki Gateway Pods stay `Pending`

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • Logging 5.8.4
    • Log Storage
    • False
    • None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Important

      Description of problem:

      Customer deployed the LokiStack to replace ElasticSearch. However due to missing resources on infrastructure nodes, two Loki Gateway (?) Pods stayed in "Pending", which was not noticed by the customer. However there was no alerting rule firing, alerting the customer about this, leading to no logs being stored.

      The alerting rules `LokiStackWriteRequestErrors` and `LokiRequestErrors` both return "100" (meaning 100% of write requests return with an error). An alert is triggered when the threshold is exceeded for more than 15 minutes. In the situation above there are breaks all the time within these 15 minutes, with the result that alerts are never triggered.

      See screenshot attached.

      The expectation would be that the rules fire in such a case when two Loki Pods stay "Pending" or are unavailable for other reasons.

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.12.46
      cluster-logging.v5.8.4

      How reproducible:

      Always

      Steps to Reproduce:

      1. Set up a cluster with OpenShift Logging 5.8.4
      2. Deploy OpenShift Logging 5.8 and create a LokiStack with "1x.small" sizing
      3. To simulate the issue, update the LokiStack object field ".spec.template.gateway.nodeSelector" with 'does-not-exist: "true"'. Delete the existing "gateway" ReplicaSet, this will lead to Pods being scheduled with the non-existent nodeSelectors. This will lead to the "gateway" Pods to stay in "Pending".

      Actual results:

      All writes are failing as expected. However `LokiStackWriteRequestErrors` and `LokiRequestErrors` alerting rules are not firing even after 15 minutes.

      Observe that on the `LokiStackWriteRequestErrors` alerting rule we see a broken graph like the screenshots attached.

      Expected results:

      After 15 minutes `LokiStackWriteRequestErrors` and `LokiRequestErrors` alerting rules are firing

      Additional info:

      • Logging "must-gather" available in Support Case 03796198

            Unassigned Unassigned
            rhn-support-skrenger Simon Krenger
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: