Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: Logging 5.8.4
Component/s: Log Storage
Labels:
- devel_ack-

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix
Intelligence Requested:
Market:

Severity:
Important

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

Customer deployed the LokiStack to replace ElasticSearch. However due to missing resources on infrastructure nodes, two Loki Gateway (?) Pods stayed in "Pending", which was not noticed by the customer. However there was no alerting rule firing, alerting the customer about this, leading to no logs being stored.

The alerting rules `LokiStackWriteRequestErrors` and `LokiRequestErrors` both return "100" (meaning 100% of write requests return with an error). An alert is triggered when the threshold is exceeded for more than 15 minutes. In the situation above there are breaks all the time within these 15 minutes, with the result that alerts are never triggered.

See screenshot attached.

The expectation would be that the rules fire in such a case when two Loki Pods stay "Pending" or are unavailable for other reasons.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.46
cluster-logging.v5.8.4

How reproducible:

Always

Steps to Reproduce:

Set up a cluster with OpenShift Logging 5.8.4
Deploy OpenShift Logging 5.8 and create a LokiStack with "1x.small" sizing
To simulate the issue, update the LokiStack object field ".spec.template.gateway.nodeSelector" with 'does-not-exist: "true"'. Delete the existing "gateway" ReplicaSet, this will lead to Pods being scheduled with the non-existent nodeSelectors. This will lead to the "gateway" Pods to stay in "Pending".

Actual results:

All writes are failing as expected. However `LokiStackWriteRequestErrors` and `LokiRequestErrors` alerting rules are not firing even after 15 minutes.

Observe that on the `LokiStackWriteRequestErrors` alerting rule we see a broken graph like the screenshots attached.

Expected results:

After 15 minutes `LokiStackWriteRequestErrors` and `LokiRequestErrors` alerting rules are firing

Additional info:

Logging "must-gather" available in Support Case 03796198

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot_from_2024-04-18_18-40-04.png
97 kB
2024/04/19 1:18 PM
Screenshot_from_2024-04-18_18-41-03.png
115 kB
2024/04/19 1:18 PM

Assignee:: Unassigned

Reporter:: Simon Krenger

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/04/19 1:17 PM

Updated:: 2024/04/30 12:51 PM

Resolved:: 2024/04/30 9:05 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates