[LOG-2884] LokiRequestErrors alert not Firing on 10 '502' query_range response for 15 mins

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: Logging 5.5.3
Affects Version/s: Logging 5.5.0
Component/s: Log Storage, Loki
Labels:
- devel_ack+
- rn-done-resolved

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:

Hide
Before this update, when requests to an unavailable pod was sent through the gateway, no alert would fire to warn of the disruption. With this update, the individual alerts will fire if the gateway has issues completing a write or read request respectively.

Show
Before this update, when requests to an unavailable pod was sent through the gateway, no alert would fire to warn of the disruption. With this update, the individual alerts will fire if the gateway has issues completing a write or read request respectively.

Sprint:
Log Storage - Sprint 223, Log Storage - Sprint 224, Log Storage - Sprint 225

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description:

LokiRequestErrors alert is not firing after taking down querier in LokiStack unmanaged mode. Need to run query/query_range queries to initiate 5xx errors.

Steps to Reproduce:
1) Deploy CLO and Loki Operator
2) Create LokiStack CR and forward logs to gateway
3) Edit the LokiStack and go 'unmanaged'
4) Delete querier deployment (Querier should not be running at this point)

$ oc get deployments lokistack-dev-querier -n openshift-logging
Error from server (NotFound): deployments.apps "lokistack-dev-querier" not found

5) Fire some queries

logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/application" query '{log_type="application"}'
logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/audit" query '{log_type="audit"}'
logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type="infrastructure"}'

Queries would be responded by timeout error.

Error response from server: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
 (<nil>) attempts remaining: 0
Query failed: Run out of attempts while querying the server

Logs on Gateway:

level=warn name=lokistack-gateway ts=2022-08-04T17:38:53.784246849Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T17:38:53.784304735Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-010955 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001070673s bytes=0 level=warn name=lokistack-gateway ts=2022-08-04T17:41:27.417207863Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T17:41:27.41727024Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-012258 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001531582s bytes=0 level=warn name=lokistack-gateway ts=2022-08-04T22:28:43.749167657Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T22:28:43.749228224Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-186110 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001512461s bytes=0

OCP Version: 4.10

How reproducible: Always

Actual Result:
Alert is not firing

Expected Result:
LokiRequestErrors Alert should be firing when most requests are responded by 5xx error

links to

Kabir Bharti added a comment - 2022/09/22 5:17 PM

Verified on loki-operator.5.5.2.

Kabir Bharti added a comment - 2022/09/22 5:17 PM Verified on loki-operator.5.5.2.

Kabir Bharti added a comment - 2022/09/20 2:51 AM

Hey rojacob@redhat.com, After our discussion last week on Loki alerts, I still see the LokiStackReadRequestErrors alert is in pending mode but the state is not changing to Firing.

Kabir Bharti added a comment - 2022/09/20 2:51 AM Hey rojacob@redhat.com , After our discussion last week on Loki alerts, I still see the LokiStackReadRequestErrors alert is in pending mode but the state is not changing to Firing.

Robert Jacob added a comment - 2022/09/14 11:51 AM

Just had another look at the alerting queries and metrics and tried out reproducing the alert and it seems ok for me (the 15min have not passed yet, so it's still in pending, but it should get to firing). Two things occurred to me during testing:

One weird thing is that the client is getting a 504 but the gateway is reporting a 502 which makes me suspect that the timeout is actually happening in the ingress and not the gateway, but as both errors are 5xx this should not change anything with the alert.
One minor improvement that I seem to have overlooked in the queries is using irate instead of rate for the short range vectors, but that as well does not change the outcome substantially.

Should we have a short talk today where we can look at how we try to reproduce the alert and we can see what happens together?

Robert Jacob added a comment - 2022/09/14 11:51 AM Just had another look at the alerting queries and metrics and tried out reproducing the alert and it seems ok for me (the 15min have not passed yet, so it's still in pending, but it should get to firing). Two things occurred to me during testing: One weird thing is that the client is getting a 504 but the gateway is reporting a 502 which makes me suspect that the timeout is actually happening in the ingress and not the gateway, but as both errors are 5xx this should not change anything with the alert. One minor improvement that I seem to have overlooked in the queries is using irate instead of rate for the short range vectors, but that as well does not change the outcome substantially. Should we have a short talk today where we can look at how we try to reproduce the alert and we can see what happens together?

Kabir Bharti added a comment - 2022/09/13 7:41 PM - edited

gvanloo That's right. I do not expect the alert to fire after 10+ failed requests. The error requests have been distributed for more than 15 minutes.

As per the firing condition for LokiStackReadRequestErrors 'At least 10% of query requests to the lokistack-gateway are responded with 5xx server errors' for an interval of 15 mins.

Steps to reproduce:

1) Deploy LokiStack and fwd logs to Loki as logstore.

2) Edit LokiStack CR and make instance as Unmanaged.

3) Delete querier deployment.

4) Query for logs. Should be responded with below error

logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type="infrastructure"}'
https://lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure/loki/api/v1/query_range?direction=BACKWARD&end=1663097928028028396&limit=30&query=%7Blog_type%3D%22infrastructure%22%7D&start=1663094328028028396
Error response from server: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
 (<nil>) attempts remaining: 0
Query failed: Run out of attempts while querying the server

5) Keep querying for logs for more than 15 mins with 5xx errors.

Kabir Bharti added a comment - 2022/09/13 7:41 PM - edited gvanloo That's right. I do not expect the alert to fire after 10+ failed requests. The error requests have been distributed for more than 15 minutes. As per the firing condition for LokiStackReadRequestErrors 'At least 10% of query requests to the lokistack-gateway are responded with 5xx server errors' for an interval of 15 mins. Steps to reproduce: 1) Deploy LokiStack and fwd logs to Loki as logstore. 2) Edit LokiStack CR and make instance as Unmanaged. 3) Delete querier deployment. 4) Query for logs. Should be responded with below error logcli -o raw --tls-skip-verify --bearer-token= "$(oc whoami -t)" --addr "https: //lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type= "infrastructure" }' https: //lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure/loki/api/v1/query_range?direction=BACKWARD&end=1663097928028028396&limit=30&query=%7Blog_type%3D%22infrastructure%22%7D&start=1663094328028028396 Error response from server: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html> (<nil>) attempts remaining: 0 Query failed: Run out of attempts while querying the server 5) Keep querying for logs for more than 15 mins with 5xx errors.

Gerard Vanloo (Inactive) added a comment - 2022/09/13 11:29 AM

rhn-support-kbharti Hmm, I think this is one of two things: the alert threshold is either too high or the rate of the failures you are hitting the gateway with is too low.

Like the `LokiRequestErrors`, the Gateway focused alerts only activate once that total failure rate reaches 10%, not just 10+ failed requests. Ideally, whatever test case you are using for that alert should be the same here after you remove whichever a component.

Can you confirm if that is how this is being tested?

Gerard Vanloo (Inactive) added a comment - 2022/09/13 11:29 AM rhn-support-kbharti Hmm, I think this is one of two things: the alert threshold is either too high or the rate of the failures you are hitting the gateway with is too low. Like the `LokiRequestErrors`, the Gateway focused alerts only activate once that total failure rate reaches 10%, not just 10+ failed requests. Ideally, whatever test case you are using for that alert should be the same here after you remove whichever a component. Can you confirm if that is how this is being tested?

Kabir Bharti added a comment - 2022/09/13 12:17 AM - edited

gvanloo Retested this on latest cpaas build for 5.5.2. I see the alerts defined LokiStackReadRequestErrors and LokiStackWriteRequestErrors defined on the build but I still can't seem to see the alert being fired on UI.

Gateway logs

This was triggered for more than 15 mins with 502 errors on gateway.

$ oc get csv
NAME                           DISPLAY                            VERSION   REPLACES   PHASE
cluster-logging.5.5.2          Red Hat OpenShift Logging          5.5.2                Succeeded
elasticsearch-operator.5.5.2   OpenShift Elasticsearch Operator   5.5.2                Succeeded
loki-operator.5.5.2            Loki Operator                      5.5.2                Succeeded

Kabir Bharti added a comment - 2022/09/13 12:17 AM - edited gvanloo Retested this on latest cpaas build for 5.5.2. I see the alerts defined LokiStackReadRequestErrors and LokiStackWriteRequestErrors defined on the build but I still can't seem to see the alert being fired on UI. Gateway logs This was triggered for more than 15 mins with 502 errors on gateway. $ oc get csv NAME DISPLAY VERSION REPLACES PHASE cluster-logging.5.5.2 Red Hat OpenShift Logging 5.5.2 Succeeded elasticsearch- operator .5.5.2 OpenShift Elasticsearch Operator 5.5.2 Succeeded loki- operator .5.5.2 Loki Operator 5.5.2 Succeeded

Gerard Vanloo (Inactive) added a comment - 2022/09/01 6:45 PM

rhn-support-kbharti Not yet. It has been merged into the Loki upstream, but hasn't made it to midstream yet.

Gerard Vanloo (Inactive) added a comment - 2022/09/01 6:45 PM rhn-support-kbharti Not yet. It has been merged into the Loki upstream, but hasn't made it to midstream yet.

Kabir Bharti added a comment - 2022/09/01 6:27 PM

gvanloo Is this fix added in the latest 5.5.1 cpaas build?

Kabir Bharti added a comment - 2022/09/01 6:27 PM gvanloo Is this fix added in the latest 5.5.1 cpaas build?

Gerard Vanloo (Inactive) added a comment - 2022/08/29 4:12 PM - edited

I agree with rojacob@redhat.com on this one.

I have not confirmed this, but there might have been some communication from the ingester to the querier (perhaps though the index-gateway) which caused the ingester to omit a 5XX. Though that might be something here in terms of the alerts sensitivity.

However, adding an alert for the gateway itself is a good idea and does catch this use case (I've tested this bit). I think it would be good anyway in case there is some way in which a pod is not ready or available and thus is causing problems anyway.

Gerard Vanloo (Inactive) added a comment - 2022/08/29 4:12 PM - edited I agree with rojacob@redhat.com on this one. I have not confirmed this, but there might have been some communication from the ingester to the querier (perhaps though the index-gateway) which caused the ingester to omit a 5XX. Though that might be something here in terms of the alerts sensitivity. However, adding an alert for the gateway itself is a good idea and does catch this use case (I've tested this bit). I think it would be good anyway in case there is some way in which a pod is not ready or available and thus is causing problems anyway.

Robert Jacob added a comment - 2022/08/10 2:38 PM

While I have not validated this myself yet, I think the alert is working correctly, as it is only measuring errors in Loki components. The gateway is not a Loki component, so it does not have the loki_request_* metrics.

This means that the documentation probably needs some improvement to make this distinction more clear. Additionally we probably should have an alert on the gateway metrics as well, because the gateway is the "most user-facing component" of them all and leave the LokiRequestErrors for diagnosing internal errors.

Robert Jacob added a comment - 2022/08/10 2:38 PM While I have not validated this myself yet, I think the alert is working correctly, as it is only measuring errors in Loki components . The gateway is not a Loki component, so it does not have the loki_request_* metrics. This means that the documentation probably needs some improvement to make this distinction more clear. Additionally we probably should have an alert on the gateway metrics as well, because the gateway is the "most user-facing component" of them all and leave the LokiRequestErrors for diagnosing internal errors.

Assignee:: Gerard Vanloo (Inactive)

Reporter:: Kabir Bharti

QA Contact:: Kabir Bharti

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/08/04 10:31 PM

Updated:: 2022/12/15 6:17 PM

Resolved:: 2022/09/26 10:21 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Kabir Bharti added a comment - 2022/09/22 5:17 PM

Expand comment: Kabir Bharti added a comment - 2022/09/22 5:17 PM

Collapse comment: Kabir Bharti added a comment - 2022/09/20 2:51 AM

Expand comment: Kabir Bharti added a comment - 2022/09/20 2:51 AM

Collapse comment: Robert Jacob added a comment - 2022/09/14 11:51 AM

Expand comment: Robert Jacob added a comment - 2022/09/14 11:51 AM

Collapse comment: Kabir Bharti added a comment - 2022/09/13 7:41 PM, Edited by Kabir Bharti - 2022/09/13 7:42 PM

Expand comment: Kabir Bharti added a comment - 2022/09/13 7:41 PM, Edited by Kabir Bharti - 2022/09/13 7:42 PM

Collapse comment: Gerard Vanloo (Inactive) added a comment - 2022/09/13 11:29 AM

Expand comment: Gerard Vanloo (Inactive) added a comment - 2022/09/13 11:29 AM

Collapse comment: Kabir Bharti added a comment - 2022/09/13 12:17 AM, Edited by Kabir Bharti - 2022/09/13 12:18 AM

Expand comment: Kabir Bharti added a comment - 2022/09/13 12:17 AM, Edited by Kabir Bharti - 2022/09/13 12:18 AM

Collapse comment: Gerard Vanloo (Inactive) added a comment - 2022/09/01 6:45 PM

Expand comment: Gerard Vanloo (Inactive) added a comment - 2022/09/01 6:45 PM

Collapse comment: Kabir Bharti added a comment - 2022/09/01 6:27 PM

Expand comment: Kabir Bharti added a comment - 2022/09/01 6:27 PM

Collapse comment: Gerard Vanloo (Inactive) added a comment - 2022/08/29 4:12 PM, Edited by Gerard Vanloo - 2022/08/29 4:13 PM

Expand comment: Gerard Vanloo (Inactive) added a comment - 2022/08/29 4:12 PM, Edited by Gerard Vanloo - 2022/08/29 4:13 PM

Collapse comment: Robert Jacob added a comment - 2022/08/10 2:38 PM

Expand comment: Robert Jacob added a comment - 2022/08/10 2:38 PM

People

Dates