Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2884

LokiRequestErrors alert not Firing on 10 '502' query_range response for 15 mins

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Hide
      Before this update, when requests to an unavailable pod was sent through the gateway, no alert would fire to warn of the disruption. With this update, the individual alerts will fire if the gateway has issues completing a write or read request respectively.
      Show
      Before this update, when requests to an unavailable pod was sent through the gateway, no alert would fire to warn of the disruption. With this update, the individual alerts will fire if the gateway has issues completing a write or read request respectively.
    • Log Storage - Sprint 223, Log Storage - Sprint 224, Log Storage - Sprint 225

      Description:

      LokiRequestErrors alert is not firing after taking down querier in LokiStack unmanaged mode. Need to run query/query_range queries to initiate 5xx errors.

      Steps to Reproduce:
      1) Deploy CLO and Loki Operator
      2) Create LokiStack CR and forward logs to gateway
      3) Edit the LokiStack and go 'unmanaged'
      4) Delete querier deployment (Querier should not be running at this point)

      $ oc get deployments lokistack-dev-querier -n openshift-logging
      Error from server (NotFound): deployments.apps "lokistack-dev-querier" not found
      

      5) Fire some queries

      logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/application" query '{log_type="application"}'
      logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/audit" query '{log_type="audit"}'
      logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-410-100.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type="infrastructure"}'

      Queries would be responded by timeout error.

      Error response from server: <html><body><h1>504 Gateway Time-out</h1>
      The server didn't respond in time.
      </body></html>
       (<nil>) attempts remaining: 0
      Query failed: Run out of attempts while querying the server
      

      Logs on Gateway:

      level=warn name=lokistack-gateway ts=2022-08-04T17:38:53.784246849Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T17:38:53.784304735Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-010955 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001070673s bytes=0 level=warn name=lokistack-gateway ts=2022-08-04T17:41:27.417207863Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T17:41:27.41727024Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-012258 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001531582s bytes=0 level=warn name=lokistack-gateway ts=2022-08-04T22:28:43.749167657Z caller=stdlib.go:105 caller=reverseproxy.go:489 msg="http: proxy error: context canceled" level=warn name=lokistack-gateway ts=2022-08-04T22:28:43.749228224Z caller=instrumentation.go:33 request=lokistack-dev-gateway-84996dbb9-6dxwl/2jSiMJWyj5-186110 proto=HTTP/1.1 method=GET status=502 content= path=/api/logs/v1/infrastructure/loki/api/v1/query_range duration=30.001512461s bytes=0

      OCP Version: 4.10

      How reproducible: Always

      Actual Result:
      Alert is not firing

      Expected Result:
      LokiRequestErrors Alert should be firing when most requests are responded by 5xx error

            [LOG-2884] LokiRequestErrors alert not Firing on 10 '502' query_range response for 15 mins

            Verified on loki-operator.5.5.2.

            Kabir Bharti added a comment - Verified on loki-operator.5.5.2.

            Hey rojacob@redhat.com, After our discussion last week on Loki alerts, I still see the LokiStackReadRequestErrors alert is in pending mode but the state is not changing to Firing.

            Kabir Bharti added a comment - Hey rojacob@redhat.com , After our discussion last week on Loki alerts, I still see the LokiStackReadRequestErrors alert is in pending mode but the state is not changing to Firing.

            Just had another look at the alerting queries and metrics and tried out reproducing the alert and it seems ok for me (the 15min have not passed yet, so it's still in pending, but it should get to firing). Two things occurred to me during testing:

            • One weird thing is that the client is getting a 504 but the gateway is reporting a 502 which makes me suspect that the timeout is actually happening in the ingress and not the gateway, but as both errors are 5xx this should not change anything with the alert.
            • One minor improvement that I seem to have overlooked in the queries is using irate instead of rate for the short range vectors, but that as well does not change the outcome substantially.

            Should we have a short talk today where we can look at how we try to reproduce the alert and we can see what happens together?

            Robert Jacob added a comment - Just had another look at the alerting queries and metrics and tried out reproducing the alert and it seems ok for me (the 15min have not passed yet, so it's still in pending, but it should get to firing). Two things occurred to me during testing: One weird thing is that the client is getting a 504 but the gateway is reporting a 502 which makes me suspect that the timeout is actually happening in the ingress and not the gateway, but as both errors are 5xx this should not change anything with the alert. One minor improvement that I seem to have overlooked in the queries is using irate instead of rate for the short range vectors, but that as well does not change the outcome substantially. Should we have a short talk today where we can look at how we try to reproduce the alert and we can see what happens together?

            Kabir Bharti added a comment - - edited

            gvanloo That's right. I do not expect the alert to fire after 10+ failed requests. The error requests have been distributed for more than 15 minutes.

            As per the firing condition for LokiStackReadRequestErrors 'At least 10% of query requests to the lokistack-gateway are responded with 5xx server errors' for an interval of 15 mins.

            Steps to reproduce:

            1) Deploy LokiStack and fwd logs to Loki as logstore.

            2) Edit LokiStack CR and make instance as Unmanaged.

            3) Delete querier deployment.

            4) Query for logs. Should be responded with below error 

            logcli -o raw --tls-skip-verify --bearer-token="$(oc whoami -t)" --addr "https://lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type="infrastructure"}'
            https://lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure/loki/api/v1/query_range?direction=BACKWARD&end=1663097928028028396&limit=30&query=%7Blog_type%3D%22infrastructure%22%7D&start=1663094328028028396
            Error response from server: <html><body><h1>504 Gateway Time-out</h1>
            The server didn't respond in time.
            </body></html>
             (<nil>) attempts remaining: 0
            Query failed: Run out of attempts while querying the server
            

            5) Keep querying for logs for more than 15 mins with 5xx errors.

            Kabir Bharti added a comment - - edited gvanloo That's right. I do not expect the alert to fire after 10+ failed requests. The error requests have been distributed for more than 15 minutes. As per the firing condition for LokiStackReadRequestErrors 'At least 10% of query requests to the lokistack-gateway are responded with 5xx server errors' for an interval of 15 mins. Steps to reproduce: 1) Deploy LokiStack and fwd logs to Loki as logstore. 2) Edit LokiStack CR and make instance as Unmanaged. 3) Delete querier deployment. 4) Query for logs. Should be responded with below error  logcli -o raw --tls-skip-verify --bearer-token= "$(oc whoami -t)" --addr "https: //lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure" query '{log_type= "infrastructure" }' https: //lokistack-dev-openshift-logging.apps.kbharti-clus7.qe.devcluster.openshift.com/api/logs/v1/infrastructure/loki/api/v1/query_range?direction=BACKWARD&end=1663097928028028396&limit=30&query=%7Blog_type%3D%22infrastructure%22%7D&start=1663094328028028396 Error response from server: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>  (<nil>) attempts remaining: 0 Query failed: Run out of attempts while querying the server 5) Keep querying for logs for more than 15 mins with 5xx errors.

            rhn-support-kbharti Hmm, I think this is one of two things: the alert threshold is either too high or the rate of the failures you are hitting the gateway with is too low.

            Like the `LokiRequestErrors`, the Gateway focused alerts only activate once that total failure rate reaches 10%, not just 10+ failed requests. Ideally, whatever test case you are using for that alert should be the same here after you remove whichever a component.

            Can you confirm if that is how this is being tested?

            Gerard Vanloo (Inactive) added a comment - rhn-support-kbharti Hmm, I think this is one of two things: the alert threshold is either too high or the rate of the failures you are hitting the gateway with is too low. Like the `LokiRequestErrors`, the Gateway focused alerts only activate once that total failure rate reaches 10%, not just 10+ failed requests. Ideally, whatever test case you are using for that alert should be the same here after you remove whichever a component. Can you confirm if that is how this is being tested?

            Kabir Bharti added a comment - - edited

            gvanloo Retested this on latest cpaas build for 5.5.2. I see the alerts defined LokiStackReadRequestErrors and LokiStackWriteRequestErrors defined on the build but I still can't seem to see the alert being fired on UI.

            Gateway logs

            This was triggered for more than 15 mins with 502 errors on gateway.

            $ oc get csv
            NAME                           DISPLAY                            VERSION   REPLACES   PHASE
            cluster-logging.5.5.2          Red Hat OpenShift Logging          5.5.2                Succeeded
            elasticsearch-operator.5.5.2   OpenShift Elasticsearch Operator   5.5.2                Succeeded
            loki-operator.5.5.2            Loki Operator                      5.5.2                Succeeded
            

             

            Kabir Bharti added a comment - - edited gvanloo Retested this on latest cpaas build for 5.5.2. I see the alerts defined LokiStackReadRequestErrors and LokiStackWriteRequestErrors defined on the build but I still can't seem to see the alert being fired on UI. Gateway logs This was triggered for more than 15 mins with 502 errors on gateway. $ oc get csv NAME                           DISPLAY                            VERSION   REPLACES   PHASE cluster-logging.5.5.2          Red Hat OpenShift Logging          5.5.2                Succeeded elasticsearch- operator .5.5.2   OpenShift Elasticsearch Operator   5.5.2                Succeeded loki- operator .5.5.2            Loki Operator                      5.5.2                Succeeded  

            rhn-support-kbharti Not yet. It has been merged into the Loki upstream, but hasn't made it to midstream yet.

            Gerard Vanloo (Inactive) added a comment - rhn-support-kbharti Not yet. It has been merged into the Loki upstream, but hasn't made it to midstream yet.

            gvanloo Is this fix added in the latest 5.5.1 cpaas build?

            Kabir Bharti added a comment - gvanloo Is this fix added in the latest 5.5.1 cpaas build?

            Gerard Vanloo (Inactive) added a comment - - edited

            I agree with rojacob@redhat.com on this one.

            I have not confirmed this, but there might have been some communication from the ingester to the querier (perhaps though the index-gateway) which caused the ingester to omit a 5XX. Though that might be something here in terms of the alerts sensitivity.

            However, adding an alert for the gateway itself is a good idea and does catch this use case (I've tested this bit). I think it would be good anyway in case there is some way in which a pod is not ready or available and thus is causing problems anyway.

            Gerard Vanloo (Inactive) added a comment - - edited I agree with rojacob@redhat.com on this one. I have not confirmed this, but there might have been some communication from the ingester to the querier (perhaps though the index-gateway) which caused the ingester to omit a 5XX. Though that might be something here in terms of the alerts sensitivity. However, adding an alert for the gateway itself is a good idea and does catch this use case (I've tested this bit). I think it would be good anyway in case there is some way in which a pod is not ready or available and thus is causing problems anyway.

            While I have not validated this myself yet, I think the alert is working correctly, as it is only measuring errors in Loki components. The gateway is not a Loki component, so it does not have the loki_request_* metrics.

            This means that the documentation probably needs some improvement to make this distinction more clear. Additionally we probably should have an alert on the gateway metrics as well, because the gateway is the "most user-facing component" of them all and leave the LokiRequestErrors for diagnosing internal errors.

            Robert Jacob added a comment - While I have not validated this myself yet, I think the alert is working correctly, as it is only measuring errors in Loki components . The gateway is not a Loki component, so it does not have the loki_request_* metrics. This means that the documentation probably needs some improvement to make this distinction more clear. Additionally we probably should have an alert on the gateway metrics as well, because the gateway is the "most user-facing component" of them all and leave the LokiRequestErrors for diagnosing internal errors.

              gvanloo Gerard Vanloo (Inactive)
              rhn-support-kbharti Kabir Bharti
              Kabir Bharti Kabir Bharti
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: