Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-2101

"RHODS Route Error Burn Rate" (for 3h) alert fired after runing jupyterhub notebooks for 3 hours

XMLWordPrintable

    • IDH Sprint 13

      Description of problem:

      In summary, it looks like if you make a long running jupyternotebook test (3h) the alert "RHODS Route Error Burn Rate (for 3h)" fires for jupyterhub, even if jupyterhub works fine during that time

       

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

       

      We have seen that, if you run the jenkins job to check PRs (rhods-ci-pr-test), the prometheus alert "RHODS Route Error Burn Rate" (for 3h) is activated as PENDING (yellow color). 

      As the jenkins job takes 1h 30 mins, if you run it again the same alert is activated as FIRING (red color).

      The jenkins job doesn't do anything destructive to the cluster, is running jupyterhub notebooks and performing other actions. So, in my opinion, this alert shouldn't be firing, specially if they send pages to the SRE team (I'm not sure if they do, but probably)

       

      IMPORTANT:  even when the was alert was firing Jupypterhub seemed to work fine.

       

      I've done the test today in this cluster (adding Anish and Maulik as cluster-admins):

      https://console-openshift-console.apps.ods-qe-b2.sr2s.s1.devshift.org/

       

       

      More info:

      The first PR job was started at 13:31. We see in Grafana that a few minutes after, the haproxy_backend_http_responses_total:burnrate1h, 2h and 6h start to be over 0

      • haproxy_backend_http_responses_total:burnrate1h {route="jupytehub"} (purple color in the image)
      • haproxy_backend_http_responses_total:burnrate2h {route="jupytehub"} (red color in the image)
      • haproxy_backend_http_responses_total:burnrate6h {route="jupytehub"} (cyan color in the image)

      At 14:02 "RHODS Route Error Burn Rate" (for 3h) is activated as PENDING (yellow color).  At 15:23 I start the 2n jenkins job. We can see that a few minutes after the burnrateX values raise again

      At 17:03 "RHODS Route Error Burn Rate" (for 3h) is activated as FIRING (red color)

       

       

       

       

       

      Actual results:

      Expected results:

      Reproducibility (Always/Intermittent/Only Once):

      I found the same behavior in 2 different clusterss

      Build Details:

      RHODS 1.1.1-57 installed using our script

      Workaround:

      Additional info:

        1. grafana-alerts.png
          96 kB
          Jorge Garcia Oncins
        2. grafana-burnrate.png
          256 kB
          Jorge Garcia Oncins
        3. image-2021-11-04-12-52-42-653.png
          63 kB
          Anish Asthana
        4. prometheus-alerts-after-test.png
          197 kB
          Jorge Garcia Oncins
        5. Route Error Burn Rate-RHODS-1.3.0.png
          272 kB
          Jorge Garcia Oncins

              aasthana@redhat.com Anish Asthana
              rhn-support-jgarciao Jorge Garcia Oncins
              Jorge Garcia Oncins Jorge Garcia Oncins
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: