Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-5205

Alert "RHODS Jupyter Probe Success Burn Rate" does not fire when kubeflow pods are down

XMLWordPrintable

    • RHODS 1.18

      Description of problem:

      The alert RHODS Jupyter Probe Success Burn Rate does not fire when kubeflow notebook controller or odh notebook controller pods are down, making the alert not very useful.

      This was identified while verifying the PR with the monitoring changes for kubeflow:
      https://github.com/red-hat-data-services/odh-deployer/pull/256#discussion_r955190909

      The problem seems to be that:

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      • Scale down to 0 rhods-operator
      • Scale down to 0 notebook-controller-deployment
      • Scale down to 0 odh-notebook-controller-manager
      • Wait 5 mins until alerts "Kubeflow notebook controller pod is not running" and "ODH notebook controller pod is not running" are firing
      • Try to spawn a notebook. You'll see the error "Failed to create a notebook, please try again later"
      • Verify that alert "RHODS Jupyter Probe Success Burn Rate" is not firing

      Actual results:

      Expected results:

      I think alert "RHODS Jupyter Probe Success Burn Rate" should fire if jupyter is not working properly

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      RHODS 1.16.0-hotfix-2fada07

      Workaround:

      Unknown

      Additional info:

      In blackbox-exporter-logs.txt you'll find the logs of blackbox exporter while the kubeflow pods were down

              rh-ee-atheodor Adriana Theodorakopoulou
              rhn-support-jgarciao Jorge Garcia Oncins
              Jorge Garcia Oncins Jorge Garcia Oncins
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: