-
Bug
-
Resolution: Done
-
Critical
-
RHODS_1.16.0_GA
-
False
-
None
-
False
-
Yes
-
-
-
-
-
-
1.18.0-2
-
No
-
No
-
Yes
-
None
-
RHODS 1.18
Description of problem:
The alert RHODS Jupyter Probe Success Burn Rate does not fire when kubeflow notebook controller or odh notebook controller pods are down, making the alert not very useful.
This was identified while verifying the PR with the monitoring changes for kubeflow:
https://github.com/red-hat-data-services/odh-deployer/pull/256#discussion_r955190909
The problem seems to be that:
- The alert is checking for the availability of this url: https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.cos9.s1.devshift.org/notebookController/spawner
- Even if the kubeflow pods are not running, the url renders correctly the spawner page and returns a http code 200
- When the user clicks on "Start Server" there is an error "Failed to create a notebook, please try again later"
Prerequisites (if any, like setup, operators/versions):
Steps to Reproduce
- Scale down to 0 rhods-operator
- Scale down to 0 notebook-controller-deployment
- Scale down to 0 odh-notebook-controller-manager
- Wait 5 mins until alerts "Kubeflow notebook controller pod is not running" and "ODH notebook controller pod is not running" are firing
- Try to spawn a notebook. You'll see the error "Failed to create a notebook, please try again later"
- Verify that alert "RHODS Jupyter Probe Success Burn Rate" is not firing
Actual results:
Expected results:
I think alert "RHODS Jupyter Probe Success Burn Rate" should fire if jupyter is not working properly
Reproducibility (Always/Intermittent/Only Once):
Always
Build Details:
RHODS 1.16.0-hotfix-2fada07
Workaround:
Unknown
Additional info:
In blackbox-exporter-logs.txt you'll find the logs of blackbox exporter while the kubeflow pods were down