Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHODS_1.18.0_GA
Affects Version/s: RHODS_1.16.0_GA
Component/s: Monitoring
Labels:
- eng
- groomed

Blocked:
False
Blocked Reason:
None
Ready:
False
Automated:
Yes
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.18.0-2
Regression:
No
Target Release:

RHODS_1.18.0_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-deployer/pull/263
Test Link:
https://polarion.engineering.redhat.com/polarion/redirect/project/OpenDataHub/workitem?id=ODS-1700

Sprint:
RHODS 1.18

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The alert RHODS Jupyter Probe Success Burn Rate does not fire when kubeflow notebook controller or odh notebook controller pods are down, making the alert not very useful.

This was identified while verifying the PR with the monitoring changes for kubeflow:
https://github.com/red-hat-data-services/odh-deployer/pull/256#discussion_r955190909

The problem seems to be that:

The alert is checking for the availability of this url: https://rhods-dashboard-redhat-ods-applications.apps.qeaisrhods-mon.cos9.s1.devshift.org/notebookController/spawner
Even if the kubeflow pods are not running, the url renders correctly the spawner page and returns a http code 200
When the user clicks on "Start Server" there is an error "Failed to create a notebook, please try again later"

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Scale down to 0 rhods-operator
Scale down to 0 notebook-controller-deployment
Scale down to 0 odh-notebook-controller-manager
Wait 5 mins until alerts "Kubeflow notebook controller pod is not running" and "ODH notebook controller pod is not running" are firing
Try to spawn a notebook. You'll see the error "Failed to create a notebook, please try again later"
Verify that alert "RHODS Jupyter Probe Success Burn Rate" is not firing

Actual results:

Expected results:

I think alert "RHODS Jupyter Probe Success Burn Rate" should fire if jupyter is not working properly

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

RHODS 1.16.0-hotfix-2fada07

Workaround:

Unknown

Additional info:

In blackbox-exporter-logs.txt you'll find the logs of blackbox exporter while the kubeflow pods were down

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

blackbox-exporter-logs.txt
7 kB
2022/09/13 2:56 PM
error-spawning-notebook-when-kubeflow-pods-are-not-running.png
70 kB
2022/09/13 2:52 PM

links to

red-hat-data-services/odh-deployer#263: Fix RHODS Jupyter Probe Success Burn Rate

Assignee:: Adriana Theodorakopoulou

Reporter:: Jorge Garcia Oncins

QA Contact:: Jorge Garcia Oncins

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/09/13 2:50 PM

Updated:: 2022/11/11 2:20 PM

Resolved:: 2022/10/10 5:37 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates