Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: RHODS_1.3.0_GA
Affects Version/s: RHODS_1.1_GA
Component/s: Monitoring
Labels:
- groomed
- idh-team

Story Points:
3
Blocked:
False
Ready:
False
Acceptance Criteria:
None
Automated:
No
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.3.0-6
Regression:
No
Target Release:

RHODS_1.3.0_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-deployer/pull/190
Market:

Sprint:
IDH Sprint 13

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

In summary, it looks like if you make a long running jupyternotebook test (3h) the alert "RHODS Route Error Burn Rate (for 3h)" fires for jupyterhub, even if jupyterhub works fine during that time

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

We have seen that, if you run the jenkins job to check PRs (rhods-ci-pr-test), the prometheus alert "RHODS Route Error Burn Rate" (for 3h) is activated as PENDING (yellow color).

As the jenkins job takes 1h 30 mins, if you run it again the same alert is activated as FIRING (red color).

The jenkins job doesn't do anything destructive to the cluster, is running jupyterhub notebooks and performing other actions. So, in my opinion, this alert shouldn't be firing, specially if they send pages to the SRE team (I'm not sure if they do, but probably)

IMPORTANT: even when the was alert was firing Jupypterhub seemed to work fine.

I've done the test today in this cluster (adding Anish and Maulik as cluster-admins):

https://console-openshift-console.apps.ods-qe-b2.sr2s.s1.devshift.org/

More info:

The first PR job was started at 13:31. We see in Grafana that a few minutes after, the haproxy_backend_http_responses_total:burnrate1h, 2h and 6h start to be over 0

haproxy_backend_http_responses_total:burnrate1h {route="jupytehub"} (purple color in the image)
haproxy_backend_http_responses_total:burnrate2h {route="jupytehub"} (red color in the image)
haproxy_backend_http_responses_total:burnrate6h {route="jupytehub"} (cyan color in the image)

At 14:02 "RHODS Route Error Burn Rate" (for 3h) is activated as PENDING (yellow color). At 15:23 I start the 2n jenkins job. We can see that a few minutes after the burnrateX values raise again

At 17:03 "RHODS Route Error Burn Rate" (for 3h) is activated as FIRING (red color)

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

I found the same behavior in 2 different clusterss

Build Details:

RHODS 1.1.1-57 installed using our script

Workaround:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

grafana-alerts.png
2021/10/21 5:04 PM
96 kB
Jorge Garcia Oncins
grafana-burnrate.png
2021/10/21 5:04 PM
256 kB
Jorge Garcia Oncins
image-2021-11-04-12-52-42-653.png
2021/11/04 4:52 PM
63 kB
Anish Asthana
prometheus-alerts-after-test.png
2021/10/21 5:04 PM
197 kB
Jorge Garcia Oncins
Route Error Burn Rate-RHODS-1.3.0.png
2021/12/03 2:47 PM
272 kB
Jorge Garcia Oncins

Assignee:: Anish Asthana

Reporter:: Jorge Garcia Oncins

QA Contact:: Jorge Garcia Oncins

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2021/10/21 5:05 PM

Updated:: 2022/11/03 8:51 AM

Resolved:: 2021/11/11 1:22 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

I've done the test today in this cluster (adding Anish and Maulik as cluster-admins):

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates