Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

Sprint:
MON Sprint 243, MON Sprint 245

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

We know of about at least 3 clusters where the liveness probes failed due to the VPA recommender overwhelming the Prometheus web server (max TCP connections reached) and leading to continuous restarts of Prometheus pods.

OCPBUGS-18971
OCPBUGS-15337
~~OCPBUGS-4186~~ (suspicion, to be confirmed)

We need to describe exhaustively the issue and how we can mitigate it. OCP already has a post-mortem template that we can use.

DoD

Post-mortem document reviewed by the team.

blocks

MON-3292 Make Prometheus flooding/DoS problems easier to detect

To Do

is triggered by

OCPBUGS-4186 Prometheus ReadinessProbes failing after upgrade to OpenShift 4.10

Closed

links to

Post mortem document

Assignee:: Simon Pasquier

Reporter:: Simon Pasquier

Contributors:: Ayoub Mrini, Daniel Mohr, Jan Fajerski, Simon Pasquier

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/09/25 7:58 AM

Updated:: 2023/12/04 3:35 PM

Resolved:: 2023/12/04 3:22 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates