-
Task
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
None
-
3
-
False
-
-
False
-
NEW
-
NEW
-
-
-
MON Sprint 243, MON Sprint 245
We know of about at least 3 clusters where the liveness probes failed due to the VPA recommender overwhelming the Prometheus web server (max TCP connections reached) and leading to continuous restarts of Prometheus pods.
- OCPBUGS-18971
- OCPBUGS-15337
OCPBUGS-4186(suspicion, to be confirmed)
We need to describe exhaustively the issue and how we can mitigate it. OCP already has a post-mortem template that we can use.
DoD
- Post-mortem document reviewed by the team.
- blocks
-
MON-3292 Make Prometheus flooding/DoS problems easier to detect
-
- To Do
-
- is triggered by
-
OCPBUGS-4186 Prometheus ReadinessProbes failing after upgrade to OpenShift 4.10
-
- Closed
-
- links to