-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
prometheus-DoS
-
Quality / Stability / Reliability
-
False
-
-
False
-
Not Selected
-
NEW
-
To Do
-
NEW
-
50% To Do, 0% In Progress, 50% Done
-
75% (Medium)
-
3
For multiple clusters, see:
https://issues.redhat.com/browse/OCPBUGS-15337
https://issues.redhat.com/browse/OCPBUGS-4186
Prometheus was flooded (all its web.max-connections (512 by default) spots was continually filled with query connections), the net stack queues were also filled with query connections, which led to probes not being able to run.
To make debugging such problems easier we can:
- See with CCX team if we can add a rule to detect the SYN flooding (in general) from sosreport https://issues.redhat.com/browse/INSIGHTOCP-1307
- Add a Prometheus alert when the number of connections that prometheus is processing approaches the max. If we see the problem from another angle, we can say that the probes were failing because Prometheus couldn't accept() and process their connections as it was already dealing with its max (-web.max-connections),
- is blocked by
-
MON-3390 Write post-mortem document on liveness probes being unresponsive due to VPA
-
- Closed
-
- is related to
-
OCPBUGS-4186 Prometheus ReadinessProbes failing after upgrade to OpenShift 4.10
-
- Closed
-