Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-3390

Write post-mortem document on liveness probes being unresponsive due to VPA

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • MON Sprint 243, MON Sprint 245

      We know of about at least 3 clusters where the liveness probes failed due to the VPA recommender overwhelming the Prometheus web server (max TCP connections reached) and leading to continuous restarts of Prometheus pods.

      • OCPBUGS-18971
      • OCPBUGS-15337
      • OCPBUGS-4186 (suspicion, to be confirmed)

      We need to describe exhaustively the issue and how we can mitigate it. OCP already has a post-mortem template that we can use.

      DoD

      • Post-mortem document reviewed by the team.

              spasquie@redhat.com Simon Pasquier
              spasquie@redhat.com Simon Pasquier
              Ayoub Mrini, Daniel Mohr, Jan Fajerski, Simon Pasquier
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: