Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14255

[4.14] Add Controller health to CEO liveness probe

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Minor Minor
    • 4.14.0
    • 4.14
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • Low
    • No
    • None
    • None
    • ETCD Sprint 238
    • 1
    • Done
    • Bug Fix
    • N/A
    • None
    • None
    • None
    • None

      We've had several forum cases and bugs already where a restart of the CEO was fixing issues that could be resolved automatically by a liveness probe.

      We previously traced it down to stuck/deadlocked controllers, missing timeouts in grpc calls and other issues we haven't been able to find yet. Since the list of failures that can happen is pretty large, we should add a liveness probe to the CEO that will periodically health check:

      • all controllers have been running sync at least once in the last 5/10 minutes
      • on failure, produce a goroutine dump to analyse what went wrong

      This check should not indicate whether the etcd cluster itself is healthy, it's purely for the CEO itself.

              tjungblu@redhat.com Thomas Jungblut
              tjungblu@redhat.com Thomas Jungblut
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: