Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5806

kube-controller-manager crashes when kubernetes API is down

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.9
    • None
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2083757
      
      Description of problem:
      If the kubernetes API is down while kube-controller-manager is running [1] the initOpCache function, it logs a fatal error and crashes
      
      The https://bugzilla.redhat.com/show_bug.cgi?id=2082628 bz was created separately to deal with the kubelet issues we observe following the crash, while this bz is about the crash itself.
      
      Version-Release number of selected component (if applicable):
      At-least as early as 4.9
      
      How reproducible:
      Not sure, see https://bugzilla.redhat.com/show_bug.cgi?id=2082628 for the context in which we notice this crash. It's not entirely clear whether the rarity is due to kubelet's behavior or whether this crash itself is rare. 
      
      Steps to Reproduce:
      See https://bugzilla.redhat.com/show_bug.cgi?id=2082628
      
      Actual results:
      kube-controller-manager crashes
      
      Expected results:
      kube-controller-manager should be more tolerant of API downtime and not crash, as crashes add up in metrics/events and cause alerts / CI test failures
      
      Additional info:
      None
      
      [1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165
      
      
      it looks like this controller should behave more like GC does, in that it should try to scrape the data, if it can't just retry in a while. 
      Definitely something that should be pursuit upstream. I'll bump priority on it to get it ideally in 1.25

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

              fkrepins@redhat.com Filip Krepinsky
              fkrepins@redhat.com Filip Krepinsky
              ying zhou ying zhou
              Filip Krepinsky
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: