Uploaded image for project: 'OpenShift Workloads'
  1. OpenShift Workloads
  2. WRKLDS-358 Expose additional information about GC and Quota under /debug endpoint
  3. WRKLDS-724

kube-controller-manager should be more tolerant of API downtime and not crash

    XMLWordPrintable

Details

    • Sub-task
    • Resolution: Unresolved
    • Undefined
    • None
    • None
    • None
    • False
    • None
    • False
    • OCPSTRAT-46 - Strategic Upstream Work - OCP Control Plane and Node Lifecycle Group
    • Workloads Sprint 208, Workloads Sprint 210, Workloads Sprint 211, Workloads Sprint 212, Workloads Sprint 214, Workloads Sprint 215, Workloads Sprint 216, Workloads Sprint 217, Workloads - 4.12, Workloads Sprint 225, Workloads Sprint 226, Workloads Sprint 227, Workloads Sprint 228, Workloads Sprint 229, Workloads Sprint 230, Workloads Sprint 231, Workloads Sprint 232, Workloads Sprint 233, Workloads Sprint 234, Workloads Sprint 235, Workloads Sprint 236, Workloads Sprint 237, Workloads Sprint 238, Workloads Sprint 239, Workloads Sprint 240, Workloads Sprint 241

    Description

      kube-controller-manager should be more tolerant of API downtime and not crash, as crashes add up in metrics/events and cause alerts / CI test failures (see https://issues.redhat.com/browse/OCPBUGS-5806 / https://bugzilla.redhat.com/show_bug.cgi?id=2083757)

      If the kubernetes API is down while kube-controller-manager is running [1] the initOpCache function, it logs a fatal error and crashes

      The https://bugzilla.redhat.com/show_bug.cgi?id=2082628 bz was created separately to deal with the kubelet issues we observe following the crash, while this bz is about the crash itself.

      Not sure, see https://bugzilla.redhat.com/show_bug.cgi?id=2082628 for the context in which we notice this crash. It's not entirely clear whether the rarity is due to kubelet's behavior or whether this crash itself is rare.

      [1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165

      it looks like this controller should behave more like GC does, in that it should try to scrape the data, if it can't just retry in a while.
      Definitely something that should be pursuit upstream. I'll bump priority on it to get it ideally in 1.25

      Attachments

        Activity

          People

            Unassigned Unassigned
            fkrepins@redhat.com Filip Krepinsky
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: