Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59390

Kubelet's GC goes rogues on systems with lots of CPUs

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.16.z, 4.18.z
    • Node / Kubelet
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This is a low-prio follow-up to OCPBUGS-54565. On systems with lots of cores and without a PerfProfile, kubelet's go GC has a tendency to go rogue. TL;DR was that the golang GC runs on min(CPU core count affinity group; CPU core count host) and it creates a lot of load during garbage collection runs on systems with lots of cores (https://tip.golang.org/doc/gc-guide).

      E.g., we saw kubelet on an AMD system with 384 SMT cores easily go above 4000% CPU load, and pprof showed evidence that garbage collection was responsible for up to nearly 90% of that.

      kubelet can be reigned in with a PerformanceProfile because this will set kubelet's CPU affinity, or with:

      /etc/systemd/system/kubelet.service.d/99-override.conf 
      [Service]
      Environment="GOMAXPROCS=4"
      

      FYI: High memory allocations on the heap are believed to be due to https://github.com/kubernetes/kubernetes/issues/104459 | https://github.com/prometheus/client_golang/issues/1702 but the problem then gets amplified by the GC default behavior.

      I'm creating this ticket because it may be a good idea to think about setting GOMAXPROCS for kubelet (and other components?) to some sane default (in code or via service drop-in file). It might also be good to follow up on the cadvisor mem allocation issue.

              aos-node@redhat.com Node Team Bot Account
              akaris@redhat.com Andreas Karis
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: