Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6869

tuned throughput-performance's scheduler plugin usage yields high CPU usage

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • rhel-8.10
    • rhel-8.6.0
    • tuned
    • tuned-2.22.0-1.el8
    • None
    • None
    • rhel-sst-cs-net-perf-services
    • ssg_core_services
    • 26
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:
      With RHEL 8.6 the tuned throughput-performance profile uses the scheduler plugin for some settings for which it used the sysctl plugin, before (e.g. in RHEL 7.9).

      Version-Release number of selected component (if applicable):
      tuned-2.18.0-2.el8_6.1.noarch

      How reproducible:
      always

      Steps to Reproduce:
      1. make sure that the throughput-performance tuned profile is activated (otherwise: `tuned-adm profile throughput-performance`)
      2. increase the fork-rate of the system (until the tuned process uses 30 % CPU or more)
      3. `perf trace -s -p $(pgrep tuned) – sleep 60`

      Actual results:
      tuned CPU usage increases with the fork rate, easily up to 30 % and more
      perf trace output shows high syscall rates for one tuned thread, i.e. for poll(), read(), openat(), lseek(), ioctl(), close() and fstat()

      Expected results:
      tuned CPU usage is very low (just a few percent) and is independent of the fork rate of the system.

      Additional info:
      This is caused by how the scheduler plugin polls for process creation events, even when the plugin's usage doesn't contain any process matching declarations, as with the throughput-performance profile. Each such event is then amplified by tuned invoking multiple syscalls on pseudo files under /proc/$pid/.

      Looking at a syscall trace in detail shows that a bunch of syscalls to read files under /proc/$pid/ is superfluous or even pointless (even if there were process matching declarations in the config), e.g.:

      ```
      196436 openat(AT_FDCWD, "/proc/3678736/cmdline", O_RDONLY|O_CLOEXEC) = 28</proc/3678736/cmdline>
      196436 fstat(28</proc/3678736/cmdline>,

      {st_mode=S_IFREG|0444, st_size=0, ...}

      ) = 0
      196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd410) = -1 ENOTTY (Inappropriate ioctl for device)
      196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
      196436 ioctl(28</proc/3678736/cmdline>, TCGETS, 0x7f1113ffd3f0) = -1 ENOTTY (Inappropriate ioctl for device)
      196436 lseek(28</proc/3678736/cmdline>, 0, SEEK_CUR) = 0
      196436 read(28</proc/3678736/cmdline>, "/opt/xyz/bin/foobar\0foobar\0", 8192) = 23
      196436 read(28</proc/3678736/cmdline>, "", 8192) = 0
      196436 close(28</proc/3678736/cmdline>) = 0
      ```

      A simple fix for the throughput-performance profile (which is activated, by default, on RHEL systems) is to convert the scheduler plugin settings back to sysctl ones, e.g. like this:

      ```
      — /usr/lib/tuned/throughput-performance/tuned.conf 2022-06-08 11:48:16.000000000 +0200
      +++ new/throughput-performance/tuned.conf 2022-11-04 18:03:05.468461294 +0100
      @@ -58,12 +58,11 @@

      1. and move them to swap cache
        vm.swappiness=10

      -[scheduler]

      1. ktune sysctl settings for rhel6 servers, maximizing i/o throughput
        #
      2. Minimal preemption granularity for CPU-bound tasks:
      3. (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds)
        -sched_min_granularity_ns = 10000000
        +kernel.sched_min_granularity_ns = 10000000
      1. SCHED_OTHER wake-up granularity.
      2. (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds)
        @@ -71,7 +70,7 @@
      3. This option delays the preemption effects of decoupled workloads
      4. and reduces their over-scheduling. Synchronous workloads will still
      5. have immediate wakeup/sleep latencies.
        -sched_wakeup_granularity_ns = 15000000
        +kernel.sched_wakeup_granularity_ns = 15000000
      1. Marvell ThunderX
        [sysctl.thunderx]
        @@ -81,8 +80,8 @@
        kernel.numa_balancing=0
      1. AMD
        -[scheduler.amd]
        -type=scheduler
        +[sysctl.amd]
        +type=sysctl
        uname_regex=x86_64
        cpuinfo_regex=${amd_cpuinfo_regex}
        -sched_migration_cost_ns=5000000
        +kernel.sched_migration_cost_ns=5000000
        ```

              jskarvad Jaroslav Škarvada
              xeops-wu224 Georg Sauthoff (Inactive)
              Jaroslav Škarvada Jaroslav Škarvada
              Robin Hack Robin Hack
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: