Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7284

CPU steal seen in instance appears to originate in tuned handling of interrupts

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • rhel-8.4.0
    • tuned
    • None
    • None
    • Critical
    • rhel-net-perf
    • ssg_core_services
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      What were you trying to do that didn't work?

      On an OSP 16.2.3 (RHEL 8.4) compute host, tenant observes CPU steal from within instance.

      We made sure that isolcpus was specified correctly [1].

      Also made sure instances "emulatorpin cpuset" was set to use cpus outside of isolcpus [2],
      and that there were no overlap in vcpu pin assignment between instances (XML not shown here for that).

      In nova.conf, vcpu_pin_set [3] matches isolcpus from cmdline. And for some reason customer 
      has specified values for cpu_shared_set, which we undestand are overriden by vcpu_pin_set 
      since cpu_dedicated_set is not specified.

      File /etc/sysconfig/irqbalance shows: IRQBALANCE_BANNED_CPUS=fffffcff,fffcffff,fcfffffc
      which matches the non-isolcpus set

      Tuned on compute host is set to cpu-partitioning [4].

      when the tuned-adm verify command is run, several IRQs throw [5] which seems to suggest cpu steal could be caused
      by something originating in tuned handling of interrupts.

      [1] $ cat ./proc/cmdline 
      BOOT_IMAGE=(hd0,gpt3)/boot/vmlinuz-4.18.0-305.49.1.el8_4.x86_64 root=UUID=d34238e8-4842-40b3-9634-0d75169ead87 ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=900 intel_iommu=on iommu=pt transparent_hugepage=never vt.handoff=1 nmi_watchdog=0 numa_balancing=disable intel_idle.max_cstate=0 nosoftlockup rcu_nocbs=2-23,26-47,50-71,74-95 nohz_full=2-23,26-47,50-71,74-95 isolcpus=2-23,26-47,50-71,74-95 tsx=off skew_tick=1 nohz=on nohz_full=2-23,26-47,50-71,74-95 rcu_nocbs=2-23,26-47,50-71,74-95 tuned.non_isolcpus=00000300,00030000,03000003 intel_pstate=disable nosoftlockup

      [2]  grep "emulatorpin cpuset" ./etc/libvirt/qemu/*
      ./etc/libvirt/qemu/instance-000001b6.xml:    <emulatorpin cpuset='0-1,24-25,48-49,72-73'/>
      ./etc/libvirt/qemu/instance-000001b9.xml:    <emulatorpin cpuset='0-1,24-25,48-49,72-73'/>

      [3] from nova.conf:

      vcpu_pin_set=2-23,26-47,50-71,74-95

      #cpu_dedicated_set=<None>

      cpu_shared_set=0-1,24-25,48-49,72-73

      [4] $ cat ./etc/tuned/active_profile
      cpu-partitioning

      [5] $ grep ERROR ./var/log/tuned/tuned.log

       ... output truncated - many more errors are showing ...
       
      2023-09-20 12:45:08,064 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 0' = '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'

        ...

      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 322' = '[21]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 323' = '[22]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 324' = '[23]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 325' = '[64]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 326' = '[65]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,069 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 327' = '[66]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,070 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 328' = '[67]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,070 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 329' = '[68]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,070 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 330' = '[69]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,070 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 331' = '[70]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
      2023-09-20 12:45:08,070 ERROR    tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 332' = '[71]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'

       

       

      Please provide the package NVR for which bug is seen:

      How reproducible:

      Customer said this happens more often under high load.

      Steps to reproduce

      1. Can be reproduced on customer setup
      2.  
      3.  

      Expected results

      No CPU steal should be observed.

      Actual results

      CPU steal observed in tenant instance.

        1. tuned.log
          1.78 MB
          Jason Calhoun

              jskarvad Jaroslav Škarvada
              rhn-support-fpalin Francois Palin
              Jaroslav Škarvada Jaroslav Škarvada
              Robin Hack Robin Hack
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: