-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
rhel-8.4.0
-
None
-
None
-
Critical
-
rhel-net-perf
-
ssg_core_services
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
None
What were you trying to do that didn't work?
On an OSP 16.2.3 (RHEL 8.4) compute host, tenant observes CPU steal from within instance.
We made sure that isolcpus was specified correctly [1].
Also made sure instances "emulatorpin cpuset" was set to use cpus outside of isolcpus [2],
and that there were no overlap in vcpu pin assignment between instances (XML not shown here for that).
In nova.conf, vcpu_pin_set [3] matches isolcpus from cmdline. And for some reason customer
has specified values for cpu_shared_set, which we undestand are overriden by vcpu_pin_set
since cpu_dedicated_set is not specified.
File /etc/sysconfig/irqbalance shows: IRQBALANCE_BANNED_CPUS=fffffcff,fffcffff,fcfffffc
which matches the non-isolcpus set
Tuned on compute host is set to cpu-partitioning [4].
when the tuned-adm verify command is run, several IRQs throw [5] which seems to suggest cpu steal could be caused
by something originating in tuned handling of interrupts.
[1] $ cat ./proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/boot/vmlinuz-4.18.0-305.49.1.el8_4.x86_64 root=UUID=d34238e8-4842-40b3-9634-0d75169ead87 ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=900 intel_iommu=on iommu=pt transparent_hugepage=never vt.handoff=1 nmi_watchdog=0 numa_balancing=disable intel_idle.max_cstate=0 nosoftlockup rcu_nocbs=2-23,26-47,50-71,74-95 nohz_full=2-23,26-47,50-71,74-95 isolcpus=2-23,26-47,50-71,74-95 tsx=off skew_tick=1 nohz=on nohz_full=2-23,26-47,50-71,74-95 rcu_nocbs=2-23,26-47,50-71,74-95 tuned.non_isolcpus=00000300,00030000,03000003 intel_pstate=disable nosoftlockup
[2] grep "emulatorpin cpuset" ./etc/libvirt/qemu/*
./etc/libvirt/qemu/instance-000001b6.xml: <emulatorpin cpuset='0-1,24-25,48-49,72-73'/>
./etc/libvirt/qemu/instance-000001b9.xml: <emulatorpin cpuset='0-1,24-25,48-49,72-73'/>
[3] from nova.conf:
vcpu_pin_set=2-23,26-47,50-71,74-95
#cpu_dedicated_set=<None>
cpu_shared_set=0-1,24-25,48-49,72-73
[4] $ cat ./etc/tuned/active_profile
cpu-partitioning
[5] $ grep ERROR ./var/log/tuned/tuned.log
... output truncated - many more errors are showing ...
2023-09-20 12:45:08,064 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 0' = '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
...
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 322' = '[21]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 323' = '[22]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 324' = '[23]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 325' = '[64]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 326' = '[65]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,069 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 327' = '[66]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,070 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 328' = '[67]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,070 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 329' = '[68]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,070 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 330' = '[69]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,070 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 331' = '[70]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
2023-09-20 12:45:08,070 ERROR tuned.plugins.plugin_scheduler: verify: failed: 'SMP affinity of IRQ 332' = '[71]', expected '[0, 1, 72, 73, 48, 49, 24, 25]'
Please provide the package NVR for which bug is seen:
How reproducible:
Customer said this happens more often under high load.
Steps to reproduce
- Can be reproduced on customer setup
Expected results
No CPU steal should be observed.
Actual results
CPU steal observed in tenant instance.