-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
False
-
-
False
-
CLOSED
-
---
-
---
-
-
-
High
-
None
Description of problem:
We are seeing high %steal CPU time in a VM running in Openshift Virtualization configured with dedicated resources.
Version-Release number of selected component (if applicable):
Openshift 4.12.14
OpenShift Virtualization 4.12.2
How reproducible:
Always
Steps to Reproduce:
I've configured the environment using this article without the real-time part: https://access.redhat.com/solutions/7007632
1. Label the worker MachineConfigPool with custom-kubelet=cpumanager-enabled
2. Create a KubeletConfig:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
```
3. Create a PerformanceProfile:
```
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: "5-15"
reserved: "0-4"
globallyDisableIrqLoadBalancing: true
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
count: 3
node: 0
numa:
topologyPolicy: single-numa-node
nodeSelector:
node-role.kubernetes.io/worker: ""
```
4. Create VM with 2 CPUs, 2 GB memory. These are the relevant configurations:
```
apiVersion: kubevirt.io/v1
kind: VirtualMachine
spec:
template:
domain:
cpu:
cores: 1
dedicatedCpuPlacement: true
isolateEmulatorThread: true
model: host-passthrough
numa:
guestMappingPassthrough: {}
sockets: 2
threads: 1
devices:
autoattachGraphicsDevice: false
autoattachMemBalloon: false
autoattachSerialConsole: true
ioThreadsPolicy: auto
machine:
type: pc-q35-rhel8.6.0
memory:
hugepages:
pageSize: 1Gi
resources:
limits:
memory: 2Gi
requests:
memory: 2Gi
~~~
5. Run a CPU-intensive load in the guest. I have tested running 2 `openssl speed` commands each one pinned to a vCPU:
```
for cpu in $(seq 0 1); do taskset -c "${cpu}" openssl speed >/dev/null 2>&1 & done
```
Actual results:
Using top I see a consistent high steal time in the guest, between 10% and 30%:
```
%Cpu0 : 71.1 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 28.9 st
%Cpu1 : 72.3 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 27.7 st
```
I don't see steal time in another VM with 2 CPUs, 2 GB of memory running the same workload but configured with the default configs (no dedicatedCpuPlacement, no limits, no hugepages, etc).
Expected results:
No steal time.
Additional info:
In the virt-launcher pod I confirm that the vCPUs are pinned to a pCPUs 5 and 6:
```
<cputune>
<vcpupin vcpu='0' cpuset='5'/>
<vcpupin vcpu='1' cpuset='6'/>
<emulatorpin cpuset='7'/>
<iothreadpin iothread='1' cpuset='7'/>
</cputune>
```
The cpumask for CPU 5 is 20:
```
$ python3 -c 'cpu=5; x=str("%x" % (1<<cpu)); print(",".join(x[i-8 if i>8 else 0:i] for i in reversed(range(len, 0, -8))))'
20
```
In the node where the VM is running I run a trace for the sched_switch and workqueue_execute_start events in CPU 5:
```
- cd /sys/kernel/debug/tracing/
- echo 20 > tracing_cpumask
- echo > set_event
- echo sched_switch >> set_event
- echo workqueue_execute_start >> set_event
(wait 30 seconds) - echo > set_event
- cat trace
- tracer: nop
# - _-----=> irqs-off
- / _----=> need-resched
/ _---=> hardirq/softirq / _--=> preempt-depth / delay - TASK-PID CPU# |||| TIMESTAMP FUNCTION
CPU 0/KVM-30764 [005] d... 2766.297106: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
<idle>-0 [005] d... 2766.540965: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
CPU 0/KVM-30764 [005] d... 2767.321081: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
<idle>-0 [005] d... 2767.641028: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
CPU 0/KVM-30764 [005] d... 2768.345137: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
<idle>-0 [005] d... 2768.741041: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
CPU 0/KVM-30764 [005] d... 2769.369178: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
<idle>-0 [005] d... 2769.741063: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
```
We can see that the CPU is scheduled to idle even if the vCPU is running (R+ state)
- duplicates
-
CNV-28792 [2203291] kubevirt should allow runtimeclass to be configured in a pod
- Closed
- external trackers