Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-28509

[2192636] High steal time in VMs using dedicated resources

XMLWordPrintable

    • High
    • None

      Description of problem:
      We are seeing high %steal CPU time in a VM running in Openshift Virtualization configured with dedicated resources.

      Version-Release number of selected component (if applicable):
      Openshift 4.12.14
      OpenShift Virtualization 4.12.2

      How reproducible:
      Always

      Steps to Reproduce:
      I've configured the environment using this article without the real-time part: https://access.redhat.com/solutions/7007632

      1. Label the worker MachineConfigPool with custom-kubelet=cpumanager-enabled
      2. Create a KubeletConfig:

      ```
      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
      name: cpumanager-enabled
      spec:
      machineConfigPoolSelector:
      matchLabels:
      custom-kubelet: cpumanager-enabled
      kubeletConfig:
      cpuManagerPolicy: static
      cpuManagerReconcilePeriod: 5s
      ```

      3. Create a PerformanceProfile:

      ```
      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
      name: performance
      spec:
      cpu:
      isolated: "5-15"
      reserved: "0-4"
      globallyDisableIrqLoadBalancing: true
      hugepages:
      defaultHugepagesSize: "1G"
      pages:

      • size: "1G"
        count: 3
        node: 0
        numa:
        topologyPolicy: single-numa-node
        nodeSelector:
        node-role.kubernetes.io/worker: ""
        ```

      4. Create VM with 2 CPUs, 2 GB memory. These are the relevant configurations:

      ```
      apiVersion: kubevirt.io/v1
      kind: VirtualMachine
      spec:
      template:
      domain:
      cpu:
      cores: 1
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true
      model: host-passthrough
      numa:
      guestMappingPassthrough: {}
      sockets: 2
      threads: 1
      devices:
      autoattachGraphicsDevice: false
      autoattachMemBalloon: false
      autoattachSerialConsole: true
      ioThreadsPolicy: auto
      machine:
      type: pc-q35-rhel8.6.0
      memory:
      hugepages:
      pageSize: 1Gi
      resources:
      limits:
      memory: 2Gi
      requests:
      memory: 2Gi
      ~~~

      5. Run a CPU-intensive load in the guest. I have tested running 2 `openssl speed` commands each one pinned to a vCPU:

      ```
      for cpu in $(seq 0 1); do taskset -c "${cpu}" openssl speed >/dev/null 2>&1 & done
      ```

      Actual results:
      Using top I see a consistent high steal time in the guest, between 10% and 30%:

      ```
      %Cpu0 : 71.1 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 28.9 st
      %Cpu1 : 72.3 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 27.7 st
      ```

      I don't see steal time in another VM with 2 CPUs, 2 GB of memory running the same workload but configured with the default configs (no dedicatedCpuPlacement, no limits, no hugepages, etc).

      Expected results:
      No steal time.

      Additional info:
      In the virt-launcher pod I confirm that the vCPUs are pinned to a pCPUs 5 and 6:

      ```
      <cputune>
      <vcpupin vcpu='0' cpuset='5'/>
      <vcpupin vcpu='1' cpuset='6'/>
      <emulatorpin cpuset='7'/>
      <iothreadpin iothread='1' cpuset='7'/>
      </cputune>
      ```

      The cpumask for CPU 5 is 20:

      ```
      $ python3 -c 'cpu=5; x=str("%x" % (1<<cpu)); print(",".join(x[i-8 if i>8 else 0:i] for i in reversed(range(len, 0, -8))))'
      20
      ```

      In the node where the VM is running I run a trace for the sched_switch and workqueue_execute_start events in CPU 5:

      ```

      1. cd /sys/kernel/debug/tracing/
      2. echo 20 > tracing_cpumask
      3. echo > set_event
      4. echo sched_switch >> set_event
      5. echo workqueue_execute_start >> set_event
        (wait 30 seconds)
      6. echo > set_event
      7. cat trace
      1. tracer: nop
        #
      2. _-----=> irqs-off
      3. / _----=> need-resched
      4. / _---=> hardirq/softirq
      5. / _--=> preempt-depth
      6. / delay
      7. TASK-PID CPU# |||| TIMESTAMP FUNCTION
      8.          

        CPU 0/KVM-30764 [005] d... 2766.297106: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
        <idle>-0 [005] d... 2766.540965: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
        CPU 0/KVM-30764 [005] d... 2767.321081: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
        <idle>-0 [005] d... 2767.641028: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
        CPU 0/KVM-30764 [005] d... 2768.345137: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
        <idle>-0 [005] d... 2768.741041: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
        CPU 0/KVM-30764 [005] d... 2769.369178: sched_switch: prev_comm=CPU 0/KVM prev_pid=30764 prev_prio=120 prev_state=R+ ==> next_comm=swapper/5 next_pid=0 next_prio=120
        <idle>-0 [005] d... 2769.741063: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=CPU 0/KVM next_pid=30764 next_prio=120
        ```

      We can see that the CPU is scheduled to idle even if the vCPU is running (R+ state)

              iholder@redhat.com Itamar Holder
              rhn-support-jortialc Juan Orti
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: