XML

Word

Printable

Epic Name:
improve-contention-related-metrics
Story Points:
77
Acceptance Criteria:
Hide

Enable schedstats=enable by default (to be confirmed that we want this)

Fix kubevirt_vmi_vcpu_wait_seconds_total documentation to be IO specific

Add documentation for kubevirt_vmi_vcpu_delay_seconds_total to be CPU scheduling specific

Make sure the metric is displayed in a dashboard

Explore if Pressure Stall Informations are valueable to detect contended workloads

Comparison CPU steal (guest) VS vCPU wait (qemu) VS PSI (cgroup)
Show
Enable schedstats=enable by default (to be confirmed that we want this) Fix kubevirt_vmi_vcpu_wait_seconds_total documentation to be IO specific Add documentation for kubevirt_vmi_vcpu_delay_seconds_total to be CPU scheduling specific Make sure the metric is displayed in a dashboard Explore if Pressure Stall Informations are valueable to detect contended workloads Comparison CPU steal (guest) VS vCPU wait (qemu) VS PSI (cgroup)
Current Status:
Green
Epic Status:
In Progress
Hierarchy Progress Bar:

40% To Do, 0% In Progress, 60% Done
Ready-Ready:

dev-ready, doc-ready, po-ready, qe-ready, ux-ready
Status Summary:

Hide

2025-07-14:
Implementation of this is delayed due to capacity, but is expected to land for 4.20....

Show
2025-07-14: Implementation of this is delayed due to capacity, but is expected to land for 4.20....

Goal

Improve vCPU contention related metrics by

make them work out of the box
Update documentation for existing metric kubevirt_vmi_vcpu_wait_seconds_total to https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/virtualization/monitoring#virt-promql-vcpu-metrics_virt-prometheus-queries
kubevirt_vmi_vcpu_wait_seconds_total is documented - add that the metric is IO specific (_delay metric is CPU scheduling specific)

In addition: A spike to explore how valueable PSI metrics are for VMs, if they complement vCPU ready.

User Stories

As a cluster administrator, I want know when there are vCPU performance issues with my VM, so that I can take action

List of things not included in this epic, to alleviate any doubt raised during the grooming process.

is documented by

CNV-45694 Add documentation for kubevirt_vmi_vcpu_delay_seconds_total

is related to

CNV-46221 Add to the Red Hat docs a link to the CNV metrics documentation

relates to

VIRTSTRAT-65 CPU Load Aware balancing within a single cluster

1.	upstream roadmap issue	New	Unassigned
2.	upstream design	New	Unassigned
3.	upstream documentation	New	Unassigned
4.	upgrade consideration	New	Unassigned
5.	CEE/PX summary presentation	Closed	Unassigned
6.	test plans in polarion	New	Unassigned
7.	automated tests	New	Unassigned
8.	downstream documentation merged	New	Unassigned