Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.0, 4.20.0
Component/s: kube-scheduler
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
Yes
Architecture:

x86_64

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    We detect a regression in the time pods need to be in "ready" state when varying, increasing, the number of pods to be deployed in OCP.

Version-Release number of selected component (if applicable):

    The baseline version is OCP 4.16 with regression (increasing times) detected in OCP 4.18.0 and 4.20.0

How reproducible:

Creating a varying, increasing, number of pods in the OCP cluster with kube-burner [1] and its canned rds workload, which creates 52 pods per namespace in its current version, scaled up to 1 namespace/1 iteration per worker node available.

[1]: https://github.com/kube-burner/kube-burner-ocp

Steps to Reproduce:

    1. Set up a OCP 4.10 environment.
    2. Run kube-burner-ocp with rds-core workload
    3. kube-burner-ocp reports a metric called "podReadyLatency", which measures the time a pod needs to transition from creation to ready state, i.e., podReadyLatency = podReadyTime − podCreationTime. We watch for the p99 values of this metric.

Actual results:

    In this workload (rds-core) we have pods of different types (burstable and guaranteed) with different networking configuration, as follows:
- Burstable:
      - Client: 1x OVN
      - Server: 1x OVN and 1x SRIOV
- Guaranteed:
      - DPDK: 2x SRIOV and 1x OVN (restricted to some worker nodes, and not sharing node resources with any other pod type)

With OCP 4.16, 4.18, and 4.20 we noticed a regression in podReadyLatency, caused mainly by the client pods:

OCP 4.16: 9.4s
OCP 4.20: 9.84s

While the absolute values (avg) reported above might not be significant (~5%) across versions, in OCP 4.20, while there are several client pods scheduled at 2-3s, there is a significant amount of the same pod type scheduled at 12-14s. These client pods are pushing podReadyLatency values up. This is exclusively happening to the client pods. Server and DPDK pods' podReadyLatency even had an overall improvement looking at OCP 4.18 and OCP 4.20.

Expected results:

   More consistent, homogeneous values for scheduling any type of pod.

Additional info:

    There are readouts with the data and more details about the environment:
- OCP 4.16: https://docs.google.com/presentation/d/1qbochvGR4N_EToEYG_HBobKZldd6118xeeft1QFlT9A/edit?slide=id.g2efc8ddbd83_0_399#slide=id.g2efc8ddbd83_0_399
- OCP 4.18: https://docs.google.com/presentation/d/1LKsqYSA3fqL5WR_TKiIkVeO3kGc7omNGLjgX2px-xmk/edit?slide=id.g36869f6151d_0_53#slide=id.g36869f6151d_0_53
- OCP 4.20: https://docs.google.com/presentation/d/1vOxnZYRfwJ0RsKOFg9J5x74Y7aeo60-c8Oik95Yrw1Q/edit?slide=id.g36869f6151d_0_53#slide=id.g36869f6151d_0_53

Earlier, we have reported

Assignee:: Workloads Team Bot Account

Reporter:: Simone Ferlin-Reiter

Need Info From:: None

Contributors:: None

QA Contact:: Rama Kasturi Narra

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/12/08 2:33 PM

Updated:: 2025/12/08 2:43 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates