Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.17.z
Component/s: HyperShift
Labels:
- hcp
- rits-work

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In a management cluster with 55 Hosted Control Planes (HCPs) and (18 nodes, 3 master+15 worker nodes), 352 HCP pods from 27 different hosted clusters (49% of all HCPs) were scheduled onto a single worker node (we08d23). Cluster is experiencing severe latency issues, with upto 20+ second delays for basic system operations like `busctl` calls to systemd-hostnamed. Some nodes are experiencing high workload pressure from these HCP workloads that are observed to be unevenly distributes among the 15 worker nodes. 

19 kube-apiserver pods were scheduled to that worker.

$ cat sos_commands/crio/crictl_pods | grep "clusters-" | awk '{print $6}' | grep kube-apiserver | wc -l
19

$ cat sos_commands/crio/crictl_pods | grep "clusters-" | wc -l
352

$ cat sos_commands/crio/crictl_pods | grep "clusters-" | awk '{print $7}' | cut -d- -f1-2 | sort -u | wc -l
27

Environment:

OpenShift Version: 4.17
Worker Nodes: 15 total (256 CPU)
- Reserved CPUs for housekeeping: 32 (CPUs 0-31) - via Performance Profile
- Isolated CPUs: 224 (CPUs 32-255)
Hosted Control Planes: ~55

Steps to Reproduce:

    1. 50+ Hosted Control Planes with the cluster spec above.
    2. Worker nodes with CPU isolation (limited allocatable CPUs)

Actual results:

19 kube-apiservers in one node

Expected results:

55 apiservers ÷ 15 nodes * 3(HA) = 11 per node for kube-apiservers

Additional info:

All 19 apiservers are from different hosted clusters, meaning the affinity is working correctly.

Assignee:: Cesar Wong

Reporter:: Abdullah Sikder

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2025/10/24 5:04 AM

Updated:: 2025/10/31 3:01 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates