Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Networking / router
Labels:
- ne-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
NI&D Sprint 278
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In 4.18.24 and .22 Platform is running the following deployment stack with multiple replicas per pod. All pods are running as NodePort deployment type on GCP platform, with 1 pod per host with session affinity rules, and ExternalTrafficPolicy: Local to enforce clientIP persistence. Loadbalancer forwards traffic to the given infra node. Each pod has 1 host (no overlapping hosts).

[wrussell@supportshell-2 04264539]$ oc get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
router-default 32/32 32 32 4y
router-external 16/16 16 16 4y
router-pci 32/32 32 32 3y
router-pci-2 4/4 4 4 266d
router-pii 8/8 8 8 3y

What we are seeing is that on 4.14 and 4.16 (previous versions will be confirmed when customer completes image validation testing) cluster load was HIGHER than currently being sent to the platform.

Router pods were crashing continually until threads were expanded to 8 (from 4) and maxconn was expanded to 400,000 (from 50,000). Since then, pods are stable. HOWEVER, CPU utilization now is redlining (90% CPU consumption) on all nodes + router-default pods are crashing due to resource exhaustion even at 70% of the potential throughput during load tests. (Reaching 100% throughput crashes pods even at expanded values).

IdleConnectionTerminationPolicy set to IMMEDIATE was tested and resulted in a lot of i/o failures for client sessions without a corresponding reduction in MEM/CPU pressures during load test

Hard-stop after was also set (with a very short window of 5m which also reaped connections, but a corresponding reduction in CPU was not observed). We also confirmed that they are averaging a process count of around 50-60, threads at around 400-500 which is less than kubelet or crio's max pid count reap threshold.

At peak load, we're expecting 4.5K connections per second, but this is LESS than previous versions/observed peak expectations and we aren't able to hit that value presently - starting to observe cpu pressure thresholds closer to 70% of that max.

nodes have 8 cpu/32Gb mem, are tainted to prevent workloads other than routers.

Multiple/all routes impacted as the router pods fall over when their utilization goes too high - we've already scaled out significantly, used to run with closer to about half this deployment rate with more throughput and less CPU/mem usage, so something has significantly changed with resource consumption.

Version-Release number of selected component (if applicable):

haproxy version: Version: 2.8.10-f28885f
ocp version: 4.18.22

How reproducible:

    Continual - production platform is impacted. Scaling out to continually adapt to the problem is currently the workaround, customer can replicate in a lower environment with load testing.

Steps to Reproduce:

    1. Deploy a 4.16 cluster 
    2. perform baseline load testing with multiple replica of router pods to confirm CPU/mem utilization rates
    3. Upgrade to 4.18.22 + retest - observe with no corresponding increase in Load, MEM/cpu allocation is significantly higher/threatening stability of the platform.

Actual results:

    cluster is unable to handle peak load requirements + scaling demands

Expected results:

    Haproxy resource utilization is expected to increase somewhat, but not this drastically

Additional info:

See first comment for data points + next-steps asked + test flow.

Assignee:: Davide Salerno

Reporter:: Will Russell

Need Info From:: Davide Salerno

Contributors:: None

QA Contact:: Hongan Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/10/02 10:12 PM

Updated:: 2025/10/27 1:15 PM

Resolved:: 2025/10/27 1:15 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates