-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Rejected
-
NI&D Sprint 278
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In 4.18.24 and .22 Platform is running the following deployment stack with multiple replicas per pod. All pods are running as NodePort deployment type on GCP platform, with 1 pod per host with session affinity rules, and ExternalTrafficPolicy: Local to enforce clientIP persistence. Loadbalancer forwards traffic to the given infra node. Each pod has 1 host (no overlapping hosts). [wrussell@supportshell-2 04264539]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE router-default 32/32 32 32 4y router-external 16/16 16 16 4y router-pci 32/32 32 32 3y router-pci-2 4/4 4 4 266d router-pii 8/8 8 8 3y What we are seeing is that on 4.14 and 4.16 (previous versions will be confirmed when customer completes image validation testing) cluster load was HIGHER than currently being sent to the platform. Router pods were crashing continually until threads were expanded to 8 (from 4) and maxconn was expanded to 400,000 (from 50,000). Since then, pods are stable. HOWEVER, CPU utilization now is redlining (90% CPU consumption) on all nodes + router-default pods are crashing due to resource exhaustion even at 70% of the potential throughput during load tests. (Reaching 100% throughput crashes pods even at expanded values). IdleConnectionTerminationPolicy set to IMMEDIATE was tested and resulted in a lot of i/o failures for client sessions without a corresponding reduction in MEM/CPU pressures during load test Hard-stop after was also set (with a very short window of 5m which also reaped connections, but a corresponding reduction in CPU was not observed). We also confirmed that they are averaging a process count of around 50-60, threads at around 400-500 which is less than kubelet or crio's max pid count reap threshold. At peak load, we're expecting 4.5K connections per second, but this is LESS than previous versions/observed peak expectations and we aren't able to hit that value presently - starting to observe cpu pressure thresholds closer to 70% of that max. nodes have 8 cpu/32Gb mem, are tainted to prevent workloads other than routers. Multiple/all routes impacted as the router pods fall over when their utilization goes too high - we've already scaled out significantly, used to run with closer to about half this deployment rate with more throughput and less CPU/mem usage, so something has significantly changed with resource consumption.
Version-Release number of selected component (if applicable):
haproxy version: Version: 2.8.10-f28885f ocp version: 4.18.22
How reproducible:
Continual - production platform is impacted. Scaling out to continually adapt to the problem is currently the workaround, customer can replicate in a lower environment with load testing.
Steps to Reproduce:
1. Deploy a 4.16 cluster 2. perform baseline load testing with multiple replica of router pods to confirm CPU/mem utilization rates 3. Upgrade to 4.18.22 + retest - observe with no corresponding increase in Load, MEM/cpu allocation is significantly higher/threatening stability of the platform.
Actual results:
cluster is unable to handle peak load requirements + scaling demands
Expected results:
Haproxy resource utilization is expected to increase somewhat, but not this drastically
Additional info:
See first comment for data points + next-steps asked + test flow.