-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.19.z
-
None
-
None
-
False
-
-
None
-
Critical
-
None
-
x86_64
-
None
-
None
-
None
-
None
-
Customer Escalated
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
This was noticed after a reload of the router pods which is servicing many TLS routes using external secrets, etcd latency caused this high number of external secret loading to go past the pod's startup probe timeout and cause non-stop restarting. They've got this latency issue across a couple cluster, most notably a brand new one but the average logged etcd latency is around 300-500ms, higher than our max target threshold of 200ms. The average etcd latency is significantly slower than expected despite: - An average WAL time of <10ms - etcd fio tests confirming the disk can be used for etcd - iperf testing showing a consistent usable bandwidth of about 10Gbps+ with effectively no retries (effectively meaning sometimes it would have 1 retry) - Low system CPU and memory utilization allowing enough spare resources - All etcd databases being defragged, and in the case of the new cluster it is brand new with nothing but the routes - The number of routes they're testing with being within the ~9000 max we've tested for 4.19 (they're using about 2000 for the tests) The ask here is to determine why etcd latency is as bad as it is when it should in theory be below 200ms, ideally even lower. We're concerned that performance issues given the above context might be indicative of a problem with etcd. [1] https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/release_notes/ocp-4-19-release-notes#ocp-4-19-networking-support-load-secrets_release-notes
Version-Release number of selected component (if applicable):
4.19.z, but unsure if it would be seen on another version
How reproducible:
Customer can reproduce it on their clusters every time so far, I've had issues trying to.
Steps to Reproduce:
n/a
Actual results:
etcd latency is averaging around 300ms-500ms
Expected results:
etcd latency should be 200ms or less
Additional info:
Attachments will be posted