Description of problem:
ARO SRE have found out that the presence of the openshift-marketplace pods in a master node somehow causes overall high disk bandwidth writes that exceeds Azure disk bandwidth, which in turn triggers Azure to throttle the disk operations in this master node and therefore causing very high latency etcd writes (~5s). After further investigation we have narrowed down to the problem by stopping all pods in the openshift-marketplace namespace and thus stopped the very high disk bandwidth writes and lowering down etcd latencies to a normal level.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Install an ARO cluster with version 4.15.35 or 4.16.30. 2. In the openshit-console go to the alerts page and observe the etcdGRPCRequestsSlow alert. 3. The etcdGRPCRequestsSlow alert should be flipping between pending and inactive. 4. Grabe the alerts query and run it in and observe the etcd latencies.
Actual results:
etcd latencies are > 1s, and can even reach to 5s or 9s
Expected results:
etcd latencies should be < 1s or the alert should not be in pending nor fire
Additional info:
- duplicates
-
OCPBUGS-48697 OLMv0: excessive catalog source snapshots cause severe performance regression [openshift-4.15.z]
-
- Closed
-
- is cloned by
-
OCPBUGS-58070 High latency etcd disk writes due to openshift-marketplace pods/OLM
-
- Closed
-