-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.15.0
-
No
-
Rejected
-
False
-
Image registry disruption surfaced again specifically on vsphere serial. It's not every run, but it shows up somewhere between 50-75% of runs.
We dug in on TRT-1318 and found that there are serial tests which taint nodes, and only one registry replica is being used on vsphere. If the test happens to pick the worker where the registry is running, it will go down for 10-50s, this is likely why we see it between 50 and 75%, the actual value is probably 66% of the time due to 1/3 worker nodes being selected.
The tests we found running when the disruption occurred were things like:
[sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-node] NoExecuteTaintManager Multiple Pods [Serial] only evicts pods without tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]
Other clouds run these in their serial suites, but I found that they were all running two registry replicas.
Similar to https://issues.redhat.com/browse/OCPBUGS-18596 we need a fix to ensure two replicas are running in vsphere serial suites.
To verify, check this link a couple days after the fix merges and goes live. We should see the P75 near 0. (today it's 35s for new connections, about 15 for re-used)
- is related to
-
TRT-1318 Investigate image registry disruption on vSphere serial jobs
- Closed
-
OCPBUGS-18596 Metal CI clusters see disproportionate image registry disruption during upgrade
- Closed
- relates to
-
OCPBUGS-27323 Test failure in upgrade jobs- [bz-Image Registry] clusteroperator/image-registry should not change condition/Available
- Closed