Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22382

Image registry experiencing disruption during vSphere serial jobs


    • No
    • Rejected
    • False
    • Hide



      Image registry disruption surfaced again specifically on vsphere serial. It's not every run, but it shows up somewhere between 50-75% of runs.

      We dug in on TRT-1318 and found that there are serial tests which taint nodes, and only one registry replica is being used on vsphere. If the test happens to pick the worker where the registry is running, it will go down for 10-50s, this is likely why we see it between 50 and 75%, the actual value is probably 66% of the time due to 1/3 worker nodes being selected.

      The tests we found running when the disruption occurred were things like:

      [sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]

      [sig-node] NoExecuteTaintManager Multiple Pods [Serial] only evicts pods without tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]

      Other clouds run these in their serial suites, but I found that they were all running two registry replicas.

      Similar to https://issues.redhat.com/browse/OCPBUGS-18596 we need a fix to ensure two replicas are running in vsphere serial suites.

      To verify, check this link a couple days after the fix merges and goes live. We should see the P75 near 0. (today it's 35s for new connections, about 15 for re-used)

            fmissi Flavian Missi
            rhn-engineering-dgoodwin Devan Goodwin
            Wen Wang Wen Wang
            0 Vote for this issue
            9 Start watching this issue