Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15.0
Component/s: Image Registry
Labels:
- groomed
- trt-standup

Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Image registry disruption surfaced again specifically on vsphere serial. It's not every run, but it shows up somewhere between 50-75% of runs.

We dug in on ~~TRT-1318~~ and found that there are serial tests which taint nodes, and only one registry replica is being used on vsphere. If the test happens to pick the worker where the registry is running, it will go down for 10-50s, this is likely why we see it between 50 and 75%, the actual value is probably 66% of the time due to 1/3 worker nodes being selected.

The tests we found running when the disruption occurred were things like:

[sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]

[sig-node] NoExecuteTaintManager Multiple Pods [Serial] only evicts pods without tolerations from tainted nodes [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]

Other clouds run these in their serial suites, but I found that they were all running two registry replicas.

Similar to https://issues.redhat.com/browse/OCPBUGS-18596 we need a fix to ensure two replicas are running in vsphere serial suites.

To verify, check this link a couple days after the fix merges and goes live. We should see the P75 near 0. (today it's 35s for new connections, about 15 for re-used)

is related to

TRT-1318 Investigate image registry disruption on vSphere serial jobs

Closed

OCPBUGS-18596 Metal CI clusters see disproportionate image registry disruption during upgrade

Closed

relates to

OCPBUGS-27323 Test failure in upgrade jobs- [bz-Image Registry] clusteroperator/image-registry should not change condition/Available

Closed

Assignee:: Flavian Missi

Reporter:: Devan Goodwin

QA Contact:: Wen Wang

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/10/25 11:10 AM

Updated:: 2024/05/06 9:29 PM

Resolved:: 2024/04/11 12:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates