Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: 4.15
Affects Version/s: 4.14.0
Component/s: Image Registry
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
None
Sprint:
Sprint 243, Sprint 246
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

TRT has picked up on the fact that metal 4.14 clusters seem to be seeing an unacceptable amount of disruption to the image registry during upgrades.

Graphs: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=metal&var-percentile=P50&var-backend=image-registry-new-connections&var-backend=image-registry-reused-connections&var-backend=ingress-to-console-new-connections&var-backend=ingress-to-console-reused-connections&var-releases=4.14&var-upgrade_type=minor&var-upgrade_type=micro&var-networks=sdn&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=7&var-master_node_updated=Y

This shows between 12-45s P50 over a week's worth of job runs. Compare to ingress-to-console which only sees at most 2s for the P50.

The problems appears to affect both micro and minor upgrades, both sdn and ovn networking. As such this appears to be a registry/metal problem, not a networking issue.

Sample job runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-upgrade-ovn-ipv6/1699317456872411136

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-metal-ipi-sdn-bm-upgrade/1699192726903328768

If you expand the first "intervals" spyglass chart on each of these jobs, and search for "registry", you will see the disruption overlaps with the image-registry ClusterOperator reporting available false with a message of:

condition/Available status/False reason/NoReplicasAvailable changed: Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created

Is there some reason metal specifically would encounter a problem here?

Filing against registry as I don't know how to get to metal folks otherwise, but will loop them in on slack.

relates to

OCPBUGS-22382 Image registry experiencing disruption during vSphere serial jobs

Closed

Assignee:: Flavian Missi

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Wen Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/06 3:16 PM

Updated:: 2025/07/25 5:34 PM

Resolved:: 2023/12/18 12:36 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide