-
Bug
-
Resolution: Done
-
Minor
-
4.14.0
-
No
-
Sprint 243, Sprint 246
-
2
-
False
-
-
N/A
-
Release Note Not Required
TRT has picked up on the fact that metal 4.14 clusters seem to be seeing an unacceptable amount of disruption to the image registry during upgrades.
This shows between 12-45s P50 over a week's worth of job runs. Compare to ingress-to-console which only sees at most 2s for the P50.
The problems appears to affect both micro and minor upgrades, both sdn and ovn networking. As such this appears to be a registry/metal problem, not a networking issue.
Sample job runs:
If you expand the first "intervals" spyglass chart on each of these jobs, and search for "registry", you will see the disruption overlaps with the image-registry ClusterOperator reporting available false with a message of:
condition/Available status/False reason/NoReplicasAvailable changed: Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created
Is there some reason metal specifically would encounter a problem here?
Filing against registry as I don't know how to get to metal folks otherwise, but will loop them in on slack.
- relates to
-
OCPBUGS-22382 Image registry experiencing disruption during vSphere serial jobs
- Closed