-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
4.17.z, 4.16.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
3
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On clusters running OpenShift 4.17, we observe that from time to time, container creation fails because the node pulls the wrong image architecture (e.g., amd64 instead of arm64). This leads to crio logging an architecture mismatch and the container never starting. The issue has occurred in multiple cases: 1. While using CrunchyData’s PGO operator (not from OperatorHub) 2. With internal images like cloud-network-ingress-operator:v1.0.3 during a node drain Even after pulling the correct image with podman using --arch arm64, crio continues to try pulling the incorrect architecture This is not limited to cluster upgrades. Although the behavior became prominent after an upgrade from 4.17.23 to 4.17.24, we have observed it during normal pod scheduling, not just upgrades or node drains.
Version-Release number of selected component (if applicable):
How reproducible:
Intermittent. Not deterministic, but has been reproduced across multiple nodes and clusters.
Steps to Reproduce:
1.Deploy a pod using a multi-arch image (e.g., crunchy-postgres:15.12.1). 2. On an arm64 node, observe if it fails to start due to an architecture mismatch. 3. Inspect crio logs and note if it pulled the amd64 image instead of arm64.
Actual results:
crio pulls the image with incorrect architecture:
Image operating system mismatch: image uses OS "linux"+architecture "amd64"+"", expecting one of "linux+arm64+\"v8\", linux+arm64+\"\"" Container fails to start. The issue is resolved only after manually deleting the incorrect image with crictl rmi, after which the node pulls the correct image again.
Expected results:
crio should always select the correct image architecture (arm64 on aarch64 nodes) from multi-arch manifests.
Additional info:
podman pull --arch arm64 fetches the correct image, showing the registry and manifest are correct. We suspect a regression in the logic CRI-O uses to resolve image manifests for multi-arch images. This affects multiple images (not just PGO): docker.bin.sbb.ch/hive/crunchy-postgres:15.12.1 paas.docker.bin.sbb.ch/paas/cloud-network-ingress-operator:v1.0.3 Issue started becoming more frequent post OpenShift 4.17.x updates, but is not strictly related to upgrades.