-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
None
When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state. Specifically, when the Jaeger Operator recreates the jaeger-pod in istio-system, it goes into ImagePullBackOff, and subsequently enters a Pending state.
The cause of ImagePullBackOff is that the image reference in the oauth-proxy container included in the Jaeger Pod refers to an image with the 'latest' tag.
$ oc describe pod -n istio-system jaeger-6785d9fd7b-xxxxx <...> Normal Pulling 6m42s (x3 over 7m25s) kubelet Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest" Warning Failed 6m42s (x3 over 7m25s) kubelet Failed to pull image "registry.redhat.io/openshift4/ose-oauth-proxy:latest": rpc error: code = Unknown desc = pinging container registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 172.110.12.31:53: server misbehaving Warning Failed 6m42s (x3 over 7m25s) kubelet Error: ErrImagePull Warning Failed 6m28s (x6 over 7m25s) kubelet Error: ImagePullBackOff Normal BackOff 2m24s (x23 over 7m25s) kubelet Back-off pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest" $ oc get deployment jaeger -n istio-system -o yaml apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: template: metadata: spec: containers: - args: image: registry.redhat.io/openshift4/ose-oauth-proxy:latest #<------
The mirror registry used in a disconnected environment contains images with digest values and does not have tag information, resulting in a failed attempt to access an external registry and pull the image.
However, the oauth-proxy in Imagestream references an image with a digest value, and when the Imagestream is used, the image pull succeeds, and the Pod is deployed.
$ oc get is -n openshift oauth-proxy -o yaml apiVersion: image.openshift.io/v1 kind: ImageStream metadata: name: oauth-proxy namespace: openshift spec: lookupPolicy: local: false tags: - annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c0cf8f4d56f16d74534c21a67cab0bbc524da3ef38a84116c4080bdc00e46ca #<------ generation: 5 importPolicy: scheduled: true name: v4.4 referencePolicy: type: Source
Upon startup, the Jaeger Operator accesses route.openshift.io to automatically detect platform features, but if this fails, it recognizes the platform as non-OpenShift and consequently stops using Imagestream, deploying the oauth-proxy container with a tag reference instead.
$ oc -n openshift-operators logs jaeger-operator-57cc46697c-xxxxx -c jaeger-operator | head -n 5 2023-05-25T05:21:11.102501612Z 1.6849920711024122e+09 INFO Versions {"os": "linux", "arch": "amd64", "version": "go1.19.2", "jaeger-operator": "v1.39.0", "identity": "openshift-operators.jaeger-operator", "jaeger": "1.39.0"} 2023-05-25T05:21:11.102712914Z 1.6849920711027002e+09 INFO setup watching namespace(s) {"namespaces": ""} 2023-05-25T05:21:12.153493367Z I0525 05:21:12.153354 1 request.go:682] Waited for 1.035593138s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/authorization.openshift.io/v1?timeout=32s 2023-05-25T05:21:18.906113267Z 1.684992078906025e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "0.0.0.0:8383"} 2023-05-25T05:21:18.909317196Z 1.6849920789092264e+09 ERROR failed to determine the platform capabilities, auto-detected properties will remain the same until next cycle. {"error": "<nil>; Error getting resources for server group route.openshift.io: the server is currently unable to handle the request"} #<------ <...> 2023-05-25T05:21:19.007446090Z 1.6849920790074377e+09 INFO Not running on OpenShift, so won't configure OAuthProxy imagestream.
Because the SNO is not in a HA configuration, the openshift-apiserver does not respond while it is being recreated and initialized. This issue occurs when the initialization process of the openshift-apiserver and the Jaeger Operator overlap.
$ oc get -n openshift-apiserver pods apiserver-7fc4749f56-xxxxx -o yaml apiVersion: v1 kind: Pod status: conditions: - lastProbeTime: null lastTransitionTime: "2023-05-25T05:21:09Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2023-05-25T05:21:19Z" #<------ It is overlap the initialization of Jaeger Operator pod status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2023-05-25T05:21:19Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2023-05-25T05:20:58Z" status: "True" type: PodScheduled startTime: "2023-05-25T05:20:58Z"
According to client reports, this happens quite frequently, forcing them to recreate the Jaeger Operator Pod each time it goes into Pending state. Could we modify the process that the Jaeger Operator uses at startup to automatically detect platform features to take into account the characteristics of the SNO? If there's any additional information needed to understand the situation more comprehensively, please let me know.
- links to
-
RHSA-2023:117866 Red Hat OpenShift distributed tracing 2.9.0 operator/operand containers
- mentioned on