Uploaded image for project: 'Distributed Tracing'
  1. Distributed Tracing
  2. TRACING-3312

When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state


    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • rhosdt-2.9
    • None
    • Jaeger
    • None
    • Tracing Sprint # 240

      When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state. Specifically, when the Jaeger Operator recreates the jaeger-pod in istio-system, it goes into ImagePullBackOff, and subsequently enters a Pending state.

      The cause of ImagePullBackOff is that the image reference in the oauth-proxy container included in the Jaeger Pod refers to an image with the 'latest' tag.

      $ oc describe pod -n istio-system jaeger-6785d9fd7b-xxxxx
        Normal   Pulling         6m42s (x3 over 7m25s)   kubelet            Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"
        Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Failed to pull image "registry.redhat.io/openshift4/ose-oauth-proxy:latest": rpc error: code = Unknown desc = pinging container registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on server misbehaving
        Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Error: ErrImagePull
        Warning  Failed          6m28s (x6 over 7m25s)   kubelet            Error: ImagePullBackOff
        Normal   BackOff         2m24s (x23 over 7m25s)  kubelet            Back-off pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"
      $ oc get deployment jaeger -n istio-system -o yaml
      apiVersion: apps/v1
      kind: Deployment
        name: jaeger
        namespace: istio-system
            - args:
              image: registry.redhat.io/openshift4/ose-oauth-proxy:latest    #<------

      The mirror registry used in a disconnected environment contains images with digest values and does not have tag information, resulting in a failed attempt to access an external registry and pull the image.

      However, the oauth-proxy in Imagestream references an image with a digest value, and when the Imagestream is used, the image pull succeeds, and the Pod is deployed.

      $ oc get is -n openshift oauth-proxy -o yaml
      apiVersion: image.openshift.io/v1
      kind: ImageStream
        name: oauth-proxy
        namespace: openshift
          local: false
        - annotations: null
            kind: DockerImage
            name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c0cf8f4d56f16d74534c21a67cab0bbc524da3ef38a84116c4080bdc00e46ca    #<------
          generation: 5
            scheduled: true
          name: v4.4
            type: Source

      Upon startup, the Jaeger Operator accesses route.openshift.io to automatically detect platform features, but if this fails, it recognizes the platform as non-OpenShift and consequently stops using Imagestream, deploying the oauth-proxy container with a tag reference instead.

      $ oc -n openshift-operators logs jaeger-operator-57cc46697c-xxxxx -c jaeger-operator | head -n 5 
      2023-05-25T05:21:11.102501612Z 1.6849920711024122e+09	INFO	Versions	{"os": "linux", "arch": "amd64", "version": "go1.19.2", "jaeger-operator": "v1.39.0", "identity": "openshift-operators.jaeger-operator", "jaeger": "1.39.0"}
      2023-05-25T05:21:11.102712914Z 1.6849920711027002e+09	INFO	setup	watching namespace(s)	{"namespaces": ""}
      2023-05-25T05:21:12.153493367Z I0525 05:21:12.153354       1 request.go:682] Waited for 1.035593138s due to client-side throttling, not priority and fairness, request: GET:
      2023-05-25T05:21:18.906113267Z 1.684992078906025e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": ""}
      2023-05-25T05:21:18.909317196Z 1.6849920789092264e+09	ERROR	failed to determine the platform capabilities, auto-detected properties will remain the same until next cycle.	{"error": "<nil>; Error getting resources for server group route.openshift.io: the server is currently unable to handle the request"} #<------
      2023-05-25T05:21:19.007446090Z 1.6849920790074377e+09	INFO	Not running on OpenShift, so won't configure OAuthProxy imagestream.

      Because the SNO is not in a HA configuration, the openshift-apiserver does not respond while it is being recreated and initialized. This issue occurs when the initialization process of the openshift-apiserver and the Jaeger Operator overlap.

      $ oc get -n openshift-apiserver pods apiserver-7fc4749f56-xxxxx -o yaml
      apiVersion: v1
      kind: Pod
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:09Z"
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:19Z"   #<------ It is overlap the initialization of Jaeger Operator pod
          status: "True"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:19Z"
          status: "True"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:20:58Z"
          status: "True"
          type: PodScheduled
        startTime: "2023-05-25T05:20:58Z"

      According to client reports, this happens quite frequently, forcing them to recreate the Jaeger Operator Pod each time it goes into Pending state. Could we modify the process that the Jaeger Operator uses at startup to automatically detect platform features to take into account the characteristics of the SNO? If there's any additional information needed to understand the situation more comprehensively, please let me know.

            rhn-support-iblancas Israel Blancas Alvarez
            rhn-support-mmatsuta Masafumi Matsuta
            Ishwar Kanse Ishwar Kanse
            0 Vote for this issue
            5 Start watching this issue