Uploaded image for project: 'Distributed Tracing'
  1. Distributed Tracing
  2. TRACING-3312

When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • rhosdt-2.9
    • None
    • Jaeger
    • None
    • Tracing Sprint # 240

      When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state. Specifically, when the Jaeger Operator recreates the jaeger-pod in istio-system, it goes into ImagePullBackOff, and subsequently enters a Pending state.

      The cause of ImagePullBackOff is that the image reference in the oauth-proxy container included in the Jaeger Pod refers to an image with the 'latest' tag.

      $ oc describe pod -n istio-system jaeger-6785d9fd7b-xxxxx
      <...>
        Normal   Pulling         6m42s (x3 over 7m25s)   kubelet            Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"
        Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Failed to pull image "registry.redhat.io/openshift4/ose-oauth-proxy:latest": rpc error: code = Unknown desc = pinging container registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 172.110.12.31:53: server misbehaving
        Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Error: ErrImagePull
        Warning  Failed          6m28s (x6 over 7m25s)   kubelet            Error: ImagePullBackOff
        Normal   BackOff         2m24s (x23 over 7m25s)  kubelet            Back-off pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"
      
      $ oc get deployment jaeger -n istio-system -o yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: jaeger
        namespace: istio-system
      spec:
        template:
          metadata:
          spec:
            containers:
            - args:
              image: registry.redhat.io/openshift4/ose-oauth-proxy:latest    #<------
      

      The mirror registry used in a disconnected environment contains images with digest values and does not have tag information, resulting in a failed attempt to access an external registry and pull the image.

      However, the oauth-proxy in Imagestream references an image with a digest value, and when the Imagestream is used, the image pull succeeds, and the Pod is deployed.

      $ oc get is -n openshift oauth-proxy -o yaml
      apiVersion: image.openshift.io/v1
      kind: ImageStream
      metadata:
        name: oauth-proxy
        namespace: openshift
      spec:
        lookupPolicy:
          local: false
        tags:
        - annotations: null
          from:
            kind: DockerImage
            name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c0cf8f4d56f16d74534c21a67cab0bbc524da3ef38a84116c4080bdc00e46ca    #<------
          generation: 5
          importPolicy:
            scheduled: true
          name: v4.4
          referencePolicy:
            type: Source
      

      Upon startup, the Jaeger Operator accesses route.openshift.io to automatically detect platform features, but if this fails, it recognizes the platform as non-OpenShift and consequently stops using Imagestream, deploying the oauth-proxy container with a tag reference instead.

      $ oc -n openshift-operators logs jaeger-operator-57cc46697c-xxxxx -c jaeger-operator | head -n 5 
      2023-05-25T05:21:11.102501612Z 1.6849920711024122e+09	INFO	Versions	{"os": "linux", "arch": "amd64", "version": "go1.19.2", "jaeger-operator": "v1.39.0", "identity": "openshift-operators.jaeger-operator", "jaeger": "1.39.0"}
      2023-05-25T05:21:11.102712914Z 1.6849920711027002e+09	INFO	setup	watching namespace(s)	{"namespaces": ""}
      2023-05-25T05:21:12.153493367Z I0525 05:21:12.153354       1 request.go:682] Waited for 1.035593138s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/authorization.openshift.io/v1?timeout=32s
      2023-05-25T05:21:18.906113267Z 1.684992078906025e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "0.0.0.0:8383"}
      2023-05-25T05:21:18.909317196Z 1.6849920789092264e+09	ERROR	failed to determine the platform capabilities, auto-detected properties will remain the same until next cycle.	{"error": "<nil>; Error getting resources for server group route.openshift.io: the server is currently unable to handle the request"} #<------
      <...>
      2023-05-25T05:21:19.007446090Z 1.6849920790074377e+09	INFO	Not running on OpenShift, so won't configure OAuthProxy imagestream.
      

      Because the SNO is not in a HA configuration, the openshift-apiserver does not respond while it is being recreated and initialized. This issue occurs when the initialization process of the openshift-apiserver and the Jaeger Operator overlap.

      $ oc get -n openshift-apiserver pods apiserver-7fc4749f56-xxxxx -o yaml
      apiVersion: v1
      kind: Pod
      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:09Z"
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:19Z"   #<------ It is overlap the initialization of Jaeger Operator pod
          status: "True"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:21:19Z"
          status: "True"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2023-05-25T05:20:58Z"
          status: "True"
          type: PodScheduled
        startTime: "2023-05-25T05:20:58Z"
      
      

      According to client reports, this happens quite frequently, forcing them to recreate the Jaeger Operator Pod each time it goes into Pending state. Could we modify the process that the Jaeger Operator uses at startup to automatically detect platform features to take into account the characteristics of the SNO? If there's any additional information needed to understand the situation more comprehensively, please let me know.

            [TRACING-3312] When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Moderate: Red Hat OpenShift Distributed Tracing 2.9.0 security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:4986

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: Red Hat OpenShift Distributed Tracing 2.9.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:4986

            GitLab CEE Bot added a comment - Issue solved with fcd024cb6f95f1a331a3f325205aa30b29dc92e1 .

            The changes done as part of https://gitlab.cee.redhat.com/distributed-tracing/jaeger-midstream/-/merge_requests/246 will mitigate the issue since the platform detection will not block the logic to get the information from the OAuth Proxy image. This fix will be included as part of RHOSDT 2.9.

            Also, I created this issue: https://github.com/jaegertracing/jaeger-operator/issues/2279
            It will address the problem when the autodetection for the platform is enabled.

            Israel Blancas Alvarez (Inactive) added a comment - The changes done as part of https://gitlab.cee.redhat.com/distributed-tracing/jaeger-midstream/-/merge_requests/246 will mitigate the issue since the platform detection will not block the logic to get the information from the OAuth Proxy image. This fix will be included as part of RHOSDT 2.9. Also, I created this issue: https://github.com/jaegertracing/jaeger-operator/issues/2279 It will address the problem when the autodetection for the platform is enabled.

            Israel Blancas Alvarez mentioned this issue in a merge request of distributed-tracing / jaeger-midstream on branch TRACING-3312:

            Disable auto-detection of the platform and use always OpenShift. TRACING-3312

            GitLab CEE Bot added a comment - Israel Blancas Alvarez mentioned this issue in a merge request of distributed-tracing / jaeger-midstream on branch TRACING-3312 : Disable auto-detection of the platform and use always OpenShift. TRACING-3312

              rhn-support-iblancas Israel Blancas Alvarez (Inactive)
              rhn-support-mmatsuta Masafumi Matsuta
              Ishwar Kanse Ishwar Kanse
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: