[TRACING-3312] When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: rhosdt-2.9
Affects Version/s: None
Component/s: Jaeger
Labels:
None

Story Points:
1
Epic Link:
RHOSDT 2.9: Bug and CVE fixes tracker
Documentation Type:

Troubleshoot

Sprint:
Tracing Sprint # 240

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

When deploying Service Mesh on SNO in a disconnected environment , the Jaeger Pod frequently goes into Pending state. Specifically, when the Jaeger Operator recreates the jaeger-pod in istio-system, it goes into ImagePullBackOff, and subsequently enters a Pending state.

The cause of ImagePullBackOff is that the image reference in the oauth-proxy container included in the Jaeger Pod refers to an image with the 'latest' tag.

$ oc describe pod -n istio-system jaeger-6785d9fd7b-xxxxx
<...>
  Normal   Pulling         6m42s (x3 over 7m25s)   kubelet            Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"
  Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Failed to pull image "registry.redhat.io/openshift4/ose-oauth-proxy:latest": rpc error: code = Unknown desc = pinging container registry registry.redhat.io: Get "https://registry.redhat.io/v2/": dial tcp: lookup registry.redhat.io on 172.110.12.31:53: server misbehaving
  Warning  Failed          6m42s (x3 over 7m25s)   kubelet            Error: ErrImagePull
  Warning  Failed          6m28s (x6 over 7m25s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff         2m24s (x23 over 7m25s)  kubelet            Back-off pulling image "registry.redhat.io/openshift4/ose-oauth-proxy:latest"

$ oc get deployment jaeger -n istio-system -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: istio-system
spec:
  template:
    metadata:
    spec:
      containers:
      - args:
        image: registry.redhat.io/openshift4/ose-oauth-proxy:latest    #<------

The mirror registry used in a disconnected environment contains images with digest values and does not have tag information, resulting in a failed attempt to access an external registry and pull the image.

However, the oauth-proxy in Imagestream references an image with a digest value, and when the Imagestream is used, the image pull succeeds, and the Pod is deployed.

$ oc get is -n openshift oauth-proxy -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  name: oauth-proxy
  namespace: openshift
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c0cf8f4d56f16d74534c21a67cab0bbc524da3ef38a84116c4080bdc00e46ca    #<------
    generation: 5
    importPolicy:
      scheduled: true
    name: v4.4
    referencePolicy:
      type: Source

Upon startup, the Jaeger Operator accesses route.openshift.io to automatically detect platform features, but if this fails, it recognizes the platform as non-OpenShift and consequently stops using Imagestream, deploying the oauth-proxy container with a tag reference instead.

$ oc -n openshift-operators logs jaeger-operator-57cc46697c-xxxxx -c jaeger-operator | head -n 5 
2023-05-25T05:21:11.102501612Z 1.6849920711024122e+09	INFO	Versions	{"os": "linux", "arch": "amd64", "version": "go1.19.2", "jaeger-operator": "v1.39.0", "identity": "openshift-operators.jaeger-operator", "jaeger": "1.39.0"}
2023-05-25T05:21:11.102712914Z 1.6849920711027002e+09	INFO	setup	watching namespace(s)	{"namespaces": ""}
2023-05-25T05:21:12.153493367Z I0525 05:21:12.153354       1 request.go:682] Waited for 1.035593138s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/authorization.openshift.io/v1?timeout=32s
2023-05-25T05:21:18.906113267Z 1.684992078906025e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "0.0.0.0:8383"}
2023-05-25T05:21:18.909317196Z 1.6849920789092264e+09	ERROR	failed to determine the platform capabilities, auto-detected properties will remain the same until next cycle.	{"error": "<nil>; Error getting resources for server group route.openshift.io: the server is currently unable to handle the request"} #<------
<...>
2023-05-25T05:21:19.007446090Z 1.6849920790074377e+09	INFO	Not running on OpenShift, so won't configure OAuthProxy imagestream.

Because the SNO is not in a HA configuration, the openshift-apiserver does not respond while it is being recreated and initialized. This issue occurs when the initialization process of the openshift-apiserver and the Jaeger Operator overlap.

$ oc get -n openshift-apiserver pods apiserver-7fc4749f56-xxxxx -o yaml
apiVersion: v1
kind: Pod
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-05-25T05:21:09Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-05-25T05:21:19Z"   #<------ It is overlap the initialization of Jaeger Operator pod
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-05-25T05:21:19Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-05-25T05:20:58Z"
    status: "True"
    type: PodScheduled
  startTime: "2023-05-25T05:20:58Z"

According to client reports, this happens quite frequently, forcing them to recreate the Jaeger Operator Pod each time it goes into Pending state. Could we modify the process that the Jaeger Operator uses at startup to automatically detect platform features to take into account the characteristics of the SNO? If there's any additional information needed to understand the situation more comprehensively, please let me know.

links to

openshift/openshift-docs#62410: Distributed tracing 2.9

RHSA-2023:117866 Red Hat OpenShift distributed tracing 2.9.0 operator/operand containers

mentioned on

Merge request - Disable auto-detection of the platform and use always OpenShift. TRACING-3312

Solved by commit fcd024cb6f95f1a331a3f325205aa30b29dc92e1.

Errata Tool added a comment - 2023/09/06 7:56 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Moderate: Red Hat OpenShift Distributed Tracing 2.9.0 security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:4986

Errata Tool added a comment - 2023/09/06 7:56 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: Red Hat OpenShift Distributed Tracing 2.9.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:4986

GitLab CEE Bot added a comment - 2023/08/01 2:30 PM

Issue solved with fcd024cb6f95f1a331a3f325205aa30b29dc92e1.

GitLab CEE Bot added a comment - 2023/08/01 2:30 PM Issue solved with fcd024cb6f95f1a331a3f325205aa30b29dc92e1 .

Israel Blancas Alvarez (Inactive) added a comment - 2023/08/01 2:24 PM

The changes done as part of https://gitlab.cee.redhat.com/distributed-tracing/jaeger-midstream/-/merge_requests/246 will mitigate the issue since the platform detection will not block the logic to get the information from the OAuth Proxy image. This fix will be included as part of RHOSDT 2.9.

Also, I created this issue: https://github.com/jaegertracing/jaeger-operator/issues/2279
It will address the problem when the autodetection for the platform is enabled.

Israel Blancas Alvarez (Inactive) added a comment - 2023/08/01 2:24 PM The changes done as part of https://gitlab.cee.redhat.com/distributed-tracing/jaeger-midstream/-/merge_requests/246 will mitigate the issue since the platform detection will not block the logic to get the information from the OAuth Proxy image. This fix will be included as part of RHOSDT 2.9. Also, I created this issue: https://github.com/jaegertracing/jaeger-operator/issues/2279 It will address the problem when the autodetection for the platform is enabled.

GitLab CEE Bot added a comment - 2023/08/01 11:46 AM

Israel Blancas Alvarez mentioned this issue in a merge request of distributed-tracing / jaeger-midstream on branch TRACING-3312:

Disable auto-detection of the platform and use always OpenShift. ~~TRACING-3312~~

GitLab CEE Bot added a comment - 2023/08/01 11:46 AM Israel Blancas Alvarez mentioned this issue in a merge request of distributed-tracing / jaeger-midstream on branch TRACING-3312 : Disable auto-detection of the platform and use always OpenShift. TRACING-3312

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2023/09/06 7:56 AM

Expand comment: Errata Tool added a comment - 2023/09/06 7:56 AM

Collapse comment: GitLab CEE Bot added a comment - 2023/08/01 2:30 PM

Expand comment: GitLab CEE Bot added a comment - 2023/08/01 2:30 PM

Collapse comment: Israel Blancas Alvarez (Inactive) added a comment - 2023/08/01 2:24 PM

Expand comment: Israel Blancas Alvarez (Inactive) added a comment - 2023/08/01 2:24 PM

Collapse comment: GitLab CEE Bot added a comment - 2023/08/01 11:46 AM

Expand comment: GitLab CEE Bot added a comment - 2023/08/01 11:46 AM

People

Dates