Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74905

[DF] switch the catalogue source from exec probe to the built-in gRPC probe

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      • Description of problem:
        CrashLoopBackOff condition observed in the Red Hat Marketplace project with Dynatrace OneAgent installed
      • Dynatrace analysis:
        Based on the Dynatrace response below [Ref 4], the CrashLoopBackOff condition observed with the Red Hat Marketplace is a known symptom of OneAgent injecting into an exec probe process, which can delay startup sufficiently to violate the probe’s timing window [Ref 1].
      • Dynatrace request:
        Regarding this issue, Dynatrace recommends switching from the exec probe to the built-in gRPC probe [Ref 2].
        However, it appears that this change is not currently supported through the catalogsource [Ref 3].
      • Help request related to the Dynatrace answer above:
        Is there a method (in 4.16 and next versions) to update the gRPC configuration settings that are applied via the catalogsource definitions ?
        Thank you so much in advance for your valuable help

      ~~~
      [Ref 1]
      $ oc get pods redhat-marketplace-97vtb -n openshift-marketplace -o jsonpath='

      {range .spec.containers[*]} {"Container: "} {.name} {"\n Liveness: "} {.livenessProbe} {"\n Readiness: "} {.readinessProbe} {"\n"}{end}'
      Container: registry-server
      Liveness: {"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":3,"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":5}
      Readiness: {"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":3,"initialDelaySeconds":5,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":5}
      ~~~

      ~~~
      [Ref 2]
      $ oc get catalogsource
      NAME DISPLAY TYPE PUBLISHER AGE
      certified-operators Certified Operators grpc Red Hat 13h
      community-operators Community Operators grpc Red Hat 13h
      redhat-marketplace Red Hat Marketplace grpc Red Hat 13h
      redhat-operators Red Hat Operators grpc Red Hat 13h


      $ oc patch catalogsource redhat-operators -n openshift-marketplace --type='merge' -p '
      {
      "spec": {
      "grpcPodConfig": {
      "livenessProbe": {
      "grpc": { "port": 50051 },
      "initialDelaySeconds": 10,
      "periodSeconds": 10
      },
      "readinessProbe": {
      "grpc": { "port": 50051 },
      "periodSeconds": 5
      }
      }
      }
      }'
      Warning: unknown field "spec.grpcPodConfig.livenessProbe"
      Warning: unknown field "spec.grpcPodConfig.readinessProbe"
      catalogsource.operators.coreos.com/redhat-operators patched (no change)
      ~~~

      ~~~
      [Ref 3]
      $ oc get catalogsource redhat-operators -n openshift-marketplace -o jsonpath='{.spec.grpcPodConfig}{"n"}

      ' | jq .
      {
      "extractContent":

      { "cacheDir": "/tmp/cache", "catalogDir": "/configs" }

      ,
      "memoryTarget": "50Mi",
      "nodeSelector":

      { "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "" }

      ,
      "priorityClassName": "system-cluster-critical",
      "securityContextConfig": "restricted",
      "tolerations": [

      { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master", "operator": "Exists" }

      ,

      { "effect": "NoExecute", "key": "node.kubernetes.io/unreachable", "operator": "Exists", "tolerationSeconds": 120 }

      ,

      { "effect": "NoExecute", "key": "node.kubernetes.io/not-ready", "operator": "Exists", "tolerationSeconds": 120 }

      ]
      }

      $ oc explain catalogsource.spec.grpcPodConfig | grep "^ [a-z]."
      affinity <Object>
      extractContent <Object>
      memoryTarget <Object>
      nodeSelector <map[string]string>
      priorityClassName <string>
      securityContextConfig <string>
      tolerations <[]Object>
      ~~~

      ~~~
      [Ref 4]
      TSNet answer from Dynatrace

      Kindly note that this looks like probe‑related startup latency introduced by OneAgent deep monitoring that's tipping the OpenShift Marketplace pods into CrashLoopBackOff when Dynatrace is active. Your timings make the pattern clear:

      · With Dynatrace ON, grpc_health_probe returns in ~6s real while the actual gRPC RPC is ~1–2 ms—so the process spends most of the time waiting before/around the probe call rather than in the service itself.

      · With Dynatrace OFF, the same probe returns in ~14 ms real.

      That is a known symptom of OneAgent injecting into an exec probe process and slowing its startup enough to violate the probe's timing window. Dynatrace documents this exact scenario and recommends excluding the probe executable from deep monitoring (or using Kubernetes' native gRPC probe to avoid exec) as per this Fix probe timeouts due to OneAgent injection — Dynatrace Docs

      Your DynaKube is classicFullStack (spec.oneAgent.classicFullStack), which instruments all processes on the host, including processes started by exec probes inside containers. In classic FS, namespace annotations don't control injection; they apply to the webhook‑based modes (Cloud‑Native Full‑Stack / Application‑only). This is why labeling/annotating openshift-marketplace won't help in classic FS as per this Configure monitoring for namespaces and pods — Dynatrace Docs

      You can stop injecting into the probe process as aforementioned which will exclude only the grpc_health_probe process from deep monitoring so probes are instant again.

      As per this GitHub - grpc-ecosystem/grpc-health-probe: A command-line tool to perform health-checks for gRPC applications in Kubernetes and elsewhere, you can Prefer native gRPC probes (no exec) which will remove the need to start a new process for health checks. Since OCP 4.16/ K8s 1.29 has native gRPC health probes GA, switch from exec to the built‑in grpc: probe (which doesn't spawn grpc_health_probe and thus avoids OneAgent's process start hooks entirely):

      Example patch to replace exec-based liveness probe:

      livenessProbe:
      grpc:
      port: 50051 # your service port
      initialDelaySeconds: 5 # tune to your app
      periodSeconds: 10
      timeoutSeconds: 1
      failureThreshold: 3

      readinessProbe:
      grpc:
      port: 50051
      periodSeconds: 5
      timeoutSeconds: 1
      failureThreshold: 3

      If you must remain on exec, temporarily increase initialDelaySeconds / timeoutSeconds to cover the added startup latency but the exclusion in step 1 is the cleaner fix.
      ~~~

              rh-ee-cchantse Catherine Chan-Tse
              rhn-support-rbruzzon Riccardo Bruzzone
              None
              None
              Jian Zhang Jian Zhang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: