Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: OLM
Labels:
- dynatrace
- grpc
- latency
- marketplace
- olmv0
- triaged

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:
CrashLoopBackOff condition observed in the Red Hat Marketplace project with Dynatrace OneAgent installed

Dynatrace analysis:
Based on the Dynatrace response below [Ref 4], the CrashLoopBackOff condition observed with the Red Hat Marketplace is a known symptom of OneAgent injecting into an exec probe process, which can delay startup sufficiently to violate the probe’s timing window [Ref 1].

Dynatrace request:
Regarding this issue, Dynatrace recommends switching from the exec probe to the built-in gRPC probe [Ref 2].
However, it appears that this change is not currently supported through the catalogsource [Ref 3].

Help request related to the Dynatrace answer above:
Is there a method (in 4.16 and next versions) to update the gRPC configuration settings that are applied via the catalogsource definitions ?
Thank you so much in advance for your valuable help

~~~
[Ref 1]
$ oc get pods redhat-marketplace-97vtb -n openshift-marketplace -o jsonpath='

{range .spec.containers[*]} {"Container: "} {.name} {"\n Liveness: "} {.livenessProbe} {"\n Readiness: "} {.readinessProbe} {"\n"}{end}'
Container: registry-server
Liveness: {"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":3,"initialDelaySeconds":10,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":5}
Readiness: {"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":3,"initialDelaySeconds":5,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":5}
~~~

~~~
[Ref 2]
$ oc get catalogsource
NAME DISPLAY TYPE PUBLISHER AGE
certified-operators Certified Operators grpc Red Hat 13h
community-operators Community Operators grpc Red Hat 13h
redhat-marketplace Red Hat Marketplace grpc Red Hat 13h
redhat-operators Red Hat Operators grpc Red Hat 13h

$ oc patch catalogsource redhat-operators -n openshift-marketplace --type='merge' -p '
{
"spec": {
"grpcPodConfig": {
"livenessProbe": {
"grpc": { "port": 50051 },
"initialDelaySeconds": 10,
"periodSeconds": 10
},
"readinessProbe": {
"grpc": { "port": 50051 },
"periodSeconds": 5
}
}
}
}'
Warning: unknown field "spec.grpcPodConfig.livenessProbe"
Warning: unknown field "spec.grpcPodConfig.readinessProbe"
catalogsource.operators.coreos.com/redhat-operators patched (no change)
~~~

~~~
[Ref 3]
$ oc get catalogsource redhat-operators -n openshift-marketplace -o jsonpath='{.spec.grpcPodConfig}{"n"}

' | jq .
{
"extractContent":

{ "cacheDir": "/tmp/cache", "catalogDir": "/configs" }

,
"memoryTarget": "50Mi",
"nodeSelector":

{ "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "" }

,
"priorityClassName": "system-cluster-critical",
"securityContextConfig": "restricted",
"tolerations": [

{ "effect": "NoSchedule", "key": "node-role.kubernetes.io/master", "operator": "Exists" }

,

{ "effect": "NoExecute", "key": "node.kubernetes.io/unreachable", "operator": "Exists", "tolerationSeconds": 120 }

,

{ "effect": "NoExecute", "key": "node.kubernetes.io/not-ready", "operator": "Exists", "tolerationSeconds": 120 }

]
}

$ oc explain catalogsource.spec.grpcPodConfig | grep "^ [a-z]."
affinity <Object>
extractContent <Object>
memoryTarget <Object>
nodeSelector <map[string]string>
priorityClassName <string>
securityContextConfig <string>
tolerations <[]Object>
~~~

~~~
[Ref 4]
TSNet answer from Dynatrace

Kindly note that this looks like probe‑related startup latency introduced by OneAgent deep monitoring that's tipping the OpenShift Marketplace pods into CrashLoopBackOff when Dynatrace is active. Your timings make the pattern clear:

· With Dynatrace ON, grpc_health_probe returns in ~6s real while the actual gRPC RPC is ~1–2 ms—so the process spends most of the time waiting before/around the probe call rather than in the service itself.

· With Dynatrace OFF, the same probe returns in ~14 ms real.

That is a known symptom of OneAgent injecting into an exec probe process and slowing its startup enough to violate the probe's timing window. Dynatrace documents this exact scenario and recommends excluding the probe executable from deep monitoring (or using Kubernetes' native gRPC probe to avoid exec) as per this Fix probe timeouts due to OneAgent injection — Dynatrace Docs

Your DynaKube is classicFullStack (spec.oneAgent.classicFullStack), which instruments all processes on the host, including processes started by exec probes inside containers. In classic FS, namespace annotations don't control injection; they apply to the webhook‑based modes (Cloud‑Native Full‑Stack / Application‑only). This is why labeling/annotating openshift-marketplace won't help in classic FS as per this Configure monitoring for namespaces and pods — Dynatrace Docs

You can stop injecting into the probe process as aforementioned which will exclude only the grpc_health_probe process from deep monitoring so probes are instant again.

As per this GitHub - grpc-ecosystem/grpc-health-probe: A command-line tool to perform health-checks for gRPC applications in Kubernetes and elsewhere, you can Prefer native gRPC probes (no exec) which will remove the need to start a new process for health checks. Since OCP 4.16/ K8s 1.29 has native gRPC health probes GA, switch from exec to the built‑in grpc: probe (which doesn't spawn grpc_health_probe and thus avoids OneAgent's process start hooks entirely):

Example patch to replace exec-based liveness probe:

livenessProbe:
grpc:
port: 50051 # your service port
initialDelaySeconds: 5 # tune to your app
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3

readinessProbe:
grpc:
port: 50051
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 3

If you must remain on exec, temporarily increase initialDelaySeconds / timeoutSeconds to cover the added startup latency but the exclusion in step 1 is the cleaner fix.
~~~

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates