Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhosdt-3.4
Affects Version/s: None
Component/s: Jaeger, OpenTelemetry, Tempo
Labels:
- pre-merge-tested

Story Points:
1
Blocked:
False
Blocked Reason:
None
Ready:
False
Git Pull Request:
https://github.com/grafana/tempo-operator/pull/1014, https://github.com/grafana/tempo/pull/4159, https://github.com/grafana/tempo-operator/pull/1045
Intelligence Requested:
Market:

Sprint:
Tracing Sprint # 258, Tracing Sprint # 260
Severity:
Important

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

tempoVersion: 2.4.1 for both customer cases Iv'e seen so far

We seem to be hitting a limitation with tempo in regards to the Jaeger UI querying and what amount of results it can handle.

Unless the customer, while querying the parameters LookBack & Limit Results is set to lower(Less than 20) a 504 is returned as seen in the tempo-query logs

omg logs tempo-tempostack-query-frontend-8568765fcc-jvg6q -c tempo-query

2024-07-26T07:59:11.561305708Z {"level":"error","ts":1721980751.561069,"caller":"app/http_handler.go:504","msg":"HTTP handler, Internal Server Error","error":"stream error: rpc error: code = Canceled desc = context canceled","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:504\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:259\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleFunc.traceResponseHandler.func2\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:548\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.WithRouteTag.func1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:249\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:212\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:73\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.additionalHeadersHandler.func4\n\t/remote-source/jaeger/app/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.PropagationHandler.func5\n\t/remote-source/jaeger/app/pkg/bearertoken/http.go:52\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.CompressHandler.CompressHandlerLevel.func6\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:3137\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:2039"}

We've tried using multiple options with the TempoStack CDR including:

https://github.com/grafana/tempo-operator/blob/main/docs/spec/tempo.grafana.com_tempostacks.yaml

Increasing replicas for query-frontend, tempo-querier, ingester
Removing any resource limits or requests so it can use as much as it wants
Messing with the defaultResultLimit value such as increasing it.

None have had any effect. All the components in Tempo are working properly and no errors are being reported about ingestion or the pushing of the logs. I will note both customer cases I see this in are using OTEL for pushing logs.

Neither of these clusters are large. One of them is only a 6 nodes cluster.

Both of these clusters are using ODF with S3 functionality.

Here is a tempostack resource spec from one customer

spec:
  observability:
    grafana:
      instanceSelector: {}
    metrics: {}
    tracing:
      jaeger_agent_endpoint: 'localhost:6831'
  resources: {}
  search:
    defaultResultLimit: 20
    maxDuration: 0s
  managementState: Managed
  limits:
    global:
      ingestion: {}
      query:
        maxSearchDuration: 0s
  serviceAccount: tempo-tempostack
  images: {}
  template:
    compactor:
      replicas: 2
    distributor:
      component:
        replicas: 2
      tls:
        enabled: false
    gateway:
      component: {}
      enabled: false
      ingress:
        route: {}
    ingester:
      replicas: 3
    querier:
      replicas: 3
    queryFrontend:
      component:
        replicas: 2
      jaegerQuery:
        enabled: true
        ingress:
          route:
            termination: edge
          type: route
        monitorTab:
          enabled: false
          prometheusEndpoint: ''
        servicesQueryDuration: 72h0m0s
  replicationFactor: 1
  storage:
    secret:
      name: tempo-bucket
      type: s3
    tls:
      enabled: false
  storageSize: 10Gi
  hashRing:
    memberlist: {}
  retention:
    global:
      traces: 48h0m0s

Either way the key takeaway is that if we go about a certain number then we will get a 504 timeout after 30 seconds. This implies a limitation or that the tempo stack can't get the results back in time but these clusters are not large.

Wondering if were just hitting a limit with tempo or some sort of bug and how we can continue to troubleshoot this as we have exhausted options we can configure via the tempostack. Would increasing the timeout on the frontend route that's created help?

Please let me know if there's any metrics we can gather and if there's anything more we can adjust even unsupported to test with

I''ll attach the cases via Jira