-
Bug
-
Resolution: Done
-
Major
-
None
-
1
-
False
-
None
-
False
-
-
-
Tracing Sprint # 258, Tracing Sprint # 260
-
Important
tempoVersion: 2.4.1 for both customer cases Iv'e seen so far
We seem to be hitting a limitation with tempo in regards to the Jaeger UI querying and what amount of results it can handle.
Unless the customer, while querying the parameters LookBack & Limit Results is set to lower(Less than 20) a 504 is returned as seen in the tempo-query logs
omg logs tempo-tempostack-query-frontend-8568765fcc-jvg6q -c tempo-query 2024-07-26T07:59:11.561305708Z {"level":"error","ts":1721980751.561069,"caller":"app/http_handler.go:504","msg":"HTTP handler, Internal Server Error","error":"stream error: rpc error: code = Canceled desc = context canceled","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:504\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:259\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleFunc.traceResponseHandler.func2\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:548\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.WithRouteTag.func1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:249\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:212\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:73\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.additionalHeadersHandler.func4\n\t/remote-source/jaeger/app/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.PropagationHandler.func5\n\t/remote-source/jaeger/app/pkg/bearertoken/http.go:52\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.CompressHandler.CompressHandlerLevel.func6\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:3137\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:2039"}
We've tried using multiple options with the TempoStack CDR including:
https://github.com/grafana/tempo-operator/blob/main/docs/spec/tempo.grafana.com_tempostacks.yaml
- Increasing replicas for query-frontend, tempo-querier, ingester
- Removing any resource limits or requests so it can use as much as it wants
- Messing with the defaultResultLimit value such as increasing it.
None have had any effect. All the components in Tempo are working properly and no errors are being reported about ingestion or the pushing of the logs. I will note both customer cases I see this in are using OTEL for pushing logs.
Neither of these clusters are large. One of them is only a 6 nodes cluster.
Both of these clusters are using ODF with S3 functionality.
Here is a tempostack resource spec from one customer
spec: observability: grafana: instanceSelector: {} metrics: {} tracing: jaeger_agent_endpoint: 'localhost:6831' resources: {} search: defaultResultLimit: 20 maxDuration: 0s managementState: Managed limits: global: ingestion: {} query: maxSearchDuration: 0s serviceAccount: tempo-tempostack images: {} template: compactor: replicas: 2 distributor: component: replicas: 2 tls: enabled: false gateway: component: {} enabled: false ingress: route: {} ingester: replicas: 3 querier: replicas: 3 queryFrontend: component: replicas: 2 jaegerQuery: enabled: true ingress: route: termination: edge type: route monitorTab: enabled: false prometheusEndpoint: '' servicesQueryDuration: 72h0m0s replicationFactor: 1 storage: secret: name: tempo-bucket type: s3 tls: enabled: false storageSize: 10Gi hashRing: memberlist: {} retention: global: traces: 48h0m0s
Either way the key takeaway is that if we go about a certain number then we will get a 504 timeout after 30 seconds. This implies a limitation or that the tempo stack can't get the results back in time but these clusters are not large.
Wondering if were just hitting a limit with tempo or some sort of bug and how we can continue to troubleshoot this as we have exhausted options we can configure via the tempostack. Would increasing the timeout on the frontend route that's created help?
Please let me know if there's any metrics we can gather and if there's anything more we can adjust even unsupported to test with
I''ll attach the cases via Jira